Monday, February 15, 2010

Semantic Web: an alternative for RDFa

A few years ago I thought that XHTML would eventually be widely used but when the W3C decided to standardize on HTML5 (which I love for non Semantic Web reasons), that may have been the beginning of the end for RDFa because RDFa is an XML application.

I believe that a better alternative in a HTML5 world is to keep RDF separate from web pages but have a clear set of rules for finding RDF data files that correspond to web pages (either static or generated). One rule might be to look for a file named index.rdf for top level domain URLs; for example, see if http://markwatson.com/index.rdf exists for http://markwatson.com. For a URL like http://markwatson.com/hobbies look for http://markwatson.com/hobbies.rdf or http://markwatson.com/hobbies/index.rdf.

Although CMS support (e.g., Drupal) for RDFa and helper libraries like the RDFa Rails plugin might make it fairly easy for some web sites to provide RDFa, I think that we need something simpler that might be adopted by more web sites.

I am writing an open source tool (that will be an example program in the Semantic Web book I am writing) that will generate RDF data from web pages. I'll post a link when the code is ready.

6 comments:

dmitry_vk said...

Wouldn't using link rel="alternate" be more consistent with HTML?

Mark Watson, author and consultant said...

Hello Dmitry, I think that rel= is limited. Are you suggesting a tag with rel="alternate" then specify a URI to a RDF file?

Also, I am not against embedding RDFa if that is what you want to do - I would like to see another (even adhoc) standard for associating RDF files with specific web pages.

Paul said...

As somebody trying to build with web services - many disparate web services, mind you - it is *desperately* frustrating that in the year 2010 we are still at step 1: discussing how to get at the parsable/semantic representation of a given html document!

Users know how to copy/paste URLs. Is it so much to ask that they could paste in a URL they found with their web browser, and then my app can parse out a <link rel="alternate" or do content negotiation to get at an RDF/XML/whatever representation that's parseable?

Nearly every site I'm trying to work with has a half-decent web UI that users are already familiar with. I do *not* want to re-invent a brand new search UI within my application for each one of these data sources. I just want them to copy-paste the URL they found with their web browser!

I also do *not* want to have to accommodate all these hacks that each site seems to provide to get at the RDF/XML: "oh, you just append .rdf!" or "oh, just do ?xsl-template=foobar" or "oh, just /slightly/different/URL"

"Standardising" on .rdf is, I am sorry, *not* a good idea - we need a solid best practice based on either content negotiation or a more robust variation of the <link rel hacks or *something* - but the URL naming convention is the wrong place to solve these problems.

So. Very. Depressing.

Mark Watson, author and consultant said...

Paul: I was actually promoting separate RDF files as an alternative to embedding RDFa. Using Embedded RDFa is fine also if that is what web developers want to do.

Since I wrote the original blog article 4 months ago, I have done an RDFa project with a customer and my appreciation for RDFa has increased slightly.

Anyway, the important take away point is: for pages that are automatically generated from a database, it is easy enough to automatically also generate either embedded RDFa or a separate RDF file with a consistent URL scheme so it is obvious given a URL for HTML content what the RDF URL should be. For sites with manually generated content, hand editing RDFa is a nuisance and I would rather also hand generate an RDF file using something like Protégé.

Paul said...

Mark, I feel I should apologise for the robust rant I left here - must've been having bad day.

I stumbled across your post as yet-more-vague-advice that had no actual useful information for me to take home - just like everyone else writing about the semantic web; it just keeps getting more cloud-like ;-)

I'm not a huge fan of RDFa, but using it to point at an out-of-band RDF resource actually seems like a good idea - except you provided no example markup for which to advocate this!

Looking forward to a post from you that might clarify how you're doing it: robust examples would make this cynic particularly happy

Paul said...

Maybe I should explain a tiny little bit: I have many data providers exposing their stuff in only a handful of formats (a number that is workable, anyway: I should reasonably expect to be able to understand data from an unfamiliar provider).

Here's the problem: my app is not the starting point for discovering information - it's a collaboration area. People use google, 3rd-party aggregators, other collaboration projects and direct queries with these data providers. My app is completely out of the loop, until they want to start using something they've found.

So they found something with their web browser, from a provider that my app has never seen before. That shouldn't matter; we use the same formats.

Why do I have this mess on my hands:

If I'm lucky, after being given a URL, I can use content negotiation to get a redirect to what I'm looking for.

Failing that, let's parse out the (X)HTML response, maybe <link rel="... nobody I'm working with so far seems to be doing that.

What about chucking an .rdf on the end of the URL - ok, at least one site is doing this.

Crazy! Can I beg everyone to use RDFa to point at a dedicated resource like you say? If so, what are the specifics? owl:sameAs? dc:source? Did you do something like this, and if so, what did you decide on? Are you talking to anyone that has to parse your stuff?