You are hereOpinion / Web Identity Crisis: Topic Maps and RDF
Web Identity Crisis: Topic Maps and RDF
I have been working my way through the "identity crisis" on the web, trying to get my head around how I can get a system to work with both RDF and TMs. In a nutshell I'm interested in creating XTM and RDF feeds out of a CMS which will have some knowledge subject indicators. I'm expecting users to use a lookup service (Sindice or similar) or manually enter URLs to equate subjects in the system (tags, categories, pages, posts) with public subjects outside (wikipedia, PSI repositories, linked data). What will be the identity of these URIs though? I don't want to dwell on the pros and cons of the various solutions. I'm mainly interested in how I can pick URIs which will work in both systems. Maybe you can help.
You are probably all familiar with the basic problem where a URI can respresent an information resource or a concept represented by that resource. This "identity crisis" is well known and discussed by Steve Pepper. Topic Maps have a good solution with "Subject Indicators" vs "Subject Locators". W3C recommends a system of content negotiation with 303 redirects from the subject (non info resource) to a representation (info resource). So both camps have a solution but it appears difficult to get them to work together in the real world.
One of the RDF examples in a How to publish Linked Data on the Web concerns the city of Berlin which has 3 URIs.
1. http://dbpedia.org/resource/Berlin - non information resource representing the resource/subject
2. http://dbpedia.org/page/Berlin - related information resource in html
3. http://dbpedia.org/data/Berlin - related information resource in rdf
If you take a look at the RDF representation you see a triple stating that the orginal wikipedia.org page is a foaf:Page
http://dbpedia.org/resource/Berlin - foaf:page - http://en.wikipedia.org/wiki/Berlin
The range foaf:page is a foaf:Document which "represents those things which are, broadly conceived, 'documents'." ie. an information resource rather than a concept. BTW, I also note see that psi.ontopedia.net does a similar thing and uses a "Web Page" occurence to link to wikipedia rather than any subject equivalence.
What does this mean? It means that Wikipedia pages may well form a decent PSI for Topic Maps but they are no good for RDF because the URI is an information resource. There has been some talk of the suitability of Wikipedia URIs for PSIs in the TM community and that is fair enough. However, the point of this post is to ask how can we get them to work with RDF given the above limitation. ie. in an RDF system, Wikipedia URLs will not represent the concept.
Maximise Interoperability or Ease of Use?
Which Subject Identity should I use in order to maximise interoperability and ease of use?
- http://en.wikipedia.org/wiki/Berlin which is fine as a PSI (human readable) but considered a document in the RDF world, or
- http://dbpedia.org/resource/Berlin which is less known (html info resource is a bit technical for lay people) but descibes the concept for RDF?
Personally, I would like to use the former since it provides a human readable respresentation of the subject, ie. is a good URI for a PSI and is much better for the lay person than http://dbpedia.org/page/Berlin. However, it seems I must go with http://dbpedia.org/resource/Berlin to play nice with LOD and RDF. Alexandre Passant has a little bit of SPARQL to show where these inconsistencies exist. I don't want to add to them :) Topic Maps may have solved the identity problem but we still need to interoperate with RDF data sources. This may mean picking URIs which are expected in RDF recipes. Any thoughts on this?
Will PSIs work in an RDF world?
The corollary of the above is that I am forced to turn my back on PSIs since they can only represent information resources. Is there any way to make PSIs work in an RDF world? At the moment RDF people cannot (AFAICT) use PSIs for concepts because a PSI represents information resource rather than the concept. Mind you, I haven't seen any interest from RDF people in reusing PSIs from TM world - I'm considering it though :).
Steve Pepper's intro to PSIs says that
Human-readable as well as machine-processable metadata can be included in the Subject Indicator itself (e.g., as RDF metadata) or in a separate information resource referenced from the Subject Indicator (e.g., as XTM metadata).
I note that Ontopedia does exactly this. This metadata is helpful for RDF people but it doesn't follow their recipes enough to be useful to machines. The outome is that PSIs developed by the TM community will probbaly not be used by the RDF world.
One possible workaround is to use a service such as thing-described-by.org which will do the content negotiation for you. ie. it represents the subject, the html resource stays as is. So RDF folks could use PSIs in this way but it doesn't feel very natural to do this. To use the Berlin example above you could reuse a PSI as follows in RDF:
I suggest that this could possibly be worked around by adding content negotiation to PSI servers so that a GET on the subject redirects to an HTML version. The subject URI stays the same but the human readable content is elsewhere.
- http://psi.ontopedia.net/Berlin - the subject
- http://data.ontopedia.net/Berlin - the rdf info resource
- http://page.ontopedia.net/Berlin - the human readable info resource
This proposal works in terms of a PSI server because the people who run them are generally technically savvy and can do the content negotiation. It does raise a bar for ordinary people who want to create their own PSI datasets though. Worse, it breaks all other uses of Subject Identifiers in TMs. Topic Map authors are then limited in their use of Subject Identifiers. There is no reason why TMs should have to bow to the limitations of RDF.
As a developer who is interested in making these things work I can only shake my head at how complicated it has all become. Take a look at the LOD mailing list and the massive .htaccess a major bottleneck to Semantic Web adoption thread discussing the pitfalls of setting up content negotiation. Nuts. How are everyday people going to get this stuff to work if LOD boffins can't? Recipes may be OK for those creating massive sets of Linked Data subjects but for people wanting to talk about stuff outside these datasets things become more tricky. TMs have more of a grassroots feel to them in this case. The emerging consensus in the LOD mailing list seems to be that RDFa is a way forward to overcome these problems yet content neg still plays a role for those who can do it. Looking into the RDFa solution a bit more closely the outcome doesn't appear that positive because the content author must resort to creating blank nodes which are the primary topic of the Wikipedia info resource.
Question for Subj3ct developers
Subj3ct.com is a web service where content publishers can register and equate subjects as well as add related representations for those subjects. It potentially allows for massive aggregation around URIs in the future. I wonder if this identity crisis will be a worry for the service. To the designers of subj3ct.com - Do you see a shadow world were these two views collide? eg. http://dbpedia.org/resource/Berlin and http://en.wikipedia.org/wiki/Berlin may both be registered as concepts. You will not be able auto equate the two for fear of breaking the consistency of RDF models. How do you get around this? Maybe you could:
- have a 'thing-described-by' flag set to true in feeds.
- GET the resource headers yourself to see if it 303s.
- be aware of 'thing-described-by' services and set flag internally
- ignore it.
The problems with RDF and TM is not just in the mapping. As Lars M says, identity is a "thorny" issue. I'm getting the feeling that PSIs may really only be helpful in TM world. Surely not? OTOH RDF non information resources do work well in TMs but have an overly technical HTML representation for lay people. They also have to be conjured up with mysterious recipes and redirecting services.
How am I going to proceed? I think that I will do the following:
- search LOD and use DBPedia, etc links
- maybe include a 'thing described by' flag set to true by default if the user is entering a URL manually (yuck). This can be used for PSIs.
I'm not 100% happy with this because my (thing-described-by) URIs are probably not going to match those used by other in services such as subj3ct.com.
Hopefully, I am not too far off with all this but there is every chance that I am as I'm just coming to grips with it myself (again). If you have any experience or guidance then please add your thoughts to the comments. I could sure use some help with this. I'd really be interested to hear from those with PSI servers or aggregators :)