home
software
resumé
contact

July 04, 2006

Where RDF falls short: URIs, local files, and the structure of everything

Although I mock its complexity on a daily basis, RDF is a great thing. While, as it's normally expressed in serialized XML, it's not exactly simple, it satisfies the Web's need for a standardized language for expressing metadata. It makes possible not only the expression of a vast quantity of data regarding a document, but also the formalization of complex relationships among documents and the metadata that describes them.

At least, as long as those documents are accessible via HTTP.

RDF and URIs

As the RDF specification is currently laid out, everything, from the item being described to the tags and types describing it, is a URI. This approach simultaneously streamlines the disambiguation of item properties and complicates the enumeration of metadata elements. This isn't too much of a price to pay. It's only a slight step beyond the complexity simple XML namespacing, and its benefits justify its conceptual overhead.

Where URIs don't cut it, however, is when resources don't reside on the World Wide Web, but in a private data store. There are many situations, from the expression of metadata regarding a resource intended for future web publication to the formalization of relationships between elements in the data of a local application, where RDF's well-designed standard structure would be a welcome improvement over the variety of proprietary, vendor-specific mechanisms currently in use today. Unfortunately, there is no way to harness this power without going beyond the standards, and thus the standardized format loses most of its benefits.

URIs, URLs, and URNs

As clarified by the W3C, there are many types of URIs, or Uniform Resource Identifiers. There are the http:, ftp:, and mailto: URI schemes that virtually every user has, at some point, encountered. Since these schemes identify items by the information needed to retrieve them from their location (in this case, on the Internet), rather than some other attribute, one may also properly refer to addresses conforming to them as URLs, or Uniform Resource Locations. However, there also exist non-URL URIs. Many of these are URNs, or Uniform Resource Names. The IETF has designed the urn: URI scheme specifically to handle identifiers that represent specific resources, but do not specify their means of access. The URN specification provides for namespaces, making URNs useful in many different applications without risk of namespace collisions, and ensuring that a given URN references one and only one item. Examples of valid URNs include urn:isbn:067972110X and urn:oasis:PartyId:Type:ISO9735:8.

Unfortunately, getting a URN namespace is difficult. Each new namespace requires a separate RFC. Considering these documents are typically many pages long, and must describe both the purpose of a namespace and all its possible uses, it is simply too much work for many organizations to get a formal namespace. A quick glance at the IANA's official list of URN namespaces shows that, as of this date, there are only 27, many of which belong to Internet governing bodies. There are also informal namespaces, which benefit from a streamlined application process. Because of IETF restrictions, however, they lose many of the benefits of a namespace in the first place. All informal namespaces are simply "urn-" followed by a sequential number. They're not easily remembered, not easily understood, and not frequently used. In fact, despite the simpler application process, there are only six of them.

URIs and Local Files

When distributing a website on disk, the traditional approach has been to use relative path names for everything. By using relative path names, the site author can ensure that everything will still reference what it's intended to reference, even if the disk is renamed, or the files are copied.

Unfortunately, this approach does not translate into RDF. RDF requires a URI, and a URI is, by nature, absolute. While some RDF parsers will automatically assume any URI missing a scheme references a relative path, the RDF specification does not require this behavior. The only valid way to refer to a file is using a file: URI, which changes if a file is moved, or copied from one filesystem to another.

One could even make a case that addresses conforming to the file: URI scheme are not, in fact, URLs, since they reference locations in the filesystem, but, whereas a web browser sends a request containing the information in the URL scheme and receives a web page back, to access a file: URI, the accessing agent (or the operating system upon which it is running) must first look up the appropriate inode, then read from this inode. Thus, a true URL would refer to the inode, and not the file path. Such an approach would provide a URI that would not change even if the file was moved; it is the basis for aliases on the Mac OS (and the chief difference between Mac aliases and POSIX symlinks). However, any implementation of this URI scheme would rely on the implicit (and probably false) assumption that all filesystems use an ID number of some kind to reference files, and would still fail to solve the problem of files copied from filesystem to filesystem.

Impact

As a result of these limitations, very few organizations have picked RDF as a component of their local data interchange formats. Among the few that have, implementations violate one or more RFCs, since, in order to compensate for the lack of a URI to describe a given resource, developers must choose their own, unregistered and thus invalid, URI scheme.

I present as an example the Mozilla Foundation. Mozilla is, in most ways, very respectful of standards. It is the single best browser in terms of supporting all of the various W3C and IETF recommendations out there today. As part of the Foundation's mission of standards compliancy, they chose to use RDF as the format for Mozilla's data files. Unfortunately, in order to use RDF, it was necessary to completely ignore the IETF's namespacing rules. Browsing through the Mozilla RDF data files, I see that the Mozilla developers have seen it fit to design a host of new URN namespaces, including urn:search, urn:mimetype, and urn:mozilla. One of my Mozilla extensions has even created its own URN, in order to use Mozilla's RDF data source apparatus. The Mozilla Foundation's flouting of URN namespace registration procedures seems to indicate that these procedures are simply too complex.

Future Outlook

The W3C's position on this problem seems unclear. While the W3C understands the power of RDF, its current position, as stated in the RDF primer, suggests that it sees RDF as a tool to describe web-accessible resources and nothing else. Since the IETF/IANA does not appear to be devising any additional URI schemes, it seems those trying to extend the scope of RDF beyond the semantic web are out of luck.

While the assumption that the RDF parser will use the local document path as the in-scope base URI solves many of the problems regarding relative path names, the RDF specification does not mandate this behavior. Until the IANA relaxes its URN namespace registration procedures, it will remain impossible to access internal application resources by a valid URI in RDF. All in all, the future of RDF outside of the sematic web doesn't look so bright.

Posted by Simon at 05:44 PM | Comments (90)