Rhiza Blog

On open government data, Tim Berners-Lee is almost right

Tim Berners-Lee gave a great talk at the recent Gov 2.0 Expo in which he describes the criteria for creating open and linked government data. In the beginning of his talk he describes a star-based rating system for putting data up in machine readable format, open formats, as a CSV file, etc. As with many things that Tim does, he almost completely had me until he started describing what “linked data format” is in his mind. His notion of linked data is that the values of attributes in a data table would be URLs to some web page somewhere that points to the “definitive” source of data about that thing. There are several reasons why this is incredibly short-sighted and wrong:

  1. URLs link to a specific html page on a specific web server. They are only as permanent as long as the web server owner decides to keep it running. We’ve all encountered “404 Errors” when we go to a web page that is no longer where it used to be, and I certainly wouldn’t want vital government information that needs to persist for decades if not centuries into the future reliant on the HTML link standard.
  2. Where’s the one definitive URL for all of the information about a city, country, or any other place for that matter? Do we really expect government agencies to solve this problem when so many have tried and failed before?
  3. URLs tend not go be good for multi-lingual content. Where on Wikipedia is the single definitive URL for Paris, France? If you are English speaking, it is here: http://en.wikipedia.org/wiki/Paris but if you are a french speaker it is here: http://fr.wikipedia.org/wiki/Paris Those are different URLs that contain different information, and there are dozens of others on just the one website that are about Paris. Which one would Tim link to?
  4. The entire system of URLs relies on the HTML syntax, which Tim invented, so it is understandable that he is partial to it, but those that care about open government data also want to ensure that is archived so that people a thousand years from now can easily use the data. No offense to Tim and his amazing accomplishment of creating the Web as we know it, but there’s no way that a URL is going to be valid in a thousand years.

So what is the right type of linked data? The answer has been in place for a long time, and just needs to be used more consistently (much in the same way that CSV as a data format should be used more consistently) — unique identifiers or keys. The US government has been doing this for years. Political entities in the US all have FIPS codes and every known place in the world has been assigned a GNIS Feature ID. The US EPA publishes the Facility Registry System that uniquely identifies all EPA regulated facilities in the United States.

None of these identifier systems is perfect, but what they do allow is for a common way to refer to unique entities within the context of a given agency’s data. Yes, this will require a lot of work, but it is without a doubt the easiest path forward that yields the best results. For extra credit, agencies should utilize Universally Unique Identifiers (UUIDs) that have absolutely no semantic identifiers within them so that everyone, regardless of language or location, could share them.

Once government agencies (and the private sector too) start publishing their data with unique identifiers for common references, we can start to see a real ecosystem of data start to emerge. The connectors or links in this ecosystem are the unique identifiers, not server or location dependent URLs.

My company, Rhiza Labs, specializes in helping government agencies, non-profits and corporations future-proof and publish their data. We’ve found that once a few simple steps are followed, there’s a huge payoff in more data being used in decision making, planning and collaboration.

     by Josh Knauer   Analysis & Commentary, Blog   5 Comments

Conversation

5 Comments

Michael Higgins said on :

The web community has been aware of this problem for a long time, and they’ve tried to solve it by introducing URIs, which are a more generic identifier scheme that isn’t necessarily tied to a server location as URLs are. Unfortunately, as a practical matter they are rarely used because we’ve got all this network infrastructure devoted to resolving URLs to data objects (web pages), and we don’t have any that will take a URI and find you a data object.

(ObPlug: of course the UUIDs we use in our systems *can* be resolved and are *not* dependent on a single server location, but that’s part of why we’re cool ;-)

But back on the URL front: my recommendation would be for people to have their cake and eat it too. Use neutral identifiers (like FIPS codes or UUIDs) as the canonical identification scheme, and go ahead and embed those codes into URLs in a simple and obvious way. That way, given a URL, it’s easy to see what the “real” identifier is, and given an identifier, it’s easy to construct a URL and find the needed object. If a URL goes dark (which it will), you can migrate all your links to some new server by simply changing the way you construct your URLs.

The identifier is the FIPS code (or the UUID in our systems). The URL is a little extra syntax to help the network infrastructure do its job.

In the meantime, we and like-minded folks (and I would count the people pushing more pure URIs in that camp) should continue to advocate mechanisms that can resolve “pure” identity schemes to data objects without the extra URL baggage. Because if you do use a system like ours, you get a lot of benefits: caching, redundancy in case of failure, and clarity of reference (e.g., you avoid the French vs. English Wikipedia problem).

Kingsley Idehen said on :

Explaining Linked Data is a mercurial undertaking. Tim was trying to simplify the message (based on the audience he was dealing with) but in the process inadvertently generated confusion for people like you that understand “Identifiers”.

I wrote a blog post titled: Data 3.0 Manifesto, you should find this post addressing your concerns.

Bottom line, in the world of Linked Data UUIDs can co-exist with HTTP scheme based Identifiers. There’s even a powerful semantic for this in the OWL realm called the Inverse Functional Property assertion (what you would apply to properties or attributes with UUID values).

Links:

1. http://bit.ly/bmdv5N – Data 3.0 Manifesto .

Kingsley

Ellie K said on :

I just watched the video of Tim B-L advocating government data standards, and your response that followed. While Tim is a dynamic speaker, and justifiably has the respect of many, his premise that linking to URL’s, along with the justification, reduced to glibness.

I’ve worked as a Data Governance manager for two govt managed care programs, one federal, the other Medicare-funded at the state level. You are correct: uniform identifiers, such as those used by CMS for medical coding (HCPCS, ICD-9) or med specialties (a well-documented taxonomy of 11-byte alphanumeric codes) is the way to go. UUID’s would be best. And the ICD-9 and specialty taxonomies are usable with OR without electronic health records.

URL’s are not robust. Links often break in 10 years, let alone 1000 years, and are far to vulnerable.

Tim B-L’s appeal for standardization, and ability to hold the attention of an audience that is never excited about data policy (who is except those who do it, and those who suffer from the lack of it?) is certainly helpful. I hope that the spirit of his message is what is acted upon, rather than being used as a starting point for implementation.

David Karger said on :

Tim is right and so are you. Existing UUIDs for objects are great. But it’s also really useful to have a URL you can resolve to find out more about your UI. It’s easy to arrange for both by creating a simple transformation from UUID to URL; a perfect example of this is the DOI system (http://www.doi.org/) of UUIDs and corresponding URLs for scientific articles. Having the URL makes the DOI much more useful, but if URLs ever go out of fashion you still have the useful UUID embedded in the URL.

Josh Knauer said on :

Thank you all for your feedback!

Kingsley, you might enjoy reading a few of the papers that we have posted here: http://www.rhiza.com/about/infocommons/magic/ as I think you’ll find some very compatible thinking to what you’ve written up in your Data 3.0 Manifesto. We’re always looking to hire like-minded folks interested in helping us build the future of networked data.

Thanks again to Mike, David, Ellie and Kingsley for sharing your thoughts on this topic.

What do you think? Leave a Reply

Close

Want to schedule a demo?

Get in touch with us and we’ll be happy to set up a time.