Tuesday, June 11, 2013

UUID for Legal Text

There has been a lot of interest and I have gotten great feedback on the post about the book I'm writing with Grant about legislative data.

Data standards are always a hot topic (relatively hot-- we normalize against interest in this field in general, not against interest in the Kardashians:

Among the questions on data standards that have sparked interest is the question of how to assign unique identifiers to legal text. These are needed for many reasons, in a variety of contexts. The most straightforward is to be able to hyperlink to a specific subsection of a bill or law.

Some options for creating the unique identifier include:

  • A unique randomish code (e.g. based on the current  datetime)
  • A hash of the text of the section
  • A URN or URL identifier based on a standard, human-readable path to the section (e.g. us/uscode/title26/section100)
  • Some combination of the above
Version control is a very important consideration: Section 100 of title 26 may be amended and the identifier should tell us which version we're citing.  Some very technically savvy minds at the Law Revision Counsel of the U.S. House of Representatives, have suggested a combined approach with one identifier for the Code section, and one that specifies the version (e.g. the version as amended by P.L. 114-XYZ).

Another question is whether the id should itself carry information about the text. In the case of a hash, we could use a similarity-preserving hash, e.g. simhash, so that texts that are related would result in hashes that are close to each other. This might have advantages, for example, in citing to court documents. Text in one court opinion that is similar to text in another may provide useful precedent; a search algorithm could collect similar text sections based on these Simhashes.

Rather than get ahead of myself and draft out the entire chapter on unique identifiers, I'll stop here and invite your comments.
  • What is important to preserve in a unique identifier for legal texts?
  • What id schemes have proven successful in other document-based structures?
  • What would Google (or Linus Torvalds) do?

If you have Insights or connections to People With Insights-- please comment here or let me or Grant know.


  1. I used a structure-based identifier (similar to an XPath using numbers and mnemonics for each tag) for Tasmanian legislation way back in 1995. It was coded so that it could be a legal SGML/XML/HTML ID attribute but other approaches (e.g. a strict subset of XPath or a URL-like encoding) may prove more useful in particular circumstances. I argued for (successfully) a similar approach in Canadian legislation and also the US Code but, in the latter case, in addition to UUIDs. In Singapore we use both also. The latter two jurisdictions require both because UUIDs allow provisions to be moved from Bills to Acts to the Code (or in Singapore Revised Editions) and then subsequently renumbered without their provenance being lost whereas structure-based IDs can be more transitory. However, the latter are still required because you need to be able to resolve a citation to an ID and nearly all legislative citation schemes are based on a combination of structure and number "attributes" (whether stored as content or attributes in XML).
    The advantage of using both schemes is that it is relatively easy to automatically generate comparative tables (tables which match the provision number in one version of the Code to another or in one Revised Edition to another) and, for non-positive law, to manage references and automatic generation of footnotes in the code to the sources of the provisions.

  2. Thanks Tim! It's an honor to have you commenting on this post, and your experiences in each of these cases is extremely relevant to the work we're doing. In retrospect, were these the right choices for each of these jurisdictions? Rob Richards mentioned URN/Lex, but it seems to solve only a subset of the issues you mention. Thoughts?

    There are clearly a number of different goals for the UIDs and it may be best to separate them, as you suggest, into more than one attribute. It seems that at a minimum, we need (a) structure-type attributes and (b) version or datestamp-type attributes. I would also like self-validating id attributes, like a hash, so that ids can be used for arbitrary stretches of text (e.g. one sentence in a judicial opinion, or one phrase in a bill subsection). I will expand on this in my next post and would love to get your thoughts or references to where this type of scheme has been implemented.

    1. The ID schemes we chose in all these jurisdictions solved the problems we were trying to solve. Tasmania deliberately chooses not to renumber and doesn't have a Code and hasn't Revised since the so UUID/GUID solutions are simply not required (although could possibly be relevant to repeal and replace scenarios). Likewise, Canada has not done a Revision (preferring to actively maintain informal consolidations) although they do renumber a bit more so the UUID solution might be useful in addition to the structural IDs. Canada is complicated a little by bi-lingual requirements so IDs in English and French versions must match where the structures match.
      We are pretty happy with the solutions in Singapore which is the first jurisdiction where we have managed transaction time and valid time, used both structural and UUID schemes side by side, and the first to automate the creation of comparative tables.
      We apply in force (valid) version timestamp information at the document level (@validStart and @validEnd) - our technology manages this at the fragment level automatically using that information. Transaction version timestamp information (transactionStart and transactionEnd) is managed outside the XML as record metadata (although it is sometimes based on attributes in the documents or related amending documents).
      I believe the correct solution is to use a UUID to identify the entire version series (to adopt DMA language) either at the document or fragment level and use additional temporal attributes (you need valid and transaction to uniquely identify a version because of retrospective amendment and prospective access) to pick a particular version.

  3. Two things worth a look re UUIDs for legal text: