Thursday, June 27, 2013

UUID for legal texts: Part 451fe00e-c2fe-4c11-9f10-5f96395e2523

Creating a data-friendly reference for legal texts can be far from straightforward, as I pointed out in reference to the Supreme Court's decision this week to overturn DOMA section 3 (aka 1 USC 7).

As  Tim Arnold-Moore pointed out in response to my last post on unique identifiers, not all issues can be addressed in a single identifier, and not all applications need to address all issues. Tim, who has developed legislative data systems for Tasmania, Canada (French & English) and Singapore, among others, noted that "[t]he ID schemes we chose in all these jurisdictions solved the problems we were trying to solve." Indeed, it is a lot to ask for an id scheme to solve all problems in all contexts for legal documents.  But I believe it is important to identify the big categories of problems that will  need solving, and to develop common id schemes for these cases. In particular, the solution Tim describes for Singapore, which "used both structural and UUID schemes side by side"  and accounts for section validity, merits further amplification. And I hope we can explore it as an example in our book.

The following goals, in some combination, are required for effective referencing in a variety of legal contexts:
  • Identify: Accurately and uniquely identify the source text from the assigned id.
  • Find: The id should fit into a common lookup scheme to allow retrieval of the identified text (URL is the obvious example). Ideally, the text itself is itself just one link away from its surrounding context (e.g. a section of an Act, embedded in the Act itself).
  • Validate: Confirm that the text found is the one referred to by the id.
  • Create: Creation of the identifier should be straightforward, applying a set of unambiguous rules. In my ideal, these rules would be localized to the text itself and, if necessary, its immediate textual surroundings.
  • Update: In many circumstances, the id should distinguish between a legal object (e.g. Section 3) and its current instance (e.g. Section 3 as of 12pm on January 2, 2013). This information may include changes due to the legislature's "in force" or sunsetting provisions, repeal, amendment or, in the case of DOMA section 3, invalidation by a judicial authority.
No single identifier can deal with all of these requirements, but there should be a family of id specifications that can provide a buffet to choose from for a particular legal reference.

A round-up of other comments on legislative ids:

+Robert Richards referred to the LEX:URN standard (anyone know what the current status is, or a link to a "live" version?).  The standard uses a FRBR-like style, and requires a "Jurisdictional Registrar" to create uniform names for jurisdictions (e.g. 'eu', 'us', 'fr'). The elements within the reference are to be defined and standardized by the "national Authority". It is not clear to me how this will apply to non-national jurisdictions. Examples of the LEX:URN format (from the spec) include:
  • urn:lex:es:estado:ley:2002-07-12;123 (Spanish act)
  • urn:lex:ch;glarus:regiere:erlass:2007-10-15;963 (Glarus Swiss decree)
  • urn:lex:eu:commission:directive:2010-03-09;2010-19-EU (EU Directive)
  • urn:lex:us:federal.supreme.court:decision:1963-03-18; (US FSC decision)
+Rinke Hoekstra pointed to the CEN MetaLex standard, used to represent UK and Dutch legislation. A sample reference,, provides linked data about the "Rome Statute of the International Criminal Court", including a link to a text source: (slow to load). According to Rinke, this id scheme includes a (SHA-1) hash of the document contents, as well as a versioning mechanism (apparently a date or datetime stamp). This approach has a lot to recommend it, including the potential to connect a reference to a body of metadata, which can address other goals outlined above.

+Sean McGrath  referred to the PRESTO (Public REST Object oriented) architecture (pdf at O'Reilly), and I would be interested to know how this relates to the proposed LEX:URN standard or other existing standards.

And Franklin Siler (@franksiler) mentioned the difficulty of applying an id scheme to unpublished court opinions.

So no single solution, but a number of considerations and some existing standards to help define id(s) for legal texts.

Wednesday, June 26, 2013

DOMA Section 3: How to Cite it Now?

The "Defense of Marriage Act" (DOMA) Section 3 has been struck down. That may not be news to you by now.  If you ask me, striking it down was the easy part. Much harder is defending the Act on the grounds that the Supreme Court should show deference to the wisdom of Congress, in the same week that you vote to strike down the core of the Voting Rights Act; the dictionary entry for "chutzpah" just got a new entry. (For more on this, see Lawrence Tribe's analysis.)

Somewhere in between, on the hardness scale, is figuring out how to cite DOMA section 3 now.  Wikipedia admirably shows the full, correct legal citation [for DOMA], with links:  Pub.L. 104–199, 110 Stat. 2419, enacted September 21, 1996, 1 U.S.C. § 7and 28 U.S.C. § 1738C . This shows how many ways there were to cite the law tricky the legal citation problem was before today's Court opinion. But now that section has been invalidated by the Supreme Court. That doesn't take it out of the U.S. Code or affect its legislative history. So where to put the information that it is no longer valid under U.S. law? Lawyers will use the time worn tradition of parentheticals, like: 1 U.S.C.  7 (nixed by the Supreme Court) or 28 U.S.C. 1738C (squashed like a bug, c.f. United States v. Windsor).

But these parentheticals are not standardized, and are not logically part of the citation unit.  More concretely, in assigning a UUID to DOMA section 3, how should the court's opinion be incorporated? Assuming an XML model, is this a separate attribute on the reference element (e.g. validity="invalid")? Should there be a flag in the id itself? (e.g. href="DF3Ae8362-invalid") And should invalidation by the Court be distinguished, in the data, from repeal of the section by Congress? As was pointed out to me, this information may be added, in the future, as a Constitutionality note such as 19 U.S.C. 535 note.

Your thoughts are welcome. I plan to incorporate them, and excellent feedback (by Tim Arnold-Moore, +Robert Richards+Sean McGrath and others) that I've gotten on my previous post on UUID's into an follow-up post on UUIDs for legal texts.

A note on the legislative history of DOMA section 3, that points to the more general need for a *unique* identifier for legal documents: For starters, DOMA was 104 H.R. 3396 (pdf), and passed as Public Law 104-99 in 1996. It Section 2 amended "Chapter 115 of title 28, United States Code ... by adding after section 1738B" a new section, 28 U.S.C. 1738C,. That section, itself, includes the specific instruction to amend while Section 3 amends Chapter 1 of Title 1 of the U.S. Code (which itself was passed as the Dictionary Act) by adding a new section 7.
Note: Following comments I received from a U.S. Code expert, this post has been corrected to reflect the correct structure of the Act and its effect on the U.S. Code.

Tuesday, June 11, 2013

UUID for Legal Text

There has been a lot of interest and I have gotten great feedback on the post about the book I'm writing with Grant about legislative data.

Data standards are always a hot topic (relatively hot-- we normalize against interest in this field in general, not against interest in the Kardashians:

Among the questions on data standards that have sparked interest is the question of how to assign unique identifiers to legal text. These are needed for many reasons, in a variety of contexts. The most straightforward is to be able to hyperlink to a specific subsection of a bill or law.

Some options for creating the unique identifier include:

  • A unique randomish code (e.g. based on the current  datetime)
  • A hash of the text of the section
  • A URN or URL identifier based on a standard, human-readable path to the section (e.g. us/uscode/title26/section100)
  • Some combination of the above
Version control is a very important consideration: Section 100 of title 26 may be amended and the identifier should tell us which version we're citing.  Some very technically savvy minds at the Law Revision Counsel of the U.S. House of Representatives, have suggested a combined approach with one identifier for the Code section, and one that specifies the version (e.g. the version as amended by P.L. 114-XYZ).

Another question is whether the id should itself carry information about the text. In the case of a hash, we could use a similarity-preserving hash, e.g. simhash, so that texts that are related would result in hashes that are close to each other. This might have advantages, for example, in citing to court documents. Text in one court opinion that is similar to text in another may provide useful precedent; a search algorithm could collect similar text sections based on these Simhashes.

Rather than get ahead of myself and draft out the entire chapter on unique identifiers, I'll stop here and invite your comments.
  • What is important to preserve in a unique identifier for legal texts?
  • What id schemes have proven successful in other document-based structures?
  • What would Google (or Linus Torvalds) do?

If you have Insights or connections to People With Insights-- please comment here or let me or Grant know.

Tuesday, June 4, 2013

First Commit: Legislative Data, the Book

I'm writing a book with +Grant. This may be a surprise to him. We've discussed the book, we're planning on it, we've even begun to flesh out many of the ideas in our blogs. But we hadn't said anything publicly about it until now. Grant's in Hong Kong this week for work, so I figured it's a perfect time for me to commit us publicly to this project and deal with the consequences when he's back.

By the time he's in California again, I'm hoping that expectations have grown such that we just have to bite the bullet and write. I am anticipating a reading audience of dozens, but hope for an impact on millions. And that is where I'm counting on you.  In typical esoteric policy tech fashion, I've created a +GitHub repository with our first commit.  And a wiki with my very first draft of a table of contents:

We'll cover legislative data standards (e.g. Akoma Ntoso, SLIM), data format wars (html, xml, json, rdf), policy (e.g. DATA Act) and drafting decisions, positive law codification, open government and transparency, tools of the trade and more. Take a look and see what I've missed or what I've messed up.

Because it's on Github, you can make a branch, make suggestions or even a pull request. Suggest a new chapter, suggest a better title or subtitle for an existing chapter. Write a first draft or prepare to comment on our drafts (which may or may not be committed first on Github before publishing-- a lot may change after Grant reads this post). Or leave your comments here. And if you make extensive comments or edits, maybe that means that you should go ahead and write your own d#&!n book. Or join us as a co-author.