Friday, November 18, 2011

Legislative Model: How Much to Open Source?

Should legislative data schemes be open source?  That is the question that Grant Vergottini raises in his blog post today, To Go Open Source or Not.  It's a thought-provoking topic and I encourage you to join the discussion on Grant's blog. Some background and a bit of my thinking is below:

Some states (e.g. Oregon) have claimed copyright in the organization of state statutes in order to protect contracts that the state has with legal publishers or other monopoly arrangements to publish the state's laws. That is not the case with most states or jurisdictions, whose bills and statutes themselves are indisputably part of the public domain.

However, even when the legislative text and organization is part of the public domain, access is limited by inconsistent publishing formats and lack of common standards.  Anyone who has tried to use the public internet to search for information on state or even federal laws realizes how difficult this can be.  I have discussed the situation with the U.S. tax code, which my company, Tabulaw, is working to make more accessible at tax26.com.

I have also discussed the difficulty of accessing California's laws, which gave rise to a hackathon to improve the situation.  California, thanks to Grant's work, does have an underlying XML-based data structure, SLIM, that allows California's legislature to easily research and modify the laws and makes the technical process of writing bills more efficient.  However, this benefit has not--until recently--translated into improved access for the public.  Grant and his company have recently open-sourced SLIM which, in theory, could make it easier to make California's laws more accessible to the public, and also make the model available to use with legislation in other jurisdictions.  This could move us toward a standardization of legislative data.

On the one hand, that would be a big step forward for public access, but it does raise some concerns, as Grant points out: it would mean that one company (in this case Grant's) would own the basic data structure for public laws.  This is something that already happens, de facto, with large swaths of government documents stored in pdf, a proprietary but open sourced format.   I am also disturbed by the claim, by the private publishers of the BlueBook, to copyright in a principle standard that has been adopted for citations of legal sources, and other copyright claims that encumber the basic ways that legal citations are written. So there are clear potential problems with a privately owned standard even if open-sourced.

Wouldn't it be nice if governments at all levels would collaborate to create a single nationwide public domain data standard for legislation? That would, for example, make it easier to identify all state laws related to abortion or to compare education laws across jurisdictions.  It might be nice, but it's also less likely than the Congressional SuperCommittee reaching a compromise.  I won't be holding my breath.

I do think that a privately created, widely adopted, and open sourced standard is the next best option.  I think that the value of having a standard set of metadata in legislation outweighs the risks of private ownership of the standard.  And I believe that it is in the interests of all involved, including the owners of the standard, to make the open source licensing of the standard clear and permanent, in order to encourage the widest possible use of the standard.

7 comments:

  1. You ask if it wouldn't be nice if government's would collaborate to publish laws in a consistent manner. The difficulty has always been in finding a need - legislatures haven't historically had much need to "collaborate" in anything.

    If the requirement is to make the legislative bodies documents accessible to the people, that does not really call for consistency. As long as the PDFs or HTML documents are put up on a website and a reasonable search engine is provided, that's good enough.

    So we have a chicken and egg problem. Until we have broadly defined analysis tools that create a demand for consistent data, there won't be a need. But those tools will remain very difficult to put together until the data becomes available in a consistent manner. The trick is finding a way to get around this.

    ReplyDelete
  2. The states aren't consistent on anything, and this isn't the thing that will make them start.

    I agree with Grant that it's a chicken and egg problem. So someone needs to take it upon themselves to make the data consistent, and then show how much can be accomplished with such a data set.

    Sounds difficult and expensive. But maybe, just maybe, there's someone out there crazy enough to try it.

    ReplyDelete
  3. I'll put up my hand (again) and point out that the Zotero reference manager has a robust mechanism for collecting uniform metadata from heterogeneous sources in a personal library. All we need to get it going for law is an agreed layout for the metadata in Zotero. I have a project for that purpose here: http://citationstylist.org/. There is a discussion forum on the site. Participation by people with a stake in this problem is welcome. Feel free to write to me for access, at the address given at the top of the forum.

    ReplyDelete
  4. @Karen, I sure hope there's someone crazy enough to take it on; definitely the best solution to the chicken and egg problem.

    @Frank, one of the issues we are dealing with in legislative information is that important metadata is not published at all. So, for example, there is no metadata to identify section headings in bills from most states, and that information has to be inferred (or parsed) from the bill formatting, context and keywords.
    Once that metadata is available, linking it to Zotero's reference management system would be a great way to make that metadata more accessible to users.

    ReplyDelete
  5. Parsing stuff out from page content is one of Zotero's strengths. We don't like to do it, but there are helper functions to ease the pain when screen scraping is necessary. There is a unit testing framework for the translators, which helps to cope with the fragility of page parsing. It's also possible to navigate across pages behind the scenes to chase after content (some Japanese government sites have essential metadata scattered at random across multiple frames, which was ... interesting).

    ReplyDelete
  6. Your comment about waiting for states to agree on a standard reminded me of the similar effort under the streamlined sales tax initiative (that was basically a project aimed at a standard among states with a deep economic need (making taxation of internet sales easier) and the states still can't agree. If the states can't agree on the above, a standard on XML markup of legislation or other legal data has no hope at all.

    ReplyDelete
  7. @Ari & Frank, I would go farther than Ari and say that a lot of metadata is not available on the face of the document either. One of my favorite examples is Connecticut court releases (http://www.jud.ct.gov/external/supapp/Cases/AROcr/CR303/303CR109.pdf) They have very little data on the document -- the name of the court isn't even on the document. Another very common piece of missing metadata is the topic -- something not traditionally publishing on court decision documents or other legal documents but highly important as soon as you put any collection of data together.

    ReplyDelete