Monday, August 1, 2011

Hackathon Anyone? California Opens Legislative Database

So it's not quite Wikileaks, and it's actually officially sanctioned by California's Legislative Counsel. But an email I recently received on Sunlight Foundation's OpenStates listserv could be the first step to fully opening up California's legislation to the public. In the world of legislative transparency, this counts as exciting stuff.

I've written before about my efforts to wrangle text files of California's codes into structured data that is easier to navigate than the official site (leginfo.ca.gov). I've posted the new version of California's codes on calaw.tabulaw.com, and made the computer code available on github. Now, California's Legislative counsel has made the raw data, in XML format, available for FTP download here. What is remarkable, is that the ftp data comes with all of the SQL scripts and a guide to set up your own database of California's laws and bills, *updated each day*.

Now, I've been to the ftp site before, and perhaps this information was all there (though I don't recall seeing it). [UPDATE: The files were posted in an obscure corner of the site about a year ago as the result of a lawsuit by maplight.org] But in any case, this makes it possible to create a model site for California that goes beyond what has been possible for other state legislation to date. Much of the work that has made this possible was done by Grant Vergottini, who runs legisweb.com, and whose team developed the authoring system that California's legislature uses to write bills.

The CA site could:
1. Show a "point-in-time" version of California's law.
2. Show a redlined version of California's Codes, for any bill that would amend them.
3. Immediately update California's Codes when a new bill is passed.
4. Feature modern search and navigational tools to smoothly get from any place in the codes to any other.

A group is now forming to hack on this site and make it a reality, with a Calaw hackathon in the near future. If you're interested, contact me directly (aih at tabulaw dot com) or leave a comment.

8 comments:

  1. great idea Ari. i'd like to help and participate. SF has "Third Thursdays" where folks gather to discuss OpenGov + Technology, which might be a good venue to spread interest.

    ReplyDelete
  2. Thanks, Ryan! (And to Greg Wilson, who was the inspiration for the hackathon.) We'll count you both in and plan to spread word to the Third Thursday folks--I hadn't heard of the group and hope I can make the next gathering.

    ReplyDelete
  3. See my comments on the CA leginfo data at http://groups.google.com/group/fifty-state-project/browse_thread/thread/7d8aeb44d69aa813

    The SF Sunlight meeting is the group that Ryan refers to that meets overlapping with OpenSF Third Thursdays, link to it is also in the posting at the above link

    ReplyDelete
  4. Would love to help with this event, and to participate. I've worked on apps for the New York State legislature in the past, and there are tremendous opportunities for all kinds of useful applications with data like this.

    ReplyDelete
  5. Here's the post so you don't have to follow the link:

    RE: leginfo data

    The raw dumps and the scripts to load the data into a MySQL db that
    you have been discussing in this thread, have been there (updated
    daily, weekly, yearly) for quite some time., not new.

    In fact you will notice the prior years data at the same location you
    describe.

    We've talked about this in the San Francisco Sunlight Meetup group. We
    currently already meet overlapping with the OpenSF Third Thursdays
    monthly.

    Great that the data is there, but the issue with the leginfo raw dumps
    is that while they technically deliver legislative documents into a
    MySQL db as SQLscript loadable - the documents loaded are data blobs
    - i.e. there is no actual schema for the information.

    Its just the raw documents each loaded in each as a monolithic blob
    into a database, not a whole lot different than grabbing the same
    documents right off the surface Web that are available at the site.

    In the current blob format with no useful schema (other than to
    effectively collate the documents in their entirety by what the
    document is and by date) , it would be more immediately useful to load
    the blobs into a repository and to apply search / IR (information
    retrieval) technology - which can out-of-the-box come along with a
    choice of JSON or XML format for return of result sets.

    In fact if you really wanted to get efficient about it and you felt
    you had to have traditional RDBMS with an RDB schema - then you could
    even use that as an interim step to get there without doing it all by
    hand.

    Check out : http://www.meetup.com/SunlightFoundation/San-Francisco-CA/308991/#ini...

    or @vividsocialnet on twitter

    ReplyDelete
  6. Thanks, @Mark and Lauren!

    At this point, the raw data should be enough to build a new "showcase" CA site. I've parsed the section information and internal links. This can be plugged in to the ftp site's historical data, and connected with a time machine-like front end.

    I've signed up for the next OpenSF meeting, and would love to work with folks from SF Sunlight and Open SF to plan and participate in this hackathon.

    ReplyDelete
  7. Great news @Yousuf! Now I know the hackathon can count on some creative Python hacking!

    ReplyDelete