Wednesday, December 7, 2011

Tax Experts vs. Robots?

In the battle between robots and legal experts, I have a stake in both camps. On the one hand, the site we're developing at Tabulaw,, relies heavily on automated parsing of tax law information, in order to organize and display it intelligently for professionals.  On the other hand, this endeavor, and everything I have done professionally, require varying levels of expertise in many, usually esoteric, domains.  In fact, when my 5 year old is asked what her parents do, she says that her mom is a teacher and her dad is an expert.  Expert in what, none of us is quite sure.

So it should be no surprise that I am intrigued by the esoteric post about this battle by John Barker of Wolters Kluwer (a large legal publisher and owner of CCH, an information service for tax professionals). Barker argues that CCH's product, backed by legal experts who understand the context of arguments in a tax law case, is better positioned than Google Scholar to provide a meaningful result for professional users. Edward Bryant, one of those experts at CCH, writes on his personal blog that the tradeoff between expertise and automation is foremost a question of money: "(1) is an automated algorithm cheaper and (2) is the accuracy level I get through automation acceptable to my customers"?

I have understood these tradeoffs most clearly in discussions with Itai Gurari, formerly of Google Scholar, creator of Tracelaw, and one of the world's experts in automation of caselaw analysis.  What Itai does is really hard, and there aren't many people who can do it: while caselaw has an underlying structure due to its logical place in legal canon, finding that structure is an intrinsically hard problem to automate, because every judge thinks he's a poet, novelist or comedian.  True experts in a given area of caselaw are rare.  Experts who can formalize their knowledge into computer code are even more rare, so automation has a pretty large up-front cost.  And there are many subtleties that automated analysis will miss, no matter how many expert hours went into building it.

Where automation shines --or more broadly, computer assisted information processing-- is in the display and facilitated navigation of information.  Yes, experts' hand-drawn maps have been useful for a very long time, but even the Thomas Guide (remember those) has given way to Google Maps for most uses.

And that's where we're going with the tax map feature is just the tip of the iceberg of what can be done to map the geography of legal information. Not that this will replace experts any time soon, or my 5-year old will have to come up with a new title for me.

Wednesday, November 30, 2011

Data and Law: Two interesting new posts

What is the role of data in law?  I was trained as a scientist, so I have a tendency to look for evidence and falsifiable hypotheses in law and my daily life.  On this blog, I have discussed one aspect of data in law, the value of adding structure and metadata to legal texts.  In a similar vein, Grant Vergottini throws down the gauntlet, in his latest post,  for the creation of a uniform semantic web for legal documents.  He writes that current publishing practices provide "no uniformity between jurisdictions, minimal analysis capability (typically word search), and links connecting references and citations between documents are most often missing."

Grant asks a number of thought-provoking questions about creating a uniform semantic web for legal documents, which I think need to be addressed by the legal technology community, and the broader legal community:
What standards would be required? What services would be required?...Should the legal entities that are sources of law assume responsibility for publishing legal documents or should this be left to third party providers?
Edward Bryant takes a different approach in his post about using data in law, focusing on the value of using data to make policy decisions, which are then implemented in laws or regulations.  He discusses recommendations by the Ohio tax board to streamline processing of challenges to the state authority's valuations of residential property. The tax board apparently recommends streamlining challenges on residential property, but not commercial property, on the assumption that the residential claims will be less complex.  Bryant points out that the board's recommendation would be more credible if it used a little bit of data to correlate case complexity with the type or amount of claim.

I would expand Bryant's point to suggest that many of our leading decision-makers are not equipped to make data-driven decisions. Often, policy decisions like this are made with little or no relevant data-- or even in the face of contrary data.  Requiring some data-intensive technical training for lawyers would be a good start.  How about one semester of Evidence that focused, not on the FRE, but on how to gather and evaluate objective evidence in support of policy or legal decisions? I suspect that if lawyers, in general, were more data-literate, we'd have an easier time answering the questions that Grant poses above, on the way to create a uniform semantic web for law.

Monday, November 28, 2011

How to Convert HTML to Text, With Formatting

My current best answer: this html2text package from Germany.  It can be installed easily on a MacOS with Macports ($ sudo port install html2text), and on other Unix-like systems through their package managers.  It has a number of useful options, and I use it like this:

html2text -nobs -ascii -width 200 -style pretty -o filename.txt - < filename.html

So now that you know my current answer, here's the problem: not all html is created equal.  Legislative data is published in a variety of formats, some uglier than others.  Extracting information requires cleaning these formats up.

When I started to work with California legislation, I had the problem of converting the state's plain text into a simple html for use on web pages.  To do that, I used a Perl text2html module.  While it takes many steps to produce web-friendly html from California's laws, at least the plain text was not cluttered with formatting symbols and tags that could interfere with the core text.

The problems in other states is far worse.  Some versions of Iowa's bills, for example, appear to be published directly from Microsoft Word to the web, which means that they're littered with a maze of formatting information--sometimes positioning each word on the page--that is not related to the text of the bill.  Other states use hundreds of cells of an html table (or multiple tables) to format the bill.  Looking at the file on the state's website, you wouldn't know that the underlying data is so messy.

Simply stripping all of the html tags won't work, because that eliminates all the formatting information, including information that can change the bill's meaning (spaces, paragraphs). That's unfortunate, because there are many html libraries that would make stripping out the tags easy (e.g. Beautiful Soup for python, or similar libraries in other languages).  What I want to do is preserve the formatting, but do it with spaces and paragraphs, not tables or graphically positioning words.

Ironically, the most effective way to clean this messy data is also the easiest: copy and paste the bill displayed on your web browser.  After all, the formatting was made for the browser to interpret, and the copy-paste function (at least on a Mac) is quite faithful to the formatting.  However, automating this copy and paste process is far from simple and, with one exception, I have not seen any programs that make use of this native browser capability to convert files in bulk.  The exception is the use of the Linux web browser, Lynx, which has a function "Lynx -dump".  However, this converter  apparently has a number of faults, including an inability to process tables.  Anyone know how to use Chrome or Firefox to automate conversion of html to text for large numbers of files? This is still the solution I'd prefer.

But barring that, I found a close second, in the form of the html2text program.  Although it's relatively old (2004), it's fast and deals reasonably with tables and other formatting such as underlining and strikeouts.

Edit: Upon the suggestion by Frank Bennett, below, I installed the w3m text browser and used it to produce formatted text from html using the following command-line syntax:
w3m filename.html -dump > file.txt
Like html2text, it is fast and produces clean output, actually somewhat too clean.  The saved file strips some important formatting information, like <u> (underline) tags, so some caution is in order when using this method.

Friday, November 18, 2011

Legislative Model: How Much to Open Source?

Should legislative data schemes be open source?  That is the question that Grant Vergottini raises in his blog post today, To Go Open Source or Not.  It's a thought-provoking topic and I encourage you to join the discussion on Grant's blog. Some background and a bit of my thinking is below:

Some states (e.g. Oregon) have claimed copyright in the organization of state statutes in order to protect contracts that the state has with legal publishers or other monopoly arrangements to publish the state's laws. That is not the case with most states or jurisdictions, whose bills and statutes themselves are indisputably part of the public domain.

However, even when the legislative text and organization is part of the public domain, access is limited by inconsistent publishing formats and lack of common standards.  Anyone who has tried to use the public internet to search for information on state or even federal laws realizes how difficult this can be.  I have discussed the situation with the U.S. tax code, which my company, Tabulaw, is working to make more accessible at

I have also discussed the difficulty of accessing California's laws, which gave rise to a hackathon to improve the situation.  California, thanks to Grant's work, does have an underlying XML-based data structure, SLIM, that allows California's legislature to easily research and modify the laws and makes the technical process of writing bills more efficient.  However, this benefit has not--until recently--translated into improved access for the public.  Grant and his company have recently open-sourced SLIM which, in theory, could make it easier to make California's laws more accessible to the public, and also make the model available to use with legislation in other jurisdictions.  This could move us toward a standardization of legislative data.

On the one hand, that would be a big step forward for public access, but it does raise some concerns, as Grant points out: it would mean that one company (in this case Grant's) would own the basic data structure for public laws.  This is something that already happens, de facto, with large swaths of government documents stored in pdf, a proprietary but open sourced format.   I am also disturbed by the claim, by the private publishers of the BlueBook, to copyright in a principle standard that has been adopted for citations of legal sources, and other copyright claims that encumber the basic ways that legal citations are written. So there are clear potential problems with a privately owned standard even if open-sourced.

Wouldn't it be nice if governments at all levels would collaborate to create a single nationwide public domain data standard for legislation? That would, for example, make it easier to identify all state laws related to abortion or to compare education laws across jurisdictions.  It might be nice, but it's also less likely than the Congressional SuperCommittee reaching a compromise.  I won't be holding my breath.

I do think that a privately created, widely adopted, and open sourced standard is the next best option.  I think that the value of having a standard set of metadata in legislation outweighs the risks of private ownership of the standard.  And I believe that it is in the interests of all involved, including the owners of the standard, to make the open source licensing of the standard clear and permanent, in order to encourage the widest possible use of the standard.

Monday, November 14, 2011

Legal Informatics' New Blog, from Grant Vergottini

I'm excited to see the start of a new blog on legal informatics (and more), from Grant Vergottini.  Grant, a key participant and organizer of the California Law Hackathon, and his business partner, Bradlee Chang, developed the authoring system that California's legislature uses to write our state laws.  So he knows a thing or two about legislative data.

Grant's vision, which I share, is that at some point, legislation from around the world will be published in a standard format so that "you or your business can easily research the laws to which you are subject" due to the growth of an industry that "caters to the needs of the legal profession based on open worldwide standards."

There are a number of questions of how that vision will come to be.  I touched on some of these questions in my answer on Quora about the non-technical barriers to using version control for legislation, which stimulated a lively discussion.  I'm hopeful, with Grant's new blog, that we can have more of those discussions to work out both the non-technical (mostly political) and technical challenges in the way of open legislative data standards.

Thursday, October 13, 2011

Blog experiment over for now

I've gone back to the previous design for this blog.  I like Blogger's new "Dynamic Templates", but I suspect that they are changing how easy it is to search and find the blog.  So for now, I'm going back to the previous template.  And after this post, back to our previously scheduled legal technology programming...

Sunday, October 9, 2011

New Blog Views: Try it out

As you can see, the blog looks different.  Google has rolled out a number of new ways to view and navigate blogs, and I'm trying them out here. I like the Timeslide view the best (selection on the right), to see blog topics at a glance.

If you have strong opinions one way or the other, leave a comment and let me know.

Saturday, October 1, 2011

CA Hackathon Update: Mashup of Maplight data and California Statutes

Ever wonder who is responsible for a particular law?  Often the supporters and opponents of a bill are forgotten once the bill is codified into law.

Now, however, as one of the results of the California Law Hackathon held two weeks ago, we can now get more insight into California's laws, section by section. Mike Tahani built a mashup of Maplight data and California statutes, taken from, which shows, for a given section of California's codes, the organizations that supported and opposed its enactment.  

If you're technically inclined, check it out on GitHub, here.  It is currently a command line utility, and only applies to a very, very small subset of California's codes, because we're only working with legislation passed in the last two sessions of the California legislature.  If you're less technically inclined, wait for version 2 or 3.  Mike plans to add a one-click user interface and data visualization to show the links.

The rich possibilities for open legislative data are just starting to emerge, and Mike's Maplight mashup joins the new wave of data presentation for state codes, that includes Robb Shecter's work on Oregon's and now California's statutes (, and Waldo Jaquith's work on Virginia and soon other states, as part of the Knight Award project, State Decoded.

Thursday, September 22, 2011

Google Search: What are Uses for the Internet?

If you were wondering what the internet is for, you've come to the right place.  Apparently, this blog is one of the top ten results for the search "what are uses for the internet".

How do I know? The Google Analytics report for this blog showed that phrase as one that led here.

So if you're wondering, my 2 cents on the subject: one great use for the internet is to make legal information more accessible. While I think that the internet can be valuable to share photos, videos, tweets, tumbles, sparks and other gems, it can also be used to share our basic legal rulebooks and court decisions in a way that is accessible to everyone who is bound by them.

California Laws: Continued progress

Our work continues to translate the results of the first California Laws Hackathon for public consumption:

A small core of hackers, consisting of Grant Vergottini, Greg Willson, Mike Tahani and myself (with support from Karen Suhaka's excellent team at BillTrack50) is moving forward to apply the sample timeline to all  sections of California's codes, and to link external data to code sections.  In particular, Matt has written functions to link's lobbyist and bill positions data to California statutes.

Meanwhile, we're working with Common Cause (Philip Ung), Sunlight Foundation (Laurenellen McCann) and Maplight (Jeff ErnstFriedman and team) to debrief hackathon results and apply this momentum to strengthen Open Government initiatives in California.

If you want to chip in, contact me, or add to the growing wiki here.

Sunday, September 18, 2011

California Hackathon Update

Seven Google + hangouts, four countries, dozens of tweets, many coffees, discussions and coding sessions later, ready to call it a night for the California Law Hackathon. (twitter hashtag #calawhack).  

Special guest appearance by John Sheridan, architect of and guidance on California data APIs from Grant Vergottini, architect of California's LegisWeb and, and the Maplight team. Amazing cross-country coordination and promotion by Robert Richards.

Participants, photos, thanks, some of my embarrassing source code and early results are on the wiki and will continue to be updated:

Improved documentation of the event and access to California legislative data coming soon.

Stay tuned...

Wednesday, September 14, 2011

California Laws Hackathon Details

If you haven't yet heard the news, we're having a Hackathon for California Legislation. Feel free to spread the announcement below to friends and lists who might be interested.

Join us this Saturday to hack world-class apps for California's legislation. This hackathon was born on the Sunlight Foundation's Open State project listserve, to extend the great work there! Please forward to your lists and groups!

When:      12pm -6pm Saturday, September 17
Where:     Maplight Foundation, 2223 Shattuck Ave., Berkeley (contact me for virtual participation)
What:       Build apps for California legislation, including a legislative time machine. 
                  Play with the new CA legislative APIs from and
RSVP: or contact me directly.

Sponsors Include: Sunlight Foundation, Maplight Foundation, LegisWeb, Common Cause, BillTrack50, NationBuilder, Tabulaw

Monday, September 12, 2011

California Laws: Great new ideas

Waldo Jaquith of the StateDecoded, and Matt Carey have added a number of excellent ideas for organizing California's laws as part of our California Law Hackathon (RSVP here).

Do you have ideas for the hackathon?  Add them here*:

Current ideas include:

  • Cluster related code sections, for search and navigation
  • Create a timeline view for each code section
  • Bulk downloads for codes and legislation
  • Create identifiers for useful legislative units (e.g. language on "unfair practices")
  • Track movement of statutory text from one place in the code to another
*Write to me (aih at tabulaw dot com) for access to the wiki.

Thursday, September 1, 2011

Legal Technology: Change is Coming

A transformation is taking place in the legal technology industry. An article article by Paula Hane in InfoToday entitled "Upstart Legal Services Gain Traction", highlights a number of new online services that are challenging traditional models - both in technology and legal practice. (Disclosure: there is a nice section on Tabulaw's research and writing platform and our site.)

Pressures on lawyers and law firms to become more efficient, and to adopt advances in technology are now becoming publicly visible in a number of ways. One of them is the rise in online legal services that Hane describes in her article, another is the turmoil surrounding high law school tuitions and the weak market for new lawyers, a third is the growing interest in legal information from technologists and technology companies (e.g. legal content on Google Scholar and Google Venture's investments in LawPivot and RocketLawyer).

These changes highlight two essential components of law: information and judgment. A comment on judgment first:

Engineers often make the mistake of assuming that the entire function of law can be outsourced to technology. That thinking is fed by a certain line of thinking that runs straight up to the Supreme Court, that judgment is just the application of law to facts, like Chief Justice Roberts' "balls and strikes" analogy at his confirmation hearing. That suggests that judgment can be replaced by an algorithm. That, I hope, is not the direction of improved legal technology.

Where legal technology shines is in distilling information in a form that makes it easier for a decision maker to apply good judgment, and which clears out much of the information overload that surrounds many legal issues. The tech world is only now touching the surface of what can be done to distill the information of law, which is just text after all. As an example, a friend of mine, Itai Gurari, is building an engine that can identify the relevant legal points in a court opinion (check out his search engine, Tracelaw, here). If you want to get involved in this exciting field, a good place to start is with state statutes and by helping us with the first ever California Law Hackathon.

Wednesday, August 31, 2011

California Hackathon: Sunlight, Common Cause and More

[Edit: Sign up to join us here (Facebook event)]

The California Codes Hackathon, now scheduled for September 17 in San Francisco and Denver, is gathering steam.
In addition to the excellent hackers who've signed up to prepare data for the hackathon (check out the wiki), supporters now include:

Sunlight Foundation (thanks, Laurenellen McCann and James Turk), Common Cause (Philip Ung), Nation Builder (Adriel Hampton). We'll be announcing more soon...
Sign up now to join us: Facebook Event or through NationBuilder (same list).

Monday, August 22, 2011

California Codes: Everything Down is Live Again

Last week, Amazon Web Services notified me that something went wrong with the hardware hosting our California Codes site ( Bad timing, given our preparation for a California Law Hackathon. I built this originally as a side project and didn't have a back-up (note to future self), so I had to rebuild it from the sources. After a few trials and errors, the site is now live again, and I've added more notes to the CAlaw github Readme about how to host this Django site on Amazon EC2. In theory, that should allow anyone to build and host their own site for California Legislation, but it's not so easy, so carpe hacker.

Thursday, August 18, 2011

California Laws Hackathon + Calaw site down

Plans for the California laws hackathon are moving forward, with some sponsorships secured, thanks to Philip Ung (@philrung) at Common Cause (@CommonCauseCA). Generally updates are added at More details soon...

Also wanted to note that my California laws site ( is down, due to an Amazon Web Services interruption.  I will restore it soon on a new  AWS instance.  If you're interested in such things, the notice is below the fold.

Thursday, August 11, 2011

California Law API: Preparation for Hackathon

[Edit: Sign up to join us here (Facebook event)]

It's true, California's laws now come with an unofficial RESTful API. This is a great boost for California Law Hackathon plans: now programmers can dive right in and develop innovative ways of presenting and navigating the data in their favorite format- JSON, XML, RDF among others.  If you want to jump ahead to see the API specifications, they are available on the site here and I've posted them here on the California Law Hackathon wiki.
California Hackathon update
This Sunday, Greg Willson (Granicus) and I were joined by Alex Hendler (Ontolawgy, LLC) in Botswana on a Google+ Hangout (my first) to help set the groundwork for the hackathon. Main points of the discussion:
  • Target date for the hackathon: September 3-417-18 (update)
  • Prepare data and tools for hackathon participants
  • Prepare a list of projects and goals (e.g. legislative time-machine, before and after redlining for bills)
Notes from the meeting are here. Twitter hash: #calawhack
How does the API fit in?
I spoke yesterday with Grant Vergottini, founder of Xcential, who foresees a transformation in legislative technology like the one he helped to usher in to the graphic design world, with the development of Computer Aided Design (CAD) software. He has developed a web-accessible interface to the California laws, that can provide legislative data in a wide variety of formats. This data is updated daily, and should lend itself well to the kind of "time machine" presentation we've discussed for the hackathon.  In preparation for the hackathon, Grant put together the API specifications linked above.  Test them out and share any feedback you have on this API and other tools or data you'd like to see available for California legislation.
That's great, but: What is an API?
A web API (Application Programming Interface) tells programmers how to access data from a website.  FacebookGoogleTwitterLinkedInApple and pretty much any "Web 2.0" site provides some API to their web services.  In a truly Mr. Jobs Goes to Washington moment, the Federal Register announced last week that it was releasing a fully RESTful API, complete with Github account and developer's page.  I am excited that California's laws now have their own (unofficial) API, too. 

Wednesday, August 3, 2011

CA Hackathon Planning Meeting

This Sunday, August 7 at 1pm PST, we are having a virtual + in-person planning meeting for the hackathon, with a twitter hashtag #CaLawHack.

I've set up a collaborative editor at so we can jot notes for the planning meeting itself.

Look for more details on, where Luke Fretwell has generously offered to publicize the meeting.

Monday, August 1, 2011

Hackathon Anyone? California Opens Legislative Database

So it's not quite Wikileaks, and it's actually officially sanctioned by California's Legislative Counsel. But an email I recently received on Sunlight Foundation's OpenStates listserv could be the first step to fully opening up California's legislation to the public. In the world of legislative transparency, this counts as exciting stuff.

I've written before about my efforts to wrangle text files of California's codes into structured data that is easier to navigate than the official site ( I've posted the new version of California's codes on, and made the computer code available on github. Now, California's Legislative counsel has made the raw data, in XML format, available for FTP download here. What is remarkable, is that the ftp data comes with all of the SQL scripts and a guide to set up your own database of California's laws and bills, *updated each day*.

Now, I've been to the ftp site before, and perhaps this information was all there (though I don't recall seeing it). [UPDATE: The files were posted in an obscure corner of the site about a year ago as the result of a lawsuit by] But in any case, this makes it possible to create a model site for California that goes beyond what has been possible for other state legislation to date. Much of the work that has made this possible was done by Grant Vergottini, who runs, and whose team developed the authoring system that California's legislature uses to write bills.

The CA site could:
1. Show a "point-in-time" version of California's law.
2. Show a redlined version of California's Codes, for any bill that would amend them.
3. Immediately update California's Codes when a new bill is passed.
4. Feature modern search and navigational tools to smoothly get from any place in the codes to any other.

A group is now forming to hack on this site and make it a reality, with a Calaw hackathon in the near future. If you're interested, contact me directly (aih at tabulaw dot com) or leave a comment.

Monday, July 25, 2011

Law's Lost Generation?

Last week's NYT OpEds on law school reform continue to reverberate in the legal community - especially among law students and recent graduates.

One answer for graduates, as I suggested, is to proactively define and promote their own expertise.  Another is for this generation of new graduates to join with other new graduates and use technology to their advantage.  This generation may face the worst economic situation for lawyers since the 1930's, and are largely being shut out of traditional firms.  But they are also (by definition) the most wired generation ever and have access to technologies that can bring tremendous value and efficiency to legal practice.  These technologies include:

1. Social networking to bring in business from around the corner and around the globe.
2. Better, inexpensive and free online research platforms.
3. Virtual law firms, which can lower overhead and increase transparency by providing links to the public work product of firm lawyers, and facilitate rating or referrals from clients.
4. Workflow technologies to offer better and more efficient service to clients.

We're working on a couple of these technologies (see, and I believe that the next couple of years will bring many more.

If you are part of this new generation of lawyers, what role do you see for technology in law?  What technologies would you like to see for lawyers?

Friday, July 22, 2011

NYT: Law School Reform and Innovation in Law

Following a number of scathing articles over the last few weeks about law school graduates accumulating tremendous debt with few job prospects, the New York Times today published opinions from a variety of legal pundits about reforming law school. Many of the opinions revolved around the theme of increasing experiential learning and reducing class time.

David Lat, founder of Above the Law, argues for replacing the third year of law school with the beginning of an apprenticeship. Perhaps not surprisingly, but somewhat disappointing, three law school professors say that the current model is fine and that law school is a good opportunity, regardless of career prospects, to become "citizen scholars" (with > $120k in debt?).

This discussion has mirrored many conversations I have had recently with lawyers and Bay Area entrepreneurs who are looking not just at law school, but at the creaking wheels of law, and are working on ways to innovate.  Just this afternoon, I met with Tim Hwang, U.C. Berkeley Law student and partner in the fictional Robot, Robot & Hwang law firm.  Earlier this year, Hwang organized a conference of technologists and entrepreneurs in the legal space, and is currently working on projects that reinvent the relationship between law and technology.  We spoke about the generation of lawyers who are not following the traditional path from school into law firms--whether by choice or, more often, due to the downturn in the economy. Will this generation just disappear, or will it push for a transformation of legal practice?

I've also been speaking with Vivi Hoang, a recent law school graduate, who is working on an innovative plan to develop a startup law clinic at a Bay Area law school that would bring together students with entrepreneurs and law firms in this area.  This would seem to be a win-win all around: law students would get hands-on experience with legal issues that arise in startups; startups would be able to trim their legal bills, and participating law firms would be able to work with promising startups without taking on the full risk of a fee deferral arrangement.

Perhaps most encouraging was a meeting I had with Avlok Kohli and Kevin O'Keefe this morning, during Kevin's brief stop in the Bay Area from Seattle. Avlok is a brilliant strategist, software engineer and co-founder of a startup in the legal space that I've been working with. Kevin is the founder of the LexBlog network (tagline: "Real Lawyers Have Blogs") and has built a considerable following by helping lawyers to develop a thoughtful and effective approach to social media.  Kevin, like many of the others who I've spoken with recently, recognizes the need for a cultural shift among lawyers-- in the way we communicate with each other and with the public, and in our use of technology.  The skills that Kevin has emphasized for practicing lawyers--to develop and share their expertise in blogs and elsewhere online--is even more critical for the current generation of law school graduates.  These graduates will have to fend for themselves more and more, as the nature of the legal market and legal services inevitably change.

So while law students wait for the reforms that David Lat and others are calling for in the New York Times OpEds, they would do well to follow Kevin's advice and start now building their individual reputation online. In other words: Learn to stop worrying and love to blog.

Tuesday, July 12, 2011

139D: What Congress Does When it Runs Out of Numbers

So, while Congress debates whether to raise the debt limit or cause catastrophic damage to the country's economy, we diligently worked on making further improvements in the way we display tax law on  And one of the items that is causing our parsing engine to choke, is Section 139D of the Tax Code.  Search for it on any site you like: (26 U.S.C. 139D).  (Thanks to Sergey, author of the parsing engine, for pointing this out!)

It's not easy to find, and if you find the section at all, you may become a bit puzzled.  There are, in fact, two section 139D's:

One deals with school vouchers, and the other with "Indian health care benefits".  How did this happen?

Well, the D is what happens when Congress runs out of numbers.  Here, there was a 139 and a 140, and Congress stuck in a few additional sections between them.  And apparently added a second 139D, without knowing that that numbering already existed.  As the Law Revision Council's note states: "3! So in original. Two sections 139D have been enacted."

I pointed out before the thousands of errors I found when parsing California's electronic statutes. This error (and others like it,  including 28 U.S.C. 1932, 5 U.S.C. 5757) are in the U.S. Code itself.  Another bump on the road to digitizing legislative information.

Update: Upon further investigation, it seems that Congress has now repealed (at least one) section 139D.  An update on the House website notes that "Section repealed by Pub. L. 112-10, sec. 1858(b)(2)(A)". This update raises its own questions, since the bill that became Pub. L. 112-110 (HR 1473), does not seem to have a section 1858.

Friday, July 8, 2011

Obama's Negotiations on the Debt Ceiling and Tax Code Reform

A very interesting analysis by Kevin Drawbaugh of Reuters argues that--despite rumblings of reforming the tax code-- negotiations between President Obama and Congress over the debt ceiling are unlikely to make the kind of overhauls called for by the Fiscal Commission (National Commission on Fiscal Responsibility and Reform). Drawbaugh notes that there simply is not enough time to make those changes before the deadline of August 2.

It does look, however, like reform of many provisions of the tax code will be central to securing a deal, in which case the changes can serve as a kind of dry run for the larger overhaul that President Obama called for in his State of the Union Address.

A great opportunity to simplify the tax code not only by closing loopholes, but also by writing any new tax legislation so that humans can better understand it and computers can better process it.  See recommendations elsewhere on this blog and at start by using plain language in writing any new tax legislation.

Wednesday, June 29, 2011

Tax Law Access in the 21st Century: Guest Post on 21stCenturytaxation blog

Professor Annette Nellen generously invited me to write a guest article on her blog about technological changes that can make a difference for tax law.  Nellen is tax professor and Director of the MS Taxation Program at San Jose State University, and a thought leader on reform of tax law and practice.

Check out the post and leave your comments here:

Monday, June 27, 2011

Quora Post: What are nontechnical barriers to adopting version control for legislation?

I was invited to answer the question above at Quora, which touches on a number of themes on this blog. It generated an interesting discussion, and I reproduce my answer here:

This Venn diagram explains the most fundamental barrier:
If you squint, you might be able to find a couple of intersections, but not many. I think that this is a problem that can be solved largely by providing a clean, obvious, technical solution for lawmakers. To borrow from the Godfather: offer legislators a solution they can't refuse (more below).

But this question asks about the non-technical* barriers, and these are largely inertial. The legal community is unaware of the powerful text-based tools that could make legal work more accessible to the public and more efficient. Meanwhile, there is no "version control" lobby in Congress. So although adding version control would make a tremendous difference to the efficiency of the legal process, few people understand the value that it would bring. I've written about the potential benefits in a couple of specific cases: What can lawmakers learn from computer science? and Open Source Tax Law

Much of the current system for drafting, publishing and updating U.S. laws is more than two hundred years old, depending on how you count. It is internally consistent (mostly) and is actually quite sensible for organizing legislation into printed books.**

In the case of U.S. Federal legislation, the significant burden of writing, compiling and publishing U.S. laws is divided among three different institutions:  the Office of Legislative Counsel of the U.S. House is in charge of formatting and printing legislative drafts and proposed legislation; the Law Revision Counsel of the U.S. House maintains and updates the U.S. Code on a 6 year schedule***; and the Government Printing Office is in charge of printing the official version of the U.S. Code. When these roles were originally established, they provided the human resources and Quality Assurance to maintain an organized body of law. The challenge is to move from this system to one that is suited for an electronic age.

Each of the three institutions works with legislation in a different primary format. Where metadata has been added, e.g. to create an XML or HTML version, the formats are not consistent with each other. This is a technical barrier that will require a non-technical solution (choosing one format and responsible institution over the other). It's a question of awareness and political will.

This year has seen some progress on both counts. Just a couple of months ago, Speaker of the House John Boehner and majority leader Eric Cantor wrote a letter to the Clerk of the House, calling for e-formats for legislation.**** The Sunlight Foundation has been doing great work in pushing for transparency in government, including more consistency in e-formats for legislation.

This is where I think a technical solution (and technical people) can make a difference. We can develop a solution that "just works": showing a redlined version of laws for any bill, accurately showing changes in the U.S. Code as soon as an amendment is enacted, and browsing of legislative history like the MacOS Time Machine. A non-partisan solution that could save money and increase transparency, all at the push of a button. I still wouldn't underestimate the power of inertia, but having an elegant and simple technical solution close at hand will make it much more likely that legislators will make the change.

*By "technical" I assume the question refers to the algorithm that would actually be used to implement version control, and "non-technical", I assume, means the political or historical resistance to change.

**Legislators, and the legal community as a whole, has yet to make the transition from print-centered formatting to electronic. Legal documents--even if originated and consumed electronically--are still formatted as if destined primarily for print.

***The U.S. Code is a compilation of U.S. Federal laws into 50 Titles, divided by subject area.

****I highlight this letter, and some of the technical challenges to converting legislation into a version-control friendly formats, on my blog:

Insert mode

Thursday, June 23, 2011

California Laws Android App in 5 Minutes

I've wrapped the California Laws website in an Android application for even faster access from your mobile phone.

[UPDATE: I have shut down the live CA Laws demo website; provides the internal hyperlinks that I had built into my site, and is kept up-to-date. The Android app is also not working now.] Download the new California Laws app here for free, test it out and let me know what you think.  To install, you need to download directly to your Android device and open from the System tray.


In theory, going from a mobile-friendly website to a web application for Android, iPhone or iPad should be relatively simple.  And in practice, it now is, within limits.  I used a new web-based service to create this "version 1" California codes Android application. The website that I used to make this app,, is a bit slower than I'd like for normal page rendering, but otherwise they offer an impressive service. I have no relation to Appgeyser, but this looks like the fastest way to go from website to application, and is a good way to test how your site would look as an application.

A few downsides which can be cured in future versions:
- Tables of contents require scrolling across the screen
- Appgeyser puts an ad at the bottom of the application for their service

Wednesday, June 22, 2011

California Laws Go Mobile, With Headings

You can now browse California legislation from your mobile device at This is an extension of the work, described in these posts, to parse and display California's laws for more user-friendly navigation. I implemented this as a simple web application, using mobile-targeted style sheets, not (yet) a native application. On the devices I've tested, though, it's fast and convenient. Let me know how it works on your iPad, iPhone, Android, Blackberry, or other mobile toy device.

I also implemented an idea by Jason Wilson, and seconded (or at least retweeted) by Robert Richards, to add headings to California law sections, to help provide context. Wilson, of Jones McClure, a legal publisher, has given a lot of thought to legal technology and has many interesting ideas on how to make legal technology better. His suggestion on the California Laws site is just the kind of exchange I was hoping to generate. If you have ideas or suggestions to improve navigation of California's laws (or the Internal Revenue Code), let me know on twitter (@arihersh or @tabulaw) or in comments below.
Or make the changes yourself and send me a pull request on Github, where I've put the code for the California Law website. In some ways, using online collaborative technologies like Github brings things full circle for law: lawyers have been doing "open source" collaboration for millennia, taking branches and merging Biblical laws, Hammurabi's Code, the Roman Code and others to create new laws. I'd love to see Jones McClure and other legal publishers join in an effort to provide a truly open source repository of primary laws and court opinions, upon which secondary content and proprietary analysis tools can be built.

In future posts I will flesh out details of how this could work, in the context of California law and in open sourcing the Internal Revenue Code.

Friday, June 17, 2011

Open Source the Tax Code

This week, the U.S. government released a major update to the online version of the Tax Code. For some reason this didn't make headlines.
Here, and in future posts, I will discuss why the text of the Tax Code needs to be "open sourced", and how we're approaching this challenge at Tabulaw with The work to introduce structural metadata to California's laws (ongoing, now open sourced on Github, and available at, was a warm-up for this discussion.

This week's update of the Tax Code illustrates the challenge ahead: The update, by the Law Revision Counsel, a small, dedicated office of Congress, incorporates all of the changes that Congress has made to the Code from 2006 through the end of 2010. The Internal Revenue Code is the Federal law that arguably has the greatest impact on the lives of most Americans. And for the time being, the public has an up-to-date version* of this law.

The impact of the nearly 10,000 sections of this law is one reason that President Obama emphasized the need to simplify the Tax Code in his State of the Union Address, saying, "It makes no sense, and it has to change."

Yet this lack of clarity is itself a major impediment to change. If and when Congress takes up the battle over what the tax code should say, we will need as much clarity as possible about what current tax law actually says. The effort will raise a fierce debate about important issues of tax policy, fairness and the future of this country. However, these issues become clouded by a mire of laws, regulations and guidance that even leading experts (and the IRS) struggle to understand and explain. Technology cannot cut through all of the fog, but there are non-partisan, technical solutions that can help make the task easier. An Open Government bill introduced today by Representative Darrell Issa (R), includes an important open data provision that would impact IRS (and other) agency rulings and guidance. I believe that open-sourcing the law itself is a natural corollary to this bill.

By open sourcing, I mean to:
  1. Introduce meaningful metadata into the text.
  2. Parse or draft new tax-related bills in so that they can be:
    1. instantly compared to existing laws and, when passed,
    2. used to immediately update a public, online version of the new law.
  3. Create an platform that experts and professionals can use to research, debate and explain the law.
The first two principles are essentially a subset of common-sense "open data" principles such as these from the Sunlight Foundation or these from the initiative. The third is a focus of our work at Tabulaw to improve online tools for legal professionals (more on this soon). We are at an exciting time for initiatives to reinvent participation in government (e.g. PopVox, OpenCongress, Sunlight Foundation's OpenStates etc.). I believe that, especially wrt the Internal Revenue Code, there is much groundwork that needs to be done by the professional tax community--in clarifying, explaining and simplifying the Code--in order to make public participation in the policy-making process more meaningful.

*The LRC version is up-to-date through January 2011.

Thursday, June 9, 2011

Free Advice to Congress: 5 Better Uses for the Internet

As the Wiener scandal reminds us, Congress doesn't quite have the hang of this internet thing. So I take the liberty here to provide 5 suggestions of better things Congressmen could be doing with their access to the web and our tax dollars:

Monday, June 6, 2011

California Laws: Now with search

I've made some improvements to, which has all of California's legal codes with internal links for easy navigation of the laws. It now also has a fast search engine, powered by Sphinx.

Know anyone who works with California state law? Pass this on to them. Anyone in the legislature? They might want to replace the aging

Wednesday, June 1, 2011

Cleaning Up California Law: Errors in online sections

I found more than a thousand errors in the course of parsing the online version of California's legal codes. At first, I thought there might be something wrong with my parsing algorithms -- I had, indeed, gone through a number of rounds of bug-fixing. These repeated sections were carried over to the site I've published ( Having parsed the sections, it would take just a few minutes to clean up the duplicates, but just to make sure I looked back at the California legislature's website.

When I looked at the original data on the California legislature's website, I saw the sections repeated verbatim. I've collected the 1,368 repeated sections (about 2%), and most look like errors in California's original conversion from print to electronic document.

Want to see for yourself? Check out these sample sections:

There were also printer's errors that apparently crept in during the conversion from print to electronic format. For example:

Ý1084.] Section Ten Hundred and Eighty-four. The writ of mandamus

may be denominated a writ of mandate.

Do any of these errors cause confusion about what the law is? Maybe not, but it makes navigating the law that much more confusing. With almost all legal research now being done electronically, I think it's reasonable to expect official government electronic sources that can be relied upon.

Friday, May 27, 2011

CA Legislation New site

CA legislation transformed (w. new website--check it out!) If you want to skip the more technical post below altogether, just go to It has no styling or search function yet, but compare the navigational flow to California's official legislative site:

How to Convert All Files in a Directory: CA Legislation

Starting with the unstructured data in California's legislation, it takes many steps to add structure to a single Section. Or rather, to add back in the metadata that the Section's original drafters intended, to help a reader understand and navigate the law. The next step is to apply the transformations to all of the Sections in the law.
California helpfully makes all of its codes available for FTP download in a set of nested folders. It would be great if more government agencies made their data available in bulk. But we still have a problem: How to recursively iterate through all the files and folders in the directory (29 folders, 50,000 files  sections in total) and apply the parsing transformations to each file. Each file consists of a (variable) number of sections, e.g. here.

For this task, I went back to another old Linux utility: Find. If you type "Find /" from a command prompt in Linux (also MacOS), you get a list of all of the files and folders on your computer. Don't do this. It will take a long time, and is not really useful for anything. But you can use this powerful command within a single directory, and send the list of file names to a program that will operate on each one. In this case, I wrapped this all in a Python program, using the POpen() function to run any Linux commands that I wanted. Gory details below the fold.
CA Codes After

If you want to skip the details and go straight to the results, I've put the newly transformed California code sections on a website ( Currently, the design is very simple and has no styling, whatsoever. But I welcome you to do a before and after comparison and let me know what you think in the comments.

In my view, converting CA Legislation to structured data makes navigating the code much easier. It also reveals some problems with the version on California's website-- repeated sections, stray text markings--that should probably be cleaned up. More about these anomalies, and the brave new world that structured data can bring to law, in future posts.

Tuesday, May 24, 2011

How to Convert Citations to Hyperlinks: CA Laws

Steps 3 and 4 in converting the California legal Codes to structured HTML involve identifying references within the text (e.g. "pursuant to Section 480" or "under Section 15000 of the Vehicles Code"). This presents two challenges: (1) identifying the correct Code (the high level subject matter of the law), and (2) identifying the section in that Code.
This becomes more complex than it would seem, because California's legislature uses a variety of different forms to refer to other Sections and Codes. The most straightforward is of the form, "Section X of the Y Code". But there are many, many variants. An example:
"pursuant to the provisions of Part 2.5 (commencing with Section 18901) of Division 13 of the Health and Safety Code"
To deal with these variations, I started by identifying all Code references. I used the Linux sed utility to do this and to enclose each Code reference with html tags. This is a simplified version of the RegEx for one Code:
s_Health and Safety Code_<a href="/Code-hsc">Health and Safety Code</a>_
To identify the Section number(s), I compiled a list of the most common forms of reference, and created a RegEx expression for each. There is an additional problem, though: many of the references contain many subreferences and cover more than one line of the text:
pursuant to Chapter 3.5 (commencing with Section
11340), Chapter 4 (commencing with Section 11370), or Chapter 5
(commencing with Section 11500), of Part 1 of Division 3 of Title 2
of the Government Code
Hmm. A worthy challenge.
The Chapter, Part, Division and Title references do not seem to add any independent information for our purposes. So I look for, and skip over, anything of the form [Part OR Division OR Title] [number] of [Part OR Division OR Title]...
Now we have:
11340),...Section 11370),...
...Section 11500),...
of the <a href="/Code-gov">Government Code</a>
With the Code reference previously identified we can now focus on finding the various Section references, and associating them with the right Code. I go into a bit more technical detail on this after the fold and in the next post on how I put it all together to run through all of the Code sections (18k files; 50k files  sections) in one sitting.

Monday, May 23, 2011

How to: Convert Sections Into Hyperlink Targets

How to find section headings in a text document and convert them to targets for hyperlinks?

If you have ever had this burning question, you'll want to read on. Or you can take my word for it that it would have been better for this information to be included in the documents when they were originally published.

This post describes Step 2 of 5 to convert California statutes to structured html: Identify section, subsection and subdivision headings. To do this, I am using an old (1970s) Linux program called "sed" (stream editor).

There are lots of ways to do this using more modern programming languages, but sed has the advantages that it is VERY fast, and it has built in the operations of opening, editing and closing a file. It's basically a "find and replace" function on steroids, without the need for Congressional hearings.I must admit, that once I got the hang of sed, and its improved cousin, "Super Sed", it was pretty addictive: with one command, you can change all capital letters in a document to lower case, or replace all vowels with a *, or mark all numbers and letters at the beginning of a paragraph as section and subsection headings. Sed goes through a file one line at a time and makes these substitutions. Sed is quite powerful and there are actually a number of other things you can do with sed, operating one line at a time through a text. If this sounds like fun to you, look here for a good tutorial.

I was working with California state statutes, which I had earlier converted to html. Fortunately, the statute text has a very regular structure: sections, subdivisions and other levels of the document were marked at the beginning of lines, with consistent spacing setting them apart.

So to find the section headings, I just needed to create a set of rules (using RegEx), that describe each kind of section heading. California statutes use headings with the following levels:

100.1 (a)
100.1 (a) (1)

So I needed to describe each of these section headings in a way that they could be identified and separated from any other numbers and letters that are found within the statutes. Here's an example of a rule that does this:

s_^<p>([1-9]\d*)\._<p><span class="section level1" id="sec-\1\.">\1\.<\/span>_

It looks gory, but is actually pretty tame. In essence, it says to substitute (s_) any number at the beginning of a line (^) and beginning of a paragraph (<p>) with a label (<span>) that will identify this number as a section heading. Each kind of heading requires another rule to describe it, and then all of these rules are applied to the file using the ssed (Super Sed) command. The result converts a section heading like this:

<p>15210. Notwithstanding any other provision of this code, as used in

to something like this:

<p><span class="section level1" id="sec15210.">15210.</span> Notwithstanding any other provision of this code, as used in

Not rocket science, but one step closer to structured data. The <span> will allow us to separate out this section from the rest of the text in order, for example, to link to this section from another section that references it.

The next step is to find all of the references to other sections that are found inside the statute text and to place links from those references to the sections they refer to. Unfortunately, those references may cross over more than one line, it is harder to use a line-by-line editor such as sed to do the job. For this, I put together a short search and replace program in the Python programming language, which is more flexible and has a lot of tools to for working with text. That will be step 3 in the 5 step process, for a future post.

As I mentioned earlier, I will be publishing the final scripts on Github, and will be publishing the hyperlinked version of California legislative information. And hopefully this can inspire California's legislature to publish the statutes in a structured data format to begin with, which can be combined with the OpenStates data to make it easier to see the changes that would be made by any proposed legislation.