California Laws: Converting Plain Text to HTML

How to convert legislation from plain text to structured html? For those more interested in results than process, you're in luck. This post focuses primarily on the results of the transformation.

I am working here with California's statutes, published in plain text on the legislature's website. There are many layers of meaning that could be added to the raw text, and as a start, I'm focusing on elements that make reading and navigating the statute easier. In particular:

identifying where sections, subdivisions and other elements start and end and
adding hyperlinks from a reference to the section referenced (adding a hyperlink from references like this: "as defined in Section 203 of the Government Code").

For example, here are sections of the CA Vehicle Code that set out definitions for the sections that follow. Here are the same sections in html, after the transformations described below.

Before (no links)

After (now with links)

Nothing earth-shattering, but for even this level of metadata, it took a number of steps to add the structural information back in to the statutes (see an outline of the process below the fold). After a bit more polishing, I will upload my scripts to Github, in the hopes that my hacks can be improved upon.

For those who want to skip straight to the conclusion, here it is: automated transformations can add back in much of the metadata that is needed to navigate statutes. But the automated methods will not catch all of the relevant information--even all of the relevant references to other primary legal sources. To add the rest of this information into a public domain electronic format will require (a) that governments publish the data in a structured format to begin with, (b) a Wikipedia-like platform for expert crowdsourcing of legal sources, (c) a fundamental change in the current pay model for publishing of legal information or (d) all of the above.

Now to see, in more detail, what was gained from this first layer of transformations of the text.
What works:

Sections (e.g. 15210.) , subdivisions (e.g. 15210(b)) and sub-subdivisions (e.g. 15210(b)(1)) identified.
References to each of the 29 California Codes are linked.
Most references to other Sections are hyperlinked.

What doesn't (yet) work:

I haven't yet posted the linked documents online.
Further subdivisions (e.g. 15210(b)(2)(A)) have not yet been identified in the text.
The parser does not yet recognize some forms of reference to other Sections. E.g. where the reference is set out as a list of three or more: "in the manner described under Section 2800.1, 2800.2 or 2800.3..."
References to separate legislative Acts are not linked (e.g. "the Commercial Motor Vehicle Safety Act")
References outside the CA Codes are not yet linked, e.g. references to U.S. Federal statute or regulations.

To address the points above will require a few more layers of filtering.

This is an outline of the process I used:

Convert from text to html (Perl text2html module)
Identify section, subsection and subdivision headings. (Linux Super Sed utility)
Identify references to external Codes (Linux Super Sed utility)
Identify references to Sections (Python)
Recursively apply steps #1-4 above to each file in nested folders to convert all California's Codes. (Linux Find utility)

To choose a tool for each step, my preference was for a ready-made script or program that does as much of the job as possible, while still providing the flexibility to tweak the output as needed. Details of each tool, how I installed and put them all together, in future posts.