Friday, May 27, 2011
CA Legislation New site
How to Convert All Files in a Directory: CA Legislation
For this task, I went back to another old Linux utility: Find. If you type "Find /" from a command prompt in Linux (also MacOS), you get a list of all of the files and folders on your computer. Don't do this. It will take a long time, and is not really useful for anything. But you can use this powerful command within a single directory, and send the list of file names to a program that will operate on each one. In this case, I wrapped this all in a Python program, using the POpen() function to run any Linux commands that I wanted. Gory details below the fold.
![]() |
CA Codes After |
If you want to skip the details and go straight to the results, I've put the newly transformed California code sections on a website (calaw.tabulaw.com). Currently, the design is very simple and has no styling, whatsoever. But I welcome you to do a before and after comparison and let me know what you think in the comments.
In my view, converting CA Legislation to structured data makes navigating the code much easier. It also reveals some problems with the version on California's website-- repeated sections, stray text markings--that should probably be cleaned up. More about these anomalies, and the brave new world that structured data can bring to law, in future posts.
Tuesday, May 24, 2011
How to Convert Citations to Hyperlinks: CA Laws
"pursuant to the provisions of Part 2.5 (commencing with Section 18901) of Division 13 of the Health and Safety Code"
s_Health and Safety Code_<a href="/Code-hsc">Health and Safety Code</a>_
pursuant to Chapter 3.5 (commencing with Section
11340), Chapter 4 (commencing with Section 11370), or Chapter 5
(commencing with Section 11500), of Part 1 of Division 3 of Title 2
of the Government Code
Section
11340),...Section 11370),...
...Section 11500),...
of the <a href="/Code-gov">Government Code</a>
Monday, May 23, 2011
How to: Convert Sections Into Hyperlink Targets
How to find section headings in a text document and convert them to targets for hyperlinks?
If you have ever had this burning question, you'll want to read on. Or you can take my word for it that it would have been better for this information to be included in the documents when they were originally published.
This post describes Step 2 of 5 to convert California statutes to structured html: Identify section, subsection and subdivision headings. To do this, I am using an old (1970s) Linux program called "sed" (stream editor).
There are lots of ways to do this using more modern programming languages, but sed has the advantages that it is VERY fast, and it has built in the operations of opening, editing and closing a file. It's basically a "find and replace" function on steroids, without the need for Congressional hearings.I must admit, that once I got the hang of sed, and its improved cousin, "Super Sed", it was pretty addictive: with one command, you can change all capital letters in a document to lower case, or replace all vowels with a *, or mark all numbers and letters at the beginning of a paragraph as section and subsection headings. Sed goes through a file one line at a time and makes these substitutions. Sed is quite powerful and there are actually a number of other things you can do with sed, operating one line at a time through a text. If this sounds like fun to you, look here for a good tutorial.
I was working with California state statutes, which I had earlier converted to html. Fortunately, the statute text has a very regular structure: sections, subdivisions and other levels of the document were marked at the beginning of lines, with consistent spacing setting them apart.
So to find the section headings, I just needed to create a set of rules (using RegEx), that describe each kind of section heading. California statutes use headings with the following levels:
100
100.1
100.1 (a)
100.1 (a) (1)
So I needed to describe each of these section headings in a way that they could be identified and separated from any other numbers and letters that are found within the statutes. Here's an example of a rule that does this:
s_^<p>([1-9]\d*)\._<p><span class="section level1" id="sec-\1\.">\1\.<\/span>_
It looks gory, but is actually pretty tame. In essence, it says to substitute (s_) any number at the beginning of a line (^) and beginning of a paragraph (<p>) with a label (<span>) that will identify this number as a section heading. Each kind of heading requires another rule to describe it, and then all of these rules are applied to the file using the ssed (Super Sed) command. The result converts a section heading like this:
<p>15210. Notwithstanding any other provision of this code, as used in
to something like this:
<p><span class="section level1" id="sec15210.">15210.</span> Notwithstanding any other provision of this code, as used in
Not rocket science, but one step closer to structured data. The <span> will allow us to separate out this section from the rest of the text in order, for example, to link to this section from another section that references it.
The next step is to find all of the references to other sections that are found inside the statute text and to place links from those references to the sections they refer to. Unfortunately, those references may cross over more than one line, it is harder to use a line-by-line editor such as sed to do the job. For this, I put together a short search and replace program in the Python programming language, which is more flexible and has a lot of tools to for working with text. That will be step 3 in the 5 step process, for a future post.
As I mentioned earlier, I will be publishing the final scripts on Github, and will be publishing the hyperlinked version of California legislative information. And hopefully this can inspire California's legislature to publish the statutes in a structured data format to begin with, which can be combined with the OpenStates data to make it easier to see the changes that would be made by any proposed legislation.
Wednesday, May 18, 2011
How to convert Text to HTML: Using txt2html Perl Module
txt2html --explicitheadings --indentparbreak --maketables --make_anchors --xhtml --outfile /path/to/file.html /path/to/file
After:15210. Notwithstanding any other provision of this code, as used in
this chapter, the following terms have the following meanings:
(a) "Commercial driver's license" means a driver's license issued
by a state or other jurisdiction, in accordance with the standards
contained in Part 383 of Title 49 of the Code of Federal Regulations,
which authorizes the licenseholder to operate a class or type of
commercial motor vehicle.
(b) (1) "Commercial motor vehicle" means any vehicle or
<p>15210. Notwithstanding any other provision of this code, as used in this chapter, the following terms have the following meanings:
<br/> (a) "Commercial driver's license" means a driver's license issued
by a state or other jurisdiction, in accordance with the standards
contained in Part 383 of Title 49 of the Code of Federal Regulations, which authorizes the licenseholder to operate a class or type of
commercial motor vehicle.
<br/> (b) (1) "Commercial motor vehicle" means any vehicle or
Monday, May 16, 2011
California Laws: Converting Plain Text to HTML
- identifying where sections, subdivisions and other elements start and end and
- adding hyperlinks from a reference to the section referenced (adding a hyperlink from references like this: "as defined in Section 203 of the Government Code").
![]() |
Before (no links) |
![]() |
After (now with links) |
Nothing earth-shattering, but for even this level of metadata, it took a number of steps to add the structural information back in to the statutes (see an outline of the process below the fold). After a bit more polishing, I will upload my scripts to Github, in the hopes that my hacks can be improved upon.
For those who want to skip straight to the conclusion, here it is: automated transformations can add back in much of the metadata that is needed to navigate statutes. But the automated methods will not catch all of the relevant information--even all of the relevant references to other primary legal sources. To add the rest of this information into a public domain electronic format will require (a) that governments publish the data in a structured format to begin with, (b) a Wikipedia-like platform for expert crowdsourcing of legal sources, (c) a fundamental change in the current pay model for publishing of legal information or (d) all of the above.
What works:
- Sections (e.g. 15210.) , subdivisions (e.g. 15210(b)) and sub-subdivisions (e.g. 15210(b)(1)) identified.
- References to each of the 29 California Codes are linked.
- Most references to other Sections are hyperlinked.
- I haven't yet posted the linked documents online.
- Further subdivisions (e.g. 15210(b)(2)(A)) have not yet been identified in the text.
- The parser does not yet recognize some forms of reference to other Sections. E.g. where the reference is set out as a list of three or more: "in the manner described under Section 2800.1, 2800.2 or 2800.3..."
- References to separate legislative Acts are not linked (e.g. "the Commercial Motor Vehicle Safety Act")
- References outside the CA Codes are not yet linked, e.g. references to U.S. Federal statute or regulations.
Friday, May 13, 2011
California Law: Recovering Meaning and Metadata with RegEx

(b) Any person entitled to the exemption contained in subdivision (a), while operating, within this state, a commercial vehicle, as
defined in subdivision (b) of Section 15210, shall have in his or her
possession a current medical certificate of a type described in
subdivision (c) of Section 12804.9, which has been issued within two
years of the date of operation of that vehicle.
(b) Any person entitled to the exemption contained in subdivision (a), while operating, within this state, a commercial vehicle, as
defined in subdivision (b) of Section 15210, shall have in his or her
possession a current medical certificate of a type described in
subdivision (c) of Section 12804.9, which has been issued within two
years of the date of operation of that vehicle.
Friday, May 6, 2011
Better Access to State Legislatures: Sunlight Foundation's Open States Project

Congratulations to Sunlight and the Open States team on this milestone!
Wednesday, May 4, 2011
Better Access to Court Opinions: GPO Announces Pilot
Court opinions are already available from the courts' websites, and -- for those with an access account -- from PACER, the Federal courts' electronic filing system. The difference now, presumably, is that GPO will introduce some uniformity to the electronic format for published court opinions.
That is a good thing. Even better, will be for these opinions to include metadata about the document structure. As I discussed yesterday, most court opinions today are published online in pdf format scrambling much of the information about document structure, and losing much of the value from publishing in an electronic form.
Tuesday, May 3, 2011
Losing Data in PDF: All the King's Sources
- Find the Supreme Court opinion in AT&T Mobility v. Concepcion, (hint: look here), the recent case on contracts that block class action suits,
- Find all of the (nearly 30) briefs that were submitted to the Court (hint: look here), and
- Determine which arguments from the briefs were discussed in the Court opinion.
You can see that by trying to convert a pdf to text or to web format (html). Google Documents has a nice feature that does this, and here is Google's web-converted version of the opinion above (AT&T Mobility v. Concepcion). The way that Google presents the converted document shows the original pdf image of each page, followed by the converted version. A few items jump out from the first page of converted text. Words that were divided at the end of a line in the original, e.g. 'uncon- scionable', are still broken even in the middle of the paragraph. Text formatting, such as italics for case citations, is gone, and formatting of some paragraphs has been significantly disrupted: the top paragraph on page 3 is right-justified in the Google Docs version. Even more problematic is the title section, where the names of the Supreme Court justices are broken up:
Key information about the case--who joined which opinions--has been lost. This information can be recovered in a variety of ways, including by manually coding the vote of each justice in the case, but how wasteful, considering that all of that information was available in the original (electronic) version of the document. In fact, the original sets out the Justices names in all caps to set them apart visually:
Ironically, the Court's extra effort to provide a distinctive visual layout that highlights the Justices' names actually breaks Google's algorithm for parsing the pdf text. With a little bit of forethought, the Court could preserve both the layout and the key structural information, to make their opinions more accessible to the general public, as well as to meet Federal government accessibility standards. (Though these standards are not directly binding on the courts--another sad irony.)
So, for now, we have the technical challenge of converting pdfs to structured text, which is tough enough. Google Documents misses many of the most important text features and in another post, I will discuss other (imperfect) options to do pdf to text conversions, including the pdftotext and pdftohtml programs, and the open source Apache pdfBox and Tika projects.
But for a lawyer, or anyone who cares about the "official" or binding version of the court opinion, the problem goes beyond the encoding of the pdf opinion that the court publishes on its website. As the Court website explains, there are six different versions of opinions published by the Court.
Prior to the issuance of (1) bound volumes of the U.S. Reports, the Court's official decisions appear in three temporary printed forms: (2) bench opinions (which are transmitted electronically to subscribers over the Court's Project Hermes service); (3) slip opinions (which are posted on this website); and (4) preliminary prints.