How to find section headings in a text document and convert them to targets for hyperlinks?

If you have ever had this burning question, you'll want to read on. Or you can take my word for it that it would have been better for this information to be included in the documents when they were originally published.

This post describes Step 2 of 5 to convert California statutes to structured html: Identify section, subsection and subdivision headings. To do this, I am using an old (1970s) Linux program called "sed" (stream editor).

There are lots of ways to do this using more modern programming languages, but sed has the advantages that it is VERY fast, and it has built in the operations of opening, editing and closing a file. It's basically a "find and replace" function on steroids, without the need for Congressional hearings.I must admit, that once I got the hang of sed, and its improved cousin, "Super Sed", it was pretty addictive: with one command, you can change all capital letters in a document to lower case, or replace all vowels with a *, or mark all numbers and letters at the beginning of a paragraph as section and subsection headings. Sed goes through a file one line at a time and makes these substitutions. Sed is quite powerful and there are actually a number of other things you can do with sed, operating one line at a time through a text. If this sounds like fun to you, look here for a good tutorial.

I was working with California state statutes, which I had earlier converted to html. Fortunately, the statute text has a very regular structure: sections, subdivisions and other levels of the document were marked at the beginning of lines, with consistent spacing setting them apart.

So to find the section headings, I just needed to create a set of rules (using RegEx), that describe each kind of section heading. California statutes use headings with the following levels:

100.1 (a)
100.1 (a) (1)

So I needed to describe each of these section headings in a way that they could be identified and separated from any other numbers and letters that are found within the statutes. Here's an example of a rule that does this:

s_^<p>([1-9]\d*)\._<p><span class="section level1" id="sec-\1\.">\1\.<\/span>_

It looks gory, but is actually pretty tame. In essence, it says to substitute (s_) any number at the beginning of a line (^) and beginning of a paragraph (<p>) with a label (<span>) that will identify this number as a section heading. Each kind of heading requires another rule to describe it, and then all of these rules are applied to the file using the ssed (Super Sed) command. The result converts a section heading like this:

<p>15210. Notwithstanding any other provision of this code, as used in

to something like this:

<p><span class="section level1" id="sec15210.">15210.</span> Notwithstanding any other provision of this code, as used in

Not rocket science, but one step closer to structured data. The <span> will allow us to separate out this section from the rest of the text in order, for example, to link to this section from another section that references it.

The next step is to find all of the references to other sections that are found inside the statute text and to place links from those references to the sections they refer to. Unfortunately, those references may cross over more than one line, it is harder to use a line-by-line editor such as sed to do the job. For this, I put together a short search and replace program in the Python programming language, which is more flexible and has a lot of tools to for working with text. That will be step 3 in the 5 step process, for a future post.

As I mentioned earlier, I will be publishing the final scripts on Github, and will be publishing the hyperlinked version of California legislative information. And hopefully this can inspire California's legislature to publish the statutes in a structured data format to begin with, which can be combined with the OpenStates data to make it easier to see the changes that would be made by any proposed legislation.