Steps 3 and 4 in converting the California legal Codes to structured HTML involve identifying references within the text (e.g. "pursuant to Section 480" or "under Section 15000 of the Vehicles Code"). This presents two challenges: (1) identifying the correct Code (the high level subject matter of the law), and (2) identifying the section in that Code.
This becomes more complex than it would seem, because California's legislature uses a variety of different forms to refer to other Sections and Codes. The most straightforward is of the form, "Section X of the Y Code". But there are many, many variants. An example:
"pursuant to the provisions of Part 2.5 (commencing with Section 18901) of Division 13 of the Health and Safety Code"
To deal with these variations, I started by identifying all Code references. I used the Linux sed utility to do this and to enclose each Code reference with html tags. This is a simplified version of the RegEx for one Code:
s_Health and Safety Code_<a href="/Code-hsc">Health and Safety Code</a>_
To identify the Section number(s), I compiled a list of the most common forms of reference, and created a RegEx expression for each. There is an additional problem, though: many of the references contain many subreferences and cover more than one line of the text:
pursuant to Chapter 3.5 (commencing with Section
11340), Chapter 4 (commencing with Section 11370), or Chapter 5
(commencing with Section 11500), of Part 1 of Division 3 of Title 2
of the Government Code
Hmm. A worthy challenge.
The Chapter, Part, Division and Title references do not seem to add any independent information for our purposes. So I look for, and skip over, anything of the form [Part OR Division OR Title] [number] of [Part OR Division OR Title]...
Now we have:
Section
11340),...Section 11370),...
...Section 11500),...
of the <a href="/Code-gov">Government Code</a>
With the Code reference previously identified we can now focus on finding the various Section references, and associating them with the right Code. I go into a bit more technical detail on this after the fold and in the next post on how I put it all together to run through all of the Code sections (18k files; 50k files  sections) in one sitting.
Dealing with multi-line references using the one-line-at-a-time editing method inherent to sed is beyond my (limited) ability to bend Linux to my will. Time to pull out a more modern (though slower) programming language.
I placed all of the RegEx to identify Section numbers into a Python script, using the re module of Python. Though this allowed me to look many lines ahead to find the Code reference that goes with each Section reference, it has its downside. With sed, I was able to put together a sequence of more than a dozen RegEx expressions, and run them simultaneously, line-by-line on each file. That way, I only need to scan each file once. On the other hand, my Python script reads through the entire file for each and every kind of Section reference to perform its "find and replace" function. Not very efficient, but this may be a necessary evil to capture information that is spread across many lines of a file.