Starting with the unstructured data in California's legislation, it takes many steps to add structure to a single Section. Or rather, to add back in the metadata that the Section's original drafters intended, to help a reader understand and navigate the law. The next step is to apply the transformations to all of the Sections in the law.
California helpfully makes all of its codes available for FTP download in a set of nested folders. It would be great if more government agencies made their data available in bulk. But we still have a problem: How to recursively iterate through all the files and folders in the directory (29 folders, 50,000 files sections in total) and apply the parsing transformations to each file. Each file consists of a (variable) number of sections, e.g. here.
For this task, I went back to another old Linux utility: Find. If you type "Find /" from a command prompt in Linux (also MacOS), you get a list of all of the files and folders on your computer. Don't do this. It will take a long time, and is not really useful for anything. But you can use this powerful command within a single directory, and send the list of file names to a program that will operate on each one. In this case, I wrapped this all in a Python program, using the POpen() function to run any Linux commands that I wanted. Gory details below the fold.
CA Codes After |
If you want to skip the details and go straight to the results, I've put the newly transformed California code sections on a website (calaw.tabulaw.com). Currently, the design is very simple and has no styling, whatsoever. But I welcome you to do a before and after comparison and let me know what you think in the comments.
In my view, converting CA Legislation to structured data makes navigating the code much easier. It also reveals some problems with the version on California's website-- repeated sections, stray text markings--that should probably be cleaned up. More about these anomalies, and the brave new world that structured data can bring to law, in future posts.
import os
from subprocess import Popen
cmd = "find path/to/file -print"
fileslist = Popen(cmd, shell=True, stdout=PIPE)
for file in fileslist.stdout.readlines():
if os.path.isfile(file):
print "THE FILE IS:"+file
# Runs Linux commands and channels output to the PIPE output
parsedfile = Popen("txt2html --explicit_headings --indent_par_break --make_tables --make_anchors --xhtml " + file + "| ssed -R -n -f File_to_Parse_Sections", bufsize=-1, stdout=PIPE, shell=True)
...