Friday, February 24, 2012

IRS and Tech: Cudgel or Lever?

The National Taxpayer Advocate, Nina Olson, has a posse.  And for good reason. I attended the American Bar Association Tax Section mid-year meeting in San Diego last week and had the opportunity to hear Olson speak.  She wields statistics, legal provisions and specific taxpayer examples to show how the IRS has steadily de-personalized tax administration.  This has dramatically increased inequities for small business owners and lower income taxpayers.  Automated systems have replaced individual judgement and human contact throughout the IRS, while tax laws, regulations and guidance become more impenetrable.

The problem with technology at the IRS, according to Olson, is emphasis.  The relatively small technology budget has been primarily applied to enforcement and in creating a distance between taxpayers and individuals.  I asked about technologies to *improve* customer service, and Olson said that her office was starting to take some steps in that direction (e.g. video conferencing with taxpayers at remote Taxpayer Advocate offices), and that more needs to be done.  

She suggested putting together a conference of technologists and tax experts to discuss ways that technology could used on the taxpayer's behalf, and I think that there are a host of creative and capable consumer-facing companies in the Silicon Valley and Bay Area that could take up this task.  

What about an online tax dispute system, that reduces the barriers for a taxpayer challenging assessment? The IRS has a list of FAQs on its site, but what about a more comprehensive Q&A site, with technologies like Quora or to identify the relevant taxpayer issue?  These are just the tip of the iceberg, and I am confident that there are dozens of other technologies that could help cut through the complexity of tax law, if civic hackers and consumer internet entrepreneurs set their minds to the task.  

What do you think? Olson also now has a blog.  If you have ideas for technology and tax, you can let her know there, or post your comments here.

Thursday, February 16, 2012

US Code in Standardized XML

Grant Vergottini has done it again.  He has converted the XML of the U.S. Code, published by the Law Revision Counsel, and converted it to a format (obscurely) called Akoma Ntoso, which is growing to be the basis for an international standard for legislation. (See his post here.)

Standards for their own sake have little meaning.  What we'd like is a standard that would allow easy sharing and comparison legislative information from various jurisdictions, while flexible enough to integrate the kinds of metadata that Jim Harper of the Cato Institute has called for.  By focusing on core structural elements, Grant has shown that translation between the different data formats is not only possible, but can be relatively straightforward. The U.S. legislative process is unique, as some experts at the House conference on legislative data pointed out.

True enough: every legislative process is, in some ways, unique. But there is enough overlap that a robust standard is possible.  We're still far off from having a "computable" body of legislation, but this is a major step forward for making the code machine readable.

Wednesday, February 15, 2012

Convert PDF to Text, HTML, Word...

I've put together a small demonstration site to convert pdfs to clean html: you can try it out here.  There are many caveats that go along with this (e.g. the current server is not very stable, it only works with javascript enabled browsers, only 5 documents at a time, limited size on each document, no OCR, etc.).  But I thought I'd get it out there for all the legal data fans to try it out and get a conversation started about data encoding. Do you have a favorite way of getting text out of pdfs?

PDF documents are the only available starting point for a lot of government legal information.  I've discussed some of the problems with this before, and suffice it to say that this is a recurring problem in legal informatics.  To extract useful metadata, and to make the documents web-accessible, it is usually necessary to convert the PDF to a more portable format. The devil is in the details.

While there are many programs available that make the conversion from pdf to text, html or MS Word, there are many trade-offs, the biggest of which is to preserve layout or to make it easier to extract metadata.  Most of the converters to html that I have found, for example, include a huge number of extra tags that clutter up the text, break up sentences and paragraphs and generally make it very hard to extract meaningful metadata from the document.

I've combined a couple of open source programs (pdf2text -> txt2html) and an open source tool to upload documents, to make this small site.  If you find it useful, or need to convert large volumes of pdf documents to clean html, get in touch.

Thursday, February 2, 2012

Roundup: House Legislative Data and Transparency

Many kudos to those who put together today's House Legislative Data & Transparency conference. I was impressed with the high-level and high-quality line-up of speakers and participants, and very grateful to the Committee on House Administration, which provided a livestreaming feed, and to all the Tweeters in the room and around the country who helped fill in the blanks (search: #ldtc).

The Conference provided a clear picture of what the current state of play is with legislative data, and some very clear recommendations from the audience and some participants about where things should move. What is needed now, is a commitment to make those improvements.

John Wonderlich (also at Sunlight) expressed <understatement>disappointment</understatement> in the government's lack of commitment, after many years of requests, to providing bulk data.  I agree, though there are some bright spots: I have been pleased with the bulk data being provided by the LRC for the U.S. Code Prelim, and the regular url scheme at, which is not far off from providing bulk data.

One of the most underappreciated statements of the day came from the Law Revision Counsel.  On a question, I believe, related to authentication, the LRC highlighted the importance of positive law codification.  

I don't think most people realize: there is no single, authoritative publication of Federal statutory law.  The printed version of the U.S. Code is six years out of date. The online USC Prelim is up to date, for now. But neither one is the current law of the United States.

The Conference opened with a strong showing of bipartisanship.  That is exactly what is needed to move codification legislation forward.

XML Standard from Bill to Code: Legislative Recommendation #4

The fourth of my structural recommendations for the U.S. House conference on legislative data and transparency (being held now), is to establish a consistent XML standard from publication of a bill to incorporation in the Code.

I'll keep this one short, since others, particularly Jim Harper of the Cato Institute, have described in great detail what should go into this standard and why it is important.

Here, I want to focus on the importance of having a single XML standard from the first drafting of a bill  to its codification.  Lest you think this is already being done, or is an easy task to accomplish, Alex Howard (@digifile) of O'Reilly media has posted a helpful flowchart [and here] of the various offices that are involved in the first part of the legislative process (until the bill becomes law and is published by the GPO).  The process of codification takes place after that.

A key element of any XML standard for legislation is that it be consistent throughout this process, as it passes from the jurisdiction of one office to another.

Wednesday, February 1, 2012

Positive Law Codification: Legislative Recommendation #3

Positive law codification is probably the most under-appreciated facet of legislative transparency.  It is hard work, and requires fighting against the the entropy of legislative history. But ultimately, any effort to create more accessible legislation, will be limited without positive law codification.

The best description comes from the Office of Law Revision Counsel (LRC), here. There are many minutiae of the process that I don't know or understand, so my discussion here will necessarily be an approximation.  I welcome any corrections in the comments.

The LRC is charged with codifying Federal statutes, which is essentially organizing them into the Titles of the U.S. Code.  However, unless Congress passes a Title as law, and replaces the various laws which make up the Title, the Title will live in a parallel universe from the laws that Congress actually passed.  So relying on text in the Title alone can often lead to trouble. The whole Code is revised on a 6-year schedule, so some sections can be as much as 6 years out of date.  

The LRC is moving ever faster, and has started to release a "USCPrelim" version, which updates Titles on a faster cycle.  However, as long as (1) the Code is not positive law, and (2) changes are not made in a consistent manner, this codification process will continue to require a great deal of manual work and artistry.

At the same time, the LRC has taken up a number of projects to ask Congress to pass certain Titles into "positive law", so that the text of that Title in the Code is the law.  There are currently 8 Titles listed on the LRC's website that are, it appears, ready for Congressional action.

Congress could make tremendous progress toward legislative transparency by prioritizing positive law codification, and committing to completion of the process by a certain date (2014?).  

Now is a terrific time to start, for a number of reasons:
  1. The 6 year cycle completed in January.  So the Code is almost completely "up to date".
  2. Legislative gridlock on other issues creates a space for passing legislation of this technical nature that has few policy implications, but could offer great gains in efficiency and transparency.
  3. Data technologies have advanced to the point that the process of codification can be accelerated. Quality and completeness could be ensured by a number of automated, as well as manual tests. And the benefits of codification would be immediately visible, in the ability to update the Code in real time, just as is currently being done for the Code of Federal Regulations.

This is an exciting time for legislative data, and the House can make changes now, for a relatively modest investment, that will yield benefits for years to come.