Monday, October 25, 2021

An API for Similar Bills Using FastAPI

In my previous post, I discussed how to define similarity for bills in Congress. Given a new bill, which other bills are similar, and how are they related?

We're building applications to let researchers look up any bill and get a list of similar bills. To do this, we separate the hard question (which bills are similar) from an easier problem (how to show the list of similar bills). By separating the problems, we can make progress on both at the same time.

The key to separating the two problems is an Application Programming Interface (API). An API is a technical specification that defines what response the application will return for a given request. When a researcher asks "What bills are similar to H.R. 200, the National Intersection and Interchange Safety Construction Program Act?", the API will respond with a list of related bills, and a predictable set of data for each bill: its number, title, the 'relatedness' score, etc. The API does not determine *which* bills are listed (that is the harder problem), but what information we list for the bills we do list.

Creating an API requires us to define, ahead of time, the relevant information that a researcher will want. Do they want the date of introduction of the bill? The length of each bill? This information is gathered by talking with the researchers themselves to know what is relevant to their research.

As a starting point, I've created an API that returns a list of bills with the bill number, bill title, similarity scores, and the reasons for considering the bills related. These reasons include the bills being identical, nearly identical, or having only some sections in common. With this starting point, we hope to discover what set of information a researcher actually wants when looking for related bills.

A live demonstration, with data up-to-date through September 2021, is here (see sample response below):

Sample API response, bills similar to H.R. 200 (117th)

Technical notes

The code for this API is here:

The API was built using a terrific Python library, FastAPI, that makes it easy to define and document an API, and has become my favorite way to quickly build a prototype "back end" for an application, and to communicate ideas with other developers.

Tuesday, August 31, 2021

Finding similar bills in Congress

Bills are often recycled. Most bills introduced in Congress are very similar to bills that have been introduced in a previous session. This reflects the imbalance between the number of bills that are proposed and bills that actually become law. Many bills are introduced for the purpose of "messaging" and their sponsors do not plan to get them passed; others simply don't make it through the gauntlet of committees and votes that are required to make a new law. Each two-year session, up to 10,000 bills are introduced, of which a few hundred may pass.

When someone is reviewing a 'new' piece of legislation, they often want to know which other bills are similar to this one. They may want to know who sponsored the previous bills, what happened to those bills, and whether a bill's content has already passed as a part of another bill. This research all starts with the question: which other bills are similar to this one?

I've been working on the problem of finding similar bills for a while, most recently with the BillMap project. The first challenge is figuring out what people really mean when they say two bills are 'similar'. We often have an intuitive sense of what we mean when we say two things are similar, but rarely consider what objective factors make up that intuitive sense. And what we mean by 'similar' may change, depending on context. When we say 'comparing apples and oranges', we're assuming a context in which those two common edible fruit are not considered similar. In that context, maybe an orange is similar to a lemon, but not to an apple. We'd need to investigate more to answer other questions: is a pear similar to an apple, an orange or neither?

For legislation, people often mean a variety of things when looking for 'similar' bills. Bills may share a legislative history, share sponsors, have similar titles, deal with similar subjects, or include much of the same text. Depending on the context, some combination of these factors makes two bills 'similar'. At the same time, bills may differ in many of these factors, and still be considered similar. The most surprising case is where experts consider two bills to be similar, despite significant differences in their text.

For example, there are 38 bills that became law so far in the 117th Congress (the current one). One of these bills, H.R. 1448 has the title  “Puppies Assisting Wounded Servicemembers for Veterans Therapy Act” or PAWS Act in the House. There is also a bill in the Senate, S. 613, with the same titles. These are considered similar because they have the same titles, serve the same purpose, and contain similar text in one section that creates a pilot program for dog training. However, comparing the text of the two small bills shows significant differences: the Senate version (below on the left) has a "Findings" section that makes up much of the bill, and is not found in the House version. The Senate bill also has a section at the end that is not in the House version. So two of the four sections are different, and the key section on dog training also has many changes. Nonetheless, they are both considered versions of the same 'PAWS' bill; because they are in different chambers, they may also be considered 'companion' legislation. Our goal is to be able to identify bills like this as 'similar', despite many of their textual differences.

Similar Text, Maybe Similar Bills

The difference between the House and Senate versions of the PAWS Act is a mild example of bills that are similar, but have many text mismatches. Sometimes the text of a bill may be completely rewritten, but still be considered 'similar' to the earlier version. In fact, comparing the text of bills is often not as helpful as other factors in finding similar bills. It's also computationally very costly. 

For the 10,000 or so bills introduced in any Congress, directly comparing the text of each pair of bills would require 10,000 x 10,000, or 100,000,000 comparisons. An average text comparison takes about 1 second, so comparing all bills could take 27,000 hours of computing time. Looking back over previous Congresses would take even longer.

We looked for a way to narrow down the group of bills to evaluate. The first approach we tried, looking at similar bill titles, goes a long way toward solving the problem.

Similar Titles, Similar Bills

It turns out, when a new bill is introduced with the same purpose (and often by the same sponsors), the new bill is often given the same title. Sometimes only the year is changed. So we have the 'SHOP SAFE Act of 2020' and the 'SHOP SAFE Act of 2021'; the 'Freedom for Farmers Act of 2018', the 'Freedom for Farmers Act of 2019' and the 'Freedom for Farmers Act of 2021'. If we simply remove the year from the title name, and search through an index of titles, we can find many of the bills that are similar to a given bill.

Searching for titles highlights another kind of relationship between bills: one bill incorporated into another. Larger bills, particularly in recent years, may roll up many smaller bills. The overall bill may have one or two titles, but there are many titles within the bill that refer to only a portion of the bill. The fact that another, smaller, bill shares a title with only part of the larger bill is itself important information. For example, H.R. 1 (117th) has the overall title 'For the People Act'. Within it are dozens of other titles which refer to a smaller portion of the bill. Many of those titles are shared with other, independent bills. The 'For the People Act' may share only a small part of its text with each of those bills, while the entire smaller bill is included by the larger act. Below is an entry in the BillMap application showing many of these titles within the 'For the People Act'.

Our first pass at processing bills now looks at the two kinds of title separately. If a bill matches the 'main' title of another bill, it is probably a better match for the whole bill. If it matches one of the titles for a portion of the bill, it is likely that the smaller bill (or a version of it) was included by the larger bill.

Matching bills by title alone captures a large part of what people mean by finding similar legislation.

However, we would still like to track bills where the title is changed, or only portions of a bill are included in another bill (without the title). For those cases, we use a specialized 'more like this' search, a kind of Spotify or Amazon 'Recommendations for You', made for bills.

Searching bills by section

While comparing bills two at a time is very resource-intensive, there are shortcuts. We start by creating an index of bills and use a specialized algorithm to find text similarity (see 'More Like This' query in Elasticsearch).

We search the index with parts of each new bill. We don't search the index using the whole bill text, for the same reason you don't usually type a large text into Google's search box: we'd get too many false positives. Instead, we break each bill down into its logical building blocks, sections. We then search each section against the index, find which bills have similar sections and then combine the results at the section level to find the whole bills that match. This approach has a number of advantages: we get granular information about what sections of other bills are similar to the sections of the current bill; and by combining that information, we can find the bills that are more similar overall. Below is an example from H.R. 1 (117th) showing, for each section of the bill, a list of other bills that have similar sections.

Round 3: bill-to-bill comparison

The first two stages of our process find bills with similar titles and similar sections. The first stage is narrow, finding only bills that have the same title (minus the year). The second stage is wider, finding bills that have some similar text in their sections. The nature of that text search is that it is over-inclusive, and returns some bills that don't actually share much in common (other than some phrases), with the original bill. To add precision to our results, we take that list of similar bills (about 20-30 bills) and do a pairwise comparison with our original bill. While this is still time consuming (up to 30 seconds for some bills), it takes much less time than comparing against the tens of thousands of bills we started with.  This last stage has another advantage, it allows us to evaluate the textual similarity of the bills. We can describe this on a scale (e.g., 0 - 100) or with categories. For BillMap, we label the pairs 'identical' (more than 95% text match), 'nearly identical' (more than 80% text match), 'some similarity' (between 10-80% text match) or 'unrelated' (less than 10% text match). Additionally, when one bill is completely included in another larger bill, we can determine that it is 'included by' the other bill; and the larger bill 'includes' the smaller one.

Depending on the kind of information a person is looking for, each of these categories, as well as the information about title or section matching, may be relevant. In the BillMap application, we show this information in a table: for each bill, we return up to 30 bills that meet one or more of these criteria, and list all of the categories in a list of 'reasons' we consider the bills similar. 

Build Your Own

One of the goals of the BillMap project is to create processes that could be used in other projects to show information about bills. Toward that end, I've built a separate API interface for the bill title and related bill data. Using the interface, you can search by bill number to find the bill's titles, search by title to find other bills that have the same, and find bills that have matching section data (results shown in JSON below):

I will describe this interface and its potential uses in more detail in a future post. In the meantime, you can read the technical details of the API in the project documentation on Github.

Wednesday, July 14, 2021

Legislative Technology in 2021

When I tell people that I help Congress become more transparent and efficient, they invariably say: "Good luck with that." I'm happy to report that we have, actually, had some good luck with that.

I work with Xcential, a company that has deep expertise modernizing legislative systems, from building the drafting system that the California Legislature uses, to structuring the United States Code in a standard (XML) data format, United States Legislative Markup

More recently, the "Comparative Prints Project," which Xcential is working on for the House Office of the Clerk and the Office of Legislative Counsel, was highlighted by the House Modernization Committee (June 2020 report (pdf)) and is one of the few areas in Congress that has bipartisan support. 

For this work, we've put together experts in Natural Language Processing, Customer Experience design, and legislative data systems. As a lawyer who now builds software, I work to integrate the parts into a suite of software tools, to help answer the question: "What's going on as a bill becomes law?"

Bill language is often opaque and difficult to interpret, both for technical and political reasons. The words may be straightforward, but understanding their impact currently requires expertise and extensive research. A major change to the law may be written: In subsection 501(c)(3) of such Act, strike "religious,". Understanding this requires finding the earlier reference to 'such Act'; knowing what subsection 501(c)(3) says; and interpreting how the proposed change would affect the current law. It may also involve following a trail of such instructions earlier in the bill.

To make this information more accessible, we've built a tool that automatically processes bill language to show how a bill would change current law. We've also built a tool to compare two versions of a bill. This second tool is being used to track changes that are made to legislative drafts (like the Covid-19 relief bills) as they make their way through Congress. 

Both tools track changes in natural language documents. This, in itself, is not a new endeavor. Academics have worked for decades on algorithms to detect changes in documents. Change tracking has an even longer history in legislatures: the legislative drafter's art, developed over generations, is to 'patch' legislation with language that achieves its aims, and can still pass a divided House. We combine these two, algorithmic analysis and the art of drafting, to produce document comparisons that are both machine-readable and human-understandable.

One of the more interesting aspects of this project is the modeling of amendatory language by Sela Mador-Haim, on our team. He developed a formal grammar, Amendment Modeling and Processing Language (AMPL) after analyzing hundreds of thousands of amendment phrases. From these phrases, he identified a small set of building blocks ("strike", "insert"; "before", "after"; "and", "or", etc) and developed a recursively-defined syntax to combine them. The vast majority of Congressional amendments can be parsed and converted into a syntactically valid combination of these AMPL building blocks. The AMPL statements are then interpreted by our automated tools to show changes in the law:

An amendment from H.R. 1500 (116th Congress), is interpreted and automatically applied to the law

The system can handle a wide variety of amendment types

As with any legislative system, there are limits to what can be automated. We recognized this early on, and have worked with Charlotte Lee and her Customer Experience team at Monday Design to bring in drafters and other experts to the creation of the system. The goal is to provide tools that help these experts review changes in the law and to communicate the changes in a way that non-experts can better understand.

We owe our "luck" in recent years to some persistent people in government. Among them is Kirsten Gullicksen at the Clerk's Office, who manages the Comparative Prints Project. Kirsten won a 2021 Service to the Citizen Award, for her work on this and other initiatives to support and modernize systems in the House.

To see Kirsten present the Comparative Project, and to hear others who are working to help Congress become more transparent and efficient, tune in today at the virtual meeting of the Bulk Data Task Force on July 14 (RSVP required).

Over the coming months, I hope to discuss some of the other projects that were presented at the meeting and provide more detail about the technologies we are using.

Thursday, March 7, 2019

Converting Daily Temperature Data to Sound with TwoTone

I was distracted yesterday. I saw a tweet by Alberto Cairo (@albertocairo) about a new web-based tool to convert data to sound. I have thought a lot about 'visualization' in sound since my days doing electrophysiology of locust brains at Caltech (I was even acknowledged for 'helpful comments' in a Nature paper). The tool, 'Two Tone' was "made by Datavized Technologies with support from Google News Initiative".

My masterpiece is here: Chicago vs. Redwood City (temperatures, 2015, SoundCloud)

I wanted to try it, and the first thoughts I had were to use data from a fitness tracker (e.g. steps per day), but I don't have one. Next I thought to nerdily plot my Github contributions over time. But for now I've done something a little more conventional: plot the minimum temperature over a year in Chicago, and in the Bay Area.

I started by getting data from Google's BigQuery accessing NOAA's Global Historical Climate Network (GHCN). After a little trial and error based on Google's existing examples, I figured out that the daily minimum temperature in Chicago for 2015 can be found with:

  wx.value/10.0 AS min_temperature
  `bigquery-public-data.ghcn_d.ghcnd_2015` AS wx
  id = 'USW00094846'
  AND qflag IS NULL
  AND element = 'TMIN'

To do the same for Redwood City, CA (the nearest station to me), I used
id = 'USC00047339'
(found here).

Lots of caveats: TwoTone probably normalizes the dynamic range, so we don't get a fair comparison of the two cities. I'd also like more control over features of the audio: the maximum tempo is 300 bpm, but audio data can be processed much faster by the brain; I'd also like to be able to combine tracks with an eye to meaningfully comparing the data. I did this in Garage Band, but by then, the data had been converted to audio waves. One of the great features of TwoTone is to manipulate sound visualizations, while still looking at the source data. I look forward to the evolution of this tool, and generally to the field of audio 'visualizations' of data.