Monday, April 9, 2018

Pope Shows the Way to Modernity in "Call to Holiness"

Pope Francis published a remarkable document today, called the "Gaudete et exsultate", an exhortation to holiness. He says its goal is to "repropose the call to holiness in a practical way for our own time". The philosophy of the document is bold and modern.
Pope Francis, Call to Holiness

 Each of its 177 numbered paragraphs, in five named chapters, contains an expression of wisdom and a practical view on what it is to live a holy life. It is very much worth reading and meditating upon, whatever your faith.

The document resonates on many levels. It calls for acting with "Joy and a Sense of Humor", for "Going Against the Flow", and importantly, emphasizes that charity is the highest virtue. A summary in the New York Times highlights the teaching, found in paragraph 101, that caring for the poor and immigrant are as holy as the opposition to abortion:

...Our defence of the innocent unborn, for example, needs to be clear, firm and passionate, for at stake is the dignity of a human life, which is always sacred and demands love for each person, regardless of his or her stage of development. Equally sacred, however, are the lives of the poor, those already born, the destitute, the abandoned and the underprivileged, the vulnerable infirm and elderly exposed to covert euthanasia, the victims of human trafficking, new forms of slavery, and every form of rejection.[84] We cannot uphold an ideal of holiness that would ignore injustice in a world where some revel, spend with abandon and live only for the latest consumer goods, even as others look on from afar, living their entire lives in abject poverty.
It is also refreshingly modern in its form and format. It is published in HTML, with a clear and consistent structure (numbered paragraphs, chapters). Its references all have hyperlinks to the footnotes, and the footnotes themselves are hyperlinked to original sources!

At the top of the document is a list of social media links (FB, Twitter, Google +, email...), as well as a link to print and pdf for old-school applications. It also has been translated into many languages and has links to those versions at the top.

It even has breadcrumbs placing this document in context of the other documents on the website.

It comes in a mobile version, and the document redirects to the correct format for your device:

Each of these components was clearly well thought-out, and underscores, subtly, Pope Francis' call to modernity. While he warns us not to be "caught up in superficial information, instant communication and virtual reality", he does not reject technology itself, and indeed has published an (almost) thoroughly modern document. The one step that is missing is to provide a hyperlinkable structure for the document. If each numbered paragraph had an id, for example, it would be possible to link directly to the paragraph, like :

If the Vatican is interested in taking this additional step, I would be very happy to consult on how to convert this and other papal declarations into fully standardized Akoma Ntoso.

Tuesday, May 2, 2017

XML Editing is Hard

I was reminded this week that our best customers at Xcential are those who try to build a legislative data system themselves first. And today's XCKD perfectly sums up why:

Thursday, October 22, 2015

Git for Law Revisited

Laws change. Each time a new U.S. law is enacted, it enters a backdrop of approximately 22 million words of existing law. The new law may strike some text, add some text, and make other adjustments that trickle through the legal corpus. Seeing these changes in context would help lawmakers and the public better understand their impact.

To software engineers, this problem sounds like a perfect application for automated change management. Input an amendment, output tracked changes (see sample below). In the ideal system such changes could be seen as soon as the law is enacted -- or even while a bill is being debated. We are now much closer to this ideal.

Changes to 16 U.S.C. 3835 by law 113-79

On Quora, on this blog, and elsewhere, I've discussed some of the challenges to using git, an automated change management system, to track laws. The biggest technical challenge has been that most laws, and most amendments to those laws, have not been structured in a computer friendly way. But that is changing.

The Law Revision Counsel (LRC) compiles the U.S. Code, through careful analysis of new laws, identifying the parts of existing law that will be changed (in a process called Classification), and making those changes by hand. The drafting and revision process takes great skill and legal expertise.

So, for example, the LRC makes changes to the current U.S. Code, following the language of a law such as this one:
Sample provision, 113-79 section 2006(a)
LRC attorneys identify the affected provisions of the U.S. Code and then carry out each of these instructions (strike "The Secretary", insert "During fiscal year"..."). Since 2011, the LRC is using and publishing the final result of this analysis in XML format. One of the consequences of this format change is that it becomes feasible to automatically match the "before" to the "after" text, and produce a redlined version as seen above, showing the changes in context.

To produce this redlined version, I ran xml_diff, an open-source program written by Joshua Tauberer of, who also works with my company, Xcential, on modernization projects for the U.S. House. The results can be remarkably accurate. As a pre-requisite, it is necessary to have a "before" and "after" version in XML format and a small enough stretch of text to make the comparison manageable.

Automating this analysis is in its infancy, and won't (yet) work for every law. However, the progress that has been made points the way toward a future when such redlining can be shown in real-time for laws around the world.

Wednesday, September 16, 2015

More Elasticsearch: Flexibility without duplicates

People want everything. When they're searching, they want flexibility and they want precision, too. Legal researchers, especially, show this cognitive dissonance: in their personal lives they are used to Google's flexibility ("show me that hairy dog that looks like a mop"), and at work they use 'Advanced' search interfaces that can find the right legal document, if only they write a search query that is sufficiently complex ("show me the rule between September 1981-1983 that has the words 'excessive' and 'sanctions' within 4 words of each other, and does not have the word 'contraband'").

To search through legal documents, precision is important: 42 U.S.C 2000e-5 (a section of the United States Code) is not the same as 42 U.S.C. 2000e. At the same time, a text search for 'discriminate', should probably also return results that have the word 'discrimination'. To handle this in Elasticsearch (ES) seemed at first simple: create two indexes, or two 'types' within a single index. In essence, we'd index the documents once with a permissive analyzer that doesn't discriminate between 'discriminate' and 'discrimination' (an English-language analyzer) and once with a strict analyzer, that breaks words on whitespace and will only match exact terms (read more on ES analyzers here). Search the first index when you want a flexible match and the second one when you want an exact match. So far so good.

None or too many

But what about combining a flexible match with a strict one ("section 2000e-5" AND discriminate)? You either get no results or duplicates. No results are returned if you're looking for the overlap of the two terms: by design, the two indexes were created separately.  OTOH, if you're looking for matches of either term, you get duplicates, one from each index. Back to the drawing board.

To remove duplicates, the internet suggests field collapsing: index each document, using the same ID value in both indexes, group by ID and set 'top_hits' to 1, to get just one of the two duplicates. Unfortunately, grouping also breaks the nice results pagination that comes with ES. So you can de-duplicate results, but can't easily paginate them. This is a problem for searches that return hundreds, or thousands of results. For a nice afternoon detour, you can read why pagination and aggregation don't play well together.

Two fields in one index

O.K., then, how about indexing each field twice within the same document in the index. The two copies should have different names and should be analyzed differently. For example, one could be called 'flex_docText' and the other 'exact_docText'. Combined flexible and exact searches will point to the same document. And while each field is indexed and searched differently, the original text that ES stores will be the same, so we only need to return one of these fields (it doesn't matter which) to the user.


The first step is to create the new index with a 'mapping' for the two fields that defines the different analyzers to use for each: 
POST myindex
    "mytype" : {
          "properties" : {
                "flex_docText" : { "type": "string",
          "analyzer" : "english" },
                "exact_docText" : { "type": "string",
          "analyzer" : "whitespace" }

Next, index the documents, making sure to index the 'docText' field twice, once under each name. This can be as easy as including the content twice when creating the document:

PUT /myindex/mytype
  "flex_docText": "This is the text to be indexed.",
  "exact_docText":  "This is the text to be indexed."

Indexing from SQL

An additional complication arises when importing data from a SQL database. As described in my earlier post, a nice open source JDBC import tool was built for this purpose. So nice, in fact, that it directly takes the output of a SQL query and sends it to Elasticsearch to be indexed. The downside is that the data is indexed with just the name it has in the SQL query.  So, if your database column is named 'docText', in a table named 'myTable', you might use this query:

SELECT docText FROM myTable

The JDBC import tool would then index one field, called docText. If you want to create two parallel fields in the index, it is necessary to rename the database column, and extract it twice from the database, using the following syntax:

SELECT docText as flex_docText, docText as exact_docText FROM myTable

In fact, you can extract the same data as many times as you want, under different names, and apply different analysis to the data in the index mapping.  Does that really work? Yes, that really works.  Now if you want to highlight search results and avoid duplicates, that's a story for another day.