As luck would have it, soon has arrived already: “Juriscraper”, it seems, is the CourtListener library for scraping court websites, and doesn’t actually have anything to do with extracting citations embedded in individual cases. Oops. So I put on my copywriter’s hat, and chased up a brand new name. So let’s try that again….
Say hello to the Free Law Ferret: a Firefox plugin that has emerged from the CitationStylist skunkworks with a ferocious curiousity and a full set of tiny adorable bibliographic teeth.
The tool depends on some code supplied by Zotero or MLZ so you need to have one of those installed in Firefox for starters. Then install Free Law Ferret. Next, visit a page in your browser. Any page that contains citations to US case law will do, including court judgments from the service of your choice (the case shown to the sample to the right is from Google Scholar, but the source really doesn’t matter). Right-clicking in the page will bring up the context menu with “Free Law Ferret” at the bottom as shown to the right. Click on it and see what happens (it may take several seconds for the parse to complete on a large document). Bear in mind that the code for this is only a couple of hours old as of this writing: if you get an error, let me know and I’ll check into it.
The Ferret will scan the document in the browser window (be it law case, legal brief, blog post or whatever), and present a list of citations in a dialog box like that shown to the right. Note that the parser presently supports US case law only: cites to the courts of other countries, to regulations, to statutory law and to international instruments and tribunals will not be recognized. Select cites in the dialog and click OK to search for each case in the CourtListener repository and open it in a separate browser tab. If the search for a case fails, you can either broaden the search terms in the CourtListener page, or search for it (manually) elsewhere.
You may notice that the cites in the list are more cleanly formatted than those in the original document. In a 2007 interview, Dan Chudnov (citing Dan Hillis) mentions that one of the tasks of library service is to restore information to metadata that has been corrupted by “noise”.  This happens even with electronic records: data gets mistyped, things get entered in wrong field, data can be corrupted in transfer or when migrating across platforms. Handwritten citations, on which the US national legal infrastructure largely depends, offer an exceptionally rich variety of opportunities for variation and error. The solution, as Dan indicates, is to leverage the pieces we can trust, and refresh bad metadata from reliable records that can be inferred from what we have to hand.
This is exactly what the CourtListener citator does, and it explains the uniform appearance of the listed references. The tool contains a large pool of variants of the standard reporter abbreviations, derived from painstaking corpus analysis. These are mapped back to their canonical forms by the parser. The parser then explores the text before and after the match in search of data for associated case name, volume, page, year and (optionally) court name. When sufficient details can be identified, the citation details are finalised in canonical form, and a clean citation can be reproduced from the cleansed record. The citation details can also be used to compose a search query for arbitrary search engines, such as CourtListener itself: and voilà, case text on screen.
The specific query composed by the initial version of Free Law Ferret actually ignores the core citation details themselves. The details used instead are: the case name with any abbreviated terms removed; a date range beginning on January 1st of the year given in the citation and ending on December 31st of the following year; and the courts of the specific jurisdiction, if known. The date scope is extended in this way because a case may be published in the year following the date of decision, and a citation might contain either date—there is no way to be sure, so we look for the case under both.
These search terms are obviously a heuristic. The core citation metadata would seem to be more precise, but in a strange twist, official citations are not useful for retrieving cases supplied directly by courts themselves, because judgments are finalised before their volume and page location in the relevant commercial reporter (published primarily by Thomson Reuters) are known. Things are moving along in the free law sector, and public mapping tables to connect official citations to official judgments will eventually become a reality. In the meantime, however, you must take appropriate steps to confirm that a judgment retrieved via Free Law Ferret is in fact the judgment intended by the underlying citation.
Judgments of some courts are not yet covered by CourtListener search query interface, and these are excluded from the results returned by Free Law Ferret. At the state level, the missing jurisdictions are: Alabama; Colorado; Connecticut; Delaware; Florida; Georgia; Iowa; Illinois; Kansas; Kentucky; Louisiana; Massachusetts; Maryland; Maine; Minnesota; Missouri; North Carolia; New Hampshire; New York; Ohio; Oklahoma; Pennsylvania; Rhode Island; South Carolina; Tennessee; Virginia; and Vermont. Quite a mouthful but a dramatic expansion of CourtListener coverage is in the works, and will be released in the rather near future. When it comes out, cite recognition in Free Law Ferret can be expanded accordingly.
The initial version of Free Law Ferret is sufficiently functional to be useful, but there is plenty of scope for improvement. The query mechanism that has been hard-coded to CourtListener is a primitive implementation of openURL. Properly extended, it will be possible to support fallback to additional free-access services (such as Google Scholar and court sites themselves), as well as to commercial providers such as Westlaw, Lexis and Fastcase. With modest cooperation from archive maintainers, adding new services can be made instantly configurable, and the priority of multiple services can be made controllable by the user. There is also plenty of scope for improving the interface, and interoperation with Zotero or MLZ (which is all but non-existent in the initial version).
Many possibilities—so many, in fact, that I would like to close by inviting anyone with an interest to fork the Free Law Ferret code repository on BitBucket and submit pull requests back to the project. There is more work here than I can imagine at the moment, and contributions will be most welcome.
Update: Minor rewrite of introduction (2013-08-21).