This is the first in a series of newsy posts about progress in MLZ support for legal resources. Following close on the heels of the master plan I jotted down yesterday, we have a new MLZ site translator for statutes hosted by the National Assembly of the Lao PDR. The navigation path to statutes on the site is shown in the annotated screenshot to the right.
As with most of the countries represented among our student body, I have visited only the capital city of Laos. Vientiane was the site of my first traffic hop by tuk-tuk, and I am a sucker for markets; but I have no first-hand knowledge of many of the serious issues addressed in student research. Much to learn.
The National Assembly site translator is limited to statutes in the Lao original. The site also hosts the text of the Constitution and English translations of a number of statutes, but neither are covered. The former is left for a future exercise. The latter are not accessible to the translator by intention, for reasons given below.
I’ll confess that I started with the Lao National Assembly site because I thought it would be quick and simple to cover. It didn’t quite turn out that way. The presentation is simple enough in appearance, with the national legislation presented in three lists totaling just 97 items. The markup, however, is a surprisingly chaotic forest of <span> and <br/> tags, with structures (for example) that look like this:
<span style="font-family: Saysettha OT;">37. <a href="/docs/lao/laws/econ/Industrial%20Processing%20Law%20%281999%29%20Lao.pdf" target="_blank"> ກົດໝາຍວ່າດ້ວຍ ອຸດສະຫະກໍາປຸງແຕ່ງ <br /> </a>38. <a href="/docs/lao/laws/econ/101_Protection.pdf" target="_blank"> ກົດໝາຍວ່າດ້ວຍ ການປົກປ້ອງຜູ້ຊົມໃຊ້ </a> <br />39. <a href="/docs/lao/laws/econ/102_final_statictics_law_09_8_2010.pdf" target="_blank"> ກົດໝາຍວ່າດ້ວຍ ສະຖິຕິ </a> <br />40. <a href="/docs/lao/laws/econ/100_law_hiv_8.2010_last.pdf" target="_blank"> ກົດໝາຍວ່າດ້ວຍ ການສະກັດກັ້ນ ແລະຕ້ານເຊື້ອພະຍາດ ເອສໄອວີ/ພະຍາດເອດ </a> </span>
As this small sample illustrates, the nesting of <span> tags in the source is erratic, and although line breaks are uniformly set with a <br/>, it can occur within any element, at any nesting level. To make matters worse, three statutory instruments in the listing turned out to be linked with href on empty <a> tags, making them invisible to the naked eye, and virtually impossible to associate with their respective descriptive titles, buried in entirely separate “nearby” <span> elements.
It quickly became apparent that xpath wasn’t going to cut it. After toying with the idea of contacting the National Assembly offices directly to see if the source could be cleaned up, I had a go with regular expressions against the serialized HTML source. This worked out better than expected, and I was able to finish the job in available time within the day, at least.
A small peculiarity of the translator is that it works only in “multiple” mode (showing the folder icon when the browser is displaying a statute list). This is unavoidable, since the statutes themselves are PDF files, with no intermediate page to provide the details of an individual instrument. The listing itself provides (just) enough information to generate a citation: the title is extracted from the listing; the year seems to be included as an element of the filename more often than not; and the jurisdiction is known, of course, to be “Lao People’s Democratic Republic”.
Apart from the year element, the file naming scheme for statutes seems to follow no particular pattern within the Lao and the English lists, nor across them. Because the Lao and the (possibly outdated) English versions are presented in separate pages, and are listed in a different sequence in each, it is not possible to automatically associate a Lao statutory text with its English counterpart. This would mean that items grabbed from each would be independent in MLZ, whereas the Lao metadata should be treated as official, with the English metadata registered as variant fields on the same item. Since the translator is intended for use by researchers versed in Laotian in any case, I’ll be limiting the translator to the Laotian listings. If required, English metadata must be added by hand after items are grabbed with the translator.
Despite my whiny little complaints about the site, the regexp parsing strategy (which basically leverages the visual structure of the listings) should be reasonably robust. It recovers the statutory texts that are “hidden” by the markup, and when I get to it I’ll look into extending the translator to work with the Web Archive images of the same site, for sorta-point-in-time support of the nation’s statutes.
For those interested, here is the translator code.