{"id":4893,"date":"2011-04-11T16:55:52","date_gmt":"2011-04-11T21:55:52","guid":{"rendered":"http:\/\/journal.code4lib.org\/?p=4893"},"modified":"2011-04-11T16:59:04","modified_gmt":"2011-04-11T21:59:04","slug":"from-isis-to-couchdb-databases-and-data-models-for-bibliographic-records","status":"publish","type":"post","link":"https:\/\/journal.code4lib.org\/articles\/4893","title":{"rendered":"From ISIS to CouchDB: Databases and Data Models for Bibliographic Records"},"content":{"rendered":"<p>by Luciano G. Ramalho<\/p>\n<h2><a name=\"section1\"><\/a>1. Introduction<\/h2>\n<p>The relational database model is well grounded in theory and well supported by the software industry. Relational database management systems (RDBMS) replaced legacy hierarchical and network database products in nearly every enterprise and became the mainstream. Object-oriented DBMS have not taken the leading role from relational systems, even after object-oriented programming became the norm, and in spite of the so-called \u201cimpedance mismatch\u201d perceived when one needs to store nested objects into a collection of flat tables.<\/p>\n<p>After resisting the object-oriented wave, relational databases became so dominant that they seemed like the only game in town, or at least the only game in most shops. But in the last five years, Google published a paper on BigTable (<a href=\"#chang2006\">Chang, 2006<\/a>), the proprietary, distributed non-relational database behind many of its online properties. Then Amazon.com revealed Dynamo, the non-relational database that powers some of their AWS cloud-based services (<a href=\"#decandia2007\">Decandia, 2007<\/a>). Facebook open-sourced Cassandra, the distributed non-relational database created to enable its crucial in-box search feature. Cassandra is now also used by Twitter, Digg, Rackspace and Cisco, among many others, and has such a vibrant community that it became a top-level project of the Apache Foundation in 2010, less than three years after its initial public release (<a href=\"#apache2011a\">Apache, 2011a<\/a>). These developments are part of a trend that became known, in 2009, as the \u201cNoSQL&#8221; movement. NoSQL stands for \u201cNo SQL\u201d or \u201cNot Only SQL\u201d, depending on who you ask; but everyone agrees it&#8217;s not really about the SQL language, but about seeking alternative answers to problems that are not amenable to relational solutions.<\/p>\n<p>Meanwhile, thousands of small to medium libraries in developing countries have been oblivious to all this and continue using their ISIS databases for daily operations. ISIS is a family of non-relational database systems with a history that goes back to the 1970&#8217;s. It was developed by UNESCO \u2013 the United Nations Educational, Scientific and Cultural Organization \u2013 specifically for bibliographic data, and it is still officially distributed and widely used (<a href=\"#hopkinson2005\">Hopkinson, 2005<\/a>).<\/p>\n<p>ISIS had a very positive impact, allowing libraries with limited resources to computerize their catalogs on simple, stand-alone PCs. Some ISIS products became Web-enabled, but after that, progress has been very slow in the last 10 years. Meanwhile, a couple of new Open Source NoSQL databases present a possible migration path for ISIS users, a path that will not require wholesale normalization of decades of bibliographic data. This paper describes how BIREME\/PAHO\/WHO, a digital library that is part of the Pan American Health Organization, and a large-scale user of ISIS databases, has studied the viability of migrating one of its main databases to Apache CouchDB, a modern document database.<\/p>\n<p>The rest of the paper is structured as follows. <a href=\"#section2\">Section 2<\/a> discusses data models for bibliographic records, explains why it is tempting for librarians to look beyond the relational model and compares the structure of MARC records to the semistructured data model formalized in the 1990&#8217;s. <a href=\"#section3\">Section 3<\/a> briefly presents ISIS, then compares two recent products that support semistructured records, CouchDB and MongoDB, and shows our criteria for choosing among them. <a href=\"#section4\">Section 4<\/a> discusses different formats for representing ISIS records in CouchDB, then describes the tool we developed for converting records and the process used to load them into CouchDB. <a href=\"#section5\">Section 5<\/a> talks about how indexing works in ISIS and CouchDB, and exemplifies queries on the latter. Finally, <a href=\"#section6\">Section 6<\/a> summarizes what we have done, our conclusions and ongoing work.<\/p>\n<h2><a name=\"section2\"><\/a>2. Data models for bibliographic records<\/h2>\n<p>One of the defining characteristics of the relational data model, the First Normal Form (1NF), dictates that attributes must be atomic. In other words, in the industry-standard flat relational model<sup><a name=\"noteorig1\" href=\"#note1\">1<\/a><\/sup>, fields cannot be structured into subfields nor contain multiple values.<\/p>\n<p>On the other hand, multivalued fields are useful and common for representing bibliographic data. Books often have more than one author and cover multiple subjects. A publisher, while logically a single attribute of a book record, has attributes of its own, like a name and a place. It is therefore useful to split a publisher field into parts. That is why MARC supports subfields and multivalued fields through repeating tags (similar to XML). The MARC structure, or \u201cempty container\u201d<sup><a name=\"noteorig2\" href=\"#note2\">2<\/a><\/sup> carried over to the ISO-2709 standard, which describes a generic data interchange format allowing applications to attach any meaning to the tags and subfield markers. Here we will refer to the \u201cISO-2709 data model\u201d as a generalization of the data model implied by the MARC record structure.<\/p>\n<p>Of course, library data is often stored in relational databases, but normalization means that the description of a single item is spread over multiple records in several tables. That may work well within the context of one library, but when exchanging data, it is more practical to have just one record per item. That is one reason why MARC is still with us, and why XML became so important in a world dominated by relational databases.<\/p>\n<p>If the ISO-2709 data model does not adhere to the flat relational model, then what is the theory behind it? At first there was none, but in the mid-1990s, while XML was still under development, researchers from UPenn, Stanford, AT&#038;T Labs and the INRIA French R&#038;D agency conceived the semistructured<sup><a name=\"noteorig3\" href=\"#note3\">3<\/a><\/sup> data model (<a href=\"#abiteboul1999\">Abiteboul, 1999<\/a>). Their work supports reasoning about MARC and ISO-2709 records but also about the richer data models of XML and JSON<sup><a name=\"noteorig4\" href=\"#note4\">4<\/a><\/sup> documents.<\/p>\n<p>A useful definition by Dan Suciu, a pioneer of semistructured data research, appears in the \u201cEncyclopedia of Database Systems\u201d (<a href=\"#suciu2009\">Suciu, 2009<\/a>):<\/p>\n<blockquote><p> The semi-structured data model is designed as an evolution of the relational data model that allows the representation of data with a flexible structure. Some items may have missing attributes, others may have extra attributes, some items may have two or more occurrences of the same attribute. The type of an attribute is also flexible: it may be an atomic value or it may be another record or collection. Moreover, collections may be heterogeneous, i.e., they may contain items with different structures. The semi-structured data model is a self-describing data model, in which the data values and the schema components co-exist. <\/p><\/blockquote>\n<p>Thanks to the semistructured data model, we don&#8217;t need to feel ashamed of our denormalized bibliographic records any more. Even better, we have a framework to evaluate databases of the current NoSQL crop with regard to our needs as caretakers of denormalized datasets.<\/p>\n<p>The book \u201cData on the Web: From Relations to Semistructured Data and XML\u201d (<a href=\"#abiteboul1999\">Abiteboul, 1999<\/a>), introduces a notation for representing semistructured data. It is called \u201cssd-expression\u201d and it looks like this:<\/p>\n<pre>\r\n  {name: {first: \"Lewis\", last: \"Carroll\"},\r\n   tel: 5553457,\r\n   email: \"cld@ox.ac.uk\",\r\n   email: \"lcarroll@pobox.com\"\r\n  }\r\n<\/pre>\n<p>Except for the repetition of the <code>email<\/code> key, it&#8217;s very similar to JSON (<a href=\"#crockford2006a\">Crockford 2006a<\/a>). The presence of keys describing fields like <code>name<\/code>, <code>first<\/code> etc., and the storage of those keys alongside the data, is what the encyclopedia entry means by coexisting data values and schema components (<a href=\"#suciu2009\">Suciu, 2009<\/a>). It is a characteristic shared by ISO-2709, XML, and JSON (with the limitation that ISO-2709 fields are identified by tags composed of three alphanumeric ASCII characters, though most applications only use numeric tags). The repeated <code>email<\/code> tag could be represented similarly in XML and in ISO-2709, but in JSON the keys must be unique within an object structure, so multivalued fields are often represented as lists of values:<\/p>\n<pre class=\"brush: jscript; title: ; notranslate\" title=\"\">\r\n  {&quot;name&quot;: {&quot;first&quot;: &quot;Lewis&quot;, &quot;last&quot;: &quot;Carroll&quot;},\r\n   &quot;tel&quot;: 5553457,\r\n   &quot;email&quot;: &#x5B;&quot;cld@ox.ac.uk&quot;, &quot;lcarroll@pobox.com&quot;]\r\n  }\r\n<\/pre>\n<p>The ISO-2709 record format influenced the record structure of the ISIS family of databases, introduced by UNESCO in the 1970s. ISIS records follow the structure of the ISO-2709 standard, and most ISIS systems import and export to that format. Here is the same record of the examples above, as seen in a common ISIS display format:<\/p>\n<pre>\r\n  10 \u00ab^fLewis ^lCarroll\u00bb\r\n  20 \u00ab5553457\u00bb\r\n  30 \u00abcld@ox.ac.uk\u00bb\r\n  30 \u00ablcarroll@pobox.com\u00bb\r\n<\/pre>\n<p>In an ISIS record the field names are replaced by numeric tags. The \u00ab\u00bb delimiters are not part of the content, but a feature of the display format. Here there are two occurrences of tag #30, the e-mail field. Field #10 is split into two subfields (&#8220;f&#8221; and &#8220;l&#8221;) by special markers. Instead of the subfield delimiter control character used in MARC systems, ISIS uses the circumflex accent: ^ (ASCII 94). A subfield identifier is case-insensitive and consists of just one ASCII letter or digit. The ISIS data model is simpler than that of ISO-2709 because it does not have field indicators<sup><a name=\"noteorig5\" href=\"#note5\">5<\/a><\/sup> and subfields are not repeatable<sup><a name=\"noteorig6\" href=\"#note6\">6<\/a><\/sup>.<\/p>\n<p>Clearly the ISIS data model is not as flexible as the general semistructured data model. Given the syntax of subfield markers, ISIS fields are limited to one level of nesting only. In other words, borrowing from the XML jargon: subfields cannot have child elements, just character data. Also, an ISIS field may have mixed content<sup><a name=\"noteorig7\" href=\"#note7\">7<\/a><\/sup>: some unmarked text may appear before the first subfield marker. For example, in this field:<\/p>\n<pre>\r\n  18 \u00abLords of Finance^bThe Bankers Who Broke the World\u00bb\r\n<\/pre>\n<p>The main title, &#8220;Lords of Finance&#8221; is not preceded by a subfield marker. The semistructured data model, as described by Abiteboul et. al., has no concept of mixed content like XML has. So we will call that part \u201cthe main subfield\u201d and pretend it is preceded by an implied &#8220;^_&#8221; (underscore) subfield marker whenever we need to refer to it in code. With this simple arrangement, any ISIS record can be represented as an ssd-expression or a JSON record.<\/p>\n<h2><a name=\"section3\"><\/a>3. Semistructured database systems<\/h2>\n<h3>3.1. The ISIS family<\/h3>\n<p>In 1985, Giampaolo Del Bigio ported UNESCO&#8217;s CDS\/ISIS mainframe database system to PC\/DOS (<a href=\"#lopes2010\">Lopes, 2010<\/a>). The result, called MicroISIS, was distributed free of charge and became the <i>de facto<\/i> standard for computerized library catalogs in developing countries. Its Windows version, WinISIS is still distributed by UNESCO and remains widely used in small to medium libraries (<a href=\"#hopkinson2005\">Hopkinson, 2005<\/a>). Beyond local OPACs, ISIS is also used by two large regional bibliographic databases in Latin America and the Caribbean: SciELO (Scientific Electronic Library Online) and LILACS (Latin American and Caribbean Health Sciences index), both built by BIREME\/PAHO\/WHO, which is a specialized center of the Pan American Health Organization located in S\u00e3o Paulo, Brazil. Together, SciELO and LILACS routinely handle millions of bibliographic records using CISIS, a custom version of ISIS developed by BIREME\/PAHO\/WHO. MicroISIS, WinISIS, CISIS and the more recent J-ISIS from UNESCO are all interoperable and form the ISIS family of database systems.<\/p>\n<p>The legacy ISIS codebases are showing their age, and after years of closed-source development, none of the derivations have managed to become successful Open Source projects. Meanwhile, the NoSQL movement has highlighted some Open Source non-relational databases implementing variations of the semistructured data model. This has motivated BIREME\/PAHO\/WHO to look for alternatives.<\/p>\n<h3>3.2 CouchDB and MongoDB<\/h3>\n<p>Among the recent crop of NoSQL systems, two products stand out for implementing the semistructured data model in a way that matches the operational needs of BIREME\/PAHO\/WHO: Apache CouchDB and MongoDB (<a href=\"#apache2011b\">Apache, 2011b<\/a>, <a href=\"#mongodb2011\">MongoDB.org, 2011<\/a>). Their data model is not as flat as the key-value stores (optimized for retrieving blobs given a primary key), nor as deep as graph databases (designed to allow general queries over paths of nested objects). What they offer is somewhere in between: JSON-like records allowing nested structures, and expressive query languages to index and retrieve those records. Those rich records are called \u201cdocuments\u201d. CouchDB and MongoDB call themselves \u201cdocument databases\u201d.<\/p>\n<p>In the case of CouchDB, the document format is JSON. MongoDB uses BSON, a binary format inspired by JSON but offering more data types, such as int32, datetime, and even a type called \u201cJavaScript code w\/ scope\u201d (<a href=\"#dirolf2010\">Dirolf, 2010<\/a>). JSON can be trivially converted to BSON, but the reverse may not be so easy, depending on the data. While the difference between those formats can be considered an implementation detail, they reflect different priorities.<\/p>\n<p>MongoDB is optimized for fast updating, by aggressively caching writes and by overwriting updated records in place whenever possible: the BSON structure is designed to allow updating specific fields of an existing document. The flip-side of this optimization is that a software crash can cause data loss, and a minimum deployment of two servers in master-slave configuration is recommended if durability is required<sup><a name=\"noteorig8\" href=\"#note8\">8<\/a><\/sup>. In contrast, CouchDB is fault-tolerant: it only appends to the database file on disk, and that is an atomic operation in modern operating systems. As a result, the database file is always consistent in the event of a software crash, and backups can be made while the system is running. The drawbacks of the append-only design are the need for periodic database compaction \u2013 a time-consuming batch operation \u2013 and slower updates.<\/p>\n<p>CouchDB was influenced by Lotus Notes, a networked, collaborative application suite designed to support users who are often off-line. So, CouchDB allows master-master replication, that is, synchronization between peer nodes which have received inserts and updates independently. MongoDB supports only master-slave replication: only one node receives updates and inserts, and replicates to the slaves.<\/p>\n<p>While CouchDB can be used in large clusters, it is also well-suited as a database for small desktop apps. For that reason it comes pre-installed on Ubuntu Linux since 2009. It even runs on Android mobile devices (a free, one-click installer can be found in the Android Marketplace). Because CouchDB supports HTTP and JSON natively, and can run JavaScript procedures on the server, applications can be developed without any middleware: browsers communicate directly with the database, and all the logic is written in JavaScript.<\/p>\n<p>The informative post \u201cComparing Mongo DB and Couch DB\u201d was written by Dwight Merriman, CEO of 10gen, the company behind MongoDB (<a href=\"#merriman2010\">Merriman, 2010<\/a>). Given the characteristics of these systems, we decided to use CouchDB for our LILACS experiments for these reasons:<\/p>\n<ul>\n<li>Easier deployment in a durable configuration.<\/li>\n<li>Support for master-master replication, useful for distributed cooperative   cataloging applications.<\/li>\n<li>Direct support for JSON over HTTP, enabling Web Services and AJAX   applications without middleware.<\/li>\n<\/ul>\n<p>However, we do envision scenarios in which we would like to use both MongoDB and CouchDB. For example, the canonical store of bibliographic data could be CouchDB, while a cluster of MongoDB instances could be used for user-provided content, tracking, recommendations and other features demanding faster database writes.<\/p>\n<h2><a name=\"section4\"><\/a>4. Loading ISIS records into CouchDB<\/h2>\n<h3>4.1. Alternative representations of ISIS records as JSON documents<\/h3>\n<p>The ISIS data model, a subset of the ISO-2709 model, is not as expressive as the JSON data model. There are several ways to represent the same ISIS record in JSON. Consider this abridged example of a LILACS record:<\/p>\n<pre>\r\n   1 \u00abBR1.1\u00bb\r\n   2 \u00ab538886\u00bb\r\n   4 \u00abLILACS\u00bb\r\n   4 \u00abLLXPEDT\u00bb\r\n   5 \u00abS\u00bb\r\n   6 \u00abas\u00bb\r\n   8 \u00abInternet^ihttp:\/\/\u2026\/imageBank\/PDF\/v3n3a04.pdf?aid2=168&amp;\u2026\u00bb\r\n  10 \u00abKanda, Paulo Afonso de Medeiros^1University of S\u00e3o Paulo\r\n      ^2School of Medicine^3Cognitive Disorders of Clinicas\r\n      Hospital Reference Center^pBrasil ^cS\u00e3o Paulo^rorg\u00bb\r\n  10 \u00abAnghinah, Renato^1University of S\u00e3o Paulo^2School of\r\n      Medicine ^3Cognitive Disorders of Clinicas Hospital\r\n      Reference Center ^pBrasil^cS\u00e3o Paulo^rorg\u00bb\r\n  12 \u00abThe Clinical use of quantitative EEG in cognitive disorders\r\n      ^ien\u00bb\r\n  12 \u00abA utiliza\u00e7\u00e3o cl\u00ednica do EEG quantitativo nos transtornos\r\n      cognitivos^ipt\u00bb\r\n  30 \u00abDement. neuropsychol\u00bb\r\n  31 \u00ab3\u00bb\r\n  32 \u00ab3\u00bb\r\n  35 \u00ab1980-5764\u00bb\r\n<\/pre>\n<p>Tag #2 is the LILACS identifier, a non-repeating, numeric field, and tag #10 is an author field in an analytic (article-level), record. The LILACS data dictionary (<a href=\"#bireme2008\">BIREME, 2008<\/a>) defines #10 as a repeating field. In addition, it is composed of the following subfields:<\/p>\n<table>\n<tr>\n<th>subfield<\/th>\n<th>description<\/th>\n<th>sample content<\/th>\n<\/tr>\n<tr>\n<th>main<\/th>\n<td>author name<\/td>\n<td>Kanda, Paulo Afonso de Medeiros<\/td>\n<\/tr>\n<tr>\n<th>^1<\/th>\n<td>affiliation level 1<\/td>\n<td>University of S\u00e3o Paulo<\/td>\n<\/tr>\n<tr>\n<th>^2<\/th>\n<td>affiliation level 2<\/td>\n<td>School of Medicine<\/td>\n<\/tr>\n<tr>\n<th>^3<\/th>\n<td>affiliation level 3<\/td>\n<td>Cognitive Disorders of Clinicas Hospital Reference Center<\/td>\n<\/tr>\n<tr>\n<th>^p<\/th>\n<td>country<\/td>\n<td>Brasil<\/td>\n<\/tr>\n<tr>\n<th>^c<\/th>\n<td>city<\/td>\n<td>S\u00e3o Paulo<\/td>\n<\/tr>\n<tr>\n<th>^r<\/th>\n<td>degree of responsibility<\/td>\n<td>org (organizer)<\/td>\n<\/tr>\n<\/table>\n<p>One way to represent such a record in JSON is shown below (only fields #10 and #2 are shown; some contents shortened for clarity):<\/p>\n<pre class=\"brush: jscript; title: ; notranslate\" title=\"\">\r\n{\r\n    &quot;10&quot;: &#x5B;\r\n        {\r\n            &quot;_&quot;: &quot;Kanda, Paulo Afonso&quot;,\r\n            &quot;1&quot;: &quot;University of S\u00e3o Paulo&quot;,\r\n            &quot;2&quot;: &quot;School of Medicine&quot;,\r\n            &quot;3&quot;: &quot;Cognitive Disorders of \u2026 Reference Center&quot;,\r\n            &quot;c&quot;: &quot;S\u00e3o Paulo&quot;,\r\n            &quot;p&quot;: &quot;Brasil&quot;,\r\n            &quot;r&quot;: &quot;org&quot;\r\n        },\r\n        {\r\n            &quot;_&quot;: &quot;Smidth, Magali Taino&quot;,\r\n            &quot;1&quot;: &quot;University of S\u00e3o Paulo&quot;,\r\n            &quot;2&quot;: &quot;School of Medicine&quot;,\r\n            &quot;3&quot;: &quot;Cognitive Disorders of \u2026 Reference Center&quot;,\r\n            &quot;c&quot;: &quot;S\u00e3o Paulo&quot;,\r\n            &quot;p&quot;: &quot;Brasil&quot;,\r\n            &quot;r&quot;: &quot;org&quot;\r\n        }\r\n    ],\r\n    &quot;2&quot;: &#x5B;\r\n        {\r\n            &quot;_&quot;: &quot;538886&quot;\r\n        }\r\n    ]\r\n}\r\n<\/pre>\n<p>In the representation above, the record is a JSON mapping between tags (eg. \u201c10\u201d) and lists of values (or occurrences in ISIS jargon). The use of lists allows any field to be repeatable, therefore any possible ISIS record fits this scheme. Also, within the list of occurrences, each one is represented by a mapping of subfield keys and values. The special key \u201c_\u201d (underscore) is used to denote the main subfield, and when there are no subfields that is the only key. This creates unnecessary nesting, as demonstrated by field #2 which is the LILACS identifier field, non-repeating and devoid of subfields as defined by the LILACS data dictionary. But it does make the structure homogeneous, an important feature when dealing with legacy semistructured datasets where not all records adhere to the current schema. During our research we called the representation above ISIS-JSON type 3 (<a href=\"#bireme2010\">BIREME, 2010<\/a>).<\/p>\n<p>The main drawback of ISIS-JSON type 3 is that, by definition, JSON mappings are unordered<sup><a name=\"noteorig9\" href=\"#note9\">9<\/a><\/sup>. Therefore, neither the tag order within the record nor the subfield order within a field occurrence are preserved. In the LILACS database, tag order within a record is only relevant when there are repeating tags, such as #10 (author, analytical level). Fortunately, keeping repeated occurrences in a list does preserve their relative ordering, even if the overall field order is not kept. For example, in the original ISIS record under analysis field #2 appeared before all occurrences of field #10, while in the JSON representation the key \u201c10\u201d precedes key \u201c2\u201d. This is irrelevant in practice. But, crucially, the order of the authors is the same in both formats: first \u201cKanda, Paulo Afonso\u201d, then \u201cSmidth, Magali Taino\u201d.<\/p>\n<p>Regarding subfield ordering within a field, the LILACS data dictionary does establish a canonical ordering. For example, in the case of field #10 the order is _, 1, 2, 3, p, c, r (the main subfield is always first, by definition). However, as Jason Thomale points out in \u201cInterpreting MARC: Where\u2019s the Bibliographic Data?\u201d (<a href=\"#thomale2010\">Thomale, 2010<\/a>), subfield markers should be interpreted as textual markup, and their ordering is significant. Particularly when cleaning up or converting legacy data, maintaining the order of subfields from the original record may be important to ascertain the original cataloger&#8217;s intention, even if LILACS subfields do have a prescribed ordering.<\/p>\n<p>To preserve subfield ordering, two alternative JSON representations have been tried at BIREME\/PAHO\/WHO. The first is ISIS-JSON type 1 (line breaks added within long strings for clarity):<\/p>\n<pre class=\"brush: jscript; title: ; notranslate\" title=\"\">\r\n{\r\n    &quot;10&quot;: &#x5B;\r\n        &quot;Kanda, Paulo Afonso de Medeiros^1University of S\u00e3o Paulo\r\n          ^2School of Medicine^3Cognitive Disorders of Clinicas\r\n          Hospital Reference Center^pBrasil ^cS\u00e3o Paulo^rorg&quot;,\r\n        &quot;Smidth, Magali Taino^1University of S\u00e3o Paulo ^2School\r\n          of Medicine^3Cognitive Disorders of Clinicas Hospital\r\n          Reference Center^pBrasil ^cS\u00e3o Paulo^rorg&quot;\r\n    ],\r\n    &quot;2&quot;: &#x5B;&quot;538886&quot;]\r\n}\r\n<\/pre>\n<p>In this format, each field occurrence is a single string, with the subfield delimiters embedded. This essentially \u201cpunts\u201d on the issue of how to represent subfields, leaving the parsing to be done by the database. Because CouchDB uses JavaScript as its default language for creating indexes, it is a viable approach. Another possibility is ISIS-JSON type 2:<\/p>\n<pre class=\"brush: jscript; title: ; notranslate\" title=\"\">\r\n{\r\n    &quot;10&quot;: &#x5B;\r\n        &#x5B;\r\n            &#x5B;&quot;_&quot;, &quot;Kanda, Paulo Afonso&quot;],\r\n            &#x5B;&quot;1&quot;, &quot;University of S\u00e3o Paulo&quot;],\r\n            &#x5B;&quot;2&quot;, &quot;School of Medicine&quot;],\r\n            &#x5B;&quot;3&quot;, &quot;Cognitive Disorders of \u2026 Reference Center&quot;],\r\n            &#x5B;&quot;p&quot;, &quot;Brasil&quot;],\r\n            &#x5B;&quot;c&quot;, &quot;S\u00e3o Paulo&quot;],\r\n            &#x5B;&quot;r&quot;, &quot;org&quot;]\r\n        ],\r\n        &#x5B;\r\n            &#x5B;&quot;_&quot;, &quot;Smidth, Magali Taino&quot;],\r\n            &#x5B;&quot;1&quot;, &quot;University of S\u00e3o Paulo&quot;],\r\n            &#x5B;&quot;2&quot;, &quot;School of Medicine&quot;],\r\n            &#x5B;&quot;3&quot;, &quot;Cognitive Disorders of \u2026 Reference Center&quot;],\r\n            &#x5B;&quot;p&quot;, &quot;Brasil&quot;],\r\n            &#x5B;&quot;c&quot;, &quot;S\u00e3o Paulo&quot;],\r\n            &#x5B;&quot;r&quot;, &quot;org&quot;]\r\n        ]\r\n    ],\r\n    &quot;2&quot;: &#x5B;\r\n        &#x5B;\r\n            &#x5B;&quot;_&quot;, &quot;538886&quot;]\r\n        ]\r\n    ]\r\n}\r\n<\/pre>\n<p>Here each field occurrence is an association list<sup><a name=\"noteorig10\" href=\"#note10\">10<\/a><\/sup>, that is, a mapping represented as a list of lists, where each inner list contains a key-value pair. The advantage of an associative list over a JSON object mapping is that the order of items is preserved, but at the cost of a linear search to locate a key within a field occurrence. Because of how indexing works in CouchDB, this is not as costly as it may seem, as will be seen in the next section. For MARC-style records, the association list has the added advantage of allowing repeated mappings to represent repeating subfields.<\/p>\n<h3>4.2. The isis2json conversion tool<\/h3>\n<p>To convert ISIS records into JSON structures, we developed a Python script called <code>isis2json.py<\/code> (<a href=\"#bireme2010b\">BIREME, 2010b<\/a>). It can be executed with both the Python and Jython interpreters, versions 2.5 through 2.7. When running under Python it can read only ISO-2709 files, but as a Jython script it leverages the Bruma Java library (<a href=\"#barbieri2011\">Barbieri, 2011<\/a>) and can also read binary ISIS files in .MST format directly. Several options control the structure of the JSON output. For example, the command line below generates output suitable for batch importing to CouchDB:<\/p>\n<pre>\r\n$ .\/isis2json.py cds.iso -c -t 3 -q 100 > cds1.json\r\n<\/pre>\n<p>The arguments used in the above example are:<\/p>\n<p><code>-c<\/code> to generate records within a \u201cdocs\u201d list, as required by the CouchDB <code>_bulk_docs<\/code> API;<\/p>\n<p><code>-t 3<\/code> for ISIS-JSON type 3 export (fields as dictionaries of subfields);<\/p>\n<p><code>-q 100<\/code> to output only 100 records;<\/p>\n<p>Here is the help screen of isis2json.py:<\/p>\n<pre>\r\n$ .\/isis2json.py -h\r\nusage: isis2json.py [-h] [-o OUTPUT.json] [-c] [-m] [-t ISIS_JSON_TYPE]\r\n                    [-q QTY] [-s SKIP] [-i TAG_NUMBER] [-u] [-p PREFIX] [-n]\r\n                    [-k TAG:VALUE]\r\n                    INPUT.(mst|iso)\r\n\r\nConvert an ISIS .mst or .iso file to a JSON array\r\n\r\npositional arguments:\r\n  INPUT.(mst|iso)       .mst or .iso file to read\r\n\r\noptional arguments:\r\n  -h, --help            show this help message and exit\r\n  -o OUTPUT.json, --out OUTPUT.json\r\n                        the file where the JSON output should be written\r\n                        (default: write to stdout)\r\n  -c, --couch           output array within a \"docs\" item in a JSON document\r\n                        for bulk insert to CouchDB via POST to db\/_bulk_docs\r\n  -m, --mongo           output individual records as separate JSON\r\n                        dictionaries, one per line for bulk insert to MongoDB\r\n                        via mongoimport utility\r\n  -t ISIS_JSON_TYPE, --type ISIS_JSON_TYPE\r\n                        ISIS-JSON type, sets field structure: 1=string,\r\n                        2=alist, 3=dict\r\n  -q QTY, --qty QTY     maximum quantity of records to read (default=ALL)\r\n  -s SKIP, --skip SKIP  records to skip from start of .mst (default=0)\r\n  -i TAG_NUMBER, --id TAG_NUMBER\r\n                        generate an \"_id\" from the given unique TAG field\r\n                        number for each record\r\n  -u, --uuid            generate an \"_id\" with a random UUID for each record\r\n  -p PREFIX, --prefix PREFIX\r\n                        concatenate prefix to every numeric field tag (ex. 99\r\n                        becomes \"v99\")\r\n  -n, --mfn             generate an \"_id\" from the MFN of each record\r\n                        (available only for .mst input)\r\n  -k TAG:VALUE, --constant TAG:VALUE\r\n                        Include a constant tag:value in every record (ex. -k\r\n                        type:AS)\r\n<\/pre>\n<h3>4.3. Interacting with CouchDB<\/h3>\n<p>HTTP is the main protocol for interacting with CouchDB<sup><a name=\"noteorig11\" href=\"#note11\">11<\/a><\/sup>. Most database administration is done via Futon, its built-in web interface.<\/p>\n<div style=\"width: 500px; margin: auto;\"><a href=\"\/media\/issue13\/ramalho\/fig1-futon.png\"><img decoding=\"async\" class=\"caption\" src=\"\/media\/issue13\/ramalho\/fig1-futon-500x422.png\" alt=\"Futon screen shot: viewing a record\" \/><\/a><\/div>\n<p class=\"caption\"><strong>Figure 1.<\/strong> CouchDB Futon web interface showing a LILACS record in ISIS-JSON Type-2 format.<\/p>\n<p>Nearly every task which can be done via Futon can also be automated using the RESTful API of CouchDB via any HTTP client. In fact, interacting with CouchDB is a great way to learn, in practice, what REST and RESTful interfaces are all about. Here we use cURL, a command-line HTTP client available for many systems. For example, a database is created with a simple PUT request:<\/p>\n<pre>\r\n$ curl -X PUT http:\/\/admin_user:password@127.0.0.1:5984\/lilacs\r\n<\/pre>\n<p>One operation that cannot be done via Futon is uploading records in bulk. Continuing the previous example, we could convert an ISO-2709 file to JSON, then upload it to CouchDB using cURL in two steps:<\/p>\n<pre>\r\n$ .\/isis2json.py cds.iso -c > cds1.json\r\n$ curl -d cds1.json -H\"Content-Type: application\/json\" \\\r\n       -X POST http:\/\/127.0.0.1:5984\/lilacs\/_bulk_docs\r\n<\/pre>\n<p>When used together, the <code>-q<\/code> and the <code>-s<\/code> (skip) options allow splitting the output into several batches, which results in faster loading for large datasets on multi-core servers. For example, loading 20,000 full LILACS records, totaling 64MB, took 147 seconds using this shell command:<\/p>\n<pre>\r\n$ isis2json.py -c -p v -i 2 -q 20000 lilacs100k.iso | \\\r\n  curl -d @- -H\"Content-Type: application\/json\" \\\r\n       -X POST http:\/\/127.0.0.1:5984\/lilacs\/_bulk_docs\r\n<\/pre>\n<p>Loading the same data in two 10,000-record batches took 100 seconds when taking advantage of two cores with a simple shell script like this one:<\/p>\n<pre>\r\n#!\/bin\/bash\r\nisis2json.py -c -p v -i 2 -q 10000 lilacs100k.iso | \\\r\ncurl -d @- -H\"Content-Type: application\/json\" \\\r\n     -X POST http:\/\/127.0.0.1:5984\/lilacs\/_bulk_docs &\r\nisis2json.py -c -p v -i 2 -q 10000 -s 10000 lilacs100k.iso | \\\r\ncurl -d @- -H\"Content-Type: application\/json\" \\\r\n     -X POST http:\/\/127.0.0.1:5984\/lilacs\/_bulk_docs\r\n<\/pre>\n<p>If an <code>_id<\/code> attribute is not present in an inserted document, CouchDB provides one with a UUID (Universally Unique Identifier) like <code>\"ead3af23a4459b2d7a1aef05cb0012a9\"<\/code>. It is highly recommended that an <code>_id<\/code> is given when adding documents to prevent inadvertent duplication of records if a bulk loading process is interrupted and restarted. Therefore <code>isis2json.py<\/code> provides the <code>-i<\/code> option, used in the examples above, to fetch the value of one field in the ISIS input and use it as the <code>_id<\/code> attribute.<\/p>\n<p>So, for example, using the <code>-i 2<\/code> option this (partial) ISIS structure:<\/p>\n<pre>\r\n   1 \u00abBR1.1\u00bb\r\n   2 \u00ab538886\u00bb\r\n   4 \u00abLILACS\u00bb\r\n   4 \u00abLLXPEDT\u00bb\r\n<\/pre>\n<p>Can be converted into this JSON document record, ready to import into CouchDB:<\/p>\n<pre class=\"brush: jscript; title: ; notranslate\" title=\"\">\r\n{\r\n    &quot;_id&quot;: &quot;538886&quot;,\r\n    &quot;1&quot;: &#x5B;\r\n        &#x5B;\r\n            &#x5B;&quot;_&quot;, &quot;BR1.1&quot;]\r\n        ]\r\n    ],\r\n    &quot;2&quot;: &#x5B;\r\n        &#x5B;\r\n            &#x5B;&quot;_&quot;, &quot;538886&quot;]\r\n        ]\r\n    ],\r\n    &quot;4&quot;: &#x5B;\r\n        &#x5B;\r\n            &#x5B;&quot;_&quot;, &quot;LILACS&quot;]\r\n        ],\r\n        &#x5B;\r\n            &#x5B;&quot;_&quot;, &quot;LLXPEDT&quot;]\r\n        ]\r\n    ]\r\n}\r\n<\/pre>\n<p>Note how the content of the first occurrence of the #2 tag was used as <code>_id<\/code>.<\/p>\n<h2><a name=\"section5\"><\/a>5. Indexing and querying<\/h2>\n<h3>5.1. Building indexes<\/h3>\n<p>For efficient retrieval, records must be indexed. It is useful to distinguish between a primary index and secondary indexes. A primary index allows retrieval via the primary key. Every NoSQL database uses a primary index to allow queries on the primary key. In fact, \u201ckey-value\u201d databases often provide only this means of retrieval. Document databases also support secondary indexes, allowing fast access via arbitrary attributes of the records. In order to index those attributes, some form of data extraction must be performed on the records. By default, JavaScript is the language used for data extraction in CouchDB<sup><a name=\"noteorig12\" href=\"#note12\">12<\/a><\/sup>. In ISIS databases, a simple but terse \u201cISIS Formatting Language\u201d is used to generate secondary indexes, also called inverted files.<\/p>\n<p>Index definition in ISIS is done in a special file called the \u201cField Select Table\u201d (FST). Here are just a few lines from the large FST used to generate indexes for the LILACS database:<\/p>\n<pre>\r\n98 0 v1,\"-\"v2\r\n10 8 ,mpl,'|AU_|'(v10^*|%|\/),\r\n38 0 if s(mpu,v38^a,mpl):'CD' then 'SP_CD-ROM' fi,\r\n<\/pre>\n<p>Without going into all the details, here is what those lines do (the numbering is not relevant to this discussion<sup><a name=\"noteorig13\" href=\"#note13\">13<\/a><\/sup>):<\/p>\n<ul>\n<li>line 98 indexes the contents of tags #1 (cooperating center code) and #2 (record id) concatenated with an hyphen;<\/li>\n<li>line 10 indexes each word of the main subfield of field #10 (author), concatenated with an <code>'AU_'<\/code> prefix;<\/li>\n<li>line 38 indexes the string &#8216;SP_CD-ROM&#8217; if field 38, subfield ^a contains the string &#8216;CD&#8217;;<\/li>\n<\/ul>\n<p>To generate an inverted file, an ISIS system applies each line of the FST to each record in the database, generating one or more index entries (or postings) per record. Documents can also be indexed incrementally, but BIREME\/PAHO\/WHO does a lot of batch processing and usually does \u201cfull inversion\u201d, that is, regenerating inverted files from scratch.<\/p>\n<p>The overall process is very similar in CouchDB. Instead of an FST, CouchDB allows us to define \u201cviews\u201d, which are the result of running JavaScript functions to generate indexes. The simplest view contains only one function, the <code>map<\/code> function, which receives a document as an argument, extracts some data from it, and calls a special <code>emit<\/code> function to add entries to the index.<\/p>\n<p>The <code>emit<\/code> function takes two arguments: <code>key<\/code> and <code>value<\/code>. The <code>key<\/code> argument is the one actually indexed; queries will be made on it. It is usually unstructured, like a string, but sometimes it is useful to have an array as a key, for instance <code>[country, province]<\/code>, to allow multi-level grouping of results. The <code>value<\/code> may be any JSON structure, and it is used to display the results of a query. For example, in an OPAC application we may have an index where each entry has an ISBN as a key, and the book title and year of publication as the value.<\/p>\n<p>If you know some SQL, you may find it easier to think in terms of a simple SELECT statement:<\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\nSELECT title, year FROM books WHERE isbn=&#039;9781594201820&#039;\r\n<\/pre>\n<p>Here we have a <code>books<\/code> table with <code>title<\/code>, <code>year<\/code> and <code>isbn<\/code> fields. To support a query like the one above we would need an index on <code>isbn<\/code>. The <code>title<\/code> and <code>year<\/code> fields have no influence on the search, but they are mentioned in the query because we want to display them in the search result.<\/p>\n<p>In CouchDB terms, to support a similar query we would need a view with a <code>map<\/code> function emitting the ISBN as key, and the value would be a structure like <code>{\"title\":\"Lords of Finance\", \"year\":2009}<\/code>. The <code>map<\/code> function for that view would look like this:<\/p>\n<pre class=\"brush: jscript; title: ; notranslate\" title=\"\">\r\nfunction(doc) {\r\n  if (doc.type == &quot;book&quot;) {\r\n    emit(doc.isbn, {title: doc.title, year: doc.year});\r\n  }\r\n}\r\n<\/pre>\n<p>Because CouchDB has no concept of tables like SQL databases have, it is common to have different types of records mixed in the same database. Therefore, <code>map<\/code> functions often have an <code>if<\/code> statement to select which records to index. In the example above, we index only documents of type <code>\"book\"<\/code>. For each book, the key will be the ISBN and the value will be an object with the <code>title<\/code> and <code>year<\/code> attributes.<\/p>\n<p>In addition to a <code>map<\/code> function, CouchDB views may have a <code>reduce<\/code> function, which is used to aggregate results. Views with <code>reduce<\/code> functions will be discussed later.<\/p>\n<h3>Indexing ISIS-JSON type 2 records: first approach<\/h3>\n<p>Being able to use a powerful language like JavaScript to create <code>map<\/code> functions gives a lot of flexibility when creating views. We can dig as deep as necessary into the structure of our documents, and massage the data we find in sophisticated ways. For one report, we created an index of all the LILACS records containing fields with repeating subfield markers. That would not be viable in an ISIS FST. However, one important limitation of <code>map<\/code> functions is that we have no access to data that is not part of the document being indexed. It is impossible, for example, to look up some value in another record, like a join operation in an RDBMS (ISIS also allows this, via the <code>l\/lookup<\/code> function of the ISIS Formatting Language).<\/p>\n<p>To start with a simple example, let us create a view to index LILACS records by tag #1 (cooperating center code), the id of the cataloging institution which created the record. From the LILACS data dictionary we know that tag #1 is non-repeating and has no subfields, which means all the content of that field resides in the main subfield (the first one) and only the first occurrence matters (because there are no repetitions). Here is a view function that would create an index with field #1 as key and field #12 (title, analytic) as value:<\/p>\n<pre class=\"brush: jscript; title: ; notranslate\" title=\"\">\r\nfunction(doc) {\r\n  emit(doc&#x5B;&quot;1&quot;]&#x5B;0]&#x5B;0]&#x5B;1], doc&#x5B;&quot;12&quot;]&#x5B;0]&#x5B;0]&#x5B;1]);\r\n}\r\n<\/pre>\n<p>Here we begin to see the price we pay for the generality of the ISIS-JSON type 2 structure. The LILACS field #1 is non-repeating and has no subfields, so it could be represented like this:<\/p>\n<pre class=\"brush: jscript; title: ; notranslate\" title=\"\">\r\n{\r\n    &quot;1&quot; : &quot;UY6.1&quot;,\r\n    ...\r\n}\r\n<\/pre>\n<p>And then we could reach it within a <code>map<\/code> function with the simple expression <code>doc[\"1\"]<\/code>. Even better, we could add an alpha prefix to the field name (the <code>--prefix<\/code> option of <code>isis2json.py<\/code> does that). Then we could use dot notation (<code>doc.v1<\/code>) to access tag v1 in the following:<\/p>\n<pre class=\"brush: jscript; title: ; notranslate\" title=\"\">\r\n{\r\n    &quot;v1&quot; : &quot;UY6.1&quot;,\r\n    ...\r\n}\r\n<\/pre>\n<p>However, to deal with ISIS records of any kind in the absence of schema information, we must assume that every field may have more than one occurrence, therefore its value cannot be just a string, but an array of occurrences. In addition, we must assume that every field may have subfields, therefore each occurrence must be structured as an associative list (ISIS-JSON type 2) or a dictionary (ISIS-JSON type 3), unless we want to parse the subfields every time when indexing. In an ISIS-JSON type 2 record, field #1 is represented as:<\/p>\n<pre class=\"brush: jscript; title: ; notranslate\" title=\"\">\r\n{\r\n    &quot;1&quot;: &#x5B;\r\n        &#x5B;\r\n            &#x5B;\r\n                &quot;_&quot;,\r\n                &quot;UY6.1&quot;\r\n            ]\r\n        ]\r\n    ],\r\n    ...\r\n}\r\n<\/pre>\n<p>And to access its value we must write <code>doc[\"1\"][0][0][1]<\/code>. Here is why:<\/p>\n<pre>\r\ndoc   ->   {\"1\": [[[\"_\", \"UY6.1\"]]]}     # entire document\r\ndoc[\"1\"]   ->   [[[\"_\", \"UY6.1\"]]]     # list of occurrences of field #1\r\ndoc[\"1\"][0]  ->  [[\"_\", \"UY6.1\"]]    # list of subfields of occurrence 0 of #1\r\ndoc[\"1\"][0][0] -> [\"_\", \"UY6.1\"]   # subfield 0 of occurrence 0 of #1 (key \"_\")\r\ndoc[\"1\"][0][0][1]  ->  \"UY6.1\"   # value of subfield 0 of occurrence 0 of #1\r\n<\/pre>\n<p>This quickly becomes tedious, and very burdensome in more complex cases, for instance, to retrieve a specific subfield aside from the main one we would need to iterate over the subfield keys. We will soon show a library to make it easier to handle ISIS-JSON type 2 records, but first let&#8217;s return to our simple <code>map<\/code> function:<\/p>\n<pre class=\"brush: jscript; title: ; notranslate\" title=\"\">\r\nfunction(doc) {\r\n  emit(doc&#x5B;&quot;1&quot;]&#x5B;0]&#x5B;0]&#x5B;1], doc&#x5B;&quot;12&quot;]&#x5B;0]&#x5B;0]&#x5B;1]);\r\n}\r\n<\/pre>\n<p>This function is part of a view in a &#8220;design document&#8221; in CouchDB. A design document is written in JSON, stored in CouchDB as other documents are, but its identifier starts with <code>_design\/<\/code>. It may contain JavaScript functions for creating views as well as formatting output, validating document inserts\/updates, etc. Each CouchDB database may have several design documents, and each design document may have several views.<\/p>\n<p>When trying out CouchDB initially, the easiest way to create a view is by using the &#8220;Temporary view&#8230;&#8221; option of the view dropdown in the top right area of the Futon interface. Then you can run the <code>map\/reduce<\/code> functions and quickly see their results (if your dataset is not too large). To save your work to a permanent view, you will be prompted to provide a design document name and a view name.<\/p>\n<p>For any serious work, the CouchApp tool is highly recommended (<a href=\"#anderson2009\">Anderson, 2009<\/a>). With it you can develop your design documents in your local filesystem, using your favorite editor and version control system, then push your code to a local or remote CouchDB instance with a command. All of the views for this article were developed in this manner. The code is in Bitbucket (<a href=\"#ramalho2011\">Ramalho, 2011<\/a>).<\/p>\n<div style=\"width: 500px; margin: auto;\"><a href=\"\/media\/issue13\/ramalho\/fig2-tempview.png\"><img decoding=\"async\" class=\"caption\" src=\"\/media\/issue13\/ramalho\/fig2-tempview-500x399.png\" alt=\"Futon screen shot: testing a temporary view\" \/><\/a><\/div>\n<p class=\"caption\"><strong>Figure 2.<\/strong> Editing and running a temporary view in CouchDB Futon.<\/p>\n<p>When a view is first visited via HTTP, CouchDB indexes all of the documents by applying the <code>map<\/code> function to each of them. CouchDB also incrementally updates the indexes if documents are inserted or updated. The indexing is only done on demand, when a view is actually requested, and not when documents are created or changed.<\/p>\n<p>To install the <code>map<\/code> function above, we created a design document called <code>lilacs<\/code> and within it a view called <code>center<\/code> (for &#8220;cooperating center&#8221;). Here is part of the result of requesting that view with the cURL utility:<\/p>\n<pre>\r\n$ curl -s http:\/\/ramalho.couchone.com\/lilcouch\/_design\/lilacs\/_view\/center | head -5\r\n{\"total_rows\":926,\"offset\":0,\"rows\":[\r\n{\"id\":\"559682\",\"key\":\"AR1.1\",\"value\":\"Hemorragia digestiva baja: [revisi\\u00f3n]\"},\r\n{\"id\":\"559683\",\"key\":\"AR1.1\",\"value\":\"La radiolog\\u00eda del colon por enema en el siglo XXI\"},\r\n{\"id\":\"559684\",\"key\":\"AR1.1\",\"value\":\"El s\\u00edndrome an\\u00e9mico en las neoplasias del colon derecho\"},\r\n{\"id\":\"559685\",\"key\":\"AR1.1\",\"value\":\"Met\\u00e1stasis cerebrales del carcinoma epidermoide del ano: comunicaci\\u00f3n de un caso\"},\r\n<\/pre>\n<p>Note that the result comes as a JSON object with three properties: <code>total_rows<\/code>, <code>offset<\/code> and <code>rows<\/code>, the latter being an array of <code>map<\/code> function results. Besides the key and value generated by the <code>emit<\/code> function, each result row also has an <code>id<\/code> attribute, which carries the <code>_id<\/code> property of the corresponding indexed document. By default, the result is sorted in ascending key order. Here we used the <code>head<\/code> shell command to crop the displayed results; we will soon see a way of limiting the results actually sent by CouchDB.<\/p>\n<p>By the way, the URL shown here is public and you should be able to access it to try the examples by editing the URL:<\/p>\n<p><a href=\"http:\/\/ramalho.couchone.com\/lilcouch\/_design\/lilacs\/_view\/center\" >http:\/\/ramalho.couchone.com\/lilcouch\/_design\/lilacs\/_view\/center<\/a><\/p>\n<p>However, please note that the <code>lilcouch<\/code> database there contains only a sample of 1000 records, not the full LILACS database.<\/p>\n<h3>5.2. Querying a view<\/h3>\n<p>As we have just seen, CouchDB queries are executed by making HTTP requests on views. If no arguments are passed, all rows of the result are returned. However, there are several arguments that can be used to filter the results. For example, the following query uses the <code>descending<\/code> and <code>limit<\/code> arguments. Note the use of single quotes around the URL, necessary because &#8220;&#038;&#8221; is a shell operator and we are using the command line. In this case the <code>head<\/code> shell command was not used. The <code>total_rows<\/code> property still counts 926, but only the last three rows are returned because of the <code>limit=3<\/code> option.<\/p>\n<pre>\r\n$ curl -s 'http:\/\/ramalho.couchone.com\/lilcouch\/_design\/lilacs\/_view\/center?descending=true&limit=3'\r\n{\"total_rows\":926,\"offset\":0,\"rows\":[\r\n{\"id\":\"560003\",\"key\":\"UY6.1\",\"value\":\"Vacunaci\\u00f3n antigripal en personal de salud del Hospital Pedi\\u00e1trico del Centro Hospitalario Pereira Rossell: evoluci\\u00f3n de la cobertura del a\\u00f1o 2006 al 2008\"},\r\n{\"id\":\"560002\",\"key\":\"UY6.1\",\"value\":\"Contin\\u00faa descendiendo la mortalidadpor asma en Uruguay: per\\u00edodo 1984-2008\"},\r\n{\"id\":\"560001\",\"key\":\"UY6.1\",\"value\":\"Localizaci\\u00f3n de lesiones mamarias subcl\\u00ednicas con marcador met\\u00e1lico (arponaje): an\\u00e1lisis de los m\\u00e1rgenes quir\\u00fargicos\"}\r\n]}\r\n<\/pre>\n<p>Another possibility is to filter by key. The next query returns only the documents created by the cooperating center with code &#8220;CO113&#8221;:<\/p>\n<pre>\r\n$ curl -s 'http:\/\/ramalho.couchone.com\/lilcouch\/_design\/lilacs\/_view\/center?key=\"CO113\"'\r\n{\"total_rows\":926,\"offset\":744,\"rows\":[\r\n{\"id\":\"559986\",\"key\":\"CO113\",\"value\":\"De serendipia al tratamiento quir\\u00fargico de la diabetes mellitus tipo II\"},\r\n{\"id\":\"559987\",\"key\":\"CO113\",\"value\":\"La Academia Nacional de Medicina se pronuncia sobre la emergencia social, la medicina y la salud\"},\r\n{\"id\":\"559988\",\"key\":\"CO113\",\"value\":\"Presentaci\\u00f3n inicial de las pacientes con diagn\\u00f3stico de c\\u00e1ncer de seno en el Centro Javeriano de Oncolog\\u00eda, Hospital Universitario San Ignacio\"},\r\n{\"id\":\"559989\",\"key\":\"CO113\",\"value\":\"\\u00bfEs la diabetes mellitus tipo 2 una enfermedad de tratamiento quir\\u00fargico?\"},\r\n{\"id\":\"559990\",\"key\":\"CO113\",\"value\":\"P\\u00f3lipos de la ves\\u00edcula\"},\r\n{\"id\":\"559991\",\"key\":\"CO113\",\"value\":\"Ruptura espl\\u00e9nica espont\\u00e1nea asociada a linfoma perif\\u00e9rico de c\\u00e9lulas T, presentaci\\u00f3n de un caso y revisi\\u00f3n de la literatura\"},\r\n{\"id\":\"559992\",\"key\":\"CO113\",\"value\":\"Diverticulosis del yeyuno: complicaciones y manejo; reporte de caso y revisi\\u00f3n de la literatura\"},\r\n{\"id\":\"559993\",\"key\":\"CO113\",\"value\":\"Actinomicosis abdominal y p\\u00e9lvica: reto diagn\\u00f3stico y quir\\u00fargico para el cirujano general\"}\r\n]}\r\n<\/pre>\n<p>It is also possible to limit by starting and ending keys. For example, this query returns the documents created by centers starting with the &#8220;CO&#8221; prefix (for Colombia), up to and including the &#8220;CO149&#8221; center. Note that the <code>offset<\/code> property tells us that 744 rows were skipped to get to the first one with <code>key=\"CO\"<\/code>:<\/p>\n<pre>\r\n$ curl -s 'http:\/\/ramalho.couchone.com\/lilcouch\/_design\/lilacs\/_view\/center?startkey=\"CO\"&endkey=\"CO149\"'\r\n{\"total_rows\":926,\"offset\":744,\"rows\":[\r\n{\"id\":\"559986\",\"key\":\"CO113\",\"value\":\"De serendipia al tratamiento quir\\u00fargico de la diabetes mellitus tipo II\"},\r\n{\"id\":\"559987\",\"key\":\"CO113\",\"value\":\"La Academia Nacional de Medicina se pronuncia sobre la emergencia social, la medicina y la salud\"},\r\n[... several rows omitted ...]\r\n{\"id\":\"560469\",\"key\":\"CO149\",\"value\":\"Diagn\\u00f3stico sanitario de diversos zoocriaderos helic\\u00edcolas en Colombia: determinaci\\u00f3n de los principales agentes pat\\u00f3genos que afectan el caracol Helix aspersa (O.F. Muller, 1774) en cada etapa de ciclo biol\\u00f3gico\"},\r\n{\"id\":\"560470\",\"key\":\"CO149\",\"value\":\"La salud animal y la globalizaci\\u00f3n: el desaf\\u00edo de pol\\u00edticas sostenibles y equitativas en el contexto de los pa\\u00edses en desarrollo\"}\r\n]}\r\n<\/pre>\n<h3>5.3. Indexing ISIS-JSON type 2 records: using an API<\/h3>\n<p>Now we will develop a view called <code>au_countries<\/code> to list the countries of origin of the authors cited in the sample. This entails fetching the value of the &#8220;^p&#8221; subfield of field #10. A first stab at the <code>map<\/code> function came out like this:<\/p>\n<pre class=\"brush: jscript; title: ; notranslate\" title=\"\">\r\nfunction(doc) {\r\n  var occur, subf;\r\n  if (doc.hasOwnProperty(&#039;10&#039;)) {\r\n    for (occur=0; occur&amp;lt;doc&#x5B;&#039;10&#039;].length; occur++) {\r\n      for (subf=0; subf&amp;lt;doc&#x5B;&#039;10&#039;]&#x5B;occur].length; subf++) {\r\n        if (doc&#x5B;&#039;10&#039;]&#x5B;occur]&#x5B;subf]&#x5B;0] === &#039;p&#039;) {\r\n          emit(doc&#x5B;&#039;10&#039;]&#x5B;occur]&#x5B;subf]&#x5B;1], 1);\r\n          break;\r\n        }\r\n      }\r\n    }\r\n  } else {\r\n    emit(null, 1);\r\n  }\r\n}\r\n<\/pre>\n<p>Yes, that is quite ugly. Much of the ugliness comes from the fact that ISIS-JSON type 2 arranges subfields in a associative list, where a key can only be found through a linear search.<\/p>\n<p>So here is the next solution, much simpler thanks to the use of a custom library:<\/p>\n<pre class=\"brush: jscript; title: ; notranslate\" title=\"\">\r\nfunction(doc) {\r\n  \/\/ !code vendor\/isisdm_t2\/_attachments\/isisdm.js\r\n  var i, occurs = ISIS.getallsub(doc, 10, &quot;p&quot;);\r\n  for (i=0; i&amp;lt;occurs.length; i++) {\r\n    if (occurs&#x5B;i] !== undefined) {\r\n      emit(occurs&#x5B;i], 1);\r\n    }\r\n  }\r\n}\r\n<\/pre>\n<p>The first line of the function is a special <code>!code<\/code> comment that works as an include, and is processed by the CouchApp utility. This is actually a workaround because at this time CouchDB does not support the use of the CommonJS <code>require<\/code> function to import modules in <code>map<\/code> or <code>reduce<\/code> functions (<code>require<\/code> is supported elsewhere in CouchDB, in show and list functions for example). The <code>isisdm.js<\/code> module used here is part of the ISIS Data Model APIs we are developing at BIREME\/PAHO\/WHO to handle ISIS records in JavaScript and Python. Currently <code>isisdm.js<\/code> defines four main functions:<\/p>\n<dl>\n<dt><code>get(record, tag, missing)<\/code><\/dt>\n<dd>returns first occurrence of <code>tag<\/code> in <code>record<\/code>; if     <code>tag<\/code> is not found, returns the value of the last argument or     <code>undefined<\/code> if that is not given;<\/dd>\n<dt><code>getall(record, tag)<\/code><\/dt>\n<dd>returns an array with all occurrences of <code>tag<\/code> in     <code>record<\/code> or an empty array if <code>tag<\/code> is not found;<\/dd>\n<dt><code>getsub(record, tag, key, missing)<\/code><\/dt>\n<dd>returns subfield identified by <code>key<\/code> in the first occurrence     of <code>tag<\/code> in <code>record<\/code>; if <code>tag<\/code> or     <code>key<\/code> are not found, returns the value of the last argument or     <code>undefined<\/code> if that is not given;<\/dd>\n<dt><code>getallsub(record, tag, key, missing)<\/code><\/dt>\n<dd>returns an array with contents of subfield identified by     <code>key<\/code> in all occurrences of <code>tag<\/code> in     <code>record<\/code>;<\/dd>\n<\/dl>\n<p>We are using QUnit (<a href=\"#jquery2011\">JQuery, 2011<\/a>) to perform unit tests on the <code>isisdm.js<\/code> module. Here is the the test page:<\/p>\n<div style=\"width: 500px; margin: auto;\"><a href=\"\/media\/issue13\/ramalho\/fig3-qunit.png\"><img decoding=\"async\" class=\"caption\" src=\"\/media\/issue13\/ramalho\/fig3-qunit-500x394.png\" alt=\"Futon screen shot: viewing a record\" \/><\/a><\/div>\n<p class=\"caption\"><strong>Figure 3.<\/strong> QUnit test results page, showing that all tests are passing. The tests for the <code>getallsub<\/code> function are expanded. This page is live at <a href=\"http:\/\/ramalho.couchone.com\/lilcouch\/_design\/lilacs\/tests.html\" >http:\/\/ramalho.couchone.com\/lilcouch\/_design\/lilacs\/tests.html<\/a><\/p>\n<p>Now, back to the <code>map<\/code> function, note that the call to <code>emit<\/code> uses the subfield occurrence as key and the value is just a number 1: <code>emit(occurs[i], 1);<\/code>. This is because the intent of this view is to produce an aggregate count of each different key. To achieve this, we need a <code>reduce<\/code> function to sum the values emitted, like this:<\/p>\n<pre class=\"brush: jscript; title: ; notranslate\" title=\"\">\r\nfunction(keys, values, rereduce) {\r\n  return sum(values);\r\n}\r\n<\/pre>\n<p>Now that our <code>au_countries<\/code> view has both a <code>map<\/code> and <code>reduce<\/code> function, we can query it:<\/p>\n<pre>\r\n$ curl -s 'http:\/\/ramalho.couchone.com\/lilcouch\/_design\/lilacs\/_view\/au_countries'\r\n{\"rows\":[\r\n{\"key\":null,\"value\":2368}\r\n]}\r\n<\/pre>\n<p>The simplest, no-argument request returns null as the key and 2368 as the value. This is the result of the <code>reduce<\/code> function: 2368 is the sum of all the 1&#8217;s emitted by the <code>map<\/code> function. In other words, there are 2368 occurrences of the #10 tag with a <code>^p<\/code> subfield.<\/p>\n<p>The <code>reduce<\/code> function can be disabled if the <code>reduce=false<\/code> option is given in the request. Below are the first 5 and the last 5 lines of querying the <code>au_countries<\/code> view without applying the <code>reduce<\/code>. What you see is just the result of the <code>map<\/code> function:<\/p>\n<pre>\r\n$ curl -s 'http:\/\/ramalho.couchone.com\/lilcouch\/_design\/lilacs\/_view\/au_countries?reduce=false' | head -5\r\n{\"total_rows\":2368,\"offset\":0,\"rows\":[\r\n{\"id\":\"560162\",\"key\":\"\\u00c1frica do Sul\",\"value\":1},\r\n{\"id\":\"560162\",\"key\":\"\\u00c1frica do Sul\",\"value\":1},\r\n{\"id\":\"560013\",\"key\":\"Alemania\",\"value\":1},\r\n{\"id\":\"559596\",\"key\":\"Algeria\",\"value\":1},\r\n$ curl -s 'http:\/\/ramalho.couchone.com\/lilcouch\/_design\/lilacs\/_view\/au_countries?reduce=false' | tail -5\r\n{\"id\":\"560034\",\"key\":\"USA\",\"value\":1},\r\n{\"id\":\"560325\",\"key\":\"USA\",\"value\":1},\r\n{\"id\":\"559604\",\"key\":\"Venezuela\",\"value\":1},\r\n{\"id\":\"559957\",\"key\":\"Venezuela\",\"value\":1}\r\n]}\r\n<\/pre>\n<p>A more interesting result can be obtained if we pass the <code>group=true<\/code> option:<\/p>\n<pre>$ curl -s 'http:\/\/ramalho.couchone.com\/lilcouch\/_design\/lilacs\/_view\/au_countries?group=true&limit=10'\r\n{\"rows\":[\r\n{\"key\":\"\\u00c1frica do Sul\",\"value\":2},\r\n{\"key\":\"Alemania\",\"value\":1},\r\n{\"key\":\"Algeria\",\"value\":1},\r\n{\"key\":\"Argentina\",\"value\":289},\r\n{\"key\":\"Austr\\u00e1lia\",\"value\":2},\r\n{\"key\":\"Bolivia\",\"value\":1},\r\n{\"key\":\"Brasi\\u00e7\",\"value\":1},\r\n{\"key\":\"Brasil\",\"value\":1185},\r\n{\"key\":\"Brazil\",\"value\":26},\r\n{\"key\":\"Brsil\",\"value\":1}\r\n]}\r\n<\/pre>\n<p>The result set above, limited to 10 rows, shows the number of occurrences of each key. Obviously there are four entries for Brazil, with alternate and incorrect spellings. This would have to be dealt with elsewhere. The key point here is that the <code>reduce<\/code> function and the group option work together to produce aggregate results, similar to the ones we can produce with the SQL GROUP BY clause in a relational DBMS.<\/p>\n<h2><a name=\"section6\"><\/a>6. Results and Conclusion<\/h2>\n<h3>6.1. Results<\/h3>\n<p>We have identified in CouchDB and MongoDB two modern, Open Source database systems which are suitable for semistructured records like those defined by the ISO-2709 standard and the ISIS family of systems, serving the needs of MARC and LILACS datasets.<\/p>\n<p>Furthermore, we created a tool to convert ISIS records from the ISO-2709 format to JSON documents suitable for loading into CouchDB (or MongoDB, though we have not shown that in this paper).<\/p>\n<p>We considered a number of alternative representations for ISIS data in JSON format. In this paper we used the type 2 representation which, though somewhat awkward to work with, preserves subfield ordering and allows for repeated subfields, a feature of MARC records. ISIS fields do not have indicators, so we have not discussed how to represent them in JSON. One possibility would be to use special subfield markers, like &#8220;_1&#8221; and &#8220;_2&#8221;, attached to field occurrences where the indicators are present.<\/p>\n<p>Finally, we developed a JavaScript library to make it easy to handle ISIS-JSON type 2 records when creating CouchDB views.<\/p>\n<h3>6.2. Conclusion<\/h3>\n<p>These experiments and developments have shown that it is easy to import ISO-2709 data into a document database like CouchDB or MongoDB. After doing so, it becomes almost trivial to create Web services to publish the data, using just JavaScript, particularly in CouchDB thanks to its native support for JSON over HTTP.<\/p>\n<p>While the semistructured data model was only formalized in the mid 1990&#8217;s, the ISO-2709 and ISIS record formats have always been concrete, albeit limited, examples of it. Research into that model includes results such as algorithms to extract a formal schema from actual datasets, methods for dealing with shared or duplicate data, and a normal form adapted to semistructured schemas (<a href=\"tok2005\">Tok, 2005<\/a>). We have much to learn and apply from semistructured data research into our daily work with bibliographic records.<\/p>\n<p>At BIREME\/PAHO\/WHO we continue investigating the challenges and opportunities of converting from the ISIS legacy systems to modern Open Source document databases. Meanwhile, we are also developing new applications \u2013 not limited to the ISIS data model and legacy data \u2013 using CouchDB, JavaScript, Python and the Pyramid framework. Pyramid is a new web framework resulting from the merger of the Pylons and Repoze projects, and it includes Deform, a form generation and validation library with strong support for repeating fields and nested forms, important features when building interfaces for semistructured data entry (<a href=\"#pylons2011\">Pylons, 2011<\/a>). These new developments allow us to think about how we want the LILACS bibliographic database to operate in the year 2015, when its 30th anniversary will be celebrated.<\/p>\n<h2>Notes<\/h2>\n<p><a name=\"note1\" href=\"#noteorig1\"><sup>1<\/sup><\/a> The qualification \u201c<em>flat<\/em> relational model\u201d is used here because there is actually a controversy about the meaning of the First Normal Form. C. J. Date states that the atomicity requirement of the 1NF is pointless, and that since \u201ctypes can be anything\u201d it follows that \u201call relations are in first normal form by definition\u201d (<a href=\"#date2005\">Date, 2005<\/a>, p. 37). But E. F. Codd did define normalization as the process of removing non-simple domains from relations (<a href=\"#codd1970\">Codd, 1970<\/a>), and current database textbooks define the 1NF as Codd did (<a href=\"#elmasri2006\">Elmasri, 2006<\/a>; <a href=\"#silberschatz2006\">Silberschatz, 2006<\/a>). Leaving aside the theory, although some RDBMS \u2013 like PostgreSQL and Oracle \u2013 support array values and even user-defined composite field values, modern database-independent access methods, such as the Ruby on Rails ActiveRecord ORM, often target the lowest common denominator, and that is the flat relational model.<\/p>\n<p><a name=\"note2\" href=\"#noteorig2\"><sup>2<\/sup><\/a> \u201cThe structure, or &#8217;empty container&#8217;, the content designators (tags, indicators, and subfield codes) used to explicitly identify or additionally characterize the data elements, and the content, the data itself (author&#8217;s name, titles, etc.) are the three components of the [MARC II] format.\u201d (<a href=\"#avram1975\">Avram, 1975<\/a>)<\/p>\n<p><a name=\"note3\" href=\"#noteorig3\"><sup>3<\/sup><\/a> The term \u201csemistructured\u201d often appears as \u201csemi-structured\u201d in the literature. Searching for both spellings is recommended because Google returns more results for the hyphenated spelling, but some key proponents of the model use just one word (<a href=\"#abiteboul1999\">Abiteboul, 1999<\/a>; <a href=\"#tok2005\">Tok, 2005<\/a>; <a href=\"#suciu2001\">Suciu, 2001<\/a>). Here I use one word, but kept the source spelling in quotations.<\/p>\n<p><a name=\"note4\" href=\"#noteorig4\"><sup>4<\/sup><\/a> JSON is JavaScript Object Notation, a data exchange format described as a light-weight alternative do XML (<a href=\"#crockford2006b\">Crockford, 2006b<\/a>). Since 2006 JSON is an Internet standard, formalized by RFC4627 (<a href=\"#crockford2006a\">Crockford, 2006a<\/a>).<\/p>\n<p><a name=\"note5\" href=\"#noteorig5\"><sup>5<\/sup><\/a> In MARC records, indicators are two single-digit positions located between the tag number and the field content, which have different uses depending on the tag number.<\/p>\n<p><a name=\"note6\" href=\"#noteorig6\"><sup>6<\/sup><\/a> ISIS datasets sometimes contain repeated subfields due to typos and processing errors. However, the ISIS Formatting Language, used to generate indexes and displays, is only capable of extracting the first occurrence of a subfield.<\/p>\n<p><a name=\"note7\" href=\"#noteorig7\"><sup>7<\/sup><\/a> From the XML standard (<a href=\"#w3c1998\">W3C, 1998<\/a>): \u201cAn element type has mixed content when elements of that type may contain character data, optionally interspersed with child elements.\u201d Mixed content also exists in HTML and in SGML, their common ancestor.<\/p>\n<p><a name=\"note8\" href=\"#noteorig8\"><sup>8<\/sup><\/a> Here we mean \u201cdurability\u201d as in the \u201cD\u201d in the ACID database properties, meaning that once an insert or update is committed, it is written to disk. For a discussion on why the designers of MongoDB initially compromised on single-server durability, see \u201cWhat about Durability?\u201c (<a href=\"#mongodb2010\">MongoDB, 2010<\/a>). Update: as of March 16, 2011, MongoDB 1.8 is released and includes write-ahead journaling. \u201cWith journaling enabled, crash recovery is fast and safe\u201d (<a href=\"#mongodb2011\">MongoDB, 2011<\/a>).<\/p>\n<p><a name=\"note9\" href=\"#noteorig9\"><sup>9<\/sup><\/a> \u201cAn object is an unordered collection of zero or more name\/value pairs\u201d (<a href=\"#crockford2006a\">Crockford, 2006a<\/a>).<\/p>\n<p><a name=\"note10\" href=\"#noteorig10\"><sup>10<\/sup><\/a> Association lists should not to be confused with associative arrays, such as those in PHP. A PHP associative array is like a hash in Perl or Ruby, a Python dictionary or a JavaScript or JSON object. In contrast, an association list, or <em>alist<\/em>, is not a primitive type in those languages, but can be built as an array or list of key-value pairs, where each pair is also an array or list. In Python, a more readable and convenient representation would be a list of tuples: <code>[(key1, value1), (key2, value2), ...]<\/code>. Association lists originated in Lisp and are common in Scheme (<a href=\"#mit2010\">MIT, 2010<\/a>).<\/p>\n<p><a name=\"note11\" href=\"#noteorig11\"><sup>11<\/sup><\/a> As far as we know, apart from HTTP, the only other way to interact with CouchDB is via Hovercraft, an Erlang library which provides direct access to the database.<\/p>\n<p><a name=\"note12\" href=\"#noteorig12\"><sup>12<\/sup><\/a> CouchDB indexes are generated by \u201cview servers\u201d which can be written in any language. An Erlang view server is bundled with recent versions of CouchDB, but is not enabled by default. Python can be easily configured for that purpose, and Java is also known to be used. A list of view server implementations can be found at <a href=\"http:\/\/wiki.apache.org\/couchdb\/View_server\" >http:\/\/wiki.apache.org\/couchdb\/View_server<\/a>.<\/p>\n<p><a name=\"note13\" href=\"#noteorig13\"><sup>13<\/sup><\/a> The numbers in the first column (98, 10, 38 in the example) identify a specific index, usually coinciding with the indexed tag number, but not necessarily. Some versions of ISIS support advanced querying using those numbers, others do not.<\/p>\n<h2>References<\/h2>\n<p><a name=\"abiteboul1999\"><\/a>ABITEBOUL, Serge; BUNEMAN, Peter; SUCIU, Dan. Data on the Web: From Relations to Semistructured Data and XML. San Francisco: Morgan Kaufmann, 1999.<\/p>\n<p><a name=\"anderson2009\"><\/a>ANDERSON, C.; CHESNEAU, B. CouchApp: Standalone CouchDB Application Development Made Simple, 2009 [cited 2011 Mar. 26]. Available from: <a href=\"https:\/\/github.com\/couchapp\/couchapp\/\" >https:\/\/github.com\/couchapp\/couchapp\/<\/a>.<\/p>\n<p><a name=\"apache2011a\"><\/a>APACHE FOUNDATION. The Apache Cassandra Project [cited 2011 Mar. 26]. Available from: <a href=\"http:\/\/cassandra.apache.org\/\">http:\/\/cassandra.apache.org\/<\/a>.<\/p>\n<p><a name=\"apache2011b\"><\/a>APACHE FOUNDATION. The Apache CouchDB Project [cited 2011 Mar. 26]. Available from: <a href=\"http:\/\/couchdb.apache.org\/\">http:\/\/couchdb.apache.org\/<\/a>.<\/p>\n<p><a name=\"avram1975\"><\/a>AVRAM, Henriette D. MARC: Its History and Implications. Washington, DC: U.S. Government Printing Office, 1975.<\/p>\n<p><a name=\"barbieri2011\"><\/a>BARBIERI, Heitor. Bruma (Java library source code). S\u00e3o Paulo: BIREME\/OPAS\/OMS, 2010 [cited 2011 Mar. 26]. Available from: <a href=\"http:\/\/reddes.bvsalud.org\/projects\/isisnbp\/browser\/Bruma\" >http:\/\/reddes.bvsalud.org\/projects\/isisnbp\/browser\/Bruma<\/a>.<\/p>\n<p><a name=\"bireme2008\"><\/a>BIREME\/OPAS\/OMS. Diccionario de Datos del Modelo LILACS Versi\u00f3n 1.6a. S\u00e3o Paulo: BIREME\/OPAS\/OMS, 2008 [cited 2011 Mar. 26]. Available from: <a href=\"http:\/\/bvsmodelo.bvs.br\/download\/lilacs\/LILACS-5-DicionarioDados-es.pdf\" >http:\/\/bvsmodelo.bvs.br\/download\/lilacs\/LILACS-5-DicionarioDados-es.pdf<\/a><\/p>\n<p><a name=\"bireme2010\"><\/a>BIREME\/OPAS\/OMS. ISIS-JSON types. S\u00e3o Paulo: BIREME\/OPAS\/OMS, 2010 [cited 2011 Mar. 21]. Available from: <a href=\"http:\/\/reddes.bvsalud.org\/projects\/isisnbp\/wiki\/ISIS-JSON_types\" >http:\/\/reddes.bvsalud.org\/projects\/isisnbp\/wiki\/ISIS-JSON_types<\/a>.<\/p>\n<p><a name=\"bireme2010b\"><\/a>BIREME\/OPAS\/OMS. ISIS-DM: The ISIS Data Model API. S\u00e3o Paulo: BIREME\/OPAS\/OMS, 2010 [cited 2011 Mar. 29]. Available from: <a href=\"http:\/\/github.com\/bireme\/isisdm\">http:\/\/github.com\/bireme\/isisdm<\/a>.<\/p>\n<p><a name=\"chang2006\"><\/a>CHANG, Fay; DEAN, Jeffrey; GHEMEWAT, Sanjay; et al. Bigtable: A Distributed Storage System for Structured Data. In: OSDI&#8217;06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, 2006 [cited 2011 Mar. 21]. Available from: <a href=\"http:\/\/labs.google.com\/papers\/bigtable.html\" >http:\/\/labs.google.com\/papers\/bigtable.html<\/a>.<\/p>\n<p><a name=\"codd1970\"><\/a>CODD, E. F. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM, v. 13, n. 6, p. 377-387, jun. 1970 [cited 2011 Mar. 26]. Available from: <a href=\"http:\/\/citeseerx.ist.psu.edu\/viewdoc\/download?doi=10.1.1.98.5286&#038;rep=rep1&#038;type=pdf\" >http:\/\/citeseerx.ist.psu.edu\/viewdoc\/download?doi=10.1.1.98.5286&#038;rep=rep1&#038;type=pdf<\/a>.<\/p>\n<p><a name=\"crockford2006a\"><\/a>CROCKFORD, D. RFC4627: The application\/json Media Type for JavaScript Object Notation (JSON). The Internet Society, 2006 [cited 2011 Mar. 21]. Available from: <a href=\"http:\/\/tools.ietf.org\/html\/rfc4627\" >http:\/\/tools.ietf.org\/html\/rfc4627<\/a>.<\/p>\n<p><a name=\"crockford2006b\"><\/a>CROCKFORD, D. JSON: The Fat-Free Alternative to XML. JSON.ORG, 2006 [cited 2011 Mar. 21]. Available from: <a src=\"http:\/\/www.json.org\/fatfree.html\"> http:\/\/www.json.org\/fatfree.html<\/a>.<\/p>\n<p><a name=\"date2005\"><\/a>DATE, C. J. Database in Depth: Relational Theory for Practitioners. Sebastopol: O&#8217;Reilly Media, 2005.<\/p>\n<p><a name=\"decandia2007\"><\/a>DECANDIA, Giuseppe; HASTORUN, Deniz; JAMPANI, Madan; et al. Dynamo: Amazon&#8217;s Highly Available Key-Value Store. In: Proceedings of the 21st ACM Symposium on Operating Systems Principles, Stevenson, WA, 2007.<\/p>\n<p><a name=\"dirolf2010\"><\/a>DIROLF, M. Bin\u00adary JSON. BSONSPEC.ORG, 2010 [cited 2011 Mar. 21]. Available from: <a src=\"http:\/\/bsonspec.org\/#\/specification\" >http:\/\/bsonspec.org\/#\/specification<\/a>.<\/p>\n<p><a name=\"elmasri2006\"><\/a>ELMASRI, R; NAVATHE, S. B. Fundamentals of Database Systems. 5<sup>th<\/sup> ed. Reading, MA: Addison-Wesley, 2006.<\/p>\n<p><a name=\"hopkinson2005\"><\/a>HOPKINSON, Alan. CDS\/ISIS: The second decade. Information Development. February 2005 vol. 21 no. 1 p.31-37 [cited 2011 Mar. 26]. Available from: <a href=\"http:\/\/eprints.mdx.ac.uk\/2700\/3\/isisseconddecade.pdf\" >http:\/\/eprints.mdx.ac.uk\/2700\/3\/isisseconddecade.pdf<\/a>.<\/p>\n<p><a name=\"jquery2011\"><\/a>JQUERY PROJECT. QUnit, 2011. [cited 2011 Mar. 21]. Available from: <a src=\"http:\/\/docs.jquery.com\/Qunit\" >http:\/\/docs.jquery.com\/Qunit<\/a><\/p>\n<p><a name=\"lopes2010\"><\/a>LOPES, Francisco. Historia_ISIS.doc (internal memo). S\u00e3o Paulo: BIREME\/OPAS\/OMS, 2010.<\/p>\n<p><a name=\"merriman2010\"><\/a>MERRIMAN, Dwight. Comparing Mongo DB and Couch DB. MongoDB.org. [cited 2011 Mar. 21]. Available from: <a src=\"http:\/\/www.mongodb.org\/display\/DOCS\/Comparing+Mongo+DB+and+Couch+DB\" >http:\/\/www.mongodb.org\/display\/DOCS\/Comparing+Mongo+DB+and+Couch+DB&#8221;>.<\/p>\n<p><a name=\"mit2010\"><\/a>MIT. MIT\/GNU Scheme &#8211; Association Lists. Cambridge, MA: Massachusetts Institute of Technology, 2010 [cited 2011 Mar. 22]. Available from: <a href=\"http:\/\/web.mit.edu\/scheme_v9.0.1\/doc\/mit-scheme-ref\/Association-Lists.html\" >http:\/\/web.mit.edu\/scheme_v9.0.1\/doc\/mit-scheme-ref\/Association-Lists.html<\/a>.<\/p>\n<p><a name=\"mongodb2010\"><\/a>MONGODB.ORG. What about Durability? MongoDB Blog, 2010 [cited 2011 Mar. 22]. Available from: <a href=\"http:\/\/blog.mongodb.org\/post\/381927266\/what-about-durability\" >http:\/\/blog.mongodb.org\/post\/381927266\/what-about-durability<\/a>.<\/p>\n<p><a name=\"mongodb2011\"><\/a>MONGODB.ORG. MongoDB 1.8 Released. MongoDB Blog, 2011 [cited 2011 Mar. 22]. Available from: <a href=\"http:\/\/blog.mongodb.org\/post\/3903149313\/mongodb-1-8-released\" >http:\/\/blog.mongodb.org\/post\/3903149313\/mongodb-1-8-released<\/a>.<\/p>\n<p><a name=\"pylons2011\"><\/a>PYLONS PROJECT. Deform. Agendaless Consulting, 2011 [cited 2011 Mar. 22]. Available from: <a \"http:\/\/docs.pylonsproject.org\/projects\/deform\/dev\/#\" >http:\/\/docs.pylonsproject.org\/projects\/deform\/dev\/<\/a>.<\/p>\n<p><a name=\"ramalho2011\"><\/a>RAMALHO, L. LILACS on CouchDB, 2011 [cited 2011 Mar. 26]. Available from: <a href=\"https:\/\/bitbucket.org\/ramalho\/lilcouch\" >https:\/\/bitbucket.org\/ramalho\/lilcouch<\/a><\/p>\n<p><a name=\"silberschatz2006\"><\/a>SILBERSCHATZ, A.; KORTH, H.; Sudarshan, S. Database System Concepts, 5<sup>th<\/sup> Ed. New York: McGraw-Hill, 2006.<\/p>\n<p><a name=\"suciu2001\"><\/a>SUCIU, Dan. Managing XML and Semistructured Data &#8211; Lecture Series (digital slides). Seattle: Department of Computer Science &#038; Engineering, University of Washington, 2001 [cited 2011 Mar. 21]. Available from: <a href=\"http:\/\/www.cs.washington.edu\/homes\/suciu\/COURSES\/590DS\/\" >http:\/\/www.cs.washington.edu\/homes\/suciu\/COURSES\/590DS\/<\/a><\/p>\n<p><a name=\"suciu2009\"><\/a>SUCIU, Dan. Semi-Structured Data Model. In: LIU, L.; \u00d6ZSU, M. T. Encyclopedia of Database Systems: Springer, 2009. p. 2601-2605.<\/p>\n<p><a name=\"thomale2010\"><\/a>THOMALE, Jason. Interpreting MARC: Where&#8217;s the Bibliographic Data? Code4Lib Journal, Issue 11, 2010-09-21 [cited 2011 Mar. 26]. Available from: <a href=\"https:\/\/journal.code4lib.org\/articles\/3832\" >http:\/\/journal.code4lib.org\/articles\/3832<\/a>.<\/p>\n<p><a name=\"tok2005\"><\/a>TOK, Wang Ling; LEE, Mong Li; DOBBIE, Gillian. Semistructured Database Design. Boston: Springer Science, 2005.<\/p>\n<p><a name=\"w3c1998\"><\/a>W3C. Extensible Markup Language (XML) 1.0 (Fifth Edition), 2008. [cited 2011 Mar. 26]. Available from: <a href=\"http:\/\/www.w3.org\/TR\/REC-xml\/#sec-mixed-content\" >http:\/\/www.w3.org\/TR\/REC-xml\/#sec-mixed-content<\/a><\/p>\n<h2 class=\"abouttheauthor\">About the Author<\/h2>\n<p><a href=\"mailto:luciano.ramalho@bireme.org\">Luciano Ramalho<\/a> was designing large-scale Web publishing systems before the Netscape IPO and the first release of Internet Explorer. He has a B.S. in Library Sciences from the University of S\u00e3o Paulo, and is a software development supervisor at BIREME\/PAHO\/WHO, a digital library that is part of the Knowledge Management and Communication area of the Pan American Health Organization. His English-language blog is at <a href=\"http:\/\/standupprogrammer.blogspot.com\" >http:\/\/standupprogrammer.blogspot.com<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>For decades bibliographic data has been stored in non-relational databases, and thousands of libraries in developing countries still use ISIS databases to run their OPACs. Fast forward to 2010 and the NoSQL movement has shown that non-relational databases are good enough for Google, Amazon.com and Facebook. Meanwhile, several Open Source NoSQL systems have appeared.<\/p>\n<p>This paper discusses the data model of one class of NoSQL products, semistructured, document-oriented databases exemplified by Apache CouchDB and MongoDB, and why they are well-suited to collective cataloging applications. Also shown are the methods, tools, and scripts used to convert, from ISIS to CouchDB, bibliographic records of LILACS, a key Latin American and Caribbean health sciences index operated by the Pan-American Health Organization.<\/p>\n","protected":false},"author":192,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[69],"tags":[],"class_list":["post-4893","post","type-post","status-publish","format-standard","hentry","category-issue13"],"_links":{"self":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts\/4893","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/users\/192"}],"replies":[{"embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/comments?post=4893"}],"version-history":[{"count":0,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts\/4893\/revisions"}],"wp:attachment":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/media?parent=4893"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/categories?post=4893"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/tags?post=4893"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}