Stacks - Medium

Integrating the Wellcome Collection knowledge graph

Stepan Brychta — Mon, 19 Jan 2026 09:09:28 GMT

A dragonfly on a lotus flower (Nelumbo species) held above the water. Watercolour.
Source: Wellcome Collection Public Domain Mark

It’s January 2025, and you’re browsing the Wellcome Collection online catalogue, which contains over a million items exploring art, health, culture and what it means to be human. You come across an image depicting a dragonfly on a lotus flower (shown above). Its catalogue page (called work page) includes rich metadata and is tagged with relevant subjects, including Dragonflies, Ponds, and Insecta. Clicking on the Insecta tag takes you to a theme page, which aggregates hundreds of items associated with this subject.

After a bit more browsing, you discover several other bug-themed pages, including Insect and Insects. To your surprise, each page lists a different set of items. This kind duplication is common across the collection, with separate pages for Aging and Ageing, Physicians and Doctors, and Manuscripts, Medieval and Medieval Manuscripts.

Returning to the Insecta theme page, you hope to find other images showing interactions between plants and insects, but come up empty-handed. Unknown to you, the collection offers a dedicated Insect-plant relationships theme page, as well as other related pages, such as Entomology, Arthropods, and Insects — Anatomy. But the only way to discover these pages is to stumble across them by chance.

Towards a more connected collection

If you had browsed the collection a year later, your experience would be different. The duplicate Insects and Insect pages have been merged into the Insecta page, which now includes a section highlighting related pages, as well as a link to the broader Arthropods page. Similar improvements have been applied to hundreds of thousands of pages, allowing you to traverse the collection laterally by navigating between theme pages.

These features are enabled by the Wellcome Collection knowledge graph, which stores millions of connections between items and themes in our collection. We previously wrote a blog post about the graph, explaining why and how we built it, with an interactive sample of the graph itself. This post is a sequel, focusing on how we integrated the knowledge graph into our production system.

Beneath the surface

Five antiquaries look through magnifying glasses at objects. Coloured lithograph after L. Boilly, 1823.
Source: Wellcome Collection Public Domain Mark

Our public API, known as the catalogue API, is at the core of our public collection, supplying data displayed on both work pages and theme pages. The API is backed by several Elasticsearch indexes, populated by our main production pipeline, called the catalogue pipeline. This pipeline continuously processes updates from source systems to ensure that the API always stays up to date.

Surfacing relationships from the knowledge graph through the catalogue API meant integrating the graph into this pipeline while preserving its performance characteristics, including near real-time updates. To achieve this, we created a new set of services collectively known as the graph pipeline.

Graph pipeline architecture diagram.

The graph pipeline is organised as a directed acyclic graph (DAG) and orchestrated via state machines in AWS Step Functions. A set of upstream services (called extractors and bulk loaders) keeps the graph up to date, while another set of services (called ingestors) queries the graph and produces final documents indexed into the catalogue API.

The pipeline begins by extracting raw data from source systems, including Wellcome Collection and external knowledge bases such as Wikidata, Library of Congress, and MeSH. Each extractor downloads items from a single source, transforms each item into a graph node, and converts its relationships with other items into graph edges. Resulting graph entities are streamed to CSV files stored in an S3 bucket, where they are picked up by the bulk loader and inserted into the knowledge graph. At this stage, we also run remover services which delete unused nodes and edges from the graph. Finally, ingestor services query the graph for the relationships we’re interested in and embed them into catalogue API indexes.

A small sample of the knowledge graph showing how we consolidate the Insecta theme page. Wellcome themes reference relevant MeSH or Library of Congress concepts, which are connected through Wikidata.

The pipeline comes in two modes, depending on the data source. Wellcome Collection data is processed in incremental mode every 15 minutes, with each run handling only items which have changed since the previous run. Data from external sources is reprocessed in full once a month.

All services are written in Python and deployed as AWS Lambda functions or ECS tasks, depending on their memory and runtime requirements. Each service publishes a set of CloudWatch metrics and uploads artefacts to an S3 bucket, making it easier to investigate failures. We monitor the system using a Grafana dashboard, displaying service execution times, failure counts, custom metrics, and logs. When something goes wrong, we receive alerts via Slack.

A screenshot of our Grafana monitoring dashboard.

Code structure

The following sections describe a few general principles which guided us during development and helped improve the quality and maintainability of the codebase. All of our code is publicly available on GitHub in case you’d like to explore further.

Data quality

Data quality is our top priority. While we cannot control the quality of data sent to us from source systems, we can enforce strict guarantees about the structure of the data we process at every stage of the pipeline. All functions require type annotations, and we use mypy for static type checking.

More importantly, we avoid using untyped dictionaries where possible in favour of Pydantic models, which let us define the shape of our data and validate it while it’s processed. Every node and edge loaded into the graph and every document indexed into Elasticsearch is first converted into a Pydantic object. This ensures that all required fields are present and correctly typed. Adhering to this principle makes our code easier to understand and helps us catch problems early. On the rare occasions we’ve introduced bugs to the production pipeline, validation failures stopped the pipeline before any damage was done.

Abstraction

Individual services follow the ETL (extract, transform, load) pattern. A source class retrieves data from a source system and yields individual raw items as Pydantic objects. A transformer class then consumes these items, transforms them into the desired shape, and loads them into a selected destination.

A data flow diagram showing source classes, transformer classes, and destinations.

We use inheritance to support multiple data sources while promoting abstraction and code reuse. For example, the extractor service (which creates graph nodes and edges from raw items) includes a transformer base class, which defines all logic for consuming raw items from source classes and for streaming transformed nodes and edges to supported destinations. Subclasses focus only on source-specific transformation logic, defining functions accepting a single raw item and producing a transformed graph node. This allows developers to easily add support for new sources without needing to understand how the resulting nodes and edges are ultimately written to the graph.

This structure also keeps components loosely coupled. For example, if we ever wanted to replace Elasticsearch with another storage system, we could do so with minimal changes.

Robustness

Each graph pipeline run involves many API calls to other services such as Neptune or Elasticsearch, which are prone to transient failures. To minimise the impact of such failures, we employ several layers of retries. We use the backoff package to automatically retry all API calls using an exponential backoff mechanism, and we also retry failed steps at the state machine level. As a result, we’re only alerted when there is a persistent issue rather than a brief malfunction or network partition.

When making API calls that involve large numbers of items, we split the work into smaller parallel requests instead of sending a single large one. We’ve found this approach to be more reliable for both Neptune and Elasticsearch. The multi-threading logic used for this purpose is abstracted into shared utilities, so developers don’t have to think about concurrency each time they modify a service.

Generators all the way down

To minimise memory usage and speed up execution, all pipeline services stream data wherever possible. Source classes that read raw data from zipped files stream file contents in chunks and yield extracted items one by one using Python generators (which produce values lazily, rather than all at once). Transformer classes consume these generators, yielding transformed Pydantic models.

This approach allows services to process very large files with minimal memory overhead; While data is still being read at one end, processed output is already being written out at the other end. Using generators consistently throughout the codebase means we can nearly always treat source data as a simple, flat iterable and only worry about batching when interacting with external services.

Small Cypher queries

When extracting relationships from the knowledge graph for inclusion in catalogue API indexes, we use the Cypher query language. Our initial implementation relied on large, complex queries that encoded all the logic for computing relationships. Our most complex query, which recommended related theme pages based on shared items, grew to over 100 lines of Cypher code. These queries were hard to understand, time-consuming to modify, and caused performance issues when running on large batches.

For this reason, we now favour using small, modular Cypher queries which run in parallel and whose results are combined in Python. For example, one query might return all themes that are considered synonyms of a given theme, and another query might find related themes for each of those, with Python logic assembling and deduplicating the final results. The overall complexity of the system doesn’t change, but it’s distributed more evenly. The result is code that’s easier to understand and maintain, and queries that perform more reliably at scale.

A Cypher query retrieving synonymous themes (called concepts for historical reasons) via their connections to items from external knowledge bases (called source concepts). Inputting the ID of the Insecta theme would output the IDs of the Insect and Insects themes.

In search of ancestors

Integrating a graph database into the catalogue pipeline proved useful as a means of simplifying other parts of the pipeline. In particular, it allowed us to replace a set of services collectively known as the relation embedder subsystem. These services were responsible for constructing hierarchical relationships between catalogue items, allowing us to render complex archive trees in the frontend, showing all ancestors, siblings, and descendants of each item.

A tree populated by nine different specimen of the family of Scividae (squirrels). Coloured etching by W. Warwick after Captain T. Brown.
Source: Wellcome Collection Public Domain Mark

Previously, when processing a given item, the relation embedder used source metadata to find the item’s ancestors and embed them into its Elasticsearch document. This was more complicated than it sounds, since some source systems only provide an item’s immediate parent, and so constructing the full ancestry required repeatedly traversing the hierarchy. The relation embedder stored intermediate results in a flat Elasticsearch index, computing an ancestor list for each item in a denormalised way.

A graph database offers a much more natural way to model such hierarchies, with parent relationships represented as graph edges. This allowed us to remove the relation embedder entirely and extract parent relationships directly using the extractor service. Simplifying the pipeline also improved data quality by eliminating a long-standing race condition.

Modelling hierarchical relationships in the graph. Items are associated with identifiers, and identifiers reference their parents. By combining these two relationship types, we can reconstruct the full ancestry using a single Cypher query.

From events to windows

This work is part of a broader shift from an event-based architecture to a batch-oriented one. Historically, the catalogue pipeline was entirely event driven, with individual services communicating via message queues. This approach has several benefits, such as horizontal scalability and low latency (since items can be processed individually in near real-time).

However, event-based systems become expensive and inefficient during full reindex operations (where the entire dataset must be reprocessed), since each processed item generates its own messages. Event-based systems can also make it harder to reason about state, and the ephemeral nature of messages complicates debugging when something goes wrong.

To avoid these limitations, the graph pipeline operates on batches of items, with each service accepting a time window and processing all items which were modified within it. This approach works well for both small batches (as part of regular incremental runs) and large batches (as part of full reindex runs). It also simplifies retries, supports local runs, and provides an audit trail by associating each run with artefacts stored in S3.

Into the future

The Wellcome Collection knowledge graph opens up many possibilities, and this work is only the first step. Ongoing efforts include using machine learning techniques to further reduce duplicate theme pages, and experimenting with new ways to use knowledge graph relationships to recommend related content and improve browsing across the collection. If all goes well, your experience of exploring the collection in January 2027 will be better still.

A hot-air balloon leaves the ground in driving rain. Coloured wood engraving.
Source: Wellcome Collection Public Domain Mark

Integrating the Wellcome Collection knowledge graph was originally published in Stacks on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Archive Production Line: a Kinaesthetic Approach.

Angela Saward — Thu, 08 Jan 2026 13:05:24 GMT

Heni Hale undertook a three-month placement with us in the Research and Enquiry team at Wellcome Collection. Below, Heni shares an account on her approach to research and her PhD project, ‘Relational Practice and the Tavistock Institute Archive: Embodiment and Social Relations’.

Words by Heni Hale

Context

Image still from a video assemblage. Archive materials from Draft Report on Socio-Technical Systems in Higher Mechanisation and Automation SA/TIH/B/2/8/4/2, Wellcome Collection, 1959. Featured performers Ben Ash and Marina Collard.

Between April and July 2025, I carried out a short placement at Wellcome Collection with the Research & Enquiry team. This placement sat alongside and fed into my PhD project, ‘Relational Practice and the Tavistock Institute Archive: Embodiment and Social Relations’ conducted with Coventry University’s Centre for Dance Research (C-DaRE) and Tavistock Institute of Human Relations (TIHR). TIHR’s 75-year archive is held at Wellcome, and my research explores kinaesthetic and choreographic approaches to reading historical materials, most of which are paper-based typed or handwritten documents and fieldnotes from action research studies of human relations in work cultures.

Alongside this archival immersion, it felt important to acknowledge the relationship with Wellcome, both embodied and social, that shapes how I access, interact and make sense of these materials. Sharing parts of my creative practice as research became a form of reciprocal knowledge exchange with staff from the institution.

Background to my research

I approach the study and reading of historical materials through the prisms of kinaesthetic sensibility and choreographic processes drawn from my background as a dancer and performance-maker.

Kinaesthetic sensibility means encountering archival materials with attention to bodies — mine and historical others — and to movement, materiality, sensory detail, social relationality, positionality, and spatial organisation. This contrasts with traditional archival research that seeks evidence to fill gaps.

Choreographic processes involve organising movement as performance. My choreographic work involves selecting archive fragments and making facsimile copies, re-arranging them in a scrambling process that involves moving collages, vocal re-readings, assemblages, stagings, and re-enactments. These re-enactments are not literal or representational but are creative re-readings of the past as embodied performance (Crawley, 2020; Schneider, 2011).

The Tavistock archive that my project engages focuses into the late 1950s and contains intricately observed descriptions of factory workers and their relationships to technology and automation. Detailed theoretical analyses of the relationships between ‘man and machine as a total system’ became the foundation for Socio-Technical Systems theories (Emery,1959). Working creatively with the scrambling, embodying and reassembling of materials affects and reanimates how I see current relationships to work and technology. This brings sensation to the ways that tensions from them resonate — or clash — with how I see work now.

Encountering Wellcome’s System

With kinaesthetic sensibility sharpened by Tavistock’s socio-technical systems thinking, I have the opportunity to observe the production-line of archival items as a whole system within Wellcome Collection. This provokes considerations about how TIHR’s materials came to exist in the collection at all, and how they “perform” through my embodied encounter with them.

The Building

Most meetings with Wellcome team members happen on the ground floor level of the Wellcome trust building called The Street. It has the insistent, continual but irregular hum of the coffee machine firing frothed oat milk, pounding coffee grinds, crockery stacking, chairs grating…

The space is cathedral like, the central ceiling rising all the way up to the ninth floor, forming a resonant echo chamber. Voices are like a river, babbles of indecipherable consonants and vowels, a laugh pitches and strikes out occasionally to collide with metal and glass. From street level we can sense the tiers of work activity that look down into it from balconies of office desks. I work on floor 3 with the Research & Enquiry team and can look across to Investments and up to Finance. I’m accompanied by the clicking of mice, and tapping of keys, paper flapping occasionally. I have the view of the gliding soundless lifts flowing up and down, and outside the tree-leaves breathe and dance above the Euston Road traffic.

This is a field of awareness that we sense together, it is layers of texture, it seeps into our pores, it is the air we share.

Preciousness, gatekeeping, object handling, contamination,

Meetings with the research development leads — Angela Saward, Elma Brenner, Julia Nurse and Eris Williams-Reed — reveal the depth and rigour of the historical research they engage and support, connecting communities with invested interest in the search and use of distinct materials. I begin to share in a wonder at the paradox of preciousness and impermanence of the materials. Each item’s survival and continued existence in these conditions of protected access, feels improbably contingent. A scrawled doodle or memo in TIHR’s archive prompts the question ‘how is this here, against all the odds?’ To have been selected to remain, each item has been through so many stages of gatekeeping — decisions, permissions, encounters, accidents.

But the team concerns itself primarily with who is participating in the use of archive materials and how interaction or lack of interaction enlivens but also changes an object. Users create the archive and are involved in its meaning. So whilst collection items are to be cherished and revered, they can be changed depending on how they are handled.

Archival practices of handling and permission for use have changed over time and impacted what we now receive in collections today. Many older documents are no longer simply original pieces but bear the annotations and traces of past researchers and archivists. Old books or scrolls have memory lines. Contamination occurs not only from human users but other creatures that nest; beetles, moths, silverfish, and other atmospheric invaders; mould, red rot turning leather to powder. The archive even destroys itself; iron gall ink perforates and eats through pages. Archives survive and self-destruct simultaneously.

The archive production line: care in the system

Shadowing Beth Milton from the Library Engagement & Experience team navigates the choreography and ‘backstage’ work of the archive production line as it weaves through the various architectures of the building to bring archive materials to readers. Starting in the Zebra print room, where an automated machine prints out the item orders onto tickets, descending down to the stacks in the basement to the production desk, and weaving back up with trolleys and cardboard boxes via lifts to the back entrance of the Rare Materials Reading room.

Witnessing and feeling through this routine corresponds with my investigations into embodied systems thinking in factories within TIHR archive. I spiral into a constellation where Tavistock socio-technical systems (STS) collide with choreographic thinking and connect to embodied practices of care and access, (here I see the routine, lay-out of the building, trolleys, etc. as aspects of technology in the system).

Beth’s process is a routine, but I’m interested in the points of decision, discretion, and possibility for error, that would be described in STS theories as the ‘free parameters of the system’ (Murray, 1959). These are the points of bifurcation in which there may be human variability in how a procedure is enacted, the person making the decision would bring all of themselves, their histories, wants, moods on the day and have an effect on the system that gives it dynamics and vitality.

In the basement stacks, where most archive items are stored, Beth explains how she needs to take her time. The reference checking, locating and transferring of items into boxes requires concentration; the codes are in small print and she needs to be precise. Small details demonstrate the discretion within the system to care for other workers in the team who will pick up the workflow later. Leaving the shelf in a state of legibility for the next person, signing out of the computer system, are some of her free parameters of the system. Within this tactile work with archives and the labour of thought it involves, I begin to recognise the systems nature of all work. Even my own research process is its own production line with variable moments of drift, choice, order, and care.

Still from performance video. Fragments from Murray, H. (1959) Note on Task Structure and Work Relations in Automated Units: General references, notes and articles on automation project. SA/TIH/B/2/8/4/3 Wellcome Collection. Performer Ali Baybutt.

Cataloguing as affective work

Conversations with cataloguing teams revealed the complexity and variability in how collections are described and made accessible. Efforts toward consistency vital for accessibility and institutional coherence might sometimes be at odds with the diverse ways people make sense of materials — a tension mirrored in my own research practices

Observing the cataloguing work of archivist Hannah Nagle made visible a sensibility to aesthetics and affects in her work. She guided me through two projects: processing a set of deeply personal diaries written by the brother of a patient of a psychiatrist donor and conducting metadata work listening to recordings and establishing correct dates to put onto a catalogue. The diaries, chronicling the author’s mental health journey toward taking their own life, rouses strong emotional stirrings and requires her to self-guard her own mental health in an effort to reduce vicarious trauma. For this reason, she structures her day-to-day work reserving sensitive work on the diaries to the morning, leaving the mechanistic metadata work to the afternoon.

The delicate conflicts that her encounter with this sensitive material provoke can be recognised as the emotional labour inherent in working with materials from the past. Even my experiences with the less overtly sensitive TIHR’s archive regularly produce migraines and exhaustion. The politics, values, injustices, that are surfaced when reading accounts from past work cultures stir up and seep into my own contemporary narratives.

Hannah’s role sits at a charged frontline of meaning-making: deciding what matters, what to name or hide. At stake are questions such as: Does a decision to name something increase or minimise the original voice? How do we know the author’s intentions? How do we acknowledge that this material was never intended to be part of a public facing collection, and decide whether a breach of personal privacy is outweighed by the value in sharing a lived experience?

Cataloguing work, often framed as objective, in reality requires a lot of discretion based on individual sensitivity. The role is a weighty site of responsibility, agency and decision-making. In a Tavistockian system this work would be a major control point in the system of archive production. Hannah practices resilience and self-care through discomfort; slow decision-making as resistance to pressures to reach an outcome; she moulds the routine jobs to rhythms that suit her needs; and orients to decolonisation — in a broad sense — becoming sensitised to the kinaesthetic feelings that extraction and exploitation produce in us. All these demonstrate her navigation of delicate parameters of freedom where human sensitivity shapes the system.

Still from performance video. Fragment from Murray, H. (1959) Note on Task Structure and Work Relations in Automated Units: General references, notes and articles on automation project. SA/TIH/B/2/8/4/3 Wellcome Collection. Performer Ali Baybutt.

Digital archives, replication and misuse

My practice involves making digital copies of archival materials and reconfiguring them into new assemblages. This raises a complexity of ethical issues about digital archives and their re-use and I discussed with Hannah notions of archive hacking and the weird reconstitutions that are possible with the image traces of the past. The retelling of stories through mediation of appropriated images can be both a speculative re-envisioning and a potentially weaponised mistruth. Hito Steyerl’s notion of the poor image (2009) — images that circulate widely, degraded yet potent — resonates strongly. Digital reproduction evokes ‘swarm circulation, digital dispersion, fractured and flexible temporalities’ but in doing so can generate ‘defiance and appropriation’ as much as ‘conformism and exploitation.’ (p.8) Archival filmmaker Miranda Pennell suggests ‘all re-use is mis-use’ but misuse can be generative if it exposes power structures or unsettles assumptions.

The ethics of working with people’s spoken words and images weighs heavily. Whilst they may have given consent to the original research and its purposes, (although I have no record of this) I cannot gain consent from those no longer here for my excavations, intrusions, mediations, and speculations. So following Hannah’s examples I deepen my commitment to slow thinking — weighing consequences, making decisions deliberately, resisting a rush toward output, when using and mis-using historical documents.

Sharing creative practices in the Viewing Room

The final stage of my placement involved sharing my methodology with colleagues through a performance research event in the Viewing Room. I presented raw fragments, practice rituals, and working scores, inviting staff from Wellcome, researchers from Tavistock communities, and colleagues from my dance research network. The aim was to open a conversation across these 3 distinct communities about what embodied, creative research with archives can produce as knowledge.

Parallel to the placement, I had worked with a group of artists as a temporary organisation that we named ‘Our Tiny Factory’, a nod to the assemblies and production lines I witnessed in TIHR fieldnotes and at Wellcome. Drawing on TIHR’s study of a 1950s sulphur recovery factory, we used kinaesthetic and choreographic methods to re-read field notes by Hugh Murray and David Armstrong. Our mediations and re-enactments became poetic forms of memory making, akin to what Saidiya Hartman (2008) calls ‘critical fabulation.’ The performance, ‘Memory Machine: Acts of Witnessing’, layered video documentation, re-staged fieldnotes, and embodied readings with archival fragments.

What I witnessed at Wellcome is that an archive lives through the hands, bodies, and decisions that gather around it. Archives, like factories, are socio-technical systems animated by human judgement, emotion, and attention. To meet the archive kinaesthetically is to recognise these dynamics in their ongoing production. I see my research as a momentary movement in a much longer dance.

Heni Hale January 2026 https://linktr.ee/heni_hale

Stills from Memory Machine: Acts of Witnessing performance sharing at Wellcome Viewing Room. Photographer: Tony Wadham. Performers: Alice Gale-Feeney, Ali Baybutt and Rachel Lopez de la Nieta..

References

Crawley, M.-L. (2020). Dance as Radical Archaeology. Dance Research Journal, 52(2), 88–100. https://doi.org/10.1017/S0149767720000194

Emery, F.E. (1959). Characteristics of Socio-Technical Systems’.Tavistock Documents #527. Wellcome Collection SA/TIH/B/1/1/4

Hartman, S. (2008). Venus in Two Acts. small axe, 26, 1–14. https://doi.org/10.1215/-12-2-1

Murray, H. (1959). ‘A Working Note on Task Structure and Work Relations in Automated Units.’ Wellcome collection — SA/TIH/B/2/8/4/3

Schneider, R. (2011) Performing Remains: Art and War in Times of Theatrical Reenactment. Routledge

Steyerl, H. (2009). In Defence of the Poor Image. E-flux Journal #10

The Archive Production Line: a Kinaesthetic Approach. was originally published in Stacks on Medium, where people are continuing the conversation by highlighting and responding to this story.

Enhancing Discovery and Exploration: Leveraging Graph Technology for Wellcome Collection

Antonia Langfelder — Tue, 01 Apr 2025 15:51:08 GMT

Nervous system in a fruit fly larva, serial section TEM by Albert Cardona, HHMI Janelia Research Campus. Source: Wellcome Collection CC BY-NC 4.0

Wellcome Collection’s catalogue records are tagged with concepts — keywords for things like subjects, contributors, languages, genres — during the cataloguing process to categorise works effectively. These concepts are showcased on the website via dedicated theme pages which display the records tagged with them.

Examples of theme pages for the concepts ‘Medicine’ and ‘Elizabeth Garrett Anderson’ (as of March 2025)

However, there is an opportunity to enrich and further improve these theme pages. Some concepts are derived from external sources like Medical Subject Headings (MeSH) and the Library of Congress, while others are manually generated keywords without external identifiers, leading to duplication issues. Problems such as these hinder the effectiveness of theme pages in terms of helping people discover the breadth of the collection online. Consistent with this, recent user research has also revealed that theme pages are currently under-explored.

To address these challenges, we are developing a graph database that captures enriched information connected to our concepts. By integrating external data sources, including MeSH, the Library of Congress, and Wikidata, we aim to provide useful metadata and connections to other related concepts. Our vision for this interconnected graph of concepts and catalogue content is to ultimately enhance the usability of our theme pages, unlock new levels of user engagement, and facilitate a more profound exploration of the collection.

What is a graph and why do we need one?

A graph consists of two main components:

Nodes, which represent the unique entities within the graph. For example, there can be a node for a catalogue record (work), and another node for a concept that the work has been tagged with. Each of these nodes can have other metadata attached to them, such as brief descriptions (these are also called node properties).
Edges, which are the connections between the nodes. An edge represents some type of relationship between two nodes, which is either directed or undirected. For example, there can be an undirected edge between two related concepts which have the same semantic meaning but come from different ontologies. Or a directed edge from a work to a concept that it has been tagged with in the catalogue.

Using this approach, we can create a map of all the works and concepts in the collection, with meaningful connections between them. In general, modelling data as a graph has several advantages, such as:

A graph can capture complex relationships between diverse sources of information with multiple types of edges and properties. This can include both structured and unstructured data.
Graph databases are optimised for executing complex queries and traversal of relationships, revealing new paths and connections.
The graph structure enables mathematical functions and machine learning, such as finding shortest paths or patterns, clustering, predicting missing attributes, recommending new concepts, and more.
Graphs can be visualised, making it possible to show patterns, outliers, or clusters within the data.

Our graph development journey

Before going ahead and creating a graph database, we carefully considered how best to represent our internal and external data sources effectively. We also wanted to ensure our graph data model supports various downstream tasks like ingestion into Elasticsearch, analytics, and visualisation.

We have incorporated the following node types into our graph data model:

Work: This node type represents catalogue records, encompassing all the materials indexed during the cataloguing process.

Concept: These nodes represent the various concepts tagged during cataloguing. This includes anything from thematic categories, genres, to the names of people who contributed to a work.

SourceName, SourceConcept, and SourceLocation: These nodes capture the comprehensive data fetched from external sources, i.e. MeSH, Library of Congress (subjects and names), and Wikidata. The differentiation into multiple node types allows us to store relevant information specific to each type. For instance, SourceName nodes include details like a person’s date and place of birth, information which is obsolete for the other concept types.

Connecting Work and Concept nodes

In our graph database, we create edges between Work and Concept nodes to represent which catalogue records are tagged with which concepts. We also store the section in which the concept was referenced, which tells us whether it represents a contributor to a work, a subject, or a genre.

Connecting Concept nodes to external sources

We generate edges between Concept nodes and our different source node types (currently SourceConcept, SourceName, and SourceLocation). These connections can be made in multiple ways. The simplest method is when the catalogue concept already includes an identifier from an external source, allowing us to create an edge to the corresponding source node via its identifier.

However, many of our concepts, known as label-derived concepts, do not come with external source identifiers. One of our goals is to reduce these instances by linking as many label-derived concepts to external sources as possible. Currently, we generate these links based on the concept label, that is when the concept label matches exactly with a label from an external source or one of its synonyms provided by the source ontologies.

Our data model and pipeline design allow for the introduction of more complex matching methods in the future, such as those generated by machine learning. By storing the source of each edge and tracking how it was introduced, we maintain the integrity and traceability of our data connections. We have processes in place which allow us to effectively overrule individual wrong matches, which also means we can manually exclude connections which are potentially problematic.

Connecting external source nodes to each other

We store various edge types between source concepts. Broadly speaking, these can be divided into two categories. One of these is SAME_AS relationships between concepts from different sources which all represent the same entity. For example, there is a SAME_AS edge between the concept Medicine from MeSH and the concept medicine from Wikidata. The other category is edges from one concept to another, different concept. This includes broader/narrower or otherwise related terms, as well as family members and fields of work. An example of this is Elizabeth Garrett Anderson, whose field of work is medicine and who is related to Louisa Garrett Anderson. All these edges are derived from the various source ontologies and can be extended in the future based on research into which onward journeys we want to enable from theme pages.

Implementation

As expected for such a large-scale project, there were many things to consider when it came to its implementation. For example, we decided to host the graph in AWS Neptune, which enables us to load all entities in bulk, meaning we can load millions of nodes and edges into the database within minutes. Another key aspect was to design our graph pipeline in a modular way and remove dependencies as much as possible. This ultimately enabled multiple developers in the team to easily add and modify new data sources without having to know about other parts of the pipeline, such as the Neptune bulk loader.

All of this made populating the catalogue graph very efficient, to the point where we managed to go from an empty Neptune cluster to a fully-fledged graph in a matter of a few months, set to replace concepts data provided via Elasticsearch to the Wellcome Collection website.

catalogue graph sample | Graph Commons

Watch this space…

Ultimately, we want to use the graph to enable various improvements to theme pages on our website, such as:

Filtering and aggregating works related to a single, unified concept which exists in multiple source ontologies. This extends to label-derived concepts which can be matched to these, as described above.
Displaying relevant information from external data sources on concept pages, such as descriptions, birth dates, and links to other data.
Providing onward journeys from concept pages to related, broader/narrower concepts and concepts that frequently co-occur on works.

While enrichment of concept pages via source ontologies is the current focus of the graph, it is only one of its possible use cases. For example, as mentioned earlier, a graph can also facilitate visualisation of the Collection and support ML tasks via graph embeddings. Additionally, having a graph enables us to analyse the effects of any substantial pipeline changes before applying them in production, such as how the various ways of matching label-derived concepts to external ontologies affects theme pages further downstream.

It is important to keep in mind that the intention behind the graph is not to make any assumptions on what should eventually be displayed on theme pages and how. Instead, it should be seen as an enabling tool which can be adapted and extended based on design decisions centred around user research. Some work has already been done on this, and we are excited for the possibilities our new graph will bring.

Enhancing Discovery and Exploration: Leveraging Graph Technology for Wellcome Collection was originally published in Stacks on Medium, where people are continuing the conversation by highlighting and responding to this story.

Early modern recycling: waste material in book bindings

Alexandra Hill — Tue, 25 Mar 2025 14:16:14 GMT

This post first appeared on The Book & Paper Gathering

One of the most rewarding things to find in a rare book is waste material. This term refers to recycled paper or parchment, often with printed or manuscript text, that was used in the construction of historical book bindings. It can be visible as part of the outer binding, used as flyleaves or hidden beneath the boards. As part of the ongoing inventory of the pre-1851 printed rare materials at Wellcome Collection, we are recording evidence of waste material as we go along. This will improve the catalogue and create opportunities for further research. Here is a sneak peek of some of our findings.

Researchers are increasingly interested in what waste material can reveal about the history of rare books and of the printing and bookbinding trade. The materials in these bindings are sometimes unique, and they can offer insight into texts which, ironically, survived by being thrown away. Often it is only when the binding is damaged or is being repaired that this evidence is revealed.

Examples of waste material in Basilica chymica (Frankfort: J., F. Weiss for G. Tampach, [1623]) [EPB/B/1677/1] and A map of the microcosme (London: T. Harper for J. Williams, 1642) [EPB/A/15685]. The first example is a leaf from a mediaeval Latin manuscript, possibly thirteenth-century, being used as a cover. The second is a page of early-seventeenth-century printed musical notation for a hymn based on the Book of Romans. There are multiple pages of this musical notation being used as flyleaves at the front and back of the text.

Waste material can reveal how material was reused and how books were constructed. In Wellcome’s copy of Lexicon graecolatinum (Basle: H. Curo for H. Petrus, 1548), you can see the parchment waste material used for support between the sewing cords. You can also see the stain from where it was pasted down onto the board.

Lexicon graecolatinum [EPB/D/3768] with fragments of multiple parchment manuscripts used in the binding. The larger piece is used as a front flyleaf, with the stain suggesting that it was originally pasted onto the board. The smaller, lighter-coloured pieces are from a different manuscript and are acting as extensions of the spine linings.

The parchment is covered in Latin text which, fortunately, has a clear running title at the top of the page: ‘De [co]gnitione anime sep[ar]ate’. This is likely a manuscript commentary on Aristotle’s De Anima [On the Soul] by Thomas Aquinas, possibly from the thirteenth or fourteenth century. There is also parchment from another manuscript in a different, likely earlier hand, with corrections and marginalia, acting as supports. Wellcome’s copy of Lexicon graecolatinum was bought from Steven’s auction house in 1923, and, unsurprisingly, there is little recorded information about the waste material except that it is from an ‘early’ manuscript.

Two different styles of the letter a in the manuscript waste of Lexicon graecolatinum. The styles are from different time periods. The left a is from the earlier manuscript used for the spine linings. The right a is from the later De Anima text used as a flyleaf. Graphics after photography by Alexandra Hill.

Waste material can show us what kind of material was lying around in early modern bookbinding workshops, and it was more than just mediaeval manuscripts. Wellcome’s copy of Trattato di peste (Asti: V. Giangrandi, 1598) is intriguing because the waste material in it is an uncut page of printed playing cards. Playing cards are highly ephemeral items that rarely survive, so it is exciting to find an example in such an unusual place. On this sheet, different numbered cards have been printed side by side. The book was printed in Asti, in northern Italy, and the design of the card suit suggests that it was also likely bound in the same area. That part of Italy, Piedmont, used the French-style clubs suit, similar to what we recognise as clubs today, whereas further south, the equivalent card suit would have had a baton or cudgel design.

A sheet of playing cards in Trattato di peste [EPB/B/390]. Book bought from G. T. Vicenzi in Modena in 1933.

Some examples even show a mix of materials from multiple sources in the same book. One of Wellcome’s copies of De secretis mulierum libellus ([Lyons?]: [publisher not identified], [1566]) has both printed and manuscript waste material hidden within the binding. Because the fragments are small and lack a caption title, it is difficult to identify the source text. One fragment of printed Latin text appears to be from a religious commentary. The fragment has two sections of text, with the biblical text in larger gothic type and the commentary in smaller type to the side. Part of the biblical text refers to the son of Moses, while the commentary refers to conversion and persecution. The other fragment of printed text is from an ephemeris, a book which predicted the movement of the planets over the year. Ephemerides are another highly ephemeral item, as they covered a single calendar year and soon went out of date. The manuscript is more difficult to identify than the printed texts, as it is only visible in the gutter and most of it is covered with the printed pastedown.

Waste material in De secretis mulierum libellus [EPB/A/139/2]. Bought at Stevens Auction with two other items on 12/9/22. Lot 401.

These examples show how waste material can preserve items that generally have low survival rates. Strangely, being used as waste material is sometimes the main reason a text survives. One such example has been identified and catalogued at Wellcome. Fragments of A true and most dreadfull discourse of a woman [Margaret Cooper] possessed with deuill; who in the likenesse of a headlesse beare fetcher her out of her bedd … on the fower and twentie of May last. 1584, at Dichet in Sommersetshire, a pamphlet published in 1584, were discovered as waste material in an edition of The treasury of Health from 1585 and catalogued as a separate item bound in the same volume. This shows how transient such a pamphlet was — within a year of publication, it was already used to bind new books in a different printer’s workshop. This was not unusual. In my analysis of lost books — ones that cannot be traced to an existing copy — printed in early modern England, I found that newsprint on events in Britain had a lower survival rate than news on events on the continent and further afield. According to the English Short Title Catalogue, an international online union catalogue of books printed in British languages, there is only one other recorded copy of the pamphlet in the world, stored at the British Library [C.27.a.6.].

The pamphlet was printed for bookseller Thomas Nelson, while the book it was used to bind was printed by Thomas East. Even though there was no direct link between the bookseller and printer, they both had workshops near the London bookbinders’ district, within walking distance of each other. They were also both members of the Stationers’ Company, which held a monopoly over the printing industry in London during this period. Even though it is not clear who bound the book, and whether it was done in house or by an independent bookbinder, it is not difficult to imagine how the pamphlet and book might have come together in such a close-knit industry.

The pamphlet A true and most dreadfull discourse of a woman [Margaret Cooper] possessed with deuill (EPB/A/4957.2) used as waste material in another book. The clear caption title at the top, ‘Strange news out of Somersetshire’, helped identify the pamphlet, as did the fragment of the dedication to the reader. The book was bought with seven other works from Maggs Bros in August 1906.

Waste material is a fascinating topic that brings together aspects of history, cataloguing and conservation. It is important to record and preserve this unique material as we find it. Books should be photographed while they are being repaired, and any material removed should be stored alongside the book. This way, the historical connection is preserved. Wellcome has already recorded 3072 catalogued items with waste material, and those are just the books where the material is visible. Who knows how many more treasures remain hidden under the boards.

Early modern recycling: waste material in book bindings was originally published in Stacks on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Social Justice Curriculum — what have we learned?

Selene Burn — Mon, 19 Feb 2024 11:55:42 GMT

The Social Justice Curriculum — what have we learned?

“I now have a wealth of resources to read and watch, and a wider understanding of how systemic inequalities operate. I am already seeing a shift in my practice, or at least more consciousness when it comes to making choices.” Participant

The Social Justice Curriculum

Over 150 staff have participated in Wellcome Collection’s unique, bespoke Social Justice Curriculum (SJC). What have we learned about the impact of this significant personal, professional and collective learning journey led by facilitators with lived and learned experience? This article will respond to this question through participant quotes and reflections from Facilitators Amy Kavanagh and Natasha Trotman.

“Understanding the origins of racism and ableism and how they are perpetuated feels foundational to so many aspects of my work.” Participant

Ableism and racism are ingrained in our history, our society and our institutions. Wellcome Collection has an uncomfortable history, with a collection rooted in injustice. With this context in mind, we co-designed a programme to support staff to:

learn about historic ableism and racism and its relationship to society and our collection
understand core concepts and theories relating to social justice.

You can find out more in a previous article about the SJC.

“(The Social Justice Curriculum) really gave me a deeper understanding of racism and ableism as structural problems in society.” Participant

In order to make meaningful change, we also needed to find a way to support colleagues to recognise their own relationship with injustice, including how they might benefit from structural inequity.

“It has been a fantastic learning experience. It has made me sharply aware of the biases I hold, and more critical of the world around me.” Participant

Facilitators with lived and learned experience

A crucial element of this learning programme has been the expert facilitation by a small team of freelance professionals. Over the time this programme has been running, Michèle Taylor, Natasha Trotman, Alex Wanjiku Kelbert and Amy Kavanagh have provided facilitation and so much more.

“The facilitators were simply amazing, incredibly knowledgeable and dedicated.” Participant

They impart knowledge, share experience and create a held space for groups of Wellcome Collection staff members to explore historical, societal and institutional ableism and racism. The original Facilitators Natasha and Michéle were also Co-Designers of the participant’s learning experience and all facilitators have shaped and evolved the SJC over time. I wanted to hear about what they’d learned while delivering this programme and invited Amy and Natasha to share their reflections.

From your perspective, what is the purpose of the Social Justice Curriculum?

AK: The SJC is a refreshing approach to tackling the systemic oppressions and barriers experienced by disabled and racially minoritised people in museum and heritage spaces. It seeks to address ableism and racism as experienced in the museum, library, gallery, the archive and in curatorial practices. Through inviting introspection and reflexive learning, participants are supported to understand their own role in dismantling ableism and racism.

“I feel I now have a better foundation to recognise racism and ableism whether in the workplace today or through our historic collections. I think this learning is going to stay with me, I really feel like it helped me think about things in a new way.” Participant

NT: In a time where the relevancy and need for museums are being explored from multiple lenses and the rise in contested objects, for example — the SJC is a timely learning intervention. From my standpoint, I see it as a vehicle to aid the carving out and holding of space for change. The SJC is a pioneering, refreshingly new approach with transformative methods for equitable practice; offering pathways for participants to consider enduring and systemic barriers experienced by disabled and racialised individuals in museums, heritage spaces and beyond.

Participants are provided with tools and designed engagements to interrogate systemic barriers experienced by disabled and racialised individuals, then introduced to authentic, alternative and accessible methods for overcoming those barriers.

What are your thoughts about the approach?

AK: It’s great to facilitate a course that gives the time for participants to really explore a complex issue like ableism. On paper it looks complicated, but the length of the sessions really enables us to establish safe and reflective spaces to explore the key issues.

“It exceeded my hopes and expectations by opening my eyes to the insidious nature of institutional racism and ableism and how deeply it is still ingrained in our society” Participant

NT: As one of the Co-Designers, I’ve also had the opportunity to witness the inception, creation, delivery and iterative development of the SJC’s framework, methods and approaches. Having co-designed the learning journey and ‘experience interventions’, as well as delivering the SJC to multiple cohorts, this has afforded me various angles and perspectives of the Social Justice Curriculum.

Taking the entire SJC (concept to delivery) into consideration, I can say part of the beauty of it and what makes it resonate so powerfully is the delicate balance of the established SJC framework and the (carved-out and held) emergent element. This gives rise to a unique blend of lived experiences and adaptable methodologies; fostering and engaging a held learning environment that nurtures growth and deep reflections in-step with the cohort — making it the first step of a life-long learning journey and continual professional development. It reminds me of a quote from Maya Angelou: “We delight in the beauty of the butterfly but rarely admit the changes it has gone through to achieve that beauty”.

“I did feel uncomfortable at times, but this felt necessary.” Participant

What are your reflections on the SJC from your experience as a facilitator?

AK: Even though I’ve now facilitated several cohorts, each group offers different and new perspectives, which is really refreshing. I enjoy learning about all the different projects and workstreams already engaging marginalised groups at Wellcome. It gives me confidence as a disabled person that there are people who are already doing the work and want to understand the barriers and exclusions I face. It’s always helpful to feel like you’re pushing on an open door as an accessibility and inclusion professional.

NT: I’ve delivered the SJC to several cohorts; each group provides new, unique, and often fascinating responses to the learning and experience interventions. It’s great to witness the individual and collective learning journeys. I’ve enjoyed facilitating sessions and holding space for cohorts to unpack and reframe the official and unofficial systems we traverse; this highlights some of the beauty of emergent learning. The SJC’s strong emphasis on sharing knowledge, experiences and building values across departments and boundaries underscores the importance of transdisciplinary work that occurs during the process. With each cohort completing the SJC, I leave forever changed. Contributing to various cohorts’ learning journeys leaves me with a deep appreciation for the transformative power of co-creation and growth on this professional, personal, and collective learning journey. It also highlights that my practice and what I impart each time as a facilitator is a verb, a constant action, rather than a static noun; I am learning from the cohort as they engage with the module/s; we are learning together.

What impacts have you observed on participants of the SJC?

AK: One of the recent changes to the SJC has been adding a practice development session. These discussions are an opportunity to present ideas about how to make change in the spirit of the values and ideas of the SJC. It is in these discussions that I have truly felt the impact on the participants. There is usually honesty about recognising maybe where barriers have been in place, but also real tangible enthusiasm in the solutions proposed.

“The practice development session was so useful for thinking about how we can practically use our learning from SJC and how we can keep this learning at the forefront of our minds” Participant

Truthfully the impacts are sometimes also personal and this is why we try to create a supportive environment. Learning about ableism and disability justice can provoke reflection and self discovery about individual experiences. However, I always hope this process will ultimately be an empowering experience and equip participants with the tools to embrace their own identities.

NT: The impact(s) I have observed concerning the participants is illustrated by the contemplative ‘Rose, Bud, Thorn’ exercise that participants partake in during the final phase of the learning journey. This exercise not only underscores the positive aspects, but also highlights areas of intrigue or curiosity and identifies areas of challenge which may need further exploration. In my view, it symbolises the entire learning journey — fertile ground where aspirations and hopes are sown and then grow, flourish and encounter some challenging terrain along the way. Participants are receptive to learning, setting the Rose life cycle and learning journey into action.

“I’m learning so much about how to create a space in which people can feel safe to say things that they might not feel they have the ‘right’ words for.” Participant

What are the most important take-aways, particularly for organisations that aren’t able to run a programme of this scale?

AK: The resounding message we always get shared through feedback is the value of lived experience. As Facilitators we are willing and professionally prepared to share stories of discrimination and structural inequality. Our lived reality resonates stronger than statistics or abstract identities. If an organisation does not have the capacity for a long in-depth course like the SJC even having a ‘lunch and learn’ facilitated by someone with lived experience, will have significant impact.

NT: The most significant takeaways are that approaches are scalable and created for our future selves and tomorrow’s practitioners rather than just where we are now as a sector and professionals. The perspectives and insights shared during this process bring the value of the SJC’s dynamic principles into focus; this includes adaptability, diversity, access, inclusion, innovation, communication and resourcefulness; these principles can be applied to any organisation, regardless of size for example, a Lunch and Learn series can provide rich insights using a bite-sized approach.

A facilitated learning space with ring-fenced time for engagement can provide room for sharing lived experiences, open new communication pathways, and foster collaboration and new opportunities to create meaningful change. It’s not necessarily the programme’s scale that determines its success but the principles and shared values.

Another important takeaway is the value of reflective practice, transdisciplinary methods and approaches. We are in an era where resources are rapidly changing, and tomorrow’s challenges mean we may find ourselves in un-chartered terrain (locally and globally). The ability to pivot, collaborate and work across disciplines, foster shared values and responsive methods and approaches is the gold.

Thanks to Amy and Natasha for taking the time to share their insights and learning. Get in touch (socialjustice@wellcome.org) if you have questions or reflections or if you would like to access the Social Justice Curriculum resources. We would love to hear from you.

The Social Justice Curriculum — what have we learned? was originally published in Stacks on Medium, where people are continuing the conversation by highlighting and responding to this story.

When is a duplicate not a duplicate? Multiple copies and discoveries in the Early Printed Books

Alexandra Hill — Thu, 24 Aug 2023 09:27:23 GMT

In the depths of the Wellcome Collection lies a selection of early printed books (books printed pre-1851) referred to simply as ‘duplicates’. But what is a duplicate when it comes to rare books? These are items with hundreds of years of unique history under their covers. Why have these books been kept separate from the main collections? And what treasures can be revealed by the inventory team taking a closer look?

What is a duplicate?

Individual libraries and museums have different definitions of a duplicate, usually influenced by the types of items in their collections. For books, a basic definition, as described by London Library, would be ‘identical copies of the same edition’. An edition of a book is the form in which it is published, while a copy is a single specimen of that book. So, if a library had three identical copies of the same edition, two of those copies could be described as duplicates.

For libraries with a focus on modern texts, unless the same book is required by multiple students or readers at a time, having too many identical copies can be a poor use of limited space and resources.

The British Museum use a slightly broader definition, describing a duplicate as ‘an object that is identical in every significant respect to one or more other objects in the Collection, not merely of the same or a similar kind’. This is closer to the attitude we would take with early printed books at Wellcome. As this definition suggests, some things can be identical and therefore a duplicate, whilst others can be ‘of the same kind’ without being a duplicate.

Below are three frontispieces (the illustration opposite a book’s title page) from three copies of the same edition. All have been hand-coloured in multiple ways. All were used by different owners. And all were acquired by Wellcome Collection from different sources. Can any of these copies be described as identical or duplicates?

Three frontispieces from three copies of Le texte d’alchymie et le songe-verd (Paris: Laurent d’Houry, 1695). [EPB/A/50960/1; EPB/A/50960/3; EPB/A/50960/4]

A quote from a previous rare book curator at Wellcome, John Symons, speaking at an event in the 1990s may reveal why any rare books at the collection were ever described as duplicates: ‘the Wellcome Library’s principle has always been that the books are for reading, not museum exhibits’.

This suggests the focus for Wellcome in the past was on text and content, fairly standard in the world of book history at the time. If you had a perfect copy of a medical text, even one from the 17th century, why would you keep another one? While the word duplicate may have made sense when the items were first brought together, with the way we look at rare books nowadays, this term is detrimental to our understanding of the collection and the books within it.

What is the ‘duplicates’ collection at Wellcome?

The Early Printed Book Duplicates collection at Wellcome is not an actual collection but sections of the store designated for what were originally seen as unwanted books, separated by format and size. So far, I have found only limited information about these books, why they were kept separate and how attitudes towards them have changed over the years:

· When were they first separated and why?

Sometime in the 1950s, these books were separated from others. This links with when the library first opened to readers, but continued to grow as new large collections were acquired, such as the books of the British Medical Association.

· Why were they separated?

They were separated out for the same reasons that countless other collections separate items: to clear space, make money, or to remove items that no longer fit within current collecting themes.

· How were duplicates decided?

Items were usually deemed duplicates if there was another copy catalogued, or if the book was deemed imperfect.

· What might we find in the collection?

The collection contains books with interesting provenance, annotations, and bindings. Collection documents suggest these ‘duplicates’ were worth investigating rather than selling.

Part of the uncatalogued 8vo/B Duplicates collection

Apart from a small handful of items, all the books are uncatalogued and I have been unable find a list of what books are there. I estimate there are around 2500 items, though, dating from the 16th to the mid-19th century.

In March 2023 the inventory team started to record these books. While the focus of inventory is on Museum Accreditation, the project provides an amazing opportunity to open up the stories hidden within the covers. This is even more important for these thousands of unrecorded multiple copies, labelled as duplicates and put at risk of sale or disposal.

What have we discovered during inventory?

The most important part of the work is attention to detail. For one book we came across in the ‘duplicates’, not only was there no other copy held at Wellcome, but it appears to be only the second recorded surviving copy in the world. The other copy is held at Corpus Christi College in Oxford. While the book had the same title, place and date of publication as another edition held at Wellcome, the copy we discovered had a different bookseller in the imprint, making it extremely rare.

On the left, the more common imprint ‘Printed for J.M. & J.A. & T.D.’ [EPB/A/29472] and on the right, the extremely rare ‘duplicate’ with the imprint ‘Printed for Anthony VVilliamson’. [EPB/A/66552 — currently uncatalogued]

But it is not just the rarity of an edition which makes a copy unique. A collection may have multiple copies of an edition but each individual copy will have hundreds of years of unique history under the cover or even within the binding.

On opening one book printed in Latin in 1712 on the treatment of diseases, I came across a text covered in additions and corrections as well as 13 pages of handwritten recipes in English. The previous owner clearly understood Latin, interacting with the text, but also saw the book as a repository for their own knowledge of health and medicine.

The text is written beautifully with explanations of how and when to use the recipe written in black ink, and the ingredients written in red ink. The recipes include cures for the King’s evil, ‘cystick tumours’ and a list of ‘emeticks’.

Handwritten recipes in the book, Processus integri in morbis ferè omnibus curandis … quibus accessit graphica symptomatum delineatio, unà cum quamplurimis observatu dignis, necnon de phthisi tractatulo (London: J.Knapton and G. Innys, 1712) [EPB/A/66531 — currently uncatalogued]

The book is small, making it easy to carry around, and was clearly important to at least one former owner. Unfortunately, there is no inscription to suggest who the former owner was, and Wellcome records simply show that it was bought from an auction house in 1906. However, it is interesting to note how this person’s story, and this object that the owner put time and effort into adapting, was deemed surplus to requirements back in the 1950s and is only now being recorded.

History is not just found inside a book but also in the materiality of the book itself. One of my favourite things to come across in a rare book is waste material. Until the nineteenth century, books did not automatically come with a cover — covers were added later according to the desires of the customer. Often, to cut costs, a binder would re-use material to create a binding.

On one ‘duplicate’ of a book on plague and diseases hitting the besieged town of Breda and printed in 1627, I found a beautiful manuscript covering the boards.

Images of the twelfth-century binding covering De morbis et symptomatibus popularibus Bredanis tempore obsidionis, et eorum immutationibus pro anni victusque diversitate, deque medicamentis in summa rerum inopia adhibitis, tractatus duo (Antwerp: Ex officina Plantiniana Balthasaris Moreti, 1627) [EPB/B/66855 — currently uncatalogued]

Like the contents of the book, the manuscript is written in Latin. From a bit of transcription and translation, and input from Professor Susan Boynton, we identified the binding as a notated missal (a liturgical text used by a priest when celebrating mass). The text refers to the celebration of Saint Blaise, a former physician and patron saint of wool combers, and ear, nose and throat illnesses, as well as to Saint Agatha, patron saint of breast cancer patients, martyrs, wet nurses, bell-founders and bakers. Both have feast days in February. From the handwriting, particularly the straight S’s, this missal is highly likely to have been written in the twelfth century. Interestingly, this appears to be a popular time for the worship of Saint Blaise.

Not only does the document look beautiful with the different forms of handwriting and colourful capitals but it opens up questions on why a liturgical text on Saint Blaise ends up wrapped around a text on illnesses during a siege centuries later. Why was this text just lying around? Where did it come from? How did the bookbinder get hold of it? As a patron saint of ear, nose and throat illnesses, is it by design that a religious work for Saint Blaise was used to cover a book on plague?

Once again there is limited information as to its former owners: we know only that it was bought for the Wellcome collection from an antiquarian bookseller in Amsterdam in 1927.

Conclusion

The term duplicate can be extremely useful for institutions with identical items that are struggling for space and resources. However, for rare books, when each copy can be filled with so much history and when not even the covers are the same, the word duplicate is harmful.

At Wellcome, labelling items as duplicates has led to some truly amazing items being left unrecorded and vulnerable to sale and disposal.

We now use the phrase “multiple copy” to describe these items, but even this can mask the uniqueness of each book which should be treated on its own merit. Our aim, once all the items have been recorded, is to move them into the main sequences and add them to the cataloguing process.

Finally, these objects will be given the attention they deserve and the opportunity to be shared with new researchers and audiences.

When is a duplicate not a duplicate? Multiple copies and discoveries in the Early Printed Books was originally published in Stacks on Medium, where people are continuing the conversation by highlighting and responding to this story.

Holding Space: The Indigenous Knowledges Project at the Halfway Point

sjfrench — Thu, 09 Mar 2023 08:48:06 GMT

A blog post by Rhiannon Sorrell, visiting research fellow for the AHRC-NEH Indigenous Knowledges grant, to share her perspectives on the project.

View near the Canyon del Muerto, Navajo Nation.

As a new semester approaches us, and an entire new year, for that matter, begins, reflection on the previous twelve months is inevitable… as is this long overdue blogpost on the Indigenous Knowledges Project, my trip to London, Sarah’s visit to the Navajo Nation, and our joint trip to the International Conference of Indigenous Archives, Libraries, and Museums (ATALM). It has also admittedly taken this long to process all the meetings (planned or by chance), all the conversations (formal and informal), and all the spaces and places these encounters occurred. Contrary to western academic and grant deadlines, projects involving Indigenous nations and materials are slow, contemplative work, and this has been a common theme when talking about the progress and direction of this pilot.

Leading up to the submission of this project’s proposal to the National Endowment for the Humanities (on the U.S. side) and the Arts Humanities Research Council (on the U.K. side), I had been (and continue to be) involved in the Tribesourcing Southwest Film Project, as the coordinator of the Diné region of films. My work on this project allowed me to see the opportunity in digital humanities projects to give new life to early 20th century A/V materials featuring Indigenous peoples by embedding them back in their respective communities, on their terms. The Indigenous Knowledges’ primary aim — a model of collaborative practice for future digital exchanges over reciprocal and collaborative digital curation between U.K. GLAM and North American tribal college LAM professionals — spoke to the optimism I have around the role of digital technologies in giving agency of materials and collections back to tribal communities. Marisa Elena Duarte stated in her book Network Sovereignty, “Indigenous peoples resist and subvert colonizing systems, rules and practices, and state power through social and political engagement and mobilization. Indigenous peoples conscript digital devices and systems to do this work, and it is not necessarily in conceptual contradiction to Indigenous philosophies, spiritualties, and everyday practices.” In working with my own community I’ve seen how digital technologies could be viewed with caution and skepticism while also being powerful tools for mobilization on the most pressing issues around language revitalization, cultural education, political organization, and media representation.

Knowing that relationship building is at the heart of any Indigenous project, however, it was written into the grant to have a reciprocal residency exchange followed by a year-long virtual one. This meant that as the research fellow representing the Kinyaa’áanii Library, a tribal college library, I would visit the UK and Sarah French, the research fellow out of the University of Kent, would visit Diné College and the Navajo Nation. Prior to this trip, I’d never been to Europe before and the United Kingdom seemed a far stretch for potential visit of this caliber, given that the Navajos’ primary encounter with Europe was with Spanish colonists, then the Mexican and United States military. What insight could I possibly have that other Indigenous Nations haven’t already communicated to European heritage institutions? What place does tribal college libraries have in these transatlantic and international exchanges, debates, decisions? What form will this project take at the end of these trips? At the end of the project timeline?

Whatever anxieties I had upon arrival to London and to Wellcome were quickly assuaged on the first day, which coincided with a day-long meeting with the Murrup Barak group from the University of Melbourne. To meet a wonderful group of Indigenous scholars, storytellers, and artists showcasing their work far away from home — a documentary film called Warriors — on the first day was a source of inspiration and set the tone for the rest of the trip and for the project: relationship building and creating space for Indigenous thought and presence. Through the rest of my visit — which included meetings with staff and scholars from the Wellcome Collection, the British Library, the British Museum (all controversial and mammoth institutions compared to the tiny tribal college library) — I kept those thoughts at the forefront; holding space and reiterating that more Indigenous voices were crucial in order to meaningfully move forward toward many of the well intentioned, but lofty goals of the often misunderstood concept of “decolonization.” Among some of the most memorable moments of the trip to London were the conversations and debriefings that took place over coffee/tea breaks, over meals, and between meetings. One conversation in particular stands out: after meeting with staff at the British Library, Kent research fellow Sarah, logistics lead Sophie, and myself continued the conversation on Indigenous Knowledges, challenges, considerations in the Treasures of the British Library’s exhibit space, amongst works and artefacts from the Western canon.

Research fellow, Rhiannon Sorrell, with members of Murrup Barak in the Wellcome Collection Viewing Room, July 2022

After my return home (and amidst the start of a new semester) I knew I had my work cut out for me to put together appointments and meetings for Sarah’s visit. Also of big concern was the vast geographical space that we were going to cover as part of our visits and the logistics around transportation. Public transportation infrastructure is already not something the U.S is known for and even less so on the Navajo Nation. Thankfully, with the help of friends and community, we were able to make the necessary transportation arrangements and appointments. Starting with meetings here at the Tsaile, including staff from the Diné Policy Institute, the Navajo Cultural Arts Program, the Provost and Vice Provost for Research at Diné College, and friends/colleagues at Diné College Branch Libraries we were able to explore the role, challenges, and opportunities that tribal colleges see themselves playing in a transatlantic Indigenous knowledges project such as this. Historically, tribal colleges and their libraries played a multifaceted role in their communities — education at the forefront and a means of preserving and perpetuating tribal languages and culture. To accommodate this latter goal, the tribal college library often served as a de facto archive and cultural heritage institution for the tribe. Since Diné College (formerly Navajo Community College) was first charted in 1968, it and several over tribal colleges have made strides toward university status, including the development of graduate programs and increasing their profile of stand-alone externally funded projects and programs. With issues surrounding open access, digital repositories, and data sovereignty becoming increasingly pertinent to tribal colleges, leaders at these institutions are becoming aware that standing on the sidelines of these debates is not an option if they want to keep to the missions of the institution and uphold their responsibility to their communities with regard to cultural/traditional knowledge.

Lunch in Farmington, New Meixco. L-R: Sarah French (IK research assistant), Rhiannon Sorrell (IK research fellow), Samanthi Hewakapuge (director of library services, San Juan College), Clyde Henderson (librarian, Diné College, Shiprock).

I think Sarah would agree with me that one of the breakthrough moments of her visit was the trip to Flagstaff, where we met with staff from NAU’s Cline Library and the Museum of Northern Arizona. Along with an overview of their predecessors’ development of the Protocols of Native American Archival Materials, archivists Peter Runge and Sam Meier discussed experiences and lessons-learned in working with tribal communities and fostering respectful relationships with individuals and tribal institutions, as well as with Indigenous students working there at the Cline library. In the afternoon, we visited the Museum of Northern Arizona’s Easton Collection Center, which was built specifically to house the museum’s Indigenous collections in manner that was welcoming of tribal communities and respectful of the items, which are considered to be living entities and not merely inanimate objects. In this building, we were told by Director of Research and Collections Anthony Thibodeau, there are no human remains housed here and there is a UV filtered skylight in the main collection holding area, incorporated with the idea that since the objects are living entities, they should be able to “see” the light of day and the changing of the seasons, as opposed to being locked away in the dark and forgotten. Interaction is highly encouraged and instead of keeping the object safe from people, people are kept safe from the objects (some textiles having undergone preservation treatment in the past with now-banned toxic chemicals). Wrapping up that day was a visit to the museum’s Native Peoples of the Colorado Plateau exhibit space. There, we met Samantha Honanie, visitor experience manager where we discussed our hopes for more Indigenous people to join us in this field and in this work.

The final week of Sarah’s visit was spent in Temecula, California at the International Conference of Indigenous Archives, Libraries, and Museums, hosted by the Pechanga Band of Luiseño Indians. Here, we were reunited with friends and colleagues we’ve visited in the past few weeks and also met some new faces who had much to share about their work relating to Indigenous archives in a digital setting. Sarah met with Melissa Dollman, the digital projects manager for the Tribesourcing Southwest Film project, who shared an overview of her work on the project, including incorporating TK labels into the site and the creation of an Indigenous language dictionary to describe key terms in the site’s metadata description. Throughout the conference, Sarah and I did our best to try to cover a wide range of presentations between the two of us, but with so many concurrent sessions, it was difficult to catch everything we wanted to. While there were so many sessions on digital initiatives and one on international repatriation, the overall message of these practices came down to ethical, collaborative, and inclusive partnerships with Indigenous communities.

Although October marks the new year for the Diné people, this new Gregorian year brings us closer to our final grant deadlines. The team will bring all these overarching themes to the table and begin our planning for the project’s symposium, to occur in Spring 2023. In the meantime, we will continue our reading/discussion group — the first of the year — focusing on the Protocols of Native American Archival Materials. In closing, I think of how every single Protocol cannot be acted upon and truly followed unless tribes and Indigenous peoples are actively and respectfully part of the process. Any project involving Indigenous materials, even digital ones, must hold space for Indigenous peoples and thought.

Entrance to ATALM, the conference for Association of Tribal Archives, Libraries and Museums, Pechanga Resort, Temecula, CA.

Text by Rhiannon Sorrell.

Holding Space: The Indigenous Knowledges Project at the Halfway Point was originally published in Stacks on Medium, where people are continuing the conversation by highlighting and responding to this story.

Memoirs of an Arabic Manuscript Cataloguer

Rosie Maxton — Tue, 07 Mar 2023 08:50:57 GMT

Decorative miniature from MS Arabic 748, ff. 118v-119r

Read this article in Arabic.

My first encounter with the Arabic manuscripts of Wellcome Collection was in August 2018. I was hired on a short-term basis to review some descriptions for a portion of the collection, which comprises in its entirety over 1000 manuscripts. Now, almost exactly four years and several hundred manuscripts later, I feel immensely fortunate to have had the opportunity to catalogue such a rich and endlessly surprising array of materials. Below I share some reflections on my work with the Wellcome Arabic collection over the past few years.

Medicine through the ages

My first ever assignment was, in more precise terms, to update the digital descriptions for the Wellcome Arabic manuscripts being hosted by the online manuscript catalogue Fihrist. These descriptions — numbering 79 at the time — were created using the Text Encoding Initiative (TEI), a standard for developing metadata for textual materials using the mark-up language XML. The descriptions were originally a collaboration between Bibliotheca Alexandrina and Kings College London, and required updating to align with the new guidelines issued by Fihrist. Being an Arabist by training and having already done some cataloguing for Fihrist, this work with the Wellcome Arabic collection was an ideal opportunity to build on my experience in the digital humanities field.

It was not just my technical skills which would benefit from this work. Having previously only dealt with literary and historical texts in Arabic, the Wellcome manuscripts introduced me to a very different sphere of knowledge — that of medicine and its many branches, particularly pharmacology. Within this genre, the 79 manuscripts formed a rich fabric of historical periods and intellectual traditions, from Arabic translations of classical Greek scholars such as Galen (d. 216) and Hippocrates (d. c. 370), to the seminal works of the Golden Era, such as those of Ibn Sīnā (d. 1037) and Ibn al-Nafīs (d. 1288), to the treatises of pioneering Ottoman physicians, such as Haci Pasha (d. 1417) and Ibn Sallūm (d. 1670).

While I had access to digitised images of the manuscripts, I was grateful that identification of most of the authors and texts had already been done in the catalogue of Albert Zaki Iskandar, A Catalogue of Arabic Manuscripts on Medicine and Science in the Wellcome Historical Medical Library (1967). As such, much of my work at this stage involved using the catalogue to link these authors and works to digital authority files — a process which, as we confront the bewildering amount of data to be extracted from scribal materials, is becoming increasingly crucial to cataloguing projects.

Rethinking cataloguing practices

By this point, I had the impression of a collection of Arabic manuscripts meticulously constructed around a prescribed area of interest. I had no idea that the manuscripts included in the Iskandar catalogue — numbering 197 — were just one segment of the Arabic manuscripts within Wellcome Collection. This became apparent when I was offered the opportunity to continue my work with the Wellcome Arabic collection, this time on a set of unpublished TEI descriptions. These descriptions were informed by another, more recent printed catalogue: Nikolaj Serikoff’s Arabic Medical Manuscripts of the Wellcome Library: A Descriptive Catalogue of the Haddad Collection (WMS 401–487) (2005). The 87 manuscripts included in the catalogue form a subset within the Wellcome Arabic Collection acquired in 1985 and known as the ‘Haddad Collection’, having all been previously owned by Lebanese physician and academic Dr Sami Ibrahim Haddad (d. 1957).

As the title of the catalogue indicates, the majority of these manuscripts were, like those I had previously worked on, related to the study of medicine. But what struck me as I began to edit these TEI descriptions was the level of detail at which the catalogue engaged with the manuscripts — particularly when compared to the much more economical descriptions of Iskandar. The catalogue painstakingly dissected the content of each Haddad manuscript — parts, chapters, sections, sub-sections — in addition to providing information on physical features of the manuscript (binding, paper, handwriting, ‘codicological miscellanea’), and bibliographical references for the represented works. One description sometimes filled as many as ten pages. While undoubtedly an exceptional piece of scholarship, transferring this volume of data into the TEI descriptions seemed an overwhelming task, and frankly — given the open availability of digitised copies of these manuscripts — somewhat unnecessary.

Working on the Haddad collection was the moment at which I really began to consider my role as a manuscript cataloguer in a broader context, and how projects like TEI fit into the ever-shifting scales of cataloguing practices. How could legacy catalogues be used to maximise the current accessibility and discoverability of this material? And where should the boundaries of bibliographic data lie? I found sharing ideas and experiences on this topic with colleagues from both Wellcome Collection and Fihrist to be incredibly productive as I progressed with my cataloguing work. In fact, far from being exclusive to Arabic material, these conversations around legacy data are pertinent to the many different manuscript collections comprising Wellcome Collection.

Unravelling the collection

A major turning point in my cataloguing of the collection occurred around one year later. I had completed TEI descriptions for the manuscripts included in the printed catalogues, and was now given the opportunity to catalogue the remainder of the 430 digitised Wellcome Arabic manuscripts. While many of these manuscripts had pre-existing TEI descriptions, the data within them varied significantly, and at times required cataloguing from scratch. Though this felt like a sizeable step-up from my previous work, like any other manuscript enthusiast, I was above all excited for the task ahead. I felt the most compelling discovery during this phase of cataloguing was the vast array of genres which the collection unfolded. Having so far only come across medical works, I marvelled to find texts on astronomy, mathematics, history, poetry, grammar, Hadith and Islamic jurisprudence, as well as Qur’ans, Christian scriptures and talismans. The medical focus of the Wellcome Arabic collection dimmed in the face of this remarkable diversity.

During this period, my professional experience outside this role began to directly influence my cataloguing of the Wellcome Arabic collection. In September 2019 I joined the European Research Council project ‘Stories of Survival’ in the history department of Oxford University, which explored the mobility of Eastern Christians in the early modern Ottoman world through the lens of manuscript production and circulation. The project particularly emphasised the historical value of paratext — the extra notes and scribbles left in the margins of manuscripts over time by their various owners and readers. As a result, I became acutely aware of how paratextual notes in the Wellcome Arabic manuscripts could increase our understanding of the collection’s origins. In addition to indicators such as handwriting and binding, I found that scribal colophons and ownership notes traced the Wellcome Arabic manuscripts to multiple locations across North Africa, the Middle East and Central Asia, and multiple timeframes, spanning the 15th to the 20th century CE. The flexible structure of TEI allowed these notes to be easily categorised and recorded in original Arabic script and in English translation.

Ownership notes from title page of MS Arabic 836

In this cataloguing phase, some curious and wonderful snippets of information about the lives of the Wellcome Arabic manuscripts surfaced: the prophetic biography witnessed by multiple scholars of the al-Azhar Mosque in Cairo (MS Arabic 776); the Arabic grammars copied in a school in the Khanate of Crimea during the seventeenth century (MS Arabic 809); the tracts on Islamic inheritance emanating from a small village in the Persian Safavid Empire (MS Arabic 821). Taken together, these details form an elaborate patchwork of individual and communal stories across time and space.

Authorisation statements (ijazat) from MS Arabic 776, ff. 42v-43r

The Wellcome Arabic collection had further surprises to offer. As I continued cataloguing over the next two years, I encountered manuscripts both partially and fully written in languages other than Arabic, such as Persian and Ottoman Turkish. Though I was able to complete basic TEI descriptions for these manuscripts, they would certainly benefit from specialist interest.

Further insight into the linguistic and cultural breadth of the collection came with a recent discovery of several manuscripts in Karshuni (that is, Arabic language written in the Syriac script, used predominantly among Eastern Christians in the Ottoman era). Due to my research for both the ERC project and for the PhD which I began in October 2020, I had managed to develop proficiency in reading sources in Arabic Karshuni. As such, cataloguing the Wellcome Karshuni manuscripts was an exciting task — and a memorable moment to see my work as a cataloguer and my research interests so closely intertwined. Like a pocket-sized mirror on the entire Arabic collection, the Karshuni manuscripts reflected an assortment of genres: physiognomy, medicine, astrology and Christian theology.

A month before writing this, I completed the last of the TEI descriptions for the Wellcome Arabic manuscripts, ready to be showcased in the new Wellcome Collection online catalogue. Despite all the knowledge I gained during the process of cataloguing, I still struggle to define such a rich, multi-layered tangle of manuscripts. In the most general terms, I believe its existence is testament to the overwhelming cultural, intellectual, linguistic and religious diversity which characterised the early modern Arabic-speaking world — and, of course, beyond. There are, as always, more aspects to be explored, more histories to be unlocked. Like the lives of these manuscripts, the craft of the cataloguer is a never-ending journey.

Memoirs of an Arabic Manuscript Cataloguer was originally published in Stacks on Medium, where people are continuing the conversation by highlighting and responding to this story.

ذكريات مُفهرِسة المخطوطات العربية

Rosie Maxton — Tue, 07 Mar 2023 08:43:39 GMT

مكتبة ويلكم، مجموعة المخطوطات العربية، مخطوط رقم 748، صفحات 118–119

للنصّ الانجليزي إضغط هنا

كان لقائي الأول مع المخطوطات العربية لدى مكتبة ”ويلكم“ التي تشمل أكثر من ألف مخطوط عربي في شهر آب/ أغسطس سنة 2018 حيث كانت مهمّتي لفترة قصيرة مراجعة الفهرسة السابقة لبعض المخطوطات في مجموعتها. اليوم، بعد مرور أربع سنوات ومطالعتي لمئات المخطوطات، أشعر بالفخر بأنّي قد ساهمت في فهرسة وتصنيف هذه الموّاد الثمينة المتنوّعة التي لم تزل تثير دهشتي. في ما يلي سأشارككم بعض الأفكار والتأمّلات حول فهرستي لمجموعة المخطوطات بمكتبة ويلكم خلال السنوات الأربع الماضية.

الطبّ عبر العصور

كان عملي في البداية لمكتبة ويلكم يشمل مراجعة الفهرسة التي نُشرت على شبكة الإنترنت في الفهرس الإلكتروني ”فهرست“. قد تمّت فهرسة هذه المخطوطات، التي كان يبلغ عددها 79 مخطوطاً، باستخدام برنامج XML مبادرة ترميز النص، وهو أداة لتوصيف بيانات النصوص باستخدام اللغة الترميزية

لقد تمّت الفهرسة السابقة بالتعاون مع مكتبة الإسكندرية وكلية كينجز لندن وكان هناك احتياج لتحديثها تبعاً للمبادئ الجديدة الحاكمة لموقع ”فهرست“. في ضوء اختصاصي باللغة العربية وتجربتي السابقة في العمل مع موقع ”فهرست“، كانت هذه فرصة جيّدة لتطوير خبرتي في مجال علم الإنسانيات الرقمية بالعمل مع مكتبة ويلكم.

كانت أغلب مخطوطات مكتبة ويلكم التي عملتُ عليها خلال هذه الفترة متعلّقةً بالطبّ وفروعه الكثيرة كعلم الأدوية على سبيل المثال. هذه المهمّة أفادتني بالتالي على المستوى التقني لكن أيضاً على مستوى النصوص التي لم أعرفها قبلاً. هذه المخطوطات تمثّل الكثير من التقاليد الفكرية عبر العصور، على سبيل المثال ترجمات مؤلّفات علماء الإغريق مثل جالينوس (ت. 216) وأبقراط (ت. 370) وأعمال ابن سينا (ت. 1037) وابن النفيس (ت. 1288) من العصر العباسي وكتابات الأطباء الروّاد أثناء الحقبة العثمانية مثل حاجي باشا (ت. 1417) وابن سلّوم (ت. 1670). لقد استخدمتُ الصوّر الرقمية للمخطوطات بالإضافة إلى الفهرس المطبوع لألبير زكي إسكندر الذي نُشر باللغة الانجليزية في سنة 1967 بعنوان ”فهرس المخطوطات العربية في الطبّ والعلوم في مكتبة ويلكم الطبية التاريخية “. الجزء الرئيسي من مهمّتي كان البحث عن ملفات السلطة الرقمية للأسماء والأعمال المتمثّلة في المخطوطات. هذه الملفات أصبحت في الفترة الأخيرة جزءاً مهماً من مشاريع الفهرسة لمعالجة الكمية المتزايدة من المعلومات المستخرجة من هذه المخطوطات.

إعادة التفكير في فنّ الفهرسة

كنت قد اعتبرتُ أنّ مجموعة المخطوطات العربية بمكتبة ويلكم متمحورة حول موضوع الطبّ تحديداً. لم أكن أعرف وقتها أنّ المخطوطات المعروضة في فهرس إسكندر (197 مخطوط) ما كانت سوى جزء يسير من المجموعة الضخمة المتوفّرة بمكتبة ويلكم. لكن سرعان ما أدركتُ حجم المجموعة الكبير عندما اُتيحت لي الفرصة للاستمرار في الفهرسة. كانت مهمّتي هذه المرّة مراجعة الفهرسة الرقمية لمجموعة فرعية مكوّنة من سبع وثمانين مخطوطاً قبل نشرها على الإنترنت. هذه المجموعة التي اقتنتها مكتبة ويلكم عام 1985 تُعرف باسم ”مجموعة حداد“ نسبةً إلى مالكها السابق الطبيب والكاتب اللبناني د. سامي ابراهيم حدّاد (ت. 1957). وهذه الفهرسة الرقمية كانت منقولة عن الفهرس المطبوع الذي رتّبه نيكولاي سيريكوف بعنوان ”المخطوطات العربية الطبّية في مكتبة ويلكم : فهرس وصفي لمجموعة حدّاد (رقم 401–487)“ (2005).

إنّ أغلب المخطوطات في مجموعة حدّاد متعلّقة بالطبّ، كما هو واضح في عنوان فهرس سيريكوف. ولكنّ حالما بدأتُ الفهرسة، أدهشني مستوى التفصيل في فهرس سيريكوف وخاصةً بالمقارنة مع فهرس إسكندر السابق الذكر. وذلك أنّ سيريكوف في كتابه يشرح محتويات كل مخطوط بدقة (أقسامها وأجزاءها وفصولها) بالإضافة إلى معلومات كثيرة حول الخصائص المادية لكل مخطوط (مثلاً التجليد والورق والخط وجوانب أخرى من الكوديكولوجي) وأيضاً تقديمه لقائمة مراجع لمساعدة الباحثين. وصف المخطوط الواحد في فهرس سيريكوف يغطّي عادةً عشر صفحات. مع أنّ هذا الفهرس يمثّل عملاً علمياً متفوّقاً، لكنّه كان من غير المفيد تحويل كل هذه البيانات إلى الفهرسة الرقمية خاصةً وأنّ الصور الرقمية للمخطوطات كانت متوفّرة على شبكة الإنترنت.

خلال عملي على مجموعة حدّاد راودتني أسئلة كثيرة بخصوص فهرسة المخطوطات العربية : كيف يمكن تطويع برامج مثل مبادرة ترميز النص بما يتناسب مع الاحتياجات المتغيّرة باستمرار لهذه الفهرسة ؟ ما هي الطريقة المُثلى لاستخدام الفهارس القديمة المطبوعة ؟ وإلى أي حدّ من التفصيل يجب علينا أن نصف كل مخطوط ؟ لقد تناقشتُ وتبادلتُ الآراء والأفكار حول هذه المواضيع مع زملائي في مكتبة ويلكم وفي موقع ”فهرست“. هذه النقاشات ذات أهمية كبيرة ليس فقط في سياق المخطوطات العربية، لكن أيضاً في سياق المخطوطات بأي لغات أخرى.

كشف المجموعة

كان استكمال الفهرسة الرقمية للمخطوطات الموجودة في الفهارس المطبوعة المذكورة أعلاه نقطة تحوّل حيث بدأتُ الفهرسة الرقمية لباقي المخطوطات العربية المصوَّرة والمرفوعة على شبكة الانترنت، وكان عددها وقتئذ 430 مخطوط. الكثير من هذه المخطوطات كانت مفهرسة رقمياً من قبل، لكن جودة البيانات المتوفّرة كانت مختلفة من مخطوط إلى آخر وفي بعض الأحيان كان لا بدّ أن أبدأ العمل عليها من البداية. وفي الحقيقة كان ذلك تطوّراً كبيراً عن عملي السابق، وكنت قبل كل شيء متحمّسة للمهمّة أمامي كشخص يحبّ دراسة المخطوطات. كان الجانب الأكثر إثارةً لاهتمامي خلال هذه الفترة الآداب المختلفة التي رأيتها في المخطوطات. بينما كانت أغلب المخطوطات التي رأيتُها سابقاً متعلّقة بالطب، شملت المخطوطات الجديدة نصوصاً عن علم الفلك والحساب والتعويذات والتاريخ والشعر والقواعد العربية والحديث والفقه إضافةً إلى مصاحف القرآن والأناجيل. هذا التنوّع الكبير في مجموعة مخطوطات ويلكم العربية يعني أنّ الطبّ لم يعد موضوعها الأوحد.

في أيلول/ سبتمبر 2019 ازدادت خبرتي المهنية إذ بدأتُ العمل في قسم التاريخ بجامعة أوكسفورد على المشروع البحثي ”قصص النجاة“ الذي موّله مجلس البحوث الأوروبي. هذا المشروع عالج الترحال والهجرة فيما بين المجتمعات المسيحية الشرقية في الحقبة العثمانية من خلال دراسة إنتاج وتناقل المخطوطات. وشدّد المشروع على الأهمية التاريخية للنصوص الإضافية التي كتبها مالكو ومستخدمو هذه المخطوطات في هوامش صفحاتها عبر العصور. هكذا أدركتُ كيف أنّ البحث في النصوص الإضافية في مخطوطات ويلكم العربية يكشف معلومات عن تاريخ المجموعة. بالاضافة إلى فائدة الخصائص المادية للمخطوطات مثل الخط والتجليد، لقد وجدتُ أنّ نصوصاً مثل توقيع الناسخ وعلامة الملكية ساعدت على معرفة أماكن نشأة المخطوطات في شمال إفريقيا أو الشرق الأوسط أو آسيا الوسطى وغيرها. وأيضاً أفادت في تقدير عمر المخطوطات في فترات تاريخية مختلفة بين القرن الخامس عشر والقرن العشرين. وإنّ مبادرة ترميز النص سهّلت عملية تصنيف وتسجيل هذه النصوص الإضافية في الفهرسة الرقمية بطريقة واضحة.

مكتبة ويلكم، مجموعة المخطوطات العربية، مخطوط رقم 836، نصوص ملكية المخطوط، صفحة العنوان

نتيجةً لذلك ظهرت بعض الحقائق العجيبة حول ماضي المخطوطات، على سبيل المثال مخطوط السيرة النبوية الذي أجازه العديد من العلماء في الأزهر (مخطوط عربي رقم 776) والنصوص النحوية المنسوخة في مدرسة في خانية القرم أثناء القرن السابع عشر (مخطوط عربي رقم 809) ونصوص فقه المواريث التي كُتبت في قرية صغيرة في نطاق الدولة الصفوية (مخطوط عربي رقم 821). وتشكّل هذه التفاصيل خليطاً معقّداً غنياً من قصص الأفراد والجماعات عبر الزمان والمكان.

مكتبة ويلكم، مجموعة المخطوطات العربية، مخطوط رقم 776، إجازات السماع، صفحات 42–43

مجموعة ويلكم العربية استمرّت في إثارة الدهشة. على مدى العامين الماضيين وجدتُ مخطوطات مكتوبة جزئياً أو كلياً بلغات أخرى غير اللغة العربية ، مثلاً باللغة الفارسية واللغة التركية العثمانية. لقد فهرستُ هذه المخطوطات بطريقة رقمية بسيطة حتى تنال اهتمام الخبراء في هذه اللغات. في الحقيقة لقد أدركنا تنوّع المجموعة من جهة اللغة والثقافة بشكل أعمق عندما اكتشفنا مؤخراً عدّة مخطوطات منسوخة بالخط الكرشوني (ويعني ذلك اللغة العربية المكتوبة بالحروف السريانية التي كان يستخدمها غالباً المسيحيون في الشرق الأوسط خلال العصر العثماني). وحيث أنني كنت قد بدأت دراسة الدكتوراه في جامعة اوكسفورد كان يجب عليّ تعلُّم قراءة الخط الكرشوني، ولذلك كانت فهرسة المخطوطات الكرشوني فرصةً مميّزة للجمع بين تجربتي المهنية كمُفهرِسة للمخطوطات واهتماماتي البحثية. وكان تنوّع النصوص في مجموعة ويلكم العربية واضحاً أيضاً في المخطوطات الكرشوني ولكن على نطاق أضيق، فظهرت في المخطوطات مواضيع مختلفة مثل علم الفراسة والطبّ والفلك واللاهوت المسيحي.

قبل كتابة هذه المقالة بشهر، قمت بباقي الفهرسة الرقمية لمخطوطات ويلكم العربية، التي ستُعرض على موقع مكتبة ويلكم الجديد. إنني أعتبر وجود هذه المجموعة المعقَّدة الغنية دليلاً على التنوّع الثقافي والفكري واللغوي والديني الذي تميّز به العالم العربي خلال الحقبة الحديثة المبكّرة. كما هو الحال دائماً، يوجد الكثير من الجوانب والقصص التاريخية عن حياة هذه المخطوطات لاستكشافها في المستقبل. وهكذا نرى وندرك أنّ حرفة الفهرسة ما هي إلا رحلة لا تنتهي.

ذكريات مُفهرِسة المخطوطات العربية was originally published in Stacks on Medium, where people are continuing the conversation by highlighting and responding to this story.

How moving to the cloud took our digital collections to new heights

Alex Chan — Thu, 09 Feb 2023 13:12:26 GMT

We want to provide free and unrestricted access to our collections, and we digitise our collections so that we can make them available online. We’ve already digitised hundreds of thousands of items, and put millions of images online — but there’s plenty more to do!

Over the last few years, we’ve moved a lot of our back-end systems into the cloud. This has opened new possibilities for how we manage our digital collections — it’s not just a drop-in replacement, it’s a step change in what we can do.

Digitisation has come a long way since the days of video discs and CRT monitors. Photo: Wellcome Collection.

How we manage our digital collections

We get some files, either directly from our in-house digitisation team, sent by one of our external digitisation vendors, or born-digital files given to us by a donor.

These go through workflow tools which do certain processing steps, like file format identification and adding fixity checksums. For digitised material we use Goobi; for born-digital material we use Archivematica. This creates additional metadata which is stored with the files.

The processed files are then uploaded to permanent storage, and where possible published online through open, freely-available IIIF APIs. (Not all of our digital collections are available; a subset is restricted or closed in line with our Access Policy, e.g. if the files contain personal information about living people.)

Previously, all of this processing happened on-site, and our files were kept in a mixture of on-premise RAID storage and Amazon S3. We’ve moved this entire process to the cloud, and that gives us a number of very tangible benefits.

What do we get from the cloud?

We can store much bigger files

We replaced our on-premise storage with a new, open-source storage service that’s backed by Amazon S3 and Azure Blob. These cloud services allow you to upload as much data as you like, which is more flexible than our on-prem setup. Previously, new storage had to be ordered weeks or months in advance — but never again will we have to pause our collecting while we wait for new hard drives.

One place where we’ve taken advantage of this flexibility is in our audiovisual collections. When we started processing and ingesting AV material in the cloud, we could move from compressed MPEG2 files to higher quality 2K or 4K files. The higher quality files are much bigger, and obviously more desirable from a preservation point of view, but with our on-premise storage we didn’t have the capacity for that much data. The cloud can not only store these larger files, but it can store them in a cost-effective way.

This is one example of how moving to the cloud has allowed us to make more decisions based on what’s best for the collections, not our technical limitations.

We can go much faster

Our on-premise workflow tools had a fixed capacity, and we’d often hit it! If we got a particularly large batch of new files, those services would get backed up, and sometimes take days or weeks to clear their queues. The queues would clear eventually, but this bottleneck was a limit on our ability to increase our rate of digitisation.

All our cloud services are “elastic” — they can scale up or down based on the work available. At midday on a Tuesday, they’ll be using lots of resources. If you come back at 2am on Sunday, nothing will be running. This means that when a big batch of material arrives, they’ll add more processing capacity to deal with it — and then take it away when it’s done. Now this bottleneck is removed, new material will appear much sooner after it’s digitised.

This also means we don’t need to pass files around the on-site network or download them to our laptops — we can work on them directly in the cloud. This is a big win, especially in the age of hybrid working, when not everybody has a fast Internet connection at home — less time waiting to move files around.

We can reprocess content at scale

Because we have so much capacity, we can do the sort of bulk processing that simply wasn’t feasible in our old system. We have ~100M objects in our permanent storage, but we could reprocess them all in a matter of hours.

One benefit of this is that we don’t need to fixate on picking the one and only “correct” file format for our digital objects. We can convert files into new formats whenever we like, so now we store highest-quality preservation files, and then we create access copies from the preservation copy in different formats. As file format fashion changes, we can create new access copies.

For example, we have a growing pile of Word documents and PowerPoint decks in our born-digital collections. Currently we just store the original file, but at some point we might create PDF derivatives as access copies — and we could do that for all our existing files, not just new ingests. We’re not bound by decisions we’ve already made; we can migrate our digital collections as our needs evolve.

All this means we can handle more files, larger files, and more complex digital objects. We can be more ambitious about the sort of digitisation projects we undertake.

It’s easier to receive files

A few months after I started at Wellcome, we got a new batch of files from a vendor in Cambridge. I drove over to their office, collected a box full of hard drives, wrapped it in towels and bubble wrap, and carried it back to London on the train. Being jostled on the Tube has never been more stressful.

The files arrived safely, but this is obviously sub-optimal.

Vendors can now deliver files directly into our S3 buckets, which is simpler, more secure, and allows us to have lots of small deliveries rather than waiting for one massive batch.

Where next?

It’s important to acknowledge the complexity of moving all our storage and data management to the cloud. It was a multi-year process that required a lot of collaboration between teams, we had to add new skills and people, and it took careful planning and delivery. This wasn’t a quick win.

And it’s not “done” — running this sort of platform is a continuous process. We’ll continue to maintain and extend this infrastructure, and we’ll continue to grow our digital collections — adding 4 to 5 million new images a year. New bottlenecks and bugs will emerge, and we’ll address them as they do.

The cost of this setup is very manageable. Our storage cost scales linearly at a steady ~$25/TB, and that’s for all three copies of our content. Our processing cost is pretty stable, and doesn’t vary much month-to-month.

We’re about to roll our performance improvements to our image servers — as more and more people visit our website, we want to keep it fast and speedy. We’re also looking at presenting born-digital material on our website, so you can browse those files as easily as our digitised collections. And there’s plenty more to come after that.

Looking back, the benefits of building our own platform are clear. It’s more than just numbers on a spreadsheet or abstract technical advances. We’ve removed technical limitations, which means we can make more decisions based on what’s best for the collections, and not be limited by our digital infrastructure.

How moving to the cloud took our digital collections to new heights was originally published in Stacks on Medium, where people are continuing the conversation by highlighting and responding to this story.