Borealis Preservation Plan

Introduction

Borealis, the Canadian Dataverse Repository, is a bilingual, multidisciplinary, secure Canadian research data repository. Borealis repository infrastructure is a shared academic service provided in partnership with academic library consortia, participating institutions, and the Digital Research Alliance of Canada. Technical infrastructure hosting and service operations are provided by Scholars Portal and the University of Toronto Libraries (UTL). The Borealis Steering Committee, established in 2024, brings together regional library consortia (OCUL, COPPUL, CAAL, PBUQ) to support national governance and ongoing commitments to data stewardship. Learn more about Borealis academic library partners, services, and governance.

Borealis uses the open-source Dataverse software, which is developed and maintained by the Institute of Quantitative Social Science (IQSS) at Harvard University and with community members, users, and the Global Dataverse Community Consortium (GDCC).

The Borealis Preservation Plan outlines the primary objectives, roles and responsibilities, strategies, and actions for preserving the digital files uploaded by users and stored in the repository, and complements digital preservation activities and other services provided by academic libraries and institutions.

Objectives

The objectives of the preservation plan activities for the Borealis repository are as follows:

  1. Ensure a minimum level of fixity assurance for all files uploaded by users.
    • The priority for this strategy is to protect against the loss of data in the form of accidental deletion, corruption or modification of user-submitted content over time.
    • Commonly referred to as “bit-level preservation,” this strategy is focused on monitoring the integrity of the whole repository through monthly fixity reports and remediating any errors that may arise in a uniform, scalable, and efficient manner. This strategy does not guarantee any form of future usability/accessibility based on the intellectual contents or format of the files in question.
  2. Store files uploaded by users using a secure, reliable and scalable preservation storage strategy.
  3. Install and maintain all preservation features that are core to the Dataverse application (e.g. file format identification, file checksums, metadata exports, preservation metadata, tabular data conversion, other preservation features).
  4. Support Participating Institutions who wish to implement advanced preservation processing and management of datasets in their institutional collections in Borealis, through dataset exports and/or processing in external systems.

Preservation Strategies

Level 1

Description: The first level of preservation combines two broad sets of activities: bit-level preservation via regular independent file fixity checking and safe storage in the Ontario Library Research Cloud (OLRC), and, maintaining and improving the preservation features that are part of the Borealis repository. As the technical service provider, Borealis is not directly responsible for validating the contents or quality of user-uploaded files.

Level 1 preservation addresses Objectives 1, 2 and 3 (noted above): that user-uploaded files are safe from loss and that minimum preservation functions are run as a necessary precursor to additional preservation strategies.

Scope: Bit-level preservation is conducted for all user uploaded files in Borealis. This includes files associated with all versions of draft and published datasets (open or restricted). It does not include files generated by the Dataverse application itself, such as derivatives, thumbnails, or system generated metadata. Basic preservation activities are largely managed by the Borealis service team on behalf of Participating Institutions, except in cases related to fixity check failures and when interventions and remediation is required.

Term: Borealis will maintain Level 1 preservation activities for all user data as long as an institution is a subscriber to the Borealis service. In the event that the service agreement between Borealis and a Participating Institution is terminated, or stewardship for a sub-collection is no longer viable, Borealis will support transfer processes, such as data exports, to facilitate external collection management as required by the Institution to implement their plans. This may include return of data to the Institution or succession by another party identified by the Institution. Data will not be deleted except at the express request of the Institution or if reasonable attempts to ensure ongoing stewardship have not succeeded.

Activities:

  • Primary storage of all Borealis data files in the OLRC, with replicated copies stored at three of the five institutional partners storage nodes located in Ontario, Canada

  • Daily export and backup of all files to local disk storage and tape using industry-standard tape backup software

    • For active files:
      • Multiple versions of each file are available for restore for 30 days.
      • If a file has not been modified for over 30 days, the most recent version of the file is retained permanently in backup.
    • For deleted files, the latest version of that file is available for restore for 60 days.
    • For deaccessioned datasets, a tombstone record landing page including dataset metadata and reason for deaccession is retained in the repository and backups.
    • At least two copies of the backup are retained onsite and one copy is retained offsite.
  • Regular independent fixity validation checks

    • When users upload files to the Dataverse application, MD5 checksums are automatically generated and stored in the Dataverse application database
    • The Dataverse Native API includes the Physical Files Validation in a Dataset API call, which downloads a file from storage and validates its checksum against the value stored in the database
    • Borealis runs this API call against all files with an assigned File ID every 30 days
    • The record of each fixity check (both positive and negative) is stored in an internal database and a monthly summary is generated and sent to Borealis service staff
    • Any errors identified during this process are triaged for correction by retrieving an unaffected copy of the file from the backup or communicating with the Participating Institution and/or depositing User(s) for replacement
      • Fixity remediation scripts are used to replace any corrupted files, from a good copy from backup or replacement provided by the Participating Institution, and applies provenance metadata recording the steps taken in remediation cases.
  • Maintenance of additional preservation-supporting functionality available as part of the Dataverse application:

    • File format identification using an internal format registry (MIME types) and JHOVE
    • Transformation of tabular data formats into non-proprietary tab separated values text data files (Dataverse .tab format). Tabular data ingest also extracts file and variable-level metadata (e.g. Data Documentation Initiative (DDI) Codebook metadata), and other Dataverse metadata exports in JSON, XML and PDF format.
    • Generation of UNFs (Universal Numeric Fingerprints) for tabular data files
      • UNFs are designed to validate the semantic content of tabular data regardless of format and are assigned at the dataset and file level
      • The Dataverse application provides a UNF when tabular data ingest has been successful, and as a result UNFs (and the derived .tab files) do not require subsequent checks unless this value is missing, in which case individual users are notified of failed ingest by the Dataverse application.

Level 2

Description: This level of preservation is intended for Participating Institutions implementing active preservation through advanced preservation processing and/or the export of independent packages for management in external/institutional preservation systems.

  • Advanced preservation functions may be conducted when Borealis is paired with the Archivematica preservation processing system. Using the Dataverse Data Access API, Archivematica can retrieve datasets from Borealis and create independent preservation packages. Preservation tasks in the Archivematica workflow include signature-based file format identification, and format validation and characterization, as well as functions specific to Dataverse datasets, such as the inclusion of Dataverse system metadata and tabular file metadata in the Data Documentation Initiative (DDI) standard. Package and file relationships as well as the outcomes of preservation activities conducted are expressed using METS and PREMIS metadata. Packages created by Archivematica are sent to the preservation storage location of the institution’s choice. If an institution is accessing Archivematica through Permafrost, preservation packages are stored in the OLRC.
  • Alternatively, Institutions may opt to request exports of packages from Borealis (or any Dataverse) in the RDA BagIt format to receive independent copies of datasets for additional preservation processing and local management.

In collaboration with institutional preservation policies and strategies, the Borealis technical preservation workflows such as those noted above can support a Participating Institution’s application for trustworthy repository certification, such as CoreTrustSeal Repository Certification. Additional information about these preservation features and strategies is described below.

Scope: Participating Institutions are responsible for defining their preservation policies, approaches, and activities, and determining which datasets are eligible for additional processing, export, and long-term managed preservation. Administrators, curators, or other preservation staff and designates at Participating Institutions may select the complete contents of their institutional collections or a subset as guided by internal appraisals, selection criteria, and preservation policies. Borealis can provide technical support and setup as requested by institutions.

Activities:

  • Borealis will assist Participating Institutions in the setup of connections to Archivematica instances
    • If Participating Institutions are using Permafrost, the functional connection between Borealis and Archivematica will be set up as part of Permafrost technical service support activities
    • Datasets processed via Permafrost are subject to the functionality and limitations of this service
    • If a Participating Institution is using another hosted Archivematica service, or a locally-hosted instance of Archivematica, Borealis will provide advice and consultation on setup to connect Borealis and Archivematica
  • Borealis will assist Participating Institutions wishing to export BagIt-formatted packages from Borealis
    • BagIt packages produced by the Dataverse application are conformant with the RDA-endorsed BagIt profile, and contain:

      • user-uploaded files, and
      • citation, dataset-level, and file-level metadata in the form of an OAI-ORE compliant JSON-LD file, a text based manifest file, and a DataCite XML.

      Note: in the case of tabular data uploads, only the original version is retained in the BagIt package, and upon export does not include the tabular derivative files (.tab) or variable-level metadata (DDI variable-metadata in XML format), unless uploaded with the deposited dataset.

    • BagIt exports are conducted by the Borealis team at the request of a Participating Institution. Exports are conducted at the dataset level and require a structured list with the DOI and Version Number of each requested dataset from the Participating Institution.

    • Bags may be transferred to external S3 storage managed by the Participating Institution.

Roles and Responsibilities

Users: responsible for uploading data files and metadata to the Borealis repository, as well as viewing, downloading, and accessing data files and metadata in the repository. Users create an account and must adhere to the Borealis Terms of Use as well as any policies and procedures governing their use of the service as set by Participating Institutions.

Participating Institutions: responsible for administering the collections and use of Borealis at their institution. Institutions subscribe to Borealis via consortial agreements and are allocated storage space and administrative rights for local staff to manage their institutional collections within the Borealis repository. Institutions are responsible for oversight and stewardship of the data uploaded to their institutional collections by setting policies and deposit guidelines, administering users and user rights, and handling takedown and copyright decisions. Institutions may also validate data deposits for quality and completeness via curation and preservation activities, or providing guidance to depositors about preferred file formats and data documentation for deposit, sharing, and long-term preservation, as well as metadata to support discovery, understandability, reproducibility, and FAIR data for now and in the future.

Preservation policies and managed preservation activities are also defined by Participating Institutions for their collections, or selected sub-collections or datasets, to facilitate long-term preservation and access. In the event that the agreement between Borealis and Participating Institutions is terminated, or that an institution is otherwise no longer able to steward some or all of their data, Institutions are responsible for determining exit strategies and succession plans for their data.

Borealis: responsible for the technical repository and storage infrastructure, including maintenance, client support, and administration of the Dataverse repository software and service. Borealis ensures the Dataverse application is functional, secure, and updated. Borealis maintains the connected components, including storage infrastructure for files and data in the repository, integrated applications, such as Data Explorer, and customizations. Borealis supports administrators and users at Participating Institutions through training, documentation, guides, and administrator community calls. Borealis maintains no oversight over the quality, completeness, or format of files uploaded by users but will assist in identifying and remediating fixity issues in collaboration with Participating Institutions as they arise.

Definitions

Archivematica: an open source, standards-based processing tool for creating well-formed packages for preservation storage. Archivematica performs signature-based file format identification, validation and characterization functions; can normalize copies of files to preservation and access formats; and creates preservation metadata files using the METS and PREMIS standards. The Dataverse - Archivematica Integration supports processing of data packages from a Dataverse instance using the Archivematica UI tools, workflows, and Dataverse APIs.

BagIt: a set of formatting conventions that guide creating checksums for, and verifying the fixity of, collections of files. Files contained in a BagIt-formatted directory (commonly called a “bag”) include a manifest of checksums that can be used to ensure that the contents of the directory have retained fixity after transfer or in storage.

Bit-level preservation: one type of digital preservation strategy, focused on ensuring that files retain fixity in storage through checksum validation and backup of multiple copies to multiple locations to protect against accidental loss, corruption, or disaster recovery. Bit-level preservation does not guarantee any form of future usability/accessibility based on the contents or format of the files in question.

Checksum: a unique numeric or alphanumeric string produced by running a checksum-generating algorithm against a file. When the contents of the file are altered in any way, the checksum value will change, indicating that the file no longer has fixity and therefore should be replaced from a good copy. Checksum algorithms include MD5, SHA-1 and SHA-256.

CoreTrustSeal: an international, community-based, non-governmental, and non-profit organization promoting sustainable and trustworthy data repositories. Certified repositories are recognized as being sustainable, transparent, and trustworthy from organizational, resourcing, and technical standpoints. Requirements for certification are based on the OAIS (Open Archival Information System) Reference Model for preserving and making available digital information.

Dataset: a container for a group of related files. For example, a dataset can include the original source data, code, and/or documentation related to a single study or publication. A dataset must also include metadata added by the user to describe the files, including a title, author(s), description and subject.

Dataverse: the open-source research data repository software application with which the Borealis repository is hosted and operated. Dataverse is developed by the Institute for Quantitative Social Science (IQSS) at Harvard University. Borealis infrastructure is based on a locally hosted instance of Dataverse.

Digital preservation: “the series of managed activities necessary to ensure continued access to digital materials for as long as necessary” (DPC Glossary). Digital preservation activities can include active and ongoing monitoring of files and formats, regular fixity checks, and refreshing of storage media.

Fixity: the quality of knowing that a digital file has not been altered or changed. Fixity is established via computing a checksum. Fixity information can help establish the integrity of files via evidence that files have remained physically unchanged over time.

Ontario Library Research Cloud (OLRC): a five-node academic library community cloud storage network maintained by Scholars Portal and institutional partners. The OLRC uses the OpenStack Swift software and ORION network to manage and connect five storage nodes located at the University of Toronto, the University of Guelph, the University of Ottawa, York University, and Queen’s University, in Ontario, Canada. Borealis uses the OLRC as its repository storage and deposited files are replicated across three of the five nodes for reliability and integrity at any given time. If one of these copies becomes unreadable, a new copy is created by the system from the two remaining good copies. The OLRC service has a connection with DuraCloud, an open-source application integrated with the OLRC for advanced file preservation management of packages. Information about the technology and security of the OLRC is contained in the Borealis Technology Infrastructure and Security Information.

Permafrost: a hosted digital preservation service offered by Scholars Portal to members of the Ontario Council of University Libraries (OCUL). Permafrost pairs Archivematica with the OLRC to provide access to technical infrastructure, support, and training to enable OCUL members to actively process digital objects for long-term preservation and access.

Acknowledgements

The Preservation Plan is updated and maintained by Borealis, Scholars Portal, and the University of Toronto Libraries, in collaboration with its national governance and participating institutions. Thank you to the Alliance’s former Dataverse North Policy Working Group for creating the initial policy framework for this document. The Alliance’s RDM Preservation Expert Group’s report Preservation for Dataverse in Canada: Recommendations provides key requirements for the preservation strategies outlined above. Additional sources of inspiration were the Texas Digital Library Digital Preservation Policy and the Harvard Dataverse Preservation Policy.

Published June 23, 2022. Updated on April 14th, 2025.