Persistent ID (will always link to the latest version): <http://w3id.org/ldac/pilars>
To cite this document (pending a publication), please use this:
Sefton, P., et al. (2024). Protocols for Implementing Long-term Archival Repositories Services (PILARS). Retrieved from http://w3id.org/ldac/pilars.
We collected feedback until the end of June 2024 at Github.
More information and background is available at (RRKive.org)
Protocols
for Implementing Long-term Archival Repositories Services (PILARS)
by
Sefton
et al. is licensed under
CC
BY
4.0
Peter Sefton, p.sefton@uq.edu.au, The University of Queensland, 0000-0002-3545-944X
Moises Sacal Bonequi, m.sacalbonequi@uq.edu.au, The University of Queensland, 0000-0002-4438-2755
Alex Ip, alex.ip@aarnet.edu.au, AARNet, 0000-0001-8937-8904
Michael Lynch, m.lynch@sydney.edu.au, University of Sydney, 0000-0001-5152-5307
Amanda Lawrence, amanda.lawrence@rmit.edu.au, RMIT, 0000-0003-2194-8178
Julia Colleen Miller, julia.miller@anu.edu.au, Australian National University, 0000-0002-8827-3825
Sam Hames, s.hames@uq.edu.au, The University of Queensland, 0000-0002-1824-2361
Marissa Takahashi, marissa.takahashi@qut.edu.au, Queensland University of Technology, 0000-0002-6695-7660
River Tae Smith, river.smith@monash.edu, Monash University, 0000-0002-2118-3147
Annie Cameron, anniec@wangkamaya.org.au, Wangka Maya PALC, 0009-0007-5522-7121
Mark Raadgever, m.raadgever@uq.edu.au, The University of Queensland
Nick Thieberger, thien@unimelb.edu.au, University of Melbourne, 0000-0001-8797-1018
Ben Foley, b.foley@uq.edu.au, The University of Queensland, 0000-0003-0879-9251
Adam Bell, adam.bell@aarnet.edu.au, AARNet, 0000-0003-2129-4776
Janet McDougall, janet.mcdougall@anu.edu.au, Australian National University, 0000-0002-2151-2190
Michael Haugh, michael.haugh@uq.edu.au, The University of Queensland, 0000-0003-4870-0850
This document sets out protocols for the design and implementation of sustainable Archival Repository services to achieve “CAREful FAIRness”, i.e. to support the CARE (Carroll et al. (2020)) and FAIR (Wilkinson et al. (2016)) principles.
The PILARS aim to guide the design and implementation of data storage services, referred to as Archival Repositories, for a range of purposes, including core use cases of:
supporting research that follows the FAIR (Wilkinson et al. (2016)) principles in any discipline, and
archiving cultural heritage.
These protocols are designed to work alongside the CARE principles (Carroll et al. (2020)), which operate at a governance level, and the Reference Model for an Open Archival Information System (OAIS) (“OAIS Reference Model (ISO 14721)” (n.d.) model.)
The high-level aims of the PILARS are to maximize:
autonomy for Data Stewards or Custodians
return on investment in data and data infrastructure
long-term sustainability for data and for data systems and management.
The technical goals to support the aims are:
Data is portable and not locked into a particular storage system.
Data can be stored and described in systems based on Open Specifications.
Services such as authorized access interfaces, catalogues and finding aids can be built and rebuilt from data in a storage system using Open Source Software solutions, services and tools.
This set of protocols is inspired by the continuing success of the technical approach taken over two decades by the PARADISEC (Pacific and Regional Archive for Digital Sources in Endangered Cultures) (Harris et al. (2015)), which houses cultural heritage material from more than 1,360 languages with standard metadata, with data stored in commodity services (initially files on disk, now objects in a cloud storage service), with metadata adjacent to the data, and from work by the Language Data Commons of Australia to generalize the PARADISEC approach to other disciplines.
These protocols are aimed at IT practitioners, archivists, librarians, researchers and infrastructure managers involved in long-term data management, and are intended to be complementary to the existing practices and principles of those disciplines.
In a research context, it is important to be able to support the FAIR principles (Wilkinson et al. (2016)), ensuring that:
Data is well described by metadata.
Data is identified with persistent identifiers.
Shared services with good governance are in place to store interoperable data, to make it findable and provide appropriate access controls.
These protocols could form the basis for design, evaluation or procurement of Archival Repository services, but also allow for Data Stewards or Custodians to begin organizing data in a format ready for archiving and digital preservation by using a range of tools, as long as they have access to some kind of commodity storage.
Divide up data files into Storage Objects that form meaningful units, of smallest practical size.
Each Storage Object is a directory (or storage object equivalent) containing the files, including metadata and administrative files such as checksums that make up an Object.
Document and implement an ID resolution mapping system to map IDs to storage locations FAIR-F1.
Store documentation about the conventions and standards, such as (Protocol 1.3) used in a data store, within the root of the storage service itself.
Data storage of well-described data objects is considered separately from the current uses to which the data is put.
Data files use open or standard formats where possible, independent of particular software FAIR-I.
Data is Annotated: Contents, structure, provenance, and access and reuse permissions are comprehensively described with metadata and licenses.
For each Storage Object, store metadata that describes (annotates) the object and (optionally) the files that make up the object. The metadata should be stored in a file or files with the data files.
For Protocol 2.1, use interoperable general-purpose linked-data metadata stored in a file format which has an Open Specification. This may be extended with domain-specific or ad hoc metadata, which might be in non linked-data formats (FAIR-F1 & FAIR-F2) and might be stored in additional files.
For each Storage Object, include at least one license document linked from the metadata using the appropriate property for a ‘license’ from the core vocabulary (e.g. http://schema.org/license), setting out in plain language how data may be used and/or redistributed and by whom (CARE & FAIR-R1.1).
Do not expose data, for example via a portal, without access controls, or disseminate confidential license or other governance information. Licensing might change, be withdrawn and new licenses added over time. Note, however, once data has been distributed under an Open Access license, it may not be withdrawn from those who have downloaded it.
Store checksum-metadata in a documented standard format alongside data files to help ensure data integrity.
Governance is in place for each Archival Repository.
The following terms (used in capitalized form) are defined.
Used to cover any system that is designed to keep data securely for a defined period of time (often forever), and to make it findable by and available to appropriate parties. The terms Repository and Archive have different nuances and are used in a variety of ways in different communities, but here we want to emphasize the commonalities and focus on advice that is relevant to the audience of these protocols.
An individual or organization with the authority to make decisions regarding data under management. This decision-making process is assumed to take place with good governance, in line with the CARE principles.
The Digital Preservation Coalition page “What is digital preservation?” defines Digital Preservation as:
The series of managed activities necessary to ensure continued access to digital materials for as long as necessary, refers to all of the actions required to maintain access to digital materials beyond the limits of media failure or technological and organizational change.
A computer file is an aggregation of data on a storage device, identified by a name.
(This definition comes from a discussion thread on Wikidepdia (TalkComputerFile2024).)
The organizational schema for a file — this might be formally defined in a specification or be ad hoc. File formats might be considered at various layers of specificity — for example, a text file might be plain text with a specific encoding, such as UTF-8, and also be an XML file conforming to a particular schema.
Metadata is data that describes other metadata. Linked Data Metadata follows the principles set out by Tim Berners Lee for Linked data, so that all metadata and references to entities described are URIs (URLs) (Berners-Lee (n.d.)).
A versioned, published, openly available description of a set of precise requirements (e.g. for a format, system or protocol), which might or might not be endorsed by a standards authority.
Freely distributable software, according to the definition of the Open Software Foundation (OSF).
The term License is used here inclusively to refer to a document which captures the terms under which data in an Archival Repository may be shared, used, reused or deposited. This includes documents such as Data Sharing Agreements or other contracts which might be negotiated at various times, which give certain parties licence to use data in defined ways.
The term Repository Collection is used here to reference the Collection class from the Portland Common Data model (“Duraspace/Pcdm” (n.d.)), which was conceived as an interchange format for repository and digital library interchange. The definition for a collection includes this:
“A Collection is a group of resources. Collections have descriptive metadata, access metadata, and may links (sic) to works and/or collections.”
The term Repository Object is used here in line with the Portland Common Data model definition (“Duraspace/Pcdm” (n.d.)), which refers to an abstract object:
“An Object is an intellectual entity, sometimes called a ‘work’, ‘digital object’, etc. Objects have descriptive metadata, access metadata and might contain files and other Objects as member ‘components’. Each level of a work is therefore represented by an Object instance, and is capable of standing on its own, being linked to from Collections and other Objects.”
A Specification published by a recognized standards body, such as the ISO or W3C. Standards are not always Open Access, so might have barriers to adoption.
A discrete unit in a physical storage service. This might represent, for example, a Repository Object or a Repository Collection, which are abstract structural concepts. This concept is similar to an OCFL Object, and the concept of a Package in OAIS.
See the Notes and Guidance for more detail about implementing the PILARS.