Status

Persistent ID (will always link to the latest version): <http://w3id.org/ldac/pilars>

To cite this document (pending a publication), please use this:

Sefton, P., et al. (2024). Protocols for Implementing Long-term Archival Repositories Services (PILARS). Retrieved from http://w3id.org/ldac/pilars.

We collected feedback until the end of June 2024 at Github.

More information and background is available at (RRKive.org)

Protocols for Implementing Long-term Archival Repositories Services (PILARS) by Sefton et al. is licensed under CC BY 4.0

Editor

Peter Sefton, p.sefton@uq.edu.au, The University of Queensland, 0000-0002-3545-944X

Contributors

Moises Sacal Bonequi, m.sacalbonequi@uq.edu.au, The University of Queensland, 0000-0002-4438-2755

Alex Ip, alex.ip@aarnet.edu.au, AARNet, 0000-0001-8937-8904

Michael Lynch, m.lynch@sydney.edu.au, University of Sydney, 0000-0001-5152-5307

Amanda Lawrence, amanda.lawrence@rmit.edu.au, RMIT, 0000-0003-2194-8178

Julia Colleen Miller, julia.miller@anu.edu.au, Australian National University, 0000-0002-8827-3825

Sam Hames, s.hames@uq.edu.au, The University of Queensland, 0000-0002-1824-2361

Marissa Takahashi, marissa.takahashi@qut.edu.au, Queensland University of Technology, 0000-0002-6695-7660

River Tae Smith, river.smith@monash.edu, Monash University, 0000-0002-2118-3147

Annie Cameron, anniec@wangkamaya.org.au, Wangka Maya PALC, 0009-0007-5522-7121

Mark Raadgever, m.raadgever@uq.edu.au, The University of Queensland

Nick Thieberger, thien@unimelb.edu.au, University of Melbourne, 0000-0001-8797-1018

Ben Foley, b.foley@uq.edu.au, The University of Queensland, 0000-0003-0879-9251

Adam Bell, adam.bell@aarnet.edu.au, AARNet, 0000-0003-2129-4776

Janet McDougall, janet.mcdougall@anu.edu.au, Australian National University, 0000-0002-2151-2190

Michael Haugh, michael.haugh@uq.edu.au, The University of Queensland, 0000-0003-4870-0850


Overview

This document sets out protocols for the design and implementation of sustainable Archival Repository services to achieve “CAREful FAIRness”, i.e. to support the CARE (Carroll et al. (2020)) and FAIR (Wilkinson et al. (2016)) principles.

The PILARS aim to guide the design and implementation of data storage services, referred to as Archival Repositories, for a range of purposes, including core use cases of:

  • supporting research that follows the FAIR (Wilkinson et al. (2016)) principles in any discipline, and

  • archiving cultural heritage.

These protocols are designed to work alongside the CARE principles (Carroll et al. (2020)), which operate at a governance level, and the Reference Model for an Open Archival Information System (OAIS) (OAIS Reference Model (ISO 14721)” (n.d.) model.)

The high-level aims of the PILARS are to maximize:

  • autonomy for Data Stewards or Custodians

  • return on investment in data and data infrastructure

  • long-term sustainability for data and for data systems and management.

The technical goals to support the aims are:

  • Data is portable and not locked into a particular storage system.

  • Data can be stored and described in systems based on Open Specifications.

  • Services such as authorized access interfaces, catalogues and finding aids can be built and rebuilt from data in a storage system using Open Source Software solutions, services and tools.

Background

This set of protocols is inspired by the continuing success of the technical approach taken over two decades by the PARADISEC (Pacific and Regional Archive for Digital Sources in Endangered Cultures) (Harris et al. (2015)), which houses cultural heritage material from more than 1,360 languages with standard metadata, with data stored in commodity services (initially files on disk, now objects in a cloud storage service), with metadata adjacent to the data, and from work by the Language Data Commons of Australia to generalize the PARADISEC approach to other disciplines.

Audience

These protocols are aimed at IT practitioners, archivists, librarians, researchers and infrastructure managers involved in long-term data management, and are intended to be complementary to the existing practices and principles of those disciplines.

Rationale

In a research context, it is important to be able to support the FAIR principles (Wilkinson et al. (2016)), ensuring that:

  • Data is well described by metadata.

  • Data is identified with persistent identifiers.

  • Shared services with good governance are in place to store interoperable data, to make it findable and provide appropriate access controls.

These protocols could form the basis for design, evaluation or procurement of Archival Repository services, but also allow for Data Stewards or Custodians to begin organizing data in a format ready for archiving and digital preservation by using a range of tools, as long as they have access to some kind of commodity storage.


The Protocols

1
Data is Portable: Assets are not locked-in to a particular mode of storage, interface or service.
1.1
Keep data in one or more general-purpose commodity IT storage systems.
1.1.1
 The storage system has a method to store and retrieve File-like datastreams using hierarchical file-paths.
1.1.2
 The storage system has a method to list all the File-paths in the storage system.
1.2

Divide up data files into Storage Objects that form meaningful units, of smallest practical size.

1.2.2

 Each Storage Object is a directory (or storage object equivalent) containing the files, including metadata and administrative files such as checksums that make up an Object.

1.2.3
 Storage Objects can be located by inspecting the contents of the storage hierarchy by listing the paths (Protocol 1.1.2), for example, by the presence of a file with a defined name in the hierarchy.
1.3

Document and implement an ID resolution mapping system to map IDs to storage locations FAIR-F1.

1.4

Store documentation about the conventions and standards, such as (Protocol 1.3) used in a data store, within the root of the storage service itself.

1.5

Data storage of well-described data objects is considered separately from the current uses to which the data is put.

1.6

Data files use open or standard formats where possible, independent of particular software FAIR-I.

1.7
If data resides in systems, such as content management systems or database applications which do not inherently support all of the Protocols 1 and 2, then put processes in place to export data to a system that does.
2

Data is Annotated: Contents, structure, provenance, and access and reuse permissions are comprehensively described with metadata and licenses.

2.1

For each Storage Object, store metadata that describes (annotates) the object and (optionally) the files that make up the object. The metadata should be stored in a file or files with the data files.

2.2

For Protocol 2.1, use interoperable general-purpose linked-data metadata stored in a file format which has an Open Specification. This may be extended with domain-specific or ad hoc metadata, which might be in non linked-data formats (FAIR-F1 & FAIR-F2) and might be stored in additional files.

2.3

For each Storage Object, include at least one license document linked from the metadata using the appropriate property for a ‘license’ from the core vocabulary (e.g. http://schema.org/license), setting out in plain language how data may be used and/or redistributed and by whom (CARE & FAIR-R1.1).

2.3.1

 Do not expose data, for example via a portal, without access controls, or disseminate confidential license or other governance information. Licensing might change, be withdrawn and new licenses added over time. Note, however, once data has been distributed under an Open Access license, it may not be withdrawn from those who have downloaded it.

2.3.2
 Documentation about licenses for deposit and archive-wide accession policies may also be stored with an object.
2.4

Store checksum-metadata in a documented standard format alongside data files to help ensure data integrity.

2.5
Represent Repository Collections, such as archival series or other organizing entities, as Storage Objects; either self-contained with their member data within the Storage Object, or as metadata-only Storage Objects referencing or referenced by other Storage Objects.
3

Governance is in place for each Archival Repository.

3.1
The purpose of the Archival Repository holding the data is articulated.
3.2
Management systems are in place to sustain the Archival Repository.
3.3
Deposit agreements are in place and documented, setting out the rights needed for the Archival Repository as an organization to manage data.
3.3
Processes are in place for ensuring data persistence for the defined periods that meet the repository purpose (including indefinitely).
3.4
Processes are in place for disposal/deaccessioning, if appropriate to the purpose.


Definitions (glossary):

The following terms (used in capitalized form) are defined.

Archival Repository

Used to cover any system that is designed to keep data securely for a defined period of time (often forever), and to make it findable by and available to appropriate parties. The terms Repository and Archive have different nuances and are used in a variety of ways in different communities, but here we want to emphasize the commonalities and focus on advice that is relevant to the audience of these protocols.

Data Steward or Custodian

An individual or organization with the authority to make decisions regarding data under management. This decision-making process is assumed to take place with good governance, in line with the CARE principles.

Digital Preservation

The Digital Preservation Coalition page “What is digital preservation?” defines Digital Preservation as:

The series of managed activities necessary to ensure continued access to digital materials for as long as necessary, refers to all of the actions required to maintain access to digital materials beyond the limits of media failure or technological and organizational change.

File

A computer file is an aggregation of data on a storage device, identified by a name.

(This definition comes from a discussion thread on Wikidepdia (TalkComputerFile2024).)

File Format

The organizational schema for a file — this might be formally defined in a specification or be ad hoc. File formats might be considered at various layers of specificity — for example, a text file might be plain text with a specific encoding, such as UTF-8, and also be an XML file conforming to a particular schema.

Linked Data Metadata

Metadata is data that describes other metadata. Linked Data Metadata follows the principles set out by Tim Berners Lee for Linked data, so that all metadata and references to entities described are URIs (URLs) (Berners-Lee (n.d.)).

Open Specification

A versioned, published, openly available description of a set of precise requirements (e.g. for a format, system or protocol), which might or might not be endorsed by a standards authority.

Open Source Software

Freely distributable software, according to the definition of the Open Software Foundation (OSF).

License

The term License is used here inclusively to refer to a document which captures the terms under which data in an Archival Repository may be shared, used, reused or deposited. This includes documents such as Data Sharing Agreements or other contracts which might be negotiated at various times, which give certain parties licence to use data in defined ways.

Repository Collection

The term Repository Collection is used here to reference the Collection class from the Portland Common Data model (“Duraspace/Pcdm” (n.d.)), which was conceived as an interchange format for repository and digital library interchange. The definition for a collection includes this:

“A Collection is a group of resources. Collections have descriptive metadata, access metadata, and may links (sic) to works and/or collections.”

Repository Object

The term Repository Object is used here in line with the Portland Common Data model definition (“Duraspace/Pcdm” (n.d.)), which refers to an abstract object:

“An Object is an intellectual entity, sometimes called a ‘work’, ‘digital object’, etc. Objects have descriptive metadata, access metadata and might contain files and other Objects as member ‘components’. Each level of a work is therefore represented by an Object instance, and is capable of standing on its own, being linked to from Collections and other Objects.”

Standard

A Specification published by a recognized standards body, such as the ISO or W3C. Standards are not always Open Access, so might have barriers to adoption.

Storage Object

A discrete unit in a physical storage service. This might represent, for example, a Repository Object or a Repository Collection, which are abstract structural concepts. This concept is similar to an OCFL Object, and the concept of a Package in OAIS.


See the Notes and Guidance for more detail about implementing the PILARS.


References

Berners-Lee, T. n.d. Linked Data, 2006. http://www.w3.org/DesignIssues/LinkedData.html.
Carroll, Stephanie Russo, Ibrahim Garba, Oscar L. Figueroa-Rodríguez, et al. 2020. The CARE Principles for Indigenous Data Governance. https://doi.org/10.5334/dsj-2020-043.
“Duraspace/Pcdm.” n.d. In GitHub. Accessed May 16, 2016. https://github.com/duraspace/pcdm.
Harris, Amanda, Nick Thieberger, and Linda Barwick. 2015. Research, Records and Responsibility: Ten Years of PARADISEC. Sydney University Press. https://doi.org/10.30722/sup.9781743324431.
OAIS Reference Model (ISO 14721).” n.d. In OAIS Reference Model (ISO 14721). Accessed April 16, 2024. http://www.oais.info/.
Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, et al. 2016. “The FAIR Guiding Principles for Scientific Data Management and Stewardship.” Scientific Data 3 (March): 160018. 10.1038/sdata.2016.18.


Return to top