This document contains implementation notes and guidance on the PILARS.
PILARS
implementation and guidance by
Sefton
et al. is licensed under
CC
BY
4.0
Very sensitive data might need to be stored on physically secured storage, such as hard drives stored in a safe or an air-gapped server, and a variety of storage systems might need to be combined for a large collection (e.g. public data stored in the cloud, access-controlled data kept locally). Encrypting sensitive data can allow technologists to work on the servers without being able to see the data, but the data might not be recoverable if the encryption keys are lost.
Storage directory-like hierarchies might group data together in collections or by rights-holder on similar paths, to aid in moving data between services using file-system tools (unlike approaches which completely obscure all meaning in paths using hash algorithms, for reasons such as optimising the storage address space).
However, to avoid ambiguity in implementing Protocol 1.2.2, a specification, for example, OCFL, might require that object paths on the storage root must not be nested within each other.
Deciding on the granularity of objects, Protocol 1.2 involves considering a number of factors:
Files with the same licensing conditions for re-use should be grouped together.
If some content might need to be withdrawn or withheld for cultural, ethical or legal reasons, then objects should be packaged so that this can be done without having to create new versions of objects.
Regarding Protocol 1.7 (If data resides in systems, such as content management systems or database applications which do not inherently support all of the Protocols 1 and 2, then put processes in place to export data to a system that does.):
Don’t design or build systems that don’t have this exit pathway designed from the start.
Don’t place data in systems that don’t have this already.
Work to add this straight away if existing systems don’t have this.
Protocol 1.3 (that an ID resolution method is in place and documented) ensures that data can be referenced remotely and inter-object relationships within a repository are supported. This supports Protocol 2.5, which means a Storage Object can be located without requiring the repository to be indexed using the ID-to-path algorithm.
The Oxford Common File Layout (OCFL) specification was developed to support a similar set of requirements to Protocol 1, and is being widely implemented in repository and digital library systems. LDaCA uses OCFL and has demonstrated that it works at the scale of tens of thousands of objects and millions of files.
OCFL is a compliant implementation solution for Protocol 1 to ensure data portability.
A simpler alternative storage based on different standards would be to:
Use a directory hierarchy on a POSIX-compliant storage system.
Signal that the directory is a Storage Object (Protocol 1.2) by the presence of a BagIt manifest in a directory.
BagIt files nested below this Storage Object would not be considered Storage Objects in their own right in the context of the repository.
BagIt provides checksums as per Protocol 2.4.
Use a documented and implementable algorithm to map object IDs to a path and store a summary in the root of the directory hierarchy.
Note that neither this example using BagIt nor OCFL implement Protocol 2.
Linked Data allows:
any conceivable data structure to be described in metadata, and there is no limit to the size and scope of the description
extensibility: vocabularies to be mixed in as needed, from a core set for all data to domain-specific to project or even dataset-specific terms; this can be formalised using Profiles
interoperability with contemporary global research information systems architectures, discovery services, etc.
inter-object relationships to be expressed – so that the physical storage layout does not define the only way that resources can be explored.
Note: Linked data is often conflated with Open Access; the phrase Linked Open Data is very common. But the Linked Data principles can be applied to non-open access-controlled materials. Non-open materials have some impact on how catalogues and indexes are implemented; for example, using linked-data graphs such as triple stores that contain multiple items might make access-control-safe queries impossible to implement for reasons of complexity and performance; the reasoning to work out whether a particular metadata statement can be seen by a particular user could be very computationally expensive.
Communities can maintain schemas/vocabularies for specific domains (FAIR-R1.3) and to document metadata profiles.
Documenting a profile can be as simple as writing a natural-language document to describe what is expected in a particular context, but can be further codified using Linked Data Schemas (or Ontologies or Vocabularies), from which documentation and validation services might be derived.
The FAIR principles mandate a License for data (FAIR-R1.1).
Licenses might be based on copyright law, or rely on other mechanisms based on other rights (such as privacy or trade secrets). Licenses should reflect the will of rights holders, and will likely be administered, chosen or written by a Data Steward or Custodian authorised to act on behalf of the rights holders, or by the rights holder themselves.
It is important to have a license document independent of algorithms or configuration file(s), and independently of a particular authorisation system that might be built into a software application. Implemented processes are, of course, needed to make data available, and might include some automation. But to make data truly portable across systems and time, it is essential that human-readable documentation is available in case authorisation processes need to be rebuilt or redesigned.
For Protocol 2.3.1, an access-control system might need to be put in place if there is non-open license data in the Archival Repository. An individual data access system might implement a range of solutions:
lists of approved data licensees associated with logins local to the system
a separate license management system or systems, where more than one data access system can consult to see if a user holds a particular license
manual offline access, with direct transfer of data to approved applicants.
The concept of a Collection is very commonly used in repositories and digital archives to represent the ‘backbone’ of the way resources, including sub-collections and repository Objects, are organized. This structure is described by the Portland Common Data Model (PCDM) (“Duraspace/Pcdm” (n.d.)). PCDM was defined to facilitate interchange of data between repositories, and this makes sense to use as a core organizing principle for PILARS; describing an Archival Repository using PCDM means it will be possible to move data to other systems in the future.
These protocols assume that once storage is taken care of, with data divided into Storage Objects and well described then it will be made available for consumption via one or more indexes which could take the form of web portals, or discovery data bases, or be a as simple as a spreadsheet.
To reduce confusion with the over-loading of the term “Object” in various stages, the RO-Crate standard renames PCDM Collection and Object as Repository Collection and Repository Object respectively. A RepositoryCollection may be stored in a single Storage Object with it’s member RepositoryObject, or each RepositoryObject may be stored in a separate Storage Object or in some cases be fragmented into multiple Storage Objects.
The choice of whether to store Collections as a single Storage Object (e.g. a directory or directory-like node in a file hierarchy) or as a number of Storage Objects might be influenced by several factors:
the size of the RepositoryObject; finer granularity might be preferred to keep Storage Objects at a manageable size
how likely the content is to change or be withdrawn, for example, because a participant wishes to have content by or about them removed from an Archival Repository; a granularity of an Object per participant/creator or cohort might make withdrawing access more manageable, by updating a License
whether there is a separation made between archival and dissemination-ready copies of files, as per OAIS (“OAIS Reference Model (ISO 14721)” (n.d.))
whether files are stored with, or separately from, annotations on those files (such as transcripts)
licensing of objects; from an implementation point of view, it is much simpler to have a single reuse license per Object than to try to administer very granular permissions.
To comply with Protocol
2.5, Repository
Collection / Repository
Object relationships might be described using either exhaustive
lists of members with a property, such as the Portland Common Data
Model’s pcdm:hasMember property, or by referencing the
containing Collection from sub-collections or a Repository Objects using
the equivalent property pcdm:memberOf.
RepositoryCollections and RepositoryObjects which have been stored (fragmented) across multiple Storage Objects can be linked back together using an index for presentation and access (this is one of the strengths of Linked Data), for example, a recording and verbatim transcript might be stored together in an Storage Object available to a very limited cohort, while anonymised transcripts and audio might be in another Storage Object made broadly available – but these can be cross-linked and presented as a single entity to authorised users.
A variety of virtual Collection-like aggregations of Objects can be created via metadata indexes using user interface devices, such as search facets, to comply with Protocol 1 and Protocol 2; these should be designed so that the metadata on which they depend is stored with the data object, not solely in an application or workspace environment.
The specifics of good governance are out of scope for these protocols, so we confine ourselves to some notes.
Protocol 3.3 and Protocol 3.4 together mean (at least) two things:
robust appraisal or sentencing practices
robust preservation processes have been established.
The Digital Preservation Handbook (Coalition (2015)) has detailed practical advice about preservation-focussed archival practice.
Research Object Crate (RO-Crate) (Soiland-Reyes et al. (2022)) is a linked-data metadata specification based on widely used Linked Data Open Specifications. RO-Crate was developed as a packaging method for describing datasets and their contents – which is a good match for describing Storage Objects, and is now being widely adopted in various research contexts. RO-Crate has been demonstrated to work at scale in the Language Data Commons of Australia (LDaCA) and PARADISEC as the basis for Archival Repositories.
RO-CRATE supports Protocol 2.5 via the inclusion of the Portland Common Data model, a schema for describing repositories with classes for Collections, Objects and Files (“Duraspace/Pcdm” (n.d.)).
RO-Crate is a compliant choice for Protocol 2 in a PILARS implementation, and is used in both PARADISEC and LDaCA.
There are two important implementation approaches to storing metadata in storage systems, which are discussed in this blog post (jrochkind (2023)):
The metadata in the store is treated as the source of truth (this is the approach implemented in the set of tools used by LDaCA).
A catalogue application maintains metadata, which is the source of truth, and exports data to the data store (the approach taken by PARADISEC).
Protocol 1.7 deals with this; it is important for the overall goals of these protocols that there are systems in place to write well-described (annotated) data to commodity storage.
See the RRKive Implementations page for a list of compliant software that implements the PILARS.