Data Package
Authors | Rufus Pollock, Paul Walsh, Adam Kariv, Evgeny Karev, Peter Desmet, Data Package Working Group |
---|---|
Profile | /profiles/2.0/datapackage.json |
A simple container format for describing a coherent collection of data in a single package. It provides the basis for convenient delivery, installation and management of datasets.
Language
The key words MUST
, MUST NOT
, REQUIRED
, SHALL
, SHALL NOT
, SHOULD
, SHOULD NOT
, RECOMMENDED
, MAY
, and OPTIONAL
in this document are to be interpreted as described in RFC 2119.
Introduction
A Data Package consists of:
- Metadata that describes the structure and contents of the package
- Resources such as data files that form the contents of the package
The Data Package metadata is stored in a “descriptor”. This descriptor is what makes a collection of data a Data Package. The structure of this descriptor is the main content of the specification below.
In addition to this descriptor a data package will include other resources such as data files. The Data Package specification does NOT impose any requirements on their form or structure and can therefore be used for packaging any kind of data.
The data included in the package can be provided as:
- Files bundled locally with the package descriptor
- Remote resources, referenced by URL
- “Inline” data (see below) which is included directly in the descriptor
Structure
A minimal data package on disk would be a directory containing a single file:
Lacking a single external source of data would make this of limited use. A slightly less minimal version would be:
Additional files such as a README, scripts (for processing or analyzing the data) and other material may be provided. By convention scripts go in a scripts directory and thus, a more elaborate data package could look like this:
Several example data packages can be found in the datasets organization on github, including:
Descriptor
Data Package descriptor MUST
be a descriptor as per Descriptor definition. A list of standard properties that can be included into a descriptor is defined in the Properties section.
When a data package is available as a physical or virtual directory containing files, a descriptor MUST
be named datapackage.json
and it MUST
be placed on the top-level of the file structure (relative to any other resources provided as part of the data package).
The descriptor is the central file in a Data Package. It provides:
- General metadata such as the package’s title, license, publisher etc
- A list of the data “resources” that make up the package including their location on disk or online and other relevant information (including, possibly, schema information about these data resources in a structured form)
An example of a Data Package descriptor:
Properties
A Data Package descriptor MUST
have resoures
property and SHOULD
have name
, id
, licenses
, and profile
properties.
resources
[required]
The resources
property is REQUIRED
, with at least one resource.
Packaged data resources are described in the resources
property of the package descriptor. This property MUST
be an array of objects
. Each object MUST
follow the Data Resource specification.
$schema
A root level Data Package descriptor MAY
have a $schema
property that MUST
be a profile as per Profile definition that MUST
include all the metadata constraints required by this specification.
The default value is https://datapackage.org/profiles/1.0/datapackage.json
and the recommended value is https://datapackage.org/profiles/2.0/datapackage.json
.
name
The name is a simple name or identifier to be used for this package in relation to any registry in which this package will be deposited.
- It
SHOULD
be human-readable and consist only of lowercase English alphanumeric characters plus.
,-
and_
. - It
SHOULD
be unique in relation to any registry in which this package will be deposited (and preferably globally unique). - It
SHOULD
be invariant, meaning that itSHOULD NOT
change when a data package is updated, unless the new package versionSHOULD
be considered a distinct package, e.g. due to significant changes in structure or interpretation. Version distinctionSHOULD
be left to the version property. As a corollary, the name alsoSHOULD NOT
include an indication of time range covered.
id
A property reserved for globally unique identifiers. Examples of identifiers that are unique include UUIDs and DOIs.
A common usage pattern for Data Packages is as a packaging format within the bounds of a system or platform. In these cases, a unique identifier for a package is desired for common data handling workflows, such as updating an existing package. While at the level of the specification, global uniqueness cannot be validated, consumers using the id
property MUST
ensure identifiers are globally unique.
Examples:
licenses
The license(s) under which the package is provided.
licenses
MUST
be an array. Each item in the array is a License. Each MUST
be an object
. The object MUST
contain a name
property and/or a path
property, and it MAY
contain a title
property:
name
: A string containing an Open Definition license IDpath
: A URL or Path, that is a fully qualified HTTP address, or a relative POSIX path.title
: A string containing human-readable title.
An example of using the licenses
property:
title
A string
providing a title or one sentence description for this package
description
A description of the package in CommonMark Markdown syntax. As any sequence of characters is a valid CommonMark document, this allows for simple plain text. The first paragraph SHOULD
be separated by a blank line from following content to be usable as summary information of the package.
homepage
A URL for the home on the web that is related to this data package.
image
An image to use for this data package. For example, when showing the package in a listing.
The value of the image property MUST
be a string pointing to the location of the image. The string MUST
be a URL or Path, that is a fully qualified HTTP address, or a relative POSIX path.
version
A version string identifying the version of the package. It SHOULD
conform to the Semantic Versioning requirements and SHOULD
follow the Data Package Version recipe.
created
The datetime on which this was created.
Note: semantics may vary between publishers — for some this is the datetime the data was created, for others the datetime the package was created.
The datetime MUST
conform to the string formats for datetime as described in RFC3339. Example:
keywords
An Array of string keywords to assist users searching for the package in catalogs.
contributors
The people or organizations who contributed to this Data Package. It MUST
be an array. Each entry is a Contributor and MUST
be an object
. A Contributor MUST
have at least one property. A Contributor is RECOMMENDED
to have title
property and MAY
contain givenName
, familyName
, path
, email
, roles
, and organization
properties:
title
: A string containing a name of the contributor.givenName
: A string containing the name a person has been given, if the contributor is a person.familyName
: A string containing the familial name that a person inherits, if the contributor is a person.path
: A fully qualified URL pointing to a relevant location online for the contributor.email
: A string containing an email address.roles
: An array of strings describing the roles of the contributor. A role isRECOMMENDED
to follow an established vocabulary, such as DataCite Metadata Schema’s contributorRole or CRediT. Useful roles to indicate are:creator
,contact
,rightsHolder
, anddataCurator
.organization
: A string describing the organization this contributor is affiliated to.
An example of the object structure is as follows:
Use of the creator
role does not imply that that person was the original creator of the data in the data package - merely that they created and/or maintain the data package. It is common for data packages to “package” up data from elsewhere. The original origin of the data can be indicated with the sources
property - see above.
References:
sources
The raw sources for this data package. It MUST
be an array of Source objects. A Source object MUST
have at least one property. A Source object is RECOMMENDED
to have title
property and MAY
have path
, email
, and version
properties:
title
: A string containing a title of the source (e.g. document or organization name).path
: A URL or Path, that is a fully qualified HTTP address, or a relative POSIX path.email
: A string containing an email address.version
: A string containing a version of the source.
An example of the object structure is as follows:
Related Work
Data Package draws content and/or inspiration from, among others, the following specifications and implementations: