Data Catalog
Authors | Michael Joseph Rosenthal |
---|
There are scenarios where one needs to describe a collection of data packages, such as when building an online registry, or when building a pipeline that ingests multiple datasets.
In these scenarios, the collection can be described using a “Catalog”, where each dataset is represented as a single resource which has:
{ "profile": "data-package", "format": "json"}
Specification
The Data Package Catalog builds directly on the Data Package specification. Thus a Data Package Catalog MUST
be a Data Package and conform to the Data Package specification.
The Data Package Catalog has the following requirements over and above those imposed by Data Package:
- There
MUST
be aprofile
property with the valuedata-package-catalog
, or aprofile
that extends it - Each resource
MUST
also be a Data Package
Examples
A generic package catalog:
{ "profile": "data-package-catalog", "name": "climate-change-packages", "resources": [ { "profile": "json-data-package", "format": "json", "name": "beacon-network-description", "path": "https://http://beacon.berkeley.edu/hypothetical_deployment_description.json" }, { "profile": "tabular-data-package", "format": "json", "path": "https://pkgstore.datahub.io/core/co2-ppm/10/datapackage.json" }, { "profile": "tabular-data-package", "name": "co2-fossil-global", "format": "json", "path": "https://pkgstore.datahub.io/core/co2-fossil-global/11/datapackage.json" } ]}
A minimal tabular data catalog:
{ "profile": "tabular-data-package-catalog", "name": "datahub-climate-change-packages", "resources": [ { "path": "https://pkgstore.datahub.io/core/co2-ppm/10/datapackage.json" }, { "name": "co2-fossil-global", "path": "https://pkgstore.datahub.io/core/co2-fossil-global/11/datapackage.json" } ]}
Data packages can also be declared inline in the data catalog:
{ "profile": "tabular-data-package-catalog", "name": "my-data-catalog", "resources": [ { "profile": "tabular-data-package", "name": "my-dataset", // here we list the data files in this dataset "resources": [ { "profile": "tabular-data-resource", "name": "resource-name", "data": [ { "id": 1, "first_name": "Louise" }, { "id": 2, "first_name": "Julia" } ], "schema": { "fields": [ { "name": "id", "type": "integer" }, { "name": "first_name", "type": "string" } ], "primaryKey": "id" } } ] } ]}
Implementations
None known.