Table Dialect
Authors | Rufus Pollock, Paul Walsh, Adam Kariv, Evgeny Karev, Peter Desmet, Kyle Husmann, Data Package Working Group |
---|---|
Profile | /profiles/2.0/tabledialect.json |
Table Dialect describes how tabular data is stored in a file. It supports delimited text files like CSV, semi-structured formats like JSON and YAML, and spreadsheets like Microsoft Excel. The specification is designed to be expressible as a single JSON-compatible descriptor.
Language
The key words MUST
, MUST NOT
, REQUIRED
, SHALL
, SHALL NOT
, SHOULD
, SHOULD NOT
, RECOMMENDED
, MAY
, and OPTIONAL
in this document are to be interpreted as described in RFC 2119.
Introduction
Table Dialect defines set of properties that can be used by data producers and data consumers to ensure data interoperability in various Tabular Data formats such as CSV, JSON, or Excel. The main goal of this specification is to define a common language for defining tabular data dialects.
It is not expected that all the properties are supported by all the Data Package implementations. An implementation MUST
choose the most suitable strategy for communicating to the users if some relevant feature is not supported.
Table Dialect is useful for programmes which might have to deal with multiple dialects of tabular files, but which can rely on being told out-of-band which dialect will be used in a given input stream. This reduces the need for heuristic inference of dialects, and simplifies the implementation of readers, which must juggle dialect inference, schema inference, unseekable input streams, character encoding issues, and the lazy reading of very large input streams.
Descriptor
Table Dialect descriptor MUST
be a descriptor as per Descriptor definition. A list of standard properties that can be included into a descriptor is defined in the Properties section.
An example of a Table Dialect descriptor:
Tabular Data Formats
Table Dialect can be used for different data formats, such as delimited text files, semi-structured formats and spreadsheets. Some properties are generic and can be used for multiple formats, while others are specific to one format.
A property MUST
be ignored if it is no applicable for an arbitrary data format. For example, SQL databases do not have a concept of a header row.
For the sake of simplicity, most of examples are written in the CSV data format. For example, this data file without providing any Table Dialect properties:
SHOULD
output this data:
Delimited
Delimited formats is a group of textual formats such as CSV and TSV. Their charactistics can be expressed the following properties:
- $schema:
https://datapackage.org/profiles/1.0/tabledialect.json
by default - header:
true
by default - headerRows:
1
by default - headerJoin:
- commentRows: undefined by default
- commentChar: undefined by default
- delimiter:
,
by default - lineTerminator:
\r\n
by default - quoteChar:
"
by default - doubleQuote:
true
by default - escapeChar: undefined by default
- nullSequence: undefined by default
- skipInitialSpace:
false
by default
An example of a well-defined Table Dialect descriptor for a CSV format:
Structured
Structured formats is a group of structured or semi-structured formats such as JSON and YAML. Their charactistics can be expressed the following properties:
- $schema:
https://datapackage.org/profiles/1.0/tabledialect.json
by default - header:
true
by default - property: undefined by default
- itemType: undefined by default
- itemKeys: undefined by default
Spreadsheet
Spreadsheet formats is a group of sheet-based formats such as Excel or ODS. Their charactistics can be expressed the following properties:
- $schema:
https://datapackage.org/profiles/1.0/tabledialect.json
by default - header:
true
by default - headerRows:
1
by default - headerJoin:
- commentRows: undefined by default
- commentChar: undefined by default
- sheetNumber:
1
by default - sheetName: undefined by default
Database
Database formats is a group of formats accessing data from databases like SQLite. Their charactistics can be expressed the following properties:
- $schema:
https://datapackage.org/profiles/1.0/tabledialect.json
by default - table: undefined by default
Properties
$schema
A root level Table Dialect descriptor MAY
have a $schema
property that MUST
be a profile as per Profile definition that MUST
include all the metadata constraints required by this specification.
The default value is https://datapackage.org/profiles/1.0/tabledialect.json
and the recommended value is https://datapackage.org/profiles/2.0/tabledialect.json
.
header
A Table Dialect descriptor MAY
have the header
property that MUST
be boolean with default value true
. This property indicates whether the file includes a header row. If true
the first row in the file MUST
be interpreted as a header row, not data.
For example, this data file:
With this dialect definition:
SHOULD
output this data:
Where field1
and field2
names are implementation-specific and used here only for illustrative purpose.
headerRows
A Table Dialect descriptor MAY
have the headerRows
property that MUST
be an array of positive integers starting from 1 with default value [1]
. This property specifies the row numbers for the header. It is RECOMMENDED
to be used for multiline-header files.
For example, this data file:
With this dialect definition:
SHOULD
output this data:
headerJoin
A Table Dialect descriptor MAY
have the headerJoin
property that MUST
be a string with default value " "
. This property specifies how multiline-header files have to join the resulting header rows.
For example, this data file:
With this dialect definition:
SHOULD
output this data:
commentRows
A Table Dialect descriptor MAY
have the commentRows
property that MUST
be an array of positive integers starting from 1; undefined by default. This property specifies what rows have to be omitted from the data.
For example, this data file:
With this dialect definition:
SHOULD
output this data:
commentChar
A Table Dialect descriptor MAY
have the commentChar
property that MUST
be a string of one or more characters; undefined by default. This property specifies what rows have to be omitted from the data based on the row’s first characters.
For example, this data file:
With this dialect definition:
SHOULD
output this data:
delimiter
A Table Dialect descriptor MAY
have the delimiter
property that MUST
be a string; with default value ,
(comma). This property specifies the character sequence which separates fields in the data file.
For example, this data file:
With this dialect definition:
SHOULD
output this data:
lineTerminator
A Table Dialect descriptor MAY
have the lineTerminator
property that MUST
be a string; with default value \r\n
. This property specifies the character sequence which terminates rows.
For example, this data file:
With this dialect definition:
SHOULD
output this data:
quoteChar
A Table Dialect descriptor MAY
have the quoteChar
property that MUST
be a string of one character length with default value "
(double quote). This property specifies a character to use for quoting in case the delimiter
needs to be used inside a data cell.
For example, this data file:
With this dialect definition:
SHOULD
output this data:
doubleQuote
A Table Dialect descriptor MAY
have the doubleQuote
property that MUST
be boolean with default value true
. This property controls the handling of quoteChar
inside data cells. If true, two consecutive quotes are interpreted as one.
For example, this data file:
With this dialect definition:
SHOULD
output this data:
escapeChar
A Table Dialect descriptor MAY
have the escapeChar
property that MUST
be a string of one character length; undefined by default. This property specifies a one-character string to use for escaping, for example, \
, mutually exclusive with quoteChar
.
For example, this data file:
With this dialect definition:
SHOULD
output this data:
nullSequence
A Table Dialect descriptor MAY
have the nullSequence
property that MUST
be a string; undefined by default. This property specifies the null sequence, for example, \N
.
For example, this data file:
With this dialect definition:
SHOULD
output this data:
skipInitialSpace
A Table Dialect descriptor MAY
have the skipInitialSpace
property that MUST
be boolean with default value false
. This property specifies how to interpret whitespace which immediately follows a delimiter; if false
, it means that whitespace immediately after a delimiter is treated as part of the following field.
For example, this data file:
With this dialect definition:
SHOULD
output this data:
property
A Table Dialect descriptor MAY
have the property
property that MUST
be a string; undefined by default. This property specifies where a data array is located in the data structure.
For example, this data file:
With this dialect definition:
SHOULD
output this data:
itemType
A Table Dialect descriptor MAY
have the itemType
property that MUST
be a string with value array
or object
; undefined by default. This property specifies whether the data property
contains an array of arrays or an array of objects.
For example, this data file:
With this dialect definition:
SHOULD
output this data:
itemKeys
A Table Dialect descriptor MAY
have the itemKeys
property that MUST
be array of strings; undefined by default. This property specifies the way of extracting rows from data arrays with itemType
is object
.
For example, this data file:
With this dialect definition:
SHOULD
output this data:
sheetNumber
A Table Dialect descriptor MAY
have the sheetNumber
property that MUST
be an integer with default value 1
. This property specifies a sheet number of a table in the spreadsheet file.
For example, this data file:
With this dialect definition:
SHOULD
output the data from the second sheet.
sheetName
A Table Dialect descriptor MAY
have the sheetName
property that MUST
be a string; undefined by default. This property specifies a sheet name of a table in the spreadsheet file.
For example, this data file:
With this dialect definition:
SHOULD
output the data from the second sheet.
table
A Table Dialect descriptor MAY
have the table
property that MUST
be a string; undefined by default. This property specifies a name of the table in the database.
For example, the database with the tables below:
With this dialect definition:
SHOULD
output the data from the second table.
Excluded
Table Dialect has nothing to do with the names, contents or types of the headers or data within the CSV file (see Table Schema instead), only how it is formatted. However, Table Dialect does allow the presence or absence of a header to be specified, similarly to RFC4180.
Table Dialect is also orthogonal to the character encoding used in the CSV file. Note that it is possible for files in CSV format to contain data in more than one encoding.
References
Some related work can be found in similar JSON format and in Python’s PEP 305.