An Open Knowledge Foundation Labs Project
This project is a community-driven effort from OKFN Labs – sign up now to get involved

A Data Package (or DataPackage) is a simple way of "packaging" up data.

To create a Data Package, all you need to do is place a "descriptor" file named datapackage.json in the top-level directory of your set of data files.

Why Data Packages? How Can They Help?

Imagine a situation such as the following:

  • You have a set of data files in a directory (and subdirectories) - whether locally or online
  • You want to provide basic info for this collection (the "dataset") - perhaps you want to publish this data for others, or simply to manage it better yourself
  • Information like: author, license, list of files in the dataset (and possibly info on those files, like a schema)

The Data Package approach provides a very simple, web friendly, standardized and extensible, way for you to do this.

Full Spec

There is a full RFC-style specification of Data Package format on the Data Protocols website to complement this quick introduction.

Tabular Data

Tabular Data Package extends Data Packages for tabular data. It supports providing additional information such as data types of columns.

Tools

There is a growing set of online and offline tools for working with Data Packages including for creating, viewing and validating.

Getting Started

A minimal example Data Package would look like this on disk:

datapackage.json
# a data file (CSV in this case but could be any type of data)
data.csv
# (Optional!) A README (in markdown format)
README.md

Any number of additional files such as more data files, scripts (for processing or analyzing the data) and other material may be provided but are not required.

datapackage.json

datapackage.json is the file that makes a Data Package a Data Package (and is the only required file). It provides:

  • General metadata such as the name of the package, its license, its publisher etc
  • A "manifest" in the the form of a list of the data resources (data files) included in this data package along with information on those files (e.g. size and schema)

As its file extension indicates it must be a JSON file. Here's a very minimal example of a datapackage.json file:

{
  "name": "a-unique-human-readable-and-url-usable-identifier",
  "title": "A nice title",
  "resources": [{
    # see below for what a resource descriptor looks like
  }]
}

Here is a much more extensive example of a datapackage JSON file:

Note: a complete list of potential attributes and their meaning can be found in the [full Data Package spec][spec].
Note: the Data Package format is extensible in that it allows you add your own attributes to the datapackage.json
{
  "name": "a-unique-human-readable-and-url-usable-identifier",
  "datapackage_version": "1.0-beta",
  "title": "A nice title",
  "description": "...",
  "version": "2.0",
  "keywords": ["name", "My new keyword"],
  "licenses": [{
    "url": "http://opendatacommons.org/licenses/pddl/",
    "name": "Open Data Commons Public Domain",
    "version": "1.0",
    "id": "odc-pddl"
  }]
  "sources": [{
    "name": "World Bank and OECD",
    "web": "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD"
  }],
  "contributors":[ {
    "name": "Joe Bloggs",
    "email": "joe@bloggs.com",
    "web": "http://www.bloggs.com"
  }],
  "maintainers": [{
    # like contributors
  }],
  "publishers": [{
    # like contributors
  }],
  "dependencies": {
    "data-package-name": ">=1.0"
  },
  "resources": [
    {
      ... see below ...
    }
  ],
  # this is an attribute that is not part of the data package spec
  # you can add your own attributes to a datapackage.json
  "my-own-attribute": "data-packages-are-awesome"
}

Resources

You list data files in the resources entry of the datapackage.json.

{
  # one of url or path should be present (you can have both)
  path: "relative-path-to-file" # e.g. data/mydata.csv
  url: "online url" # e.g http://mysite.org/some-data.csv
}

Tools

There is a growing set of online and offline tools for working with Data Packages including tools for creating, viewing, validating, publishing and managing Data Packages. See the Data Package tools page for more.

Examples

Many exemplar data packages can be found in the datasets organization on github. Specific examples:

World GDP

A Data Package which includes the data locally in the repo (data is CSV).

S&P 500 Companies Data

This is an example with more than one resource in the data package.

TopoJSON example

This data package has TopoJSON and the data is external to the repo.