Data Repositories

Whether you choose to deposit your data in a specialty repository, a general-purpose external repository, or a local SI repository, make sure that the services and terms offered fit the needs of your data.

Specialty Repositories

Given the large number of specialty repositories that exist or are being built for specific data types, specific organisms, and large grant-funded collaborative projects, it is impractical to list all the data repositories that could be used by SI researchers to conform to the FAIR (Findable, Accessible, Interoperable and Reusable) data principles. Before depositing data in a repository not listed below or on the attached best practices document, you should at a minimum insure that the repository:

  • Has a plan and sufficient funding to ensure its long-term viability
  • Allows export of data and data descriptions in a standards-compliant format, preferably identical to the format you deposited.

Ideally, the repository should also:

  • Enable easy citation of your data, including supporting DOIs (either minted by the repository or by SIL).
  • Be searchable, and indexed in a service such as DataCite or Elsevier's DataSearch.
  • Support application of an appropriate license, and embargo of data if necessary.
  • Support metadata standards for your data, e.g., ISO 19115 for Geographic data.

RE3DATA.org - a registry of research data repositories - is an excellent source of detailed information on individual repositories.  You may also want to consult the PLoS One list of recommended repositories (listed by data format and discipline).

Any repository managed by a U.S. Federal agency or national laboratory, e.g., NIH's GenBank, NASA's National Space Science Data Center, or ORNL's DAAC is considered a preferred repository for any SI research data that meets their criteria for deposit. In addition,  data repositories run by established U.S. institutions such as Harvard's Dataverse, are also acceptable.

General-purpose Repositories

SI has two repositories, SRO and SIdora, that accept Smithsonian-produced data.
Both SRO and SIdora support multiple file types and are discipline agnostic. Both accommodate use of DOIs for citation. Both have actively managed, backed-up, secure storage in the Herndon Data Center. Both support having open (accessible) and closed (private) data, though SRO additionally supports embargoes and other restrictions.

  • SRO – is best for smaller (<50GB), fixed (inactive) datasets that accompany or support publications deposited in SRO.
    To deposit data and publications in SRO, you can self-deposit using the forms found on the internal staff pages or contact research-online@si.edu  
  • SIdora – is best for larger, or more complicated datasets, including actively updated datasets
    To deposit data in SIdora contact Beth Stern or email si-sidora@si.edu

If you or your publisher prefer to deposit with a non-SI repository, there are four general-purpose repositories that support FAIR principles. Their features are compared below. Following the grid is a glossary that clarifies the terms we've used as comparison criteria.

General Purpose Data Repositories Compared
Repository Dryad Figshare Open Science Framework (OSF) Zenodo
More information More @ Re3 ⇗ More @ Re3 ⇗ More @ Re3 ⇗ More @ Re3 ⇗
Caveats Dryad's CC-0 license is at odds with SI's general Terms of Use, which is closer in spirit to CC-BY-NC. Figshare has a 5GB per file size limit. OSF is best suited for active projects. Zenodo is based in Europe, and European laws may apply to data deposited. There is a 50GB per dataset limit.
Fees (2018) $120 per deposit (SI is not a member, and cannot get a discount.) free, premium service for a fee free free
Formats
accepted
office documents, scientific & statistical data, plain text, structured text, software, source code, other office documents, images, structured graphics, audiovisual data, raw data, plain text, archived data any (no restrictions on file types) any (no restrictions on file types)
Persistent identifiers will assign a DOI supports ORCID, will assign DOI at time of publication supports ORCID, will assign ARK and DOI supports ORCID, will assign DOI or use provided DOI
Access options open;
embargoed (only for certain publishers)
open; restricted (unpublished) open; restricted; closed open; embargoed; closed; restricted
Licenses
available
CC 0 CC By, CC 0, MIT, GNU GPLv3,Apache 2.0 CC (all), Apache, MIT, GNU, other CC (all), other
Versioning
available
yes yes yes yes - updated files are considered new versions and receive new DOIs
Usage statistics yes yes yes yes

Terms explained

    Fees (2018): Fees for depositing or maintaining access to the data.  Verify the fees when depositing your work.
    Formats accepted: Data formats accepted for deposit by the repository. Common file formats will usually be abbreviated by the file extension i.e. .xlsx, .csv. These may include proprietary or uncommon file formats from software specific to one discipline.
    Persistent identifiers: Persistent identifiers are registered unique strings (numbers or alphanumeric) that allow your deposit to be referenced easily.  Notes in this field indicate if the repository provides the service of assigning identifiers, or gives you a place to store a persistent identifier you have created.

    • DOI: Digital Object Identifier is a persistent unique identifier assigned by a registration agency to a digital object. Because DOIs are registered, if the content changes location, the DOI will still be "resolvable", that is, it will still link to the content.
    • ARK: Archival Resource Key is a persistent URL assigned by one of several registered naming authorities, following the ARK schema. 

    Access options :The repository may allow you to control who and when data can be found, viewed, etc.  Embargo periods may be an option to allow for data to be hidden for a specific amount of time.  
    Licenses available:Types of licenses that the repository enables you to apply to your data. Licenses specify the permitted use and/or reuse of data. They do not control who can view or access your data (see below under "Access options.")
    For more information about choosing licences for your data the Digital Curation Centre has an excellent guide: http://www.dcc.ac.uk/resources/how-guides/license-research-data . Generally, datasets created by Federal employees in the course of their duties are considered to be in the Public Domain (with some exceptions). The most common types of licenses available are Creative Commons (CC) licenses, but many repositories offer software licenses (MIT, GNU) that may be more appropriate for code associated with datasets.

    • CC-O License: Creative Commons Zero https://creativecommons.org/choose/zero/ Waive all copyright and related or neighboring rights that you have over the work
    • CC-BY-NC License: Creative Commons Attribution - NonCommercial
    • CC-BY-ND License: Creative Commons Atribution -NoDerivatives

    Versioning : Does the repository offer automatic versioning of data when deposited?  Versioning can be important for datasets that are periodically updated. Some repositories may provide automatic versioning or version control that help clarify which datasets were used to produce which outputs (publications, etc.). 
    Usage Statistics:  Repositories may provide statistics on the use of the materials in the respository. This may include information on individual datasets, such as downloads and views.