Formatting SPARC Datasets
SPARC data is organized into datasets, each containing many files. Here are the formatting conventions.
Organization of a SPARC Dataset
SPARC data is organized into datasets, each containing many files.
A dataset template (currently at version 2.1.0) can be found here or downloaded as a zip file here. View the folder structure directly on GitHub.
A SPARC dataset comprises the following data and structure:
- An experimental protocol that has been submitted to Protocols.io, shared with the SPARC working group, curated, and published with a valid DOI.
- Data files are organized into folders by the investigators and curated according to the SPARC Dataset Structure (SDS). Data organization and submission that is in compliance with the SDS is greatly simplified and automated by SODA (Software to Organize Data Automatically).
SPARC dataset files and folders include sets of metadata and descriptive files as described below.
Set of metadata files:
-
dataset_description (xlsx, csv or json): Required file where an investigator provides basic metadata. The structure of the dataset description file has been updated so that it is broken into five clearly separated sections: (1) basic information about the dataset like title, keywords, and funding; (2) essential study information with parts similar to structured abstract like study purpose, data collected, and conclusions – the template was expanded by including information about organ system, approach (modality), and techniques used to enhance searchability – while the other three sections collect information about (3) contributors; (4) related identifiers; and (5) participants. The dataset_description.xlsx file includes additional fields specifying the specific metadata version. This field is not to be changed by data submitters, as it is used to properly align different metadata releases. The dataset_description file also includes a section for researchers to describe if they plan to later submit additional data to the dataset.
-
submission (xlsx, csv or json): Required file containing information relevant to internal SPARC bookkeeping, relating milestones negotiated with NIH to the contents of the submitted dataset. If the dataset is not a part of the milestone use N/A in the relevant field. This file is for internal use only; it will not be released when the data are published.
-
subjects (xlsx, csv or json): Required if subjects are used in the experiment producing the dataset. Contains updated fields with required and optional metadata fields providing information about subjects (model organism or animals) involved in data collection. Each subject and/or pooled subjects must be assigned a unique ID. This unique ID is used to name the data folders for individual subjects. For proper mapping of the data, folders containing experimental data need to exactly match the subject ID. All subject identifiers must be unique within a dataset and not contain any sensitive, identifiable information (for human subjects). Having each lab use consistent, unique subject identifiers across datasets is highly desirable to aid in connecting multiple experiments using the same subjects.
-
samples (xlsx, csv or json): Conditional file required if measurements are obtained from samples, e.g., tissue slices, derived from individual or pooled subjects. This file contains information about samples used to generate the data. Investigators must provide a unique ID for each sample which will be used to name the data folders, and the sample ID must match the folder ID exactly. Sample IDs must be unique within a single dataset. Each sample should also reference a subject from the subject file; a single subject (a research animal/donor) may be linked to multiple biological samples derived from that subject. If the samples are pooled from multiple subjects, the complete provenance must be specified in the subject file. The metadata present in the samples file should also explicitly note whether a sample was collected directly or was derived from another sample.
-
code_description (xlsx, csv or json): Conditional file that contains information to describe the code in terms of its quality should be submitted with all computational datasets. Code RRIDs and ontologies are required.
Set of descriptive top-level files:
-
README (txt): Required file provided by Investigators that contains necessary details for reuse of the data, beyond that which is captured in structured metadata. Some information that should be included are:
- How would a user use the files that are provided? e.g., "first open file X and then look at file Y."
- What additional details do users need to know? Are some subjects missing data?
- Are there warnings about how to use the data or code?
- Are there appropriate/inappropriate uses for this data?
- Are there other places that users can go for more information? e.g., did you provide a GitHub repository, or are there additional papers beyond what was provided in the metadata form?
-
performances (xlsx, csv or json): Required performance log file if data were gathered from performances, i.e. multiple visits, runs, sessions, or execution of one type of experimental protocol.
-
code_parameters (xlsx, csv or json): Optional file containing information describing specific parameters to run a code.
-
resources (xlsx, csv or json): Optional file containing information to describe resources used in the experiments (RRID, URL, vendor, version, additional metadata).
-
CHANGES (txt): Required file if a new version of the dataset is uploaded, to document any changes from the previous version.
For more information on how to use SPARC data and the SPARC dataset structure, see Navigating a SPARC dataset.
Example of dataset organized according to SPARC Dataset structure
Data Organization
Data files are organized into 3 different top-level folders, depending on the type of data:
-
primary: a required dataset-dependent folder that contains all folders and files for experimental subjects and/or samples (e.g., time series data, tabular data, clinical imaging data, genomic, metabolomic, or microscopy data). The data generally have been minimally processed so they are in a form ready for analysis. Within the primary folder, data is organized by subjects or samples. All subjects and samples will have a unique folder with a standardized name corresponding to the exact names or IDs as referenced in the subjects and samples metadata file.
-
source: an optional folder containing unaltered, raw files from an experiment, if they are included in the data. For example, this folder may include the “truly” raw k-space data for a Magnetic Resonance (MR) image that has not yet been reconstructed. In this case, the reconstructed DICOM or NIFTI files would be found within the primary folder.
-
derivative: a required folder, if derivative data exists. This folder contains derived data files. For example, processed image stacks that are annotated via the MicroBrightField (MBF Biosciences) tools, segmentation files, or smoothed overlays of current and voltage that demonstrate a particular effect. If files are converted into a format other than what was submitted, these files are included in the derivative folder. Derived data should be organized into subject and sample folders, using the subject and sample IDs as the folder names, as with the primary data.
-
code: a required folder, only if code is used in the generation of the data; the folder contains all the source codes used in the study such as text and source code (e.g., MATLAB, etc.) and any supporting code.
-
protocol: an optional folder that contains supplementary files to accompany the experimental protocols submitted to Protocols.io. Please note that this is not a substitution for the experimental protocol which should have been already submitted to Protocols.io/sparc.
-
docs: an optional folder that contains all the supporting documents for the dataset, including but not limited to, a representative image for the dataset. Unlike the readme file, which is necessarily a text document, docs can contain documents in multiple formats, including images.
SPARC Metadata
Experimental metadata specified by the SPARC Data Standards Committee is described in the SPARC Minimal Information Standards. MIS metadata fields have been incorporated into the subjects and samples templates zip file. We also provide an annotated list of these fields.
Updated about 1 year ago