SPARC Dataset Structure (SDS)
The SDS is the required FAIR data sharing organizational and naming system for items shared on the SPARC Portal.
Using this FAIR standard for metadata information and folder structure allows SPARC to share this data to the public in a manner that promotes understanding about the experiments, processes, and datasets. It also promotes promotes consitency, reuse, and the development of automated tools and analytics.
Current Version of SDS
The current version of SDS is 3.0.1 and the changelog with information about modifications from older versions can be found here and all releases can be found at this link, or downloaded version 3.0.1 as a zip file. Note that SDS version 2.1 is still supported and can be found here , downloaded as a zip file, and this presentation includes a high-level overview of V 2.1. For an overview of how to use this when submitting your data or model, please see this section of the Submission Walkthrough.
Overview of SDS 3.0
SDS 3.0 introduces significant changes and improvements to the SPARC Data Structure, enhancing the organization, validation, and curation of datasets. This version introduces new metadata files, validation rules, and structural changes, tightening data organization rules, and improving metadata handling. This version aims to enhance data consistency, searchability, and reusability within the SPARC ecosystem and to improve data consistency and support future interoperability between different data standards such as BIDS.
SDS 3.0 enforces stricter naming conventions, introduces new files for better tracking of specimens and sites, and expands the dataset description file with more detailed fields related to funding, device use, and dataset standards. Additionally, several validation rules ensure that data follows strict guidelines for naming, metadata consistency, and file organization.
Key Changes in SDS Version 3.0 from earlier versions
Key Changes for Users
- Stricter file and folder naming conventions.
- More detailed metadata requirements, especially for devices and funding.
- New options for describing relationships between datasets.
- Enhanced support for complex experimental designs with the new
sites
andspecimens
files. - Improved code documentation capabilities.
- Better handling of auxiliary files related to the publication process.
New Features and Improvements
- The folder structure is no longer the only way to map subjects and samples to files. Now this can be handled using the manifest file.
- Data Dictionary Support:
- New
data-dictionary-path
column in the manifest. - Support for specifying data dictionary type and description.
- New
- Enhanced File Naming Rules:
- Stricter character restrictions for file and folder names.
- New validation checks for potentially problematic file names.
- Improved Data Modality Handling:
- New
data-modality
column in the manifest. - File type restrictions enforced by modality.
- New
- Better Cross-Dataset Referencing:
- New columns in the manifest for referencing files in other datasets.
- New relationTypes in dataset_description for dataset-to-dataset relations.
- Unicode Support: Limited support for Unicode characters in the letter and number categories.
- .dss File: New file to specify the data structure standard (SDS) version used.
- Artifact Validation: Enhanced validation rules for nested folders and entity relationships.
Validation and Naming Conventions
SDS 3.0 introduces stricter validation rules for datasets, ensuring that files, folders, and metadata follow a standardized format:
- Sample and Subject IDs: Samples and subjects cannot share the same pool-id, and a given pool-id can only appear in one of the samples or subjects files.
- File Naming Rules:
- Only the following characters are allowed in file and folder names: [0-9A-Za-z,.-_ ].
- Certain special characters (@#$%^&*()+=/|"'~;:<>{}[]?) are no longer allowed, as are non-printing characters.
- Spaces are discouraged but allowed, except at the beginning or end of file names.
- Files and folders mapped to SDS entity IDs (i.e. folders named sam-1, sub01, etc.) must follow the more restrictive rule [A-Za-z0-9-].
- Unicode Restrictions: By default, SDS 3.0 excludes larger Unicode categories but provides an option to extend support for internal datasets with non-ASCII file names.
- File Type Validation by Modality: If a modality (e.g., imaging, electrophysiology) is provided, the system will validate that the file types match the expected format for that modality.
File System Changes
- New .dss File: The .dss file is added to indicate the data structure standard used by a dataset. It defines which validator to apply for different folder structures. This will not be relevant for most datasets.
- New sites and File: This file provides metadata on specific sites (e.g., electrode locations) improving the granularity and consistency of metadata.
- Removed code_parameters File: Functionality moved to code_description
SPARC Dataset Structure Policy
Rationale for SPARC's adherence to the FAIR standard for biomedical research data and the policy regarding the SPARC dataset structure can be found in this publication:
Abstract
The NIH Common Fund’s Stimulating Peripheral Activity to Relieve Conditions (SPARC) initiative is a large-scale program that seeks to accelerate the development of therapeutic devices that modulate electrical activity in nerves to improve organ function. Integral to the SPARC program are the rich anatomical and functional datasets produced by investigators across the SPARC consortium that provide key details about organ-specific circuitry, including structural and functional connectivity, mapping of cell types and molecular profiling. These datasets are provided to the research community through an open data platform, the SPARC Portal. To ensure SPARC datasets are Findable, Accessible, Interoperable and Reusable (FAIR), they are all submitted to the SPARC portal following a standard scheme established by the SPARC Curation Team, called the SPARC Data Structure (SDS). Inspired by the Brain Imaging Data Structure (BIDS), the SDS has been designed to capture the large variety of data generated by SPARC investigators who are coming from all fields of biomedical research. Here we present the rationale and design of the SDS, including a description of the SPARC curation process and the automated tools for complying with the SDS, including the SDS validator and Software to Organize Data Automatically (SODA) for SPARC. The objective is to provide detailed guidelines for anyone desiring to comply with the SDS. Since the SDS are suitable for any type of biomedical research data, it can be adopted by any group desiring to follow the FAIR data principles for managing their data, even outside of the SPARC consortium. Finally, this manuscript provides a foundational framework that can be used by any organization desiring to either adapt the SDS to suit the specific needs of their data or simply desiring to design their own FAIR data sharing scheme from scratch.
SPARC Dataset Organization
Data files are organized into folders by the investigators and curated according to the SPARC Dataset Structure (SDS). Data organization and submission that is in compliance with the SDS is greatly simplified and automated by SODA (Software to Organize Data Automatically).
To see how the SDS is used in the data submission process visit Organize Your Files. To see how to browse and navigate datasets on the Portal in SDS visit Navigating SPARC Datasets .
SPARC datasets contain:
Dataset must follow a Naming Convention.
Example of dataset organized according to SDS Version 2.1
Top-Level Folders
All data files are organized into top-level folders. Some of these are required, and some are optional, dependent upon your dataset.
Primary Folder
This folder is required and contains all of your experimental data.
- Examples may include time-series data, tabular data, clinical imaging data, genomic, metabolomic, microscopy data.
- The data generally have been minimally processed so they are in a form ready for analysis.
Data is organized by subject/sample, and all subjects and samples will have a unique folder within the Primary Folder.
The rules for naming folders and files are:
- These unique folders will have a standardized name that corresponds to the exact names or IDs as referenced in the metadata file.
- For the SDS metadata identifiers (i.e., subject labels, sample labels, performance labels, etc.): These folder names can only include letters and numbers and the dash character
- There are standardized prefixes for each type of data folder. Refer to Naming Convention section below.
- For the folder and file names within those SDS-defined folders, there are no restrictions, but we strongly suggest avoiding special characters such as !@#$%^&*()+=/|"'~. These suggestions are expected to be enforced in future versions of SDS to enable interoperability across operating systems
Source Folder
This folder is only required if unaltered, raw files from an experiment are included in the data. If so, they will be placed in this folder.
- An example would be the “truly” raw k-space data for a Magnetic Resonance (MR) image that has not yet been reconstructed.
- In this case, the reconstructed DICOM or NIFTI files would be found within the primary folder.
Please note, there are no specific requirements about folder naming inside this Source Folder. However, we recommend using the sub-, sam-, perf- structure similar to the primary folder’s naming conventions. Refer to Naming Convention section.
Derivative Folder
This folder is only required if derivative data exists. If so, they will be placed in this folder.
- Examples may include processed image stacks that are annotated via the MicroBrightField (MBF Biosciences) tools, segmentation files, or smoothed overlays of current and voltage that demonstrate a particular effect.
- If files are converted into a format other than what was submitted, these files are included in the derivative folder.
- Derived data should be organized into subject and sample folders, using the same subject and sample IDs as the folder names within the Primary Data folder.
Code Folder
This folder is only required if code is used in the generation of the data. If so, the folder contains all the source codes used in the study. Please note, if your code is on GitHub, you don’t need to share it here. Simply put the link in your “Dataset_description” file in the related identifier field.
- Examples may include text and source code (e.g. MATLAB, etc.), script to plot the results, script to open and read raw data files, source code for computational models.
- Links to supporting code that provides added value to the dataset can be included in the metadata description but does not have to be uploaded here.
Protocol Folder
This folder is optional and contains supplementary files to accompany the experimental protocols submitted to Protocols.io.
IMPORTANT: This is not a substitution for the experimental protocols which are required to be created and shared with SPARC or RE-JOIN workspace on Protocols.io.
Docs Folder
This folder is optional and contains all the supporting documents for the dataset.
- An example would a representative image for the dataset.
- Unlike the readme file, which is a text document, docs can contain documents in multiple formats, including images.
Structured Metadata Files
- dataset_description (xlsx, csv or json): Required file where an investigator provides basic metadata. The structure of the dataset description file has been updated so that it is broken into five clearly separated sections: (1) basic information about the dataset like title, keywords, and funding; (2) essential study information with parts similar to structured abstract like study purpose, data collected, and conclusions – the template was expanded by including information about organ system, approach (modality), and techniques used to enhance searchability – while the other three sections collect information about (3) contributors; (4) related identifiers; and (5) participants. The dataset_description.xlsx file includes additional fields specifying the specific metadata version. This field is not to be changed by data submitters, as it is used to properly align different metadata releases. The dataset_description file also includes a section for researchers to describe if they plan to later submit additional data to the dataset.
- subjects (xlsx, csv or json): Conditionally Required if subjects are used in the experiment producing the dataset. Contains updated fields with required and optional metadata fields providing information about subjects (model organism or animals) involved in data collection. Each subject and/or pooled subjects must be assigned a unique ID. This unique ID is used to name the data folders for individual subjects. For proper mapping of the data, folders containing experimental data need to exactly match the subject ID. All subject identifiers must be unique within a dataset and not contain any sensitive, identifiable information (for human subjects). Having each lab use consistent, unique subject identifiers across datasets is highly desirable to aid in connecting multiple experiments using the same subjects.
- samples (xlsx, csv or json): Conditionally Required if measurements are obtained from samples, e.g., tissue slices, derived from individual or pooled subjects. This file contains information about samples used to generate the data. Investigators must provide a unique ID for each sample which will be used to name the data folders, and the sample ID must match the folder ID exactly. Sample IDs must be unique within a single dataset. Each sample should also reference a subject from the subject file; a single subject (a research animal/donor) may be linked to multiple biological samples derived from that subject. If the samples are pooled from multiple subjects, the complete provenance must be specified in the subject file. The metadata present in the samples file should also explicitly note whether a sample was collected directly or was derived from another sample.
- code_description (xlsx, csv or json): Conditionally required if submitting code or a computational model. When submitting code, it documents any code or software included with the dataset. When submitting a computational model, it contains information to describe the code in terms of its quality. Code RRIDs and ontologies are required.
- submission (xlsx, csv or json): Conditionally Required if you are submitting a dataset required by your consortium associated with specific milestones. It contains information relevant to internal SPARC bookkeeping, relating the milestones you have negotiated with NIH to the contents of the submitted dataset. If the dataset is not a part of the milestone use N/A in the relevant field. This file is for internal use only; it will not be released when the data are published.
Descriptive top-level files:
-
manifest (xlsx, csv or json): Conditionally Required if the dataset contains subjects or samples but the folder structure does not include subject and/or sample ids. It lists all files and folders in the dataset, mapping them to specific entities like subjects and samples. This file helps ensure proper organization and provides key metadata for each file.
-
performances (xlsx, csv or json): Conditionally Required if data were gathered from multiple distinct performances of one type of experimental protocol on the same subject or same sample (i.e. multiple visits, runs, sessions, or execution).
-
resources (xlsx, csv or json): Optional file containing information to describe resources used in the experiments (RRID, URL, vendor, version, additional metadata).
-
sites (xlsx, csv or json): Conditionally Required if sharing experimental locations. It contains specific anatomical or experimental locations related to subjects or specimens. These could include locations like electrode placements, biopsy sites, or other defined locations.
-
README (txt): Required file provided by Investigators that contains necessary details for reuse of the data, beyond that which is captured in structured metadata. Some information that should be included are:
- How would a user use the files that are provided? e.g., "first open file X and then look at file Y."
- What additional details do users need to know? Are some subjects missing data?
- Are there warnings about how to use the data or code?
- Are there appropriate/inappropriate uses for this data?
- Are there other places that users can go for more information? e.g., did you provide a GitHub repository, or are there additional papers beyond what was provided in the metadata form?
-
CHANGES (txt): Required file if a new version of the dataset is uploaded, to document any changes from the previous version.
For older versions of the SDS, the following descriptive file may also be included
- code_parameters (xlsx, csv or json): Optional file containing information describing specific parameters to run a code.
Naming Convention
SPARC standards were designed to provide data for further research, and as such, it is imperative to use a consistent and predictable naming scheme for all files. This makes it easier not only for computers to process the data, but for other investigators to understand it.
Read on for required naming convention rules, but first, an important note: For the SDS, it is absolutely critical that the naming used within metadata files is consistent with the naming used for all folders (i.e., subject or sample names).
You can be flexible with your subject names, but you must use that same EXACT name when labeling your folders, so we – and other investigators - can easily relate the metadata contained in the descriptive file to the contents of the folder. This is also how we map metadata records computationally with individual files. All to say - consistency is critical.
Subject, Sample, and Performance Identifiers (IDs)
- Must be unique for the dataset.
- Must have prefixes: ‘sub-‘ for subjects, ‘sam-‘ for samples, ‘perf-’ for performances.
- The corresponding data folder names must use the exact subject, sample, and performance IDs. Failure to comply with this requirement is the largest source of errors in submitted datasets.
- Can include only alpha and numeric characters (0-9, A-Z, a-z), and the dash character (-).
- Special characters and empty spaces are not allowed.
- There is no limit to the character number.
Subjects, Samples, and Performances Folders Naming Constraints
- Must have prefixes: ‘sub-‘ for subjects, ‘sam-‘ for samples, ‘perf-’ for performances.
- Folder names must reflect EXACT subject, sample, and performance IDs. Failure to comply with this requirement is the largest source of errors in submitted datasets.
- There is no limit to the character number.
- Can include only alpha and numeric characters (0-9, A-Z, a-z), and the dash character (-).
- Special characters and empty spaces are NOT ALLOWED.
- Sample and performance folders should be placed inside the corresponding subject folders.
Folder and File Names
- When naming the dataset sub-folders (i.e., folders that are NOT mapped to SDS IDs), it is imperative to keep a consistent naming scheme.
- We STRONGLY recommend that all file names and folder names not mapped to a SDS metadata entity include only alpha and numeric characters (0-9, A-Z, a-z), and the dash character (-). Other charachters are allowed but we suggest avoiding the following characters as they are expected to be forbidden in a future version of the SDS standard
!@#$%^&*()+=/\|"'~;:<>{}[]?
. See the section Primary Folder for more details. - There is no limit to the character number.
- Each data file must be listed in the main manifest with an adequate description.
Detailed Protocol
Every dataset published on the SPARC Portal MUST have a published protocol associated with it. One way to do this is to publish a protocol manuscript, another way is to publish your protocol using Protocols.io. For more details about this process visit Create Your Protocol.
SPARC Metadata
Experimental metadata specified by the SPARC Data Standards Committee is described in the SPARC Minimal Information Standards. MIS metadata fields have been incorporated into the subjects and samples templates zip file. We also provide an annotated list of these fields.
SPARC standards for optical microscopy imaging data and metadata is described in this document.
Updated 6 days ago