Organize Your Files

Prepare your dataset for submission to the SPARC Portal according to SPARC Standards (SDS)

This document is part of a series related to the Data Submission to SPARC Process:

Alright, so you’ve got your metadata, you’ve got your experimental data. And now, it’s time to get organized!

The SPARC Dataset Structure (SDS) is a required organizational system with required naming conventions to promote consistency and ease of use. So if you opted out of using SODA in the previous step, please read through this page thoroughly. As you do read on, you can refer back to the below graphic for a bird’s eye view of the SDS layout, or to this presentation or this more detailed page for more in-depth info.

And, if you did use SODA for the metadata files, guess what?! You can skip this page, and move on to the final step. Because SODA will guide you through all the information below! Be sure to check that link before Disseminating Your Data, as it has some last minute checks for you!

If you will organize your datasets yourself, templates are provided in our SDS zip folder, which can be downloaded here.


Checklist

Open Checklist

But first, let’s check to make sure you’ve done everything up to this point:

  • Talked to curation team
  • Requested access to the appropriate Pennsieve workspace
  • Experimental protocol has been created on Protocols.io
  • All required metadata files have been completed
    • Temporary link to unpublished protocol has been added to dataset_description file
  • All folders/metadata files are named as set forth in the SDS file system
  • All subject & sample names are CONSISTENT across all references in the SDS
    • All human subjects have been de-identified
  • All data, metadata and associated files/info have been organized into the SDS file system
  • All experimental data has been organized by subject and sample in the Primary Folder
  • All required top-level folders include required manifest files
  • Dataset has been uploaded onto Pennsieve
  • Verify the completeness of the upload
  • Dataset has been submitted for review

Completed all the checked boxes? Continue onward!


Top-Level Folders

All data files are organized into top-level folders. Some of these are required, and some are optional, dependent upon your dataset.

Primary Folder

This folder is required and contains all of your experimental data.

  • Examples may include time-series data, tabular data, clinical imaging data, genomic, metabolomic, microscopy data.
  • The data generally have been minimally processed so they are in a form ready for analysis.
    Data is organized by subject/sample, and all subjects and samples will have a unique folder within the Primary Folder.
The rules for naming folders and files are:
  • These unique folders will have a standardized name that corresponds to the exact names or IDs as referenced in the metadata file.
  • For the SDS metadata identifiers (i.e., subject labels, sample labels, performance labels, etc.): These folder names can only include letters and numbers and the dash character
  • There are standardized prefixes for each type of data folder. Refer to Naming Convention section below.
  • For the folder and file names within those SDS-defined folders, there are no restrictions, but we strongly suggest avoiding special characters such as !@#$%^&*()+=/|"'~. These suggestions are expected to be enforced in future versions of SDS to enable interoperability across operating systems

Source Folder

This folder is only required if unaltered, raw files from an experiment are included in the data. If so, they will be placed in this folder.

  • An example would be the “truly” raw k-space data for a Magnetic Resonance (MR) image that has not yet been reconstructed.
    • In this case, the reconstructed DICOM or NIFTI files would be found within the primary folder.

Please note, there are no specific requirements about folder naming inside this Source Folder. However, we recommend using the sub-, sam-, perf- structure similar to the primary folder’s naming conventions. Refer to Naming Convention section below.

Derivative Folder

This folder is only required if derivative data exists. If so, they will be placed in this folder.

  • Examples may include processed image stacks that are annotated via the MicroBrightField (MBF Biosciences) tools, segmentation files, or smoothed overlays of current and voltage that demonstrate a particular effect.
  • If files are converted into a format other than what was submitted, these files are included in the derivative folder.
  • Derived data should be organized into subject and sample folders, using the same subject and sample IDs as the folder names within the Primary Data folder.

Code Folder

This folder is only required if code is used in the generation of the data. If so, the folder contains all the source codes used in the study. Please note, if your code is on GitHub, you don’t need to share it here. Simply put the link in your “Dataset_description” file in the related identifier field.

  • Examples may include text and source code (e.g. MATLAB, etc.), script to plot the results, script to open and read raw data files, source code for computational models.
  • Links to supporting code that provides added value to the dataset can be included in the metadata description but does not have to be uploaded here.

Protocol Folder

This folder is optional and contains supplementary files to accompany the experimental protocols submitted to Protocols.io.

IMPORTANT: This is not a substitution for the experimental protocols which are required to be created and shared with SPARC or RE-JOIN workspace on Protocols.io.

Docs Folder

This folder is optional and contains all the supporting documents for the dataset.

  • An example would a representative image for the dataset.
  • Unlike the readme file, which is a text document, docs can contain documents in multiple formats, including images.

Naming Convention

SPARC standards were designed to provide data for further research, and as such, it is imperative to use a consistent and predictable naming scheme for all files. This makes it easier not only for computers to process the data, but for other investigators to understand it.

Read on for required naming convention rules, but first, an important note: For the SDS, it is absolutely critical that the naming used within metadata files is consistent with the naming used for all folders (i.e., subject or sample names).

You can be flexible with your subject names, but you must use that same EXACT name when labeling your folders, so we – and other investigators - can easily relate the metadata contained in the descriptive file to the contents of the folder. This is also how we map metadata records computationally with individual files. All to say - consistency is critical.

Subject, Sample, and Performance Identifiers (IDs)

  • Must be unique for the dataset.
  • Must have prefixes: ‘sub-‘ for subjects, ‘sam-‘ for samples, ‘perf-’ for performances.
  • The corresponding data folder names must use the exact subject, sample, and performance IDs. Failure to comply with this requirement is the largest source of errors in submitted datasets.
  • Can include only alpha and numeric characters (0-9, A-Z, a-z), and the dash character (-).
  • Special characters and empty spaces are not allowed.
  • There is no limit to the character number.

Subjects, Samples, and Performances Folders Naming Constraints

  • Must have prefixes: ‘sub-‘ for subjects, ‘sam-‘ for samples, ‘perf-’ for performances.
  • Folder names must reflect EXACT subject, sample, and performance IDs. Failure to comply with this requirement is the largest source of errors in submitted datasets.
  • There is no limit to the character number.
  • Can include only alpha and numeric characters (0-9, A-Z, a-z), and the dash character (-).
  • Special characters and empty spaces are NOT ALLOWED.
  • Sample and performance folders should be placed inside the corresponding subject folders.

Folder and File Names

  • When naming the dataset sub-folders (i.e., folders that are NOT mapped to SDS IDs), it is imperative to keep a consistent naming scheme.
  • We STRONGLY recommend that all file names and folder names not mapped to a SDS metadata entity include only alpha and numeric characters (0-9, A-Z, a-z), and the dash character (-). Other charachters are allowed but we suggest avoiding the following characters as they are expected to be forbidden in a future version of the SDS standard !@#$%^&*()+=/\|"'~;:<>{}[]?. See the section Primary Folder for more details.
  • There is no limit to the character number.
  • Each data file must be listed in the main manifest with an adequate description.

Manifest Files

Manifest spreadsheets provide information about specific files within a folder. File-level manifest files are required in all top-level folders (listed above). In each manifest file, you will provide a brief description of the contents of the associated folder.

Please note, if you’re using SODA, it automatically creates a list of files for a manifest during the upload. But, if you are uploading datasets manually, you will need to create this file yourself for each submitted top-level folder.

Hmm, another reason to use SODA… Speaking of which…


Organizing Your Files with SODA

Alright, if you used SODA (and are still here), then it’s time to go back, upload your experimental data, and confirm your folders! Then, before Disseminating Your Data, visit the Upload Your Data page for the final step. See you soon, friend.


Organizing Your Files without SODA

If you did NOT use guided mode in SODA to create your metadata files, then you will need to organize your experimental data, and all accompanying files, into the appropriate folders of your downloaded SDS zip file. Follow these steps to do just that:

  1. Organize your experimental data in the Primary Folder by subject and sample, as seen above.
  2. Include any accompanying files that are required for your dataset (as described above) by putting them into the appropriate folder.
  3. Create a manifest file for every top-level folder.
    1. In your ZIP file, you will have a spreadsheet template for this - just copy and paste this template into every folder you are sending, and fill it out according to the contents of that folder.

Additional Resources

A list of additional resources you may find useful while organizing your files:

  • If you have any questions, please don’t hesitate to reach out to us:
  • For further details about the SDS structure, feel free to refer to this page
  • Technical users can view the schema used to validate SDS here.
  • For Tools and General Resources on SPARC, click here.

Next Step

Finished organizing your files? Then the next step is STAND UP AND FIST PUMP! Because you’re a legend, first and foremost. And second, because you’re ready to upload your data!

Which you can do by clicking right here!


What’s Next