File Formats Accepted by SPARC

File formats for SPARC

While the file extension list is thorough, it's not exhaustive. Given the ongoing advancements in technology, we anticipate the emergence of new file types not covered here. If you encounter file extensions not documented in this list, please reach out to the curation team ([email protected]) for further discussion. Presently, the DRC is discussing how to accommodate OME-Zarr.

File Type

Required

Preferred

Will accept

Will not accept

BRAIN Initiative

Cell Sorting

.fcs

Code

Python, Matlab, R,

Java,C/C++, Octave, openGL and Fortran

Documents

.txt, .md

.docx, .pdf, .rtf, .odl, .ods, or any format fully supported as a source for pandoc

.pages, .doc

Figures

.tif, .tiff, .jpg, .jpeg, .png

.svg

.cdr or any proprietary format

Generic Data

.hdf5, .mat, .xml, .json,

Generic Images

JPEG2000, OMETIFF

Any that can be converted by Microfile+

Microscopy Image data (raw, primary or derivative)

JPEG2000, OMETIFF1

Terabyte and large volume compatible formats

Any that can be converted by Bioformats2
Any that can be converted by Microfile+

.sws

Morphology

.xml

Segmentation and skeletonization: xml .swc, .roi, .ims Any that can be converted currently by MBF; open formats for MRI data

.swc Morphology

Neuroimaging

.nifti

.DICOM

Required by BIDS

Presentation

ppt or .pptx

.key

Sequence

fastq

bam, .vcf, cram

fasta

“If transcriptomic data is available, fastq files are the minimum version of data to deposit. BAM files are optional.”

System files

.ini, .db

Tabular

.csv3, .tsv

.xlsx, .ods

.numbers, .xls

Time series

.nwb 2.04

Any format that can be converted using NEO. (including generic open formats, eg. csv)

.json, .adi

.nwb 2.0

Vector drawings

svg

Video

.mp4

.avi, .webvm, .ogv

1 https://docs.openmicroscopy.org/ome-model/6.1.0/ome-tiff/

2 https://www.openmicroscopy.org/bio-formats/

3 Best practices for fomatting tabular data: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510

4 https://nwb-schema.readthedocs.io/en/latest/

Notes on File Formats

Converted files

SPARC policy is to store native files acquired by the investigator in the Source folder as raw data. All data converted into an acceptable format will be in the Primary data folder.

Compressed files

Files are currently uploaded as .zip, .gz or .bz2. If compression is used, it must be lossless compression.

Custom formats

The curation team will match all data files against the allowable file types and extensions.

  • If it is a custom file format built on a generic data format (e.g., json), it must be documented as to which type and be accompanied by this documentation.
  • If the format is truly custom; that is, not built on a pre-existing format, it must be convertable into one of the open formats and should be accompanied by code that can perform this conversion and open the files. Code should be deposited to SPARC in the **Code ** folder and a link to any appropriate code repositories provided.

Image data

We recommend that all raw, primary or derivative image data for SPARC be made available in JPEG2000 and OMETIFF. Acceptable formats are those that can currently be converted by MBF tools. MBF offers a free tool, Microfile+, that converts a number of microscopy image type to the SPARC standardized formats. A list of these is found at the MIcrofile+ site. The converted files should be stored in the derivative data folder. Note that many investigators submit supporting images or figures. These will not be converted but must be in an acceptable format.

Image data: Convert to fit SPARC Standards

If you have image data, see Microfile+: Convert Microscopy Image Data and Metadata to SPARC Standard Formats for details on how to ensure your microscopy data displayed on the SPARC Portal will have relevant context, and imaging metadata standards.


Morphology data

Morphology data is defined here as segmented data (e.g., surfaces, polygons, skeletons, binary masks) generated from primary imaging data (e.g., a neuron tree structure).

For neuron tree structures, MBF XML is a continuously supported and developed format that conforms to standard XML markup rules. As such, the data content is both machine- and human-readable and can be accessed using standard XML libraries and tools. SWC is an open format, while SWC can be used to represent tree structures, at the time this was written, it has limited or no support for other specific structures (spines, somas, varicosities, puncta, and blood vessels) or for annotation elements such as text, markers, and regions. MBF tools can read SWC and convert it to XML.


Time series data

NWB is our recommended format for time series data, as it is a BRAIN Initiative Standard and is being supported by the DANDI archive and the Allen Brain Institute for Brain Sciences. It is also endorsed by the INCF.

SPARC will take a phased approach towards adopting NWB. SPARC investigators are not yet submitting data in the .nwb format. Moreoever, we have identified > 20 different file extensions for physiological data coming from multiple commercial and open source packages, including Spike2, NeuraLynx, BrainVision, Synapse, NeuroLexus, Plexon, and Axon PCLAMP.

At this time, we accept file types for which there is a converter available into NWB so that a user will not be locked into a proprietary system. NWB has some recommendations for converters into NWB on their site.

Recommendations

  • In the short term, SPARC will accept time series data in both proprietary formats and open formats, as long as the proprietary format can be converted to NWB using an available tool.
  • The tools used to produce the data must be provided as metadata.
  • A list of formats currently supported by Spike Interfaces is available on their site.
  • In the long term, SPARC will consider requiring all time series data be submitted in NWB format.

Sequence data

The fastq file is required, but it is up to the investigator to include as much of the intermediary data as required for replicability.


System files

Investigators should remove various system-generated files (e.g., .ini, .db) from their datasets before release to the public.