SPARC Changes to Published Datasets Policy

Background

A Dataset that has been published i.e., released in full to the public portal with a DOI, is a public document that is assumed to be fixed in content. Content is defined as all descriptive metadata, data files and auxiliary files that were provided by the authors, curated and published to the SPARC Portal as a part of a SPARC dataset by the action of the dataset owner with acceptance by the SPARC Publishers. Nevertheless, under certain conditions, this content may be modified, e.g., to correct an error, to update references and citations, to add spatially referenced data, etc. This policy provides information on the conditions and procedures by which public SPARC datasets may be modified. This policy defines when changes can be made, who can make changes, what types of changes trigger the creation of a new DOI and how these changes are documented. The Pennsieve platform provides the functionality, and the associated documentation, to make these changes.

RoleDefinition
Dataset OwnerThe user within the SPARC workspace on the Pennsieve platform assigned the role of dataset owner. This is usually the PI of the grant award that funded the research to produce the data within the dataset.
SPARC Data Curation TeamThe group of users, comprised of SPARC Curators and other DRC Data Support Staff, that assist in the data curation process. The Dataset owner grants this team Manager-level permissions to a dataset during its curation process.
SPARC Publishers TeamThe group of users, comprised of SPARC Curators and other DRC Data Support Staff, that can accept or reject a dataset owner’s request for a dataset publication action on the SPARC Portal. This includes Request to Publish (both initial and subsequent Version), Request Revision, and Request to Remove a dataset already published to the SPARC Portal.

Making Changes to a Published Dataset

The Pennsieve data management platform supports three post-publication actions on datasets published to the SPARC portal:

  1. Make allowable changes to the dataset
  2. List the changes in the changelog
  3. Choose the appropriate action:
    1. Request to Publish: publishes a new version of the dataset with a new DOI
    2. Request Revision: minor edits and updates to dataset description
    3. Request Removal: unpublishing datasets for the purpose of dataset removal

Detailed workflows for each of these actions is provided in the table below.

Desired ChangesPublishing Action to initiate:Request initiated by:Request Approved by:Results in a DOI Change?
Updating the dataset description metadata
(i.e. dataset title, tags, adding contributors, etc.)
Request RevisionUsers and Teams with Manager permissions
(including SPARC Data Curation Team)
SPARC Workspace PublishersNo
Updating the contents of a dataset (files and/or metadata)Request to PublishDataset OwnerSPARC Workspace Publishers Yes
Removes dataset description and files from the SPARC PortalRequest RemovalDataset OwnerSPARC Workspace Publishers Removal of resolvable DOI
(DOI resolves to a tombstone page)

What is the difference between a New Version and a Revision of a Published Dataset?

Some changes to public datasets significantly change the content of a dataset. In those cases, a new version of the dataset will be published through the Request to Publish action, resulting in a new DOI. For other, smaller changes as currently outlined in the Pennsieve documentation, a revision will be released through the Request Revision, which does not require changes to dataset content and therefore does not trigger creation of a new DOI. However, what triggers or does not trigger a DOI change will be aligned with emerging best practices for data publishing as they become documented.

Who can make changes?

Dataset owners and the SPARC Data Curation Team can make changes to a published dataset, as outlined below. When the dataset owner is not the PI, the PI will be responsible for approving most changes. SPARC retains the right to make certain types of changes without explicit permission of the PI or dataset owner.

I. Request to Publish a dataset must be initiated by the Dataset Owner:

  • Changes to any existing content as defined above. These include fixing typos in spelling or errors in content files, file, or folder names
  • Additions or removal of any content, e.g., addition of spatially registered data or removal of any files that are considered to be part of the published dataset.
  • The dataset owner is responsible for providing SPARC with a reason as to why the changes were needed and must provide a list of changes through the changelog.

II. Request Revision to a dataset can be initiated by the dataset owner or the SPARC Data Curation Team. The Pennsieve platform currently allows the following to be changed as a revision:

  • The title of the dataset
  • The contributors of a dataset
  • The summary of the dataset
  • The license of the dataset
  • The tags associated with a dataset
  • The description associated with the dataset
  • The image associated with a dataset

However, changes approved by the SPARC Data Curation Team without permission of the PI are limited to adding additional tags for search or correcting typos in the title, summary, and description associated with a dataset. They may not make substantive changes to the title, summary, and description associated with a dataset, or change the license, contributors, or the display image without the PI’s permission. Any changes that are not approved by the PI must be pushed as errata or addenda so as not to imply that they are approved by the PI.

III. These changes do not require the permission of the PI:

  • Any changes to formatting, including updating to any standards or formatting of references
  • Exposure of metadata submitted by the author but not currently displayed on the SPARC portal to improve search and/ or new feature availability
  • Tags applied to data for faceted search and metadata enhancements
  • Addition of new viewers or interactive tools
  • Changes to the display of datasets
  • Addition of citations of a dataset obtained through citation services
  • Reformatting of file names in accordance with standard procedures, to ensure compatibility with tools, e.g., addition of file extensions, removal of spaces.

Procedure if the author cannot be contacted:

If an issue covered in item I is reported for a dataset, e.g., an error in the data or metadata, the dataset owner will be contacted to initiate an updated version. If the dataset owner does not respond, a revision may be published as an erratum but a new version of the dataset may not be issued (the primary dataset/ latest dataset version) may not be changed. Production of segmented or spatially mapped data for a given dataset is often delayed relative to the publication of the originating dataset. However, if the dataset owner cannot be contacted, then these data must be published by the curation team as a new dataset and the provenance recorded.

When is a new DOI issued?

Any changes to updates outlined in section I above, no matter how minor, to published content as defined above will create a new version of the dataset, trigger the issuing of a new DOI. As managing multiple DOIs per dataset makes it more difficult to track and use datasets, these should be kept to a minimum.

No changes to DOIs are triggered by updates made through the revision process outlined in Section II above. SPARC reserves the right to publish minor corrections, e.g., typos, as errata similar to journal policies.

Considerations for Computational Models:

The processes for updating and decommissioning computational models and simulations are still underdevelopment. We will build this process based on the following assumptions:

  • We distinguish computational studies and services, which are the building block of the study pipeline
  • We distinguish published studies and templates (a prototype study, typically representing a general workflow, that is not accessed itself. Instead, an editable copy - itself a standard study - is created whenever the template is opened)
  • Services have version numbers in the form x.x.x (major.minor.patch as per semantic versioning)
  • Studies and templates also carry version numbers

The following approach to updates of published studies and templates is proposed:

  • When the pipeline or the content of services changes (e.g., a python script in a jupyter notebook service) the version number is updated, a branch is created, the old URL remains valid, and a new DOI is issued
  • When a new service version becomes available, services in published studies are automatically updated only, if the version number changes on the patch level. If the major or minor version number change, users accessing the study / template are notified that they are using outdated service versions. they can then choose to update any service for their own study copy (the published study / template is not changed).
  • If the owner of a published study / template chooses to update to a newer service version that has a different major/minor number, this is treated as if the service content or the pipeline would have changed.

Removal of Published Data

Datasets will only be decommissioned in cases of fraud or extensive errors that cannot be corrected. A dataset owner utilizes the Request Removal action and the SPARC Publishers Team must approve this action. In these rare cases, the DOI will resolve to a tombstone page listing the DOI as a url. A list of datasets removed from the SPARC Portal and their Statement of Unavailability can be found below.