Tutorial: Download public data, scaffolds and run computations

Objectives

In this tutorial, we'll walk through a step-by-step scenario of a user who wants to explore public datasets, download relevant data, and use the Open Simulation Platform for Advanced Research (o²S²PARC) to transform the data. Using the sparc client you will go through the steps of searching for datasets, downloading and storing data, creating an o²S²PARC instance, running a data analysis pipeline, and finally, visualizing the results.

Credit: the use case presented here was inspired by one of the projects presented at the SPARC FAIR Codeathon 2022, in particular the tutorial "Mapping 2D data to a 3D organ scaffold".

Prerequisites

In order to run this tutorial, you need to have:

  • Basic python knowledge
  • Internet access
  • An active o²S²PARC account
  • installed SPARC Python Client and other required libraries
    • You could install it by runningpip install sparc.client requests pathlib tqdm from your console/command line.

Getting set up

To run a simulation, we need to configure access to osparc.io by providing access tokens. For that, you need to log into your o²S²PARC account, and create credentials on their website by navigating to your user profile (click on the image in the upper right corner) and choosing Preferences. Then choose API tag and generate api_key and api_secret- here is how to do it. The values will be displayed once only. More information could be found in Generating o²S²PARC Tokens section here.

Next, please create a text file config.ini in you working directory and fill the value from Key field into O2SPARC_USERNAME variable, and value from Secret into O2SPARC_PASSWORD

[global]
default_profile=prod

[prod]
pennsieve_profile_name=prod
O2SPARC_HOST=https://api.osparc.io
O2SPARC_USERNAME=<YOUR_API_KEY>
O2SPARC_PASSWORD=<YOUR_API_SECRET>

Your config.ini file needs to have 2 sections: global with the name of a default profile used (prod) and a section indicated by default_profile with 4 variables: pennsieve_profile_name, O2SPARC_HOST, O2SPARC_USERNAME and O2SPARC_PASSWORD.

Note: please do not use single (') or double (") quotation marks in the configuration file, just insert plain text.

Searching for Published Datasets

This section introduces metadata search services which query an Elasticsearch endpoint. This endpoint contains the same metadata that powers search on the SPARC portal. There are two basic ways to retrieve information on SPARC datasets:

  1. Retrieve a full list of datasets and associated metadata
  2. Submit a valid Elasticsearch query to retrieve selected dataset information

Query results are returned as JSON in Elasticsearch format. Information on specific fields and content can be found in a K-Core API Handbook.

In order to get started, you must add information about an API Key you can register* for. Instructions for registering and receiving an API key are available as a walk-through tutorial. This API Key must be added to your config.ini file with 1 variable: scicrunch_api_key

[global]
default_profile=prod

[prod]
pennsieve_profile_name=prod
scicrunch_api_key=<YOUR_API_KEY>

*Note: After registering for your API Key it may take 15 minutes for the API key to sync with API service endpoints

Get a list of datasets to process or download

The following example retrieves a list of datasets and associated metadata. Getting a list of datasets will also allow you to view the JSON structure and content of the metadata.

import json
from sparc.client import SparcClient

# Get a list of datasets from Elasticsearch
response = {}
response = client.metadata.list_datasets()

number_of_records = response['hits']['total']
# Can now process individual records as needed  

Search datasets via POST based query

The search is a basic search for dataset based on dataset identifier (in this example we will be retrieving SPARC datasets 10, 11, and 12 that are used in other sections of this tutorial). The query is created as a JSON body that is submitted via POST. The dataset identifiers used in the example below refer to:

  1. Dataset with ID 10: Spatial distribution and morphometric characterization of vagal afferents associated with the myenteric plexus of the rat stomach
  2. Dataset with ID 11: Spatial distribution and morphometric characterization of vagal afferents (intramuscular arrays (IMAs)) within the longitudinal and circular muscle layers of the rat stomach
  3. Dataset with ID 12: Spatial distribution and morphometric characterization of vagal efferents associated with the myenteric plexus of the rat stomach
import json
from sparc.client import SparcClient

# ES search via JSON object
response = {}
body = "{\"query\": {\"terms\": {\"_id\": [ \"10\", \"11\", \"12\" ] } } }"
body_json = json.loads(body)

response = client.metadata.search_datasets(body_json)

number_of_records = response['hits']['total']
# Can now process individual records as needed  

The query can be any valid Elasticsearch query. For more information on Elasticsearch queries, please consult their documentation of the Elasticsearch Query DSL. Examples of SPARC queries can be found in the K-Core API Documentation.

Once you have a set of datasets and associated data you can now use DAT-CORE services (described in the next section) to download files from a dataset.

Downloading files

After installation of sparc.client and other requited libraries, we will download the datasets required for the analysis. We need the following datasets that contain 3D coordinates:

  1. Dataset with ID 10: Spatial distribution and morphometric characterization of vagal afferents associated with the myenteric plexus of the rat stomach
  2. Dataset with ID 11: Spatial distribution and morphometric characterization of vagal afferents (intramuscular arrays (IMAs)) within the longitudinal and circular muscle layers of the rat stomach
  3. Dataset with ID 12: Spatial distribution and morphometric characterization of vagal efferents associated with the myenteric plexus of the rat stomach
from sparc.client import SparcClient
client = SparcClient(connect=False, config_file='config.ini') #we assume config.ini is in the current directory

#searching for the relevant files
ima_data=client.pennsieve.list_files(dataset_id=11, query="files/derivative/IMA_analyzed_data.xlsx") # Derivative file from the SPARC Portal
efferent_data=client.pennsieve.list_files(dataset_id=12, query="files/derivative/Efferent_data.xlsx") # Derivative file from the SPARC Portal
igle_data=client.pennsieve.list_files(dataset_id=10, query="files/derivative/IGLE_data.xlsx") # Derivative file from the SPARC Portal

#downloading files
client.pennsieve.download_file(file_list=ima_data[0])
client.pennsieve.download_file(file_list=efferent_data[0])
client.pennsieve.download_file(file_list=igle_data[0])

We also need 2 additional files:

  • A zip file containing the Python code you want to run and a requirements.txt for your additional Python packages.
  • The 3D mesh of an organ, in .stl format. It can be downloaded here.
import requests


# Make http request for remote file with code
code_input = requests.get('https://raw.githubusercontent.com/elisabettai/sparc.client/main/docs/tutorial-readme-assets/code_input.zip')

# Save file data to local copy
with open('code_input.zip', 'wb')as file:
  file.write(code_input.content)
  
# Download organ mesh for simulation
organ_mesh = requests.get('https://raw.githubusercontent.com/elisabettai/sparc.client/main/docs/tutorial-readme-assets/scaffold_zinc_graphics.stl')
  
# Save file data to local copy
with open('scaffold_zinc_graphics.stl', 'wb')as file:
  file.write(organ_mesh.content)
  
!dir

At this point you should see 6 files in your current directory: config.ini, Efferent_data.xlsx, IGLE_data.xlsx, IMA_analyzed_data.xlsx, code_input.zip and scaffold_zinc_graphics.stl.

Spinning Up an o²S²PARC Instance

Let's confirm that now that we can connect to o²S²PARC. We need to import o²S²PARC module from the SPARC Client and check if we are connected.

from sparc.client import SparcClient
from sparc.client.services.o2sparc import (
  O2SparcService,
  O2SparcSolver
)

o2sparc: O2SparcService = client.o2sparc
  
# Check that you are connect to o²S²PARC (you should see your e-mail address)
print(o2sparc.get_profile())

If the response is the e-mail that we registered with, we can start our simulation!

Simulation Creation for the Downloaded Datasets

In this section we will use the o²S²PARC Service Python Runner to execute Python code on the SPARC datasets that we have just downloaded. The o²S²PARC Python Runner executes the code to transform the 2D data from the portal, expressed as distances in percentages, to 3D data that be visualized on an organ scaffold of the rat stomach.

The inputs to the Python Runner, for this particular use-case, are:

  • A zip file containing the Python code you want to run and a requirements.txt for your additional Python packages (it can be downloaded here).
  • The three data files downloaded in the previous section with the DAT-Core functionality.
  • The 3D mesh of an organ, in .stl format. It can be downloaded here.

The output is a figure showing a 2D projection of the input anatomical data onto the rat stomach scaffold.

from pathlib import Path
from tqdm import tqdm
from time import sleep
import zipfile
import shutil

with Path().absolute() as tmp_dir:
    input_file_1: Path = Path(tmp_dir) / "code_input.zip" # Zip file containing a Python script to be executed and a requirements.txt for additional Python packages
    input_file_2: Path = Path(tmp_dir) / "IMA_analyzed_data.xlsx" # Derivative file from the SPARC Portal
    input_file_3: Path = Path(tmp_dir) / "Efferent_data.xlsx" # Derivative file from the SPARC Portal
    input_file_4: Path = Path(tmp_dir) / "IGLE_data.xlsx" # Derivative file from the SPARC Portal
    input_file_5: Path = Path(tmp_dir) / "scaffold_zinc_graphics.stl" # Stomach surface mesh
    job: dict = {
        "input_5": input_file_5,
        "input_4": input_file_4,
        "input_3": input_file_3,
        "input_2": input_file_2,
        "input_1": input_file_1
      }
    solver: O2SparcSolver = o2sparc.get_solver(solver_key="simcore/services/comp/osparc-python-runner",solver_version="1.2.0")
    
    job_id = solver.submit_job(job)
    
    pbar = tqdm(total=1.0)
    progress: float = 0
    while not solver.job_done(job_id):
        sleep(1)
        if solver.get_job_progress(job_id) > progress:
          pbar.update(solver.get_job_progress(job_id) - progress)
          progress = solver.get_job_progress(job_id)
    
    # Get results
    res = solver.get_results(job_id)
    
    # The oSPARC Python runner service has 1 output
    output_path = res['output_1']

    # The output_path contains a zip archive, let's extract it
    with zipfile.ZipFile(output_path,"r") as zip_ref:
        zip_ref.extractall("computation_output")
        
    # Copy the content of the archive to the current working directory for convenience
    shutil.copyfile("computation_output/data_projected_on_scaffold.png", "data_projected_on_scaffold.png")

If the code run successfully, you will a file data_projected_on_scaffold.png appearing in your current directory.

This is how the output figure will look like.

Troubleshooting

If you don't get the expected output, you can retrieve the o²S²PARC job logs with:

from tempfile import TemporaryDirectory

print("job log:")
log_dir: TemporaryDirectory = solver.get_job_log(job_id)
for elm in Path(log_dir.name).rglob("*"):
    if elm.is_file():
        print(elm.read_text())