Skip to contents

Document Status: Draft
Estimated Reading Time: 8 min

Special acknowledgments

Functionality demonstrated in this vignette benefited greatly from code originally written by hhunterzinck.

Intro

This describes how to package some Synapse processed data as a cBioPortal study dataset. A cBioPortal study contains one or more data types, see cBioPortal docs. The current API covers creating a cBioPortal study with a subset of data types relevant to the NF workflow (so not all data types). The design has been inspired by and should feel somewhat like working with the R package usethis, and data types can be added to the study package interactively.

Though there is some checking depending on the data type, final validation with the official cBioPortal validation tools/scripts should still be run.

Breaking changes are possible as the API is still in development.

Set up

First load the nfportalutils package and log in. The recommended default usage of syn_login is to use it without directly passing in credentials. Instead, have available the SYNAPSE_AUTH_TOKEN environment variable with your token stored therein.

Create a new study dataset

First create the study dataset “package” where we can put together the data. Each study dataset combines multiple data types – clinical, gene expression, gene variants, etc.


cbp_new_study(cancer_study_identifier = "npst_nfosi_ntap_2022",
              name = "Plexiform Neurofibroma and Neurofibroma (Pratilas 2022)",
              citation = "TBD")

Add data types to study

Data types can be most easily added in any order using the cbp_add* functions. These functions download data files and create the meta for them.

Note that:

  • These should be run with the working directory set to the study dataset directory as set up above to ensure consistent metadata.
  • Defaults are for known NF-OSI processed data outputs.
  • If these defaults don’t apply because of changes in the scenario, take a look at the lower-level utils make_meta_* or edit the files manually after.
  • Data types can vary in how much additional work is needed in remapping, reformatting, custom sanity checks, etc.

Add mutations data

  • maf_data references a final merged maf output file from the NF-OSI processing pipeline OK for public release.
  • This data file type requires no further modifications except renaming.

maf_data <- "syn36553188"

add_cbp_maf(maf_data)

Add copy number alterations (CNA) data

  • cna_data is expected to be a .seg file on Synapse.

cna_data <- "syn********"

cbp_add_cna(cna_data)

Add expression data

  • expression_data is expected to be a .txt called gene_tpm.tsv file on Synapse.
  • The NF-OSI default includes including the raw expression data as well, called gene_counts.tsv, but this can be omitted.
  • These NF-OSI outputs will be somewhat modified in translation to have the required headers.

mrna_data <- "syn********"
mrna_data_raw <- "syn********"

cbp_add_expression(mrna_data,
                   expression_data_raw = mrna_data_raw)

Add clinical data

  • clinical_data is a prepared clinical data table already subsetted to those released in this study, or pass in a query that can be used for subsetting if using a full clinical database table. For example, the full clinical cohort comprises patients 1-50, but this study dataset consists of available and releasable data only for patients 1-20 for expression data and data patients 15-20 for cna data. Here, clinical_data can be a smaller table of just those 1-30, or it can be the original table but pass in a suitable additional filter, e.g. where release = 'batch1'.
  • Clinical data requires mapping to be as consistent with other public datasets as possible. ref_map defines the mapping of clinical variables from the NF-OSI data dictionary to cBioPortal’s. Only variables in the mapping are exported to cBioPortal. Follow link below to inspect the default file and format used.
  • Clinical data should be added last for overall sample checks to work. For example, if there is expression data for patients 1-20 and cna data patients 15-20, it can more informatively warn about any missing/mismatches.

clinical_data <- "select * from syn43278088"
ref_map <- "https://raw.githubusercontent.com/nf-osi/nf-metadata-dictionary/main/mappings/cBioPortal.yaml"

cbp_add_clinical(clinical_data, ref_map)

Validation

There are additional steps such as generating case lists and validation that have to be done outside of the package with a cBioPortal backend, where each portal may have specific configurations (such as genomic reference) to validate against. See the general docs for dataset validation.

For the public portal, the suggested step using the public server is given below.

Assuming your present working directory is ~/datahub/public and a study folder called npst_nfosi_ntap_2022 has been placed into it, mount the dataset into the container and run validation like:

STUDY=npst_nfosi_ntap_2022
sudo docker run --rm -v $(pwd):/datahub cbioportal/cbioportal:5.4.7 validateStudies.py -d /datahub -l $STUDY -u http://cbioportal.org -html /datahub/$STUDY/html_report

The html report will list issues by data types to help with any corrections needed.