Bringing Portal Data to Other Platforms: cBioPortal
Source:vignettes/bringing-portal-data-to-other-platforms-cbioportal.Rmd
bringing-portal-data-to-other-platforms-cbioportal.Rmd
Document Status: Working
Estimated Reading Time: 8 min
Special acknowledgments
Utils demonstrated in this vignette benefited greatly from code originally written by hhunterzinck.
Important note
The requirements for cBioPortal change, just like with any software or database. The package is updated to keep up on a yearly submission basis, but there may be occasional points in time when the workflow is out-of-date with this external system.
Intro
This describes how to package some Synapse processed data as a cBioPortal study dataset. A cBioPortal study contains one or more data types, see cBioPortal docs. The current API covers creating a cBioPortal study with a subset of data types relevant to the NF workflow (so not all data types). The design has been inspired by and should feel somewhat like working with the R package usethis, and data types can be added to the study package interactively.
Though there is some checking depending on the data type, final validation with the official cBioPortal validation tools/scripts should still be run.
Breaking changes are possible as the API is still in development.
Set up
First load the nfportalutils
package and log in. The
recommended default usage of syn_login
is to use it without
directly passing in credentials. Instead, have available the
SYNAPSE_AUTH_TOKEN
environment variable with your token
stored therein.
Create a new study dataset
First create the study dataset “package” where we can put together the data. Each study dataset combines multiple data types – clinical, gene expression, gene variants, etc. Meta can be edited after the file has been created. This will also set the working directory to the new study directory.
cbp_new_study(cancer_study_identifier = "npst_nfosi_ntap_2022",
name = "Plexiform Neurofibroma and Neurofibroma (Pratilas 2022)",
type_of_cancer = "nfib", # required -- see https://oncotree.mskcc.org/
citation = "TBD")
Add data types to study
Data types can be most easily added in any order using the
cbp_add*
functions. These functions download data files and
create the meta for them.
Note that:
- These should be run with the working directory set to the study directory as set up above to ensure consistent metadata.
- Defaults are for known NF-OSI processed data outputs.
- If these defaults don’t apply because of changes in the scenario,
take a look at the lower-level utils
make_meta_*
or edit the files manually after. - Data types can vary in how much additional work is needed in remapping, reformatting, custom sanity checks, etc.
Add mutations data
-
maf_data
references a final merged maf output file from the NF-OSI processing pipeline (vcf2maf) OK for public release. - Under the hood, a required case list file is also generated.
maf_data <- "syn36553188"
cbp_add_maf(maf_data)
Add copy number alterations (CNA) data
-
cna_data
is expected to be a.seg
file on Synapse.
cna_data <- "syn********"
cbp_add_cna(cna_data)
Add expression data
-
expression_data
is expected to be a.txt
calledgene_tpm.tsv
file on Synapse. - The NF-OSI default includes including the raw expression data as
well, called
gene_counts.tsv
, but this can be omitted. - These NF-OSI outputs will be somewhat modified in translation to have the required headers.
mrna_data <- "syn********"
mrna_data_raw <- "syn********"
cbp_add_expression(mrna_data,
expression_data_raw = mrna_data_raw)
Add clinical data
-
clinical_data
is prepared from an existing Synapse table. The table can be a subsetted version of those released in the study dataset, or pass in a query that can be used for getting the subset. For example, the full clinical cohort comprises patients 1-50, but the dataset can only release data for patients 1-20 for expression data and data patients 15-20 for cna data. Here,clinical_data
can be a smaller table of just those 1-30, or it can be the original table but pass in a suitable additional filter, e.g.where release = 'batch1'
. - Clinical data requires mapping to be as consistent with other public
datasets as possible.
ref_map
defines the mapping of clinical variables from the NF-OSI data dictionary to cBioPortal’s. Only variables in the mapping are exported to cBioPortal. Follow link below to inspect the default file and format used.
clinical_data <- "select * from syn43278088" # query when the table already contains just the releasable patients
ref_map <- "https://raw.githubusercontent.com/nf-osi/nf-metadata-dictionary/main/mappings/cBioPortal/cBioPortal.yaml"
cbp_add_clinical(clinical_data, ref_map)
Validation
Validation has to be done with a cBioPortal instance. Each portal may have specific configurations (such as genomic reference) to validate against.
For an example simple offline validation, assuming you are
at ~/datahub/public
and a study folder called
npst_nfosi_ntap_2022
has been placed into it, mount the
dataset into the container and run validation like:
STUDY=npst_nfosi_ntap_2022
sudo docker run --rm -v $(pwd):/datahub cbioportal/cbioportal:6.0.25 validateData.py -s datahub/$STUDY -n -v
See the general docs for dataset validation for more examples.