Annotating nextflow processed data
2022-10-17
Source:vignettes/annotate-nf-processed-data.Rmd
annotate-nf-processed-data.Rmd
Intro
This vignette documents in-practice usage of the annotation utils for nf-processed data files; it will probably updated as practices and underlying functions are refined.
If you copy and paste some code to try out, you will need at least read access. The code to modify the entities’ provenance/annotations are not evaluated.
Important note
You may notice one minor difference in the annotation utils for RNA-seq expression vs variant calling data. The main difference is that one will add workflow directly, while the other doesn’t. This is a design change that will be made more consistent across the annotation functions going forward, where workflow values will be added for everything by default, and functions will be kept up to date to make consistency easy for DCC staff. For non-nf-processed data, values can still be overridden.
Set up
First load the nfportalutils
package and log in. The recommended default usage of syn_login
is to use it without directly passing in credentials. Instead, have available the SYNAPSE_AUTH_TOKEN
environment variable with your token stored therein.
nf-rnaseq
Creating helper resource files
To add annotations and provenance, we need a reference mapping of inputs and outputs involved in the workflow. The below puts together a table representing this mapping by parsing the samplesheet in combination with the output directory (“star_salmon”). For the nextflow rna-seq workflow, the samplesheet is an output file in “pipeline_info”.
sample_io <- map_sample_io(workflow = "nf-rnaseq",
samplesheet = "syn30841441", # synapse file
syn_out = "syn30840584")
Add provenance
Add provenance for the files involved using add_activity_batch
.
A workflow link is used (instead of just a name) because it is is a more specific reference with versions and others can be follow the link to get details of the workflow.
wf_link <- "https://nf-co.re/rnaseq/3.7/output#star-and-salmon"
prov <- add_activity_batch(sample_io$output_id,
sample_io$workflow,
wf_link,
sample_io$input_id)
Add annotations
Annotations expected for aligned reads output are different from quantified expression output. This is handled by a couple of specialized functions.
Aligned reads data files
First, it helps to see the template for the aligned reads(.bam
) data.
template <- "bts:ProcessedAlignedReadsTemplate"
schema <- "https://raw.githubusercontent.com/nf-osi/nf-metadata-dictionary/main/NF.jsonld"
props <- get_dependency_from_json_schema(id = template)
props
This looks like a large number of different annotations to put together (though most are optional). Fortunately, the utility annotate_aligned_reads
will try to do much of the work. It will look for the .bam
files and generate annotations for those .bam
files. The template and schema mentioned above are the defaults so don’t have to be specified explicitly. But it’s recommended to read the docs (?annotate_aligned_reads
) for details on how annotations are added, to know what remains to be filled in manually.
Annotations expected to be set manually each time include genomicReference
and workflowLink
, which we also just set in provenance. Why: - These are recommended to be surfaced both in provenance and annotations because only annotations are queryable in the portal currently. - These need to be set manually because reference and versions are likely to change between workflow runs.
l2_manifest <- annotate_aligned_reads(sample_io,
samtools_stats_file = "syn30841284",
dry_run = TRUE)
l2_manifest[, genomicReference := "GRCh38"]
l2_manifest[, workflowLink := "https://nf-co.re/rnaseq/3.7/output#star-and-salmon"]
This looks good, so we can go ahead and apply the annotations:
annotate_with_manifest(l2_manifest)
Note that instead of submitting as above, we can make this a schematic-compatible manifest to submit via DCA.
Expression data files
Another convenience function is provided to help annotate gene expression data files.
l3_manifest <- annotate_expression(sample_io, dry_run = TRUE)
l3_manifest[, workflowLink := "https://nf-co.re/rnaseq/3.7/output#star-and-salmon"]
annotate_with_manifest(l3_manifest)
Now, to create a Synapse dataset out of this set of files:
nf_star_salmon_datasets(l3_manifest,
parent = parent_project,
dry_run = FALSE)
nf-sarek
As the second part of this vignette will show, the process for variant data is not much different, though mapping inputs-outputs can be somewhat slower.
Creating helper resource files
sample_io <- map_sample_io(
workflow = "nf-sarek",
samplesheet = "sarek_JH_WU_mixed_kit_samples.csv", # local file,
syn_out = "syn52593603",
sample_level = 3
)
# Modify workflow assignment a bit
sample_io[workflow =="strelka", workflow := "Strelka2"]
sample_io[workflow =="mutect2", workflow := "Mutect2"]
Add provenance
Use the manifest to add provenance.
wf_link <- c(FreeBayes = "https://nf-co.re/sarek/3.2.3/output#freebayes",
Mutect2 = "https://nf-co.re/sarek/3.2.3/output#gatk-mutect2",
Strelka2 = "https://nf-co.re/sarek/3.2.3/output#strelka2")
add_activity_batch(sample_io$output_id,
sample_io$workflow,
wf_link[sample_io$workflow],
sample_io$input_id)
Add annotations
The annotate_called_variants
util is used for both somatic and germline results, and will automatically try to fill in the right values. However, you can set these in the parameters directly – see ?annotate_called_variants
. This returns a manifest for examination and correction has needed.
manifest <- annotate_called_variants(sample_io,
workflow_ref = wf_link[sample_io$workflow])
For example, if you used a slightly different version/fork, you should update it manually in the manifest. Otherwise, apply the annotations:
annotate_with_manifest(manifest)