Skip to contents

Intro

This vignette documents in-practice usage of the annotation utils for nf-processed data files; it will probably updated as practices and underlying functions are refined.

If you copy and paste some code to try out, you will need at least read access. The code to modify the entities’ provenance/annotations are not evaluated.

Important note

You may notice one minor difference in the annotation utils for RNA-seq expression vs variant calling data. The main difference is that one will add workflow directly, while the other doesn’t. This is a design change that will be made more consistent across the annotation functions going forward, where workflow values will be added for everything by default, and functions will be kept up to date to make consistency easy for DCC staff. For non-nf-processed data, values can still be overridden.

Set up

First load the nfportalutils package and log in. The recommended default usage of syn_login is to use it without directly passing in credentials. Instead, have available the SYNAPSE_AUTH_TOKEN environment variable with your token stored therein.

nf-rnaseq

Creating helper resource files

To add annotations and provenance, we need a reference mapping of inputs and outputs involved in the workflow. The below puts together a table representing this mapping by parsing the samplesheet in combination with the output directory (“star_salmon”). For the nextflow rna-seq workflow, the samplesheet is an output file in “pipeline_info”.

sample_io <- map_sample_io(workflow = "nf-rnaseq",
                           samplesheet = "syn30841441", # synapse file
                           syn_out = "syn30840584")

Add provenance

Add provenance for the files involved using add_activity_batch.

A workflow link is used (instead of just a name) because it is is a more specific reference with versions and others can be follow the link to get details of the workflow.


wf_link <- "https://nf-co.re/rnaseq/3.7/output#star-and-salmon"
prov <- add_activity_batch(sample_io$output_id, 
                           sample_io$workflow, 
                           wf_link,
                           sample_io$input_id)

Add annotations

Annotations expected for aligned reads output are different from quantified expression output. This is handled by a couple of specialized functions.

Aligned reads data files

First, it helps to see the template for the aligned reads(.bam) data.


template <- "bts:ProcessedAlignedReadsTemplate"
schema <- "https://raw.githubusercontent.com/nf-osi/nf-metadata-dictionary/main/NF.jsonld"
props <- get_dependency_from_json_schema(id = template)
props

This looks like a large number of different annotations to put together (though most are optional). Fortunately, the utility annotate_aligned_reads will try to do much of the work. It will look for the .bam files and generate annotations for those .bam files. The template and schema mentioned above are the defaults so don’t have to be specified explicitly. But it’s recommended to read the docs (?annotate_aligned_reads) for details on how annotations are added, to know what remains to be filled in manually.

Annotations expected to be set manually each time include genomicReference and workflowLink, which we also just set in provenance. Why: - These are recommended to be surfaced both in provenance and annotations because only annotations are queryable in the portal currently. - These need to be set manually because reference and versions are likely to change between workflow runs.


l2_manifest <- annotate_aligned_reads(sample_io,
                                      samtools_stats_file = "syn30841284",
                                      dry_run = TRUE)

l2_manifest[, genomicReference := "GRCh38"]
l2_manifest[, workflowLink := "https://nf-co.re/rnaseq/3.7/output#star-and-salmon"]

This looks good, so we can go ahead and apply the annotations:

Note that instead of submitting as above, we can make this a schematic-compatible manifest to submit via DCA.

Expression data files

Another convenience function is provided to help annotate gene expression data files.


l3_manifest <- annotate_expression(sample_io, dry_run = TRUE)
l3_manifest[, workflowLink := "https://nf-co.re/rnaseq/3.7/output#star-and-salmon"]
annotate_with_manifest(l3_manifest)

Now, to create a Synapse dataset out of this set of files:


nf_star_salmon_datasets(l3_manifest, 
                        parent = parent_project,
                        dry_run = FALSE)

nf-sarek

As the second part of this vignette will show, the process for variant data is not much different, though mapping inputs-outputs can be somewhat slower.

Creating helper resource files


sample_io <- map_sample_io(
  workflow =  "nf-sarek",
  samplesheet = "sarek_JH_WU_mixed_kit_samples.csv", # local file,
  syn_out = "syn52593603",
  sample_level = 3
)

# Modify workflow assignment a bit
sample_io[workflow =="strelka", workflow := "Strelka2"]
sample_io[workflow =="mutect2", workflow := "Mutect2"]

Add provenance

Use the manifest to add provenance.


wf_link <- c(FreeBayes = "https://nf-co.re/sarek/3.2.3/output#freebayes",
             Mutect2 = "https://nf-co.re/sarek/3.2.3/output#gatk-mutect2",
             Strelka2 = "https://nf-co.re/sarek/3.2.3/output#strelka2")

add_activity_batch(sample_io$output_id, 
                   sample_io$workflow, 
                   wf_link[sample_io$workflow], 
                   sample_io$input_id)
   

Add annotations

The annotate_called_variants util is used for both somatic and germline results, and will automatically try to fill in the right values. However, you can set these in the parameters directly – see ?annotate_called_variants. This returns a manifest for examination and correction has needed.


manifest <- annotate_called_variants(sample_io, 
                                     workflow_ref = wf_link[sample_io$workflow])

For example, if you used a slightly different version/fork, you should update it manually in the manifest. Otherwise, apply the annotations: