Skip to contents

Organize variant call files from Nextflow Sarek into 3-4 datasets, grouping files by variant type and workflow with titles having the format: "type Genomic Variants - workflow Pipeline", e.g. "Somatic Genomic Variants - Strelka Pipeline". As you can see, this assumes that you want to create datasets that segregate Somatic and Germline calls. This makes sense for NF because Germline calls can be treated differently. This uses latest version of all files and creates a Draft version of the dataset.

Usage

nf_sarek_datasets(
  output_map,
  parent,
  workflow = c("FreeBayes", "Mutect2", "Strelka", "DeepVariant"),
  verbose = TRUE,
  dry_run = TRUE
)

Arguments

output_map

The data.table returned from map_sample_output_sarek. See details for alternatives.

parent

Synapse id of parent project where the dataset will live.

workflow

One of workflows used.

verbose

Optional, whether to be verbose – defaults to TRUE.

dry_run

If TRUE, don't actually store dataset, just return the data object for inspection or further modification.

Value

A list of dataset objects.

Details

Since we basically just need the syn entity id, variant type, and workflow to group the files. Instead of getting this info through running map_* as in the example, you may prefer using a fileview, in which case you just need to download a table from a fileview that has id => output_id + the dataType and workflow annotations. The fileview can be used after the files are annotated. If you want to create datasets before files are annotated, then you have to use map_*.

Finally, datasets cannot use the same name if stored in the same project, so if there are multiple batches, the names will have to be made unique by adding the batch number, source data id, processing date, or whatever makes sense.

Examples

if (FALSE) { # \dontrun{
syn_out <- "syn26648589"
m <- map_sample_output_sarek(syn_out)
datasets <- nf_sarek_datasets(m, parent = "syn26462036", dry_run = F) # use a test project
} # }