Create datasets for Sarek-called somatic or germline variants results
Source:R/datasets.R
nf_sarek_datasets.Rd
Organize variant call files from Nextflow Sarek into 3-4 datasets, grouping files by variant type and workflow with titles having the format: "type Genomic Variants - workflow Pipeline", e.g. "Somatic Genomic Variants - Strelka Pipeline". As you can see, this assumes that you want to create datasets that segregate Somatic and Germline calls. This makes sense for NF because Germline calls can be treated differently. This uses latest version of all files and creates a Draft version of the dataset.
Usage
nf_sarek_datasets(
output_map,
parent,
workflow = c("FreeBayes", "Mutect2", "Strelka", "DeepVariant"),
verbose = TRUE,
dry_run = TRUE
)
Arguments
- output_map
The
data.table
returned frommap_sample_output_sarek
. See details for alternatives.- parent
Synapse id of parent project where the dataset will live.
- workflow
One of workflows used.
- verbose
Optional, whether to be verbose – defaults to TRUE.
- dry_run
If TRUE, don't actually store dataset, just return the data object for inspection or further modification.
Details
Since we basically just need the syn entity id, variant type, and workflow to group the files.
Instead of getting this info through running map_*
as in the example,
you may prefer using a fileview, in which case you just need to download a table from a fileview
that has id
=> output_id
+ the dataType
and workflow
annotations.
The fileview can be used after the files are annotated. If you want to create datasets before
files are annotated, then you have to use map_*
.
Finally, datasets cannot use the same name if stored in the same project, so if there are multiple batches, the names will have to be made unique by adding the batch number, source data id, processing date, or whatever makes sense.
Examples
if (FALSE) { # \dontrun{
syn_out <- "syn26648589"
m <- map_sample_output_sarek(syn_out)
datasets <- nf_sarek_datasets(m, parent = "syn26462036", dry_run = F) # use a test project
} # }