Strelka (Germline)¶
strelka_germline
· 1 contributor · 2 versions
Strelka2 is a fast and accurate small variant caller optimized for analysis of germline variation in small cohorts and somatic variation in tumor/normal sample pairs. The germline caller employs an efficient tiered haplotype model to improve accuracy and provide read-backed phasing, adaptively selecting between assembly and a faster alignment-based haplotyping approach at each variant locus. The germline caller also analyzes input sequencing data using a mixture-model indel error estimation method to improve robustness to indel noise. The somatic calling model improves on the original Strelka method for liquid and late-stage tumor analysis by accounting for possible tumor cell contamination in the normal sample. A final empirical variant re-scoring step using random forest models trained on various call quality features has been added to both callers to further improve precision.
Compared with submissions to the recent PrecisonFDA Consistency and Truth challenges, the average indel F-score for Strelka2 running in its default configuration is 3.1% and 0.08% higher, respectively, than the best challenge submissions. Runtime on a 28-core server is ~40 minutes for 40x WGS germline analysis and ~3 hours for a 110x/40x WGS tumor-normal somatic analysis
Strelka accepts input read mappings from BAM or CRAM files, and optionally candidate and/or forced-call alleles from VCF. It reports all small variant predictions in VCF 4.1 format. Germline variant reporting uses the gVCF conventions to represent both variant and reference call confidence. For best somatic indel performance, Strelka is designed to be run with the Manta structural variant and indel caller, which provides additional indel candidates up to a given maxiumum indel size (49 by default). By design, Manta and Strelka run together with default settings provide complete coverage over all indel sizes (in additional to SVs and SNVs).
See the user guide for a full description of capabilities and limitations
Quickstart¶
from janis_bioinformatics.tools.illumina.strelkagermline.strelkagermline import StrelkaGermline_2_9_10 wf = WorkflowBuilder("myworkflow") wf.step( "strelka_germline_step", StrelkaGermline_2_9_10( bam=None, reference=None, ) ) wf.output("configPickle", source=strelka_germline_step.configPickle) wf.output("script", source=strelka_germline_step.script) wf.output("stats", source=strelka_germline_step.stats) wf.output("variants", source=strelka_germline_step.variants) wf.output("genome", source=strelka_germline_step.genome)
OR
- Install Janis
- Ensure Janis is configured to work with Docker or Singularity.
- Ensure all reference files are available:
Note
More information about these inputs are available below.
- Generate user input files for strelka_germline:
# user inputs
janis inputs strelka_germline > inputs.yaml
inputs.yaml
bam: bam.bam
reference: reference.fasta
- Run strelka_germline with:
janis run [...run options] \
--inputs inputs.yaml \
strelka_germline
Information¶
ID: | strelka_germline |
---|---|
URL: | https://github.com/Illumina/strelka |
Versions: | 2.9.10, 2.9.9 |
Container: | michaelfranklin/strelka:2.9.10 |
Authors: | Michael Franklin |
Citations: | None |
Created: | 2018-12-24 |
Updated: | 2019-01-24 |
Outputs¶
name | type | documentation |
---|---|---|
configPickle | File | |
script | File | |
stats | tsv | A tab-delimited report of various internal statistics from the variant calling process: Runtime information accumulated for each genome segment, excluding auxiliary steps such as BAM indexing and vcf merging. Indel candidacy statistics |
variants | Gzipped<VCF> | Primary variant inferences are provided as a series of VCF 4.1 files |
genome | Gzipped<VCF> |
Additional configuration (inputs)¶
name | type | prefix | position | documentation |
---|---|---|---|---|
bam | IndexedBam | –bam | 1 | Sample BAM or CRAM file. May be specified more than once, multiple inputs will be treated as each BAM file representing a different sample. [required] (no default) |
reference | FastaWithIndexes | –referenceFasta | 1 | samtools-indexed reference fasta file [required] |
relativeStrelkaDirectory | Optional<String> | –runDir | 1 | Name of directory to be created where all workflow scripts and output will be written. Each analysis requires a separate directory. |
ploidy | Optional<Gzipped<VCF>> | –ploidy | 1 | Provide ploidy file in VCF. The VCF should include one sample column per input sample labeled with the same sample names found in the input BAM/CRAM RG header sections. Ploidy should be provided in records using the FORMAT/CN field, which are interpreted to span the range [POS+1, INFO/END]. Any CN value besides 1 or 0 will be treated as 2. File must be tabix indexed. (no default) |
noCompress | Optional<Gzipped<VCF>> | –noCompress | 1 | Provide BED file of regions where gVCF block compression is not allowed. File must be bgzip- compressed/tabix-indexed. (no default) |
callContinuousVf | Optional<String> | –callContinuousVf | Call variants on CHROM without a ploidy prior assumption, issuing calls with continuous variant frequencies (no default) | |
rna | Optional<Boolean> | –rna | 1 | Set options for RNA-Seq input. |
indelCandidates | Optional<Gzipped<VCF>> | –indelCandidates | 1 | Specify a VCF of candidate indel alleles. These alleles are always evaluated but only reported in the output when they are inferred to exist in the sample. The VCF must be tabix indexed. All indel alleles must be left-shifted/normalized, any unnormalized alleles will be ignored. This option may be specified more than once, multiple input VCFs will be merged. (default: None) |
forcedGT | Optional<Gzipped<VCF>> | –forcedGT | 1 | Specify a VCF of candidate alleles. These alleles are always evaluated and reported even if they are unlikely to exist in the sample. The VCF must be tabix indexed. All indel alleles must be left- shifted/normalized, any unnormalized allele will trigger a runtime error. This option may be specified more than once, multiple input VCFs will be merged. Note that for any SNVs provided in the VCF, the SNV site will be reported (and for gVCF, excluded from block compression), but the specific SNV alleles are ignored. (default: None) |
exome | Optional<Boolean> | –exome | 1 | Set options for exome note in particular that this flag turns off high-depth filters |
targeted | Optional<Boolean> | –exome | 1 | Set options for other targeted input: note in particular that this flag turns off high-depth filters |
callRegions | Optional<Gzipped<bed>> | –callRegions= | 1 | Optionally provide a bgzip-compressed/tabix-indexed BED file containing the set of regions to call. No VCF output will be provided outside of these regions. The full genome will still be used to estimate statistics from the input (such as expected depth per chromosome). Only one BED file may be specified. (default: call the entire genome) |
mode | Optional<String> | –mode | 3 | (-m MODE) select run mode (local|sge) |
queue | Optional<String> | –queue | 3 | (-q QUEUE) specify scheduler queue name |
memGb | Optional<String> | –memGb | 3 | (-g MEMGB) gigabytes of memory available to run workflow – only meaningful in local mode, must be an integer (default: Estimate the total memory for this node for local mode, ‘unlimited’ for sge mode) |
quiet | Optional<Boolean> | –quiet | 3 | Don’t write any log output to stderr (but still write to workspace/pyflow.data/logs/pyflow_log.txt) |
mailTo | Optional<String> | –mailTo | 3 | (-e) send email notification of job completion status to this address (may be provided multiple times for more than one email address) |
Workflow Description Language¶
version development
task strelka_germline {
input {
Int? runtime_cpu
Int? runtime_memory
Int? runtime_seconds
Int? runtime_disks
File bam
File bam_bai
File reference
File reference_fai
File reference_amb
File reference_ann
File reference_bwt
File reference_pac
File reference_sa
File reference_dict
String? relativeStrelkaDirectory
File? ploidy
File? ploidy_tbi
File? noCompress
File? noCompress_tbi
String? callContinuousVf
Boolean? rna
File? indelCandidates
File? indelCandidates_tbi
File? forcedGT
File? forcedGT_tbi
Boolean? exome
Boolean? targeted
File? callRegions
File? callRegions_tbi
String? mode
String? queue
String? memGb
Boolean? quiet
String? mailTo
}
command <<<
set -e
\
~{if defined(callContinuousVf) then ("--callContinuousVf '" + callContinuousVf + "'") else ""} \
configureStrelkaGermlineWorkflow.py \
--bam ~{bam} \
--referenceFasta ~{reference} \
~{if defined(select_first([relativeStrelkaDirectory, "strelka_dir"])) then ("--runDir " + select_first([relativeStrelkaDirectory, "strelka_dir"])) else ''} \
~{if defined(ploidy) then ("--ploidy " + ploidy) else ''} \
~{if defined(noCompress) then ("--noCompress " + noCompress) else ''} \
~{if (defined(rna) && select_first([rna])) then "--rna" else ""} \
~{if defined(indelCandidates) then ("--indelCandidates " + indelCandidates) else ''} \
~{if defined(forcedGT) then ("--forcedGT " + forcedGT) else ''} \
~{if (defined(exome) && select_first([exome])) then "--exome" else ""} \
~{if (defined(targeted) && select_first([targeted])) then "--exome" else ""} \
~{if defined(callRegions) then ("--callRegions='" + callRegions + "'") else ""} \
;~{select_first([relativeStrelkaDirectory, "strelka_dir"])}/runWorkflow.py \
~{if defined(select_first([mode, "local"])) then ("--mode " + select_first([mode, "local"])) else ''} \
~{if defined(queue) then ("--queue " + queue) else ''} \
~{if defined(memGb) then ("--memGb " + memGb) else ''} \
~{if (defined(quiet) && select_first([quiet])) then "--quiet" else ""} \
~{if defined(mailTo) then ("--mailTo " + mailTo) else ''} \
--jobs ~{select_first([runtime_cpu, 4])}
>>>
runtime {
cpu: select_first([runtime_cpu, 4, 1])
disks: "local-disk ~{select_first([runtime_disks, 20])} SSD"
docker: "michaelfranklin/strelka:2.9.10"
duration: select_first([runtime_seconds, 86400])
memory: "~{select_first([runtime_memory, 4, 4])}G"
preemptible: 2
}
output {
File configPickle = (select_first([relativeStrelkaDirectory, "strelka_dir"]) + "/runWorkflow.py.config.pickle")
File script = (select_first([relativeStrelkaDirectory, "strelka_dir"]) + "/runWorkflow.py")
File stats = (select_first([relativeStrelkaDirectory, "strelka_dir"]) + "/results/stats/runStats.tsv")
File variants = (select_first([relativeStrelkaDirectory, "strelka_dir"]) + "/results/variants/variants.vcf.gz")
File variants_tbi = (select_first([relativeStrelkaDirectory, "strelka_dir"]) + "/results/variants/variants.vcf.gz") + ".tbi"
File genome = (select_first([relativeStrelkaDirectory, "strelka_dir"]) + "/results/variants/genome.vcf.gz")
File genome_tbi = (select_first([relativeStrelkaDirectory, "strelka_dir"]) + "/results/variants/genome.vcf.gz") + ".tbi"
}
}
Common Workflow Language¶
#!/usr/bin/env cwl-runner
class: CommandLineTool
cwlVersion: v1.2
label: Strelka (Germline)
doc: |-
Strelka2 is a fast and accurate small variant caller optimized for analysis of germline variation
in small cohorts and somatic variation in tumor/normal sample pairs. The germline caller employs
an efficient tiered haplotype model to improve accuracy and provide read-backed phasing, adaptively
selecting between assembly and a faster alignment-based haplotyping approach at each variant locus.
The germline caller also analyzes input sequencing data using a mixture-model indel error estimation
method to improve robustness to indel noise. The somatic calling model improves on the original
Strelka method for liquid and late-stage tumor analysis by accounting for possible tumor cell
contamination in the normal sample. A final empirical variant re-scoring step using random forest
models trained on various call quality features has been added to both callers to further improve precision.
Compared with submissions to the recent PrecisonFDA Consistency and Truth challenges, the average
indel F-score for Strelka2 running in its default configuration is 3.1% and 0.08% higher, respectively,
than the best challenge submissions. Runtime on a 28-core server is ~40 minutes for 40x WGS germline
analysis and ~3 hours for a 110x/40x WGS tumor-normal somatic analysis
Strelka accepts input read mappings from BAM or CRAM files, and optionally candidate and/or forced-call
alleles from VCF. It reports all small variant predictions in VCF 4.1 format. Germline variant
reporting uses the gVCF conventions to represent both variant and reference call confidence.
For best somatic indel performance, Strelka is designed to be run with the Manta structural variant
and indel caller, which provides additional indel candidates up to a given maxiumum indel size
(49 by default). By design, Manta and Strelka run together with default settings provide complete
coverage over all indel sizes (in additional to SVs and SNVs).
See the user guide for a full description of capabilities and limitations
requirements:
- class: ShellCommandRequirement
- class: InlineJavascriptRequirement
- class: DockerRequirement
dockerPull: michaelfranklin/strelka:2.9.10
inputs:
- id: bam
label: bam
doc: |-
Sample BAM or CRAM file. May be specified more than once, multiple inputs will be treated as each BAM file representing a different sample. [required] (no default)
type: File
secondaryFiles:
- pattern: .bai
inputBinding:
prefix: --bam
position: 1
shellQuote: false
- id: reference
label: reference
doc: samtools-indexed reference fasta file [required]
type: File
secondaryFiles:
- pattern: .fai
- pattern: .amb
- pattern: .ann
- pattern: .bwt
- pattern: .pac
- pattern: .sa
- pattern: ^.dict
inputBinding:
prefix: --referenceFasta
position: 1
shellQuote: false
- id: relativeStrelkaDirectory
label: relativeStrelkaDirectory
doc: |-
Name of directory to be created where all workflow scripts and output will be written. Each analysis requires a separate directory.
type: string
default: strelka_dir
inputBinding:
prefix: --runDir
position: 1
shellQuote: false
- id: ploidy
label: ploidy
doc: |-
Provide ploidy file in VCF. The VCF should include one sample column per input sample labeled with the same sample names found in the input BAM/CRAM RG header sections. Ploidy should be provided in records using the FORMAT/CN field, which are interpreted to span the range [POS+1, INFO/END]. Any CN value besides 1 or 0 will be treated as 2. File must be tabix indexed. (no default)
type:
- File
- 'null'
secondaryFiles:
- pattern: .tbi
inputBinding:
prefix: --ploidy
position: 1
shellQuote: false
- id: noCompress
label: noCompress
doc: |-
Provide BED file of regions where gVCF block compression is not allowed. File must be bgzip- compressed/tabix-indexed. (no default)
type:
- File
- 'null'
secondaryFiles:
- pattern: .tbi
inputBinding:
prefix: --noCompress
position: 1
shellQuote: false
- id: callContinuousVf
label: callContinuousVf
doc: |-
Call variants on CHROM without a ploidy prior assumption, issuing calls with continuous variant frequencies (no default)
type:
- string
- 'null'
inputBinding:
prefix: --callContinuousVf
- id: rna
label: rna
doc: Set options for RNA-Seq input.
type:
- boolean
- 'null'
inputBinding:
prefix: --rna
position: 1
shellQuote: false
- id: indelCandidates
label: indelCandidates
doc: |-
Specify a VCF of candidate indel alleles. These alleles are always evaluated but only reported in the output when they are inferred to exist in the sample. The VCF must be tabix indexed. All indel alleles must be left-shifted/normalized, any unnormalized alleles will be ignored. This option may be specified more than once, multiple input VCFs will be merged. (default: None)
type:
- File
- 'null'
secondaryFiles:
- pattern: .tbi
inputBinding:
prefix: --indelCandidates
position: 1
shellQuote: false
- id: forcedGT
label: forcedGT
doc: |-
Specify a VCF of candidate alleles. These alleles are always evaluated and reported even if they are unlikely to exist in the sample. The VCF must be tabix indexed. All indel alleles must be left- shifted/normalized, any unnormalized allele will trigger a runtime error. This option may be specified more than once, multiple input VCFs will be merged. Note that for any SNVs provided in the VCF, the SNV site will be reported (and for gVCF, excluded from block compression), but the specific SNV alleles are ignored. (default: None)
type:
- File
- 'null'
secondaryFiles:
- pattern: .tbi
inputBinding:
prefix: --forcedGT
position: 1
shellQuote: false
- id: exome
label: exome
doc: |-
Set options for exome note in particular that this flag turns off high-depth filters
type:
- boolean
- 'null'
inputBinding:
prefix: --exome
position: 1
shellQuote: false
- id: targeted
label: targeted
doc: |-
Set options for other targeted input: note in particular that this flag turns off high-depth filters
type:
- boolean
- 'null'
inputBinding:
prefix: --exome
position: 1
shellQuote: false
- id: callRegions
label: callRegions
doc: |-
Optionally provide a bgzip-compressed/tabix-indexed BED file containing the set of regions to call. No VCF output will be provided outside of these regions. The full genome will still be used to estimate statistics from the input (such as expected depth per chromosome). Only one BED file may be specified. (default: call the entire genome)
type:
- File
- 'null'
secondaryFiles:
- pattern: .tbi
inputBinding:
prefix: --callRegions=
position: 1
separate: false
- id: mode
label: mode
doc: (-m MODE) select run mode (local|sge)
type: string
default: local
inputBinding:
prefix: --mode
position: 3
shellQuote: false
- id: queue
label: queue
doc: (-q QUEUE) specify scheduler queue name
type:
- string
- 'null'
inputBinding:
prefix: --queue
position: 3
shellQuote: false
- id: memGb
label: memGb
doc: |2-
(-g MEMGB) gigabytes of memory available to run workflow -- only meaningful in local mode, must be an integer (default: Estimate the total memory for this node for local mode, 'unlimited' for sge mode)
type:
- string
- 'null'
inputBinding:
prefix: --memGb
position: 3
shellQuote: false
- id: quiet
label: quiet
doc: |-
Don't write any log output to stderr (but still write to workspace/pyflow.data/logs/pyflow_log.txt)
type:
- boolean
- 'null'
inputBinding:
prefix: --quiet
position: 3
shellQuote: false
- id: mailTo
label: mailTo
doc: |-
(-e) send email notification of job completion status to this address (may be provided multiple times for more than one email address)
type:
- string
- 'null'
inputBinding:
prefix: --mailTo
position: 3
shellQuote: false
outputs:
- id: configPickle
label: configPickle
type: File
outputBinding:
glob: $((inputs.relativeStrelkaDirectory + "/runWorkflow.py.config.pickle"))
outputEval: $((inputs.relativeStrelkaDirectory.basename + "/runWorkflow.py.config.pickle"))
loadContents: false
- id: script
label: script
type: File
outputBinding:
glob: $((inputs.relativeStrelkaDirectory + "/runWorkflow.py"))
outputEval: $((inputs.relativeStrelkaDirectory.basename + "/runWorkflow.py"))
loadContents: false
- id: stats
label: stats
doc: |-
A tab-delimited report of various internal statistics from the variant calling process: Runtime information accumulated for each genome segment, excluding auxiliary steps such as BAM indexing and vcf merging. Indel candidacy statistics
type: File
outputBinding:
glob: $((inputs.relativeStrelkaDirectory + "/results/stats/runStats.tsv"))
outputEval: $((inputs.relativeStrelkaDirectory.basename + "/results/stats/runStats.tsv"))
loadContents: false
- id: variants
label: variants
doc: Primary variant inferences are provided as a series of VCF 4.1 files
type: File
secondaryFiles:
- pattern: .tbi
outputBinding:
glob: $((inputs.relativeStrelkaDirectory + "/results/variants/variants.vcf.gz"))
outputEval: |-
$((inputs.relativeStrelkaDirectory.basename + "/results/variants/variants.vcf.gz"))
loadContents: false
- id: genome
label: genome
type: File
secondaryFiles:
- pattern: .tbi
outputBinding:
glob: $((inputs.relativeStrelkaDirectory + "/results/variants/genome.vcf.gz"))
outputEval: |-
$((inputs.relativeStrelkaDirectory.basename + "/results/variants/genome.vcf.gz"))
loadContents: false
stdout: _stdout
stderr: _stderr
arguments:
- position: 0
valueFrom: configureStrelkaGermlineWorkflow.py
shellQuote: false
- position: 2
valueFrom: |-
$(";{relativeStrelkaDirectory}/runWorkflow.py".replace(/\{relativeStrelkaDirectory\}/g, inputs.relativeStrelkaDirectory))
shellQuote: false
- prefix: --jobs
position: 3
valueFrom: $([inputs.runtime_cpu, 4].filter(function (inner) { return inner != null
})[0])
shellQuote: false
hints:
- class: ToolTimeLimit
timelimit: |-
$([inputs.runtime_seconds, 86400].filter(function (inner) { return inner != null })[0])
id: strelka_germline