GATK4: Base Recalibrator¶

Gatk4BaseRecalibrator · 1 contributor · 4 versions

First pass of the base quality score recalibration. Generates a recalibration table based on various covariates. The default covariates are read group, reported quality score, machine cycle, and nucleotide context.

This walker generates tables based on specified covariates. It does a by-locus traversal operating only at sites that are in the known sites VCF. ExAc, gnomAD, or dbSNP resources can be used as known sites of variation. We assume that all reference mismatches we see are therefore errors and indicative of poor base quality. Since there is a large amount of data one can then calculate an empirical probability of error given the particular covariates seen at this site, where p(error) = num mismatches / num observations. The output file is a table (of the several covariate values, num observations, num mismatches, empirical quality score).

Quickstart¶

from janis_bioinformatics.tools.gatk4.baserecalibrator.versions import Gatk4BaseRecalibrator_4_1_4

wf = WorkflowBuilder("myworkflow")

wf.step(
    "gatk4baserecalibrator_step",
    Gatk4BaseRecalibrator_4_1_4(
        bam=None,
        knownSites=None,
        reference=None,
    )
)
wf.output("out", source=gatk4baserecalibrator_step.out)

OR

Install Janis
Ensure Janis is configured to work with Docker or Singularity.
Ensure all reference files are available:

Note

More information about these inputs are available below.

Generate user input files for Gatk4BaseRecalibrator:

# user inputs
janis inputs Gatk4BaseRecalibrator > inputs.yaml

inputs.yaml

bam: bam.bam
knownSites:
- knownSites_0.vcf.gz
- knownSites_1.vcf.gz
reference: reference.fasta

Run Gatk4BaseRecalibrator with:

janis run [...run options] \
    --inputs inputs.yaml \
    Gatk4BaseRecalibrator

Information¶

ID:	`Gatk4BaseRecalibrator`
URL:	https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_bqsr_BaseRecalibrator.php
Versions:	4.1.4.0, 4.1.3.0, 4.1.2.0, 4.0.12.0
Container:	broadinstitute/gatk:4.1.4.0
Authors:	Michael Franklin
Citations:	See https://software.broadinstitute.org/gatk/documentation/article?id=11027 for more information
Created:	2018-12-24
Updated:	2019-01-24

Outputs¶

name	type	documentation
out	tsv

Additional configuration (inputs)¶

name	type	prefix	position	documentation
bam	IndexedBam	-I	6	BAM/SAM/CRAM file containing reads
knownSites	Array<Gzipped<VCF>>	–known-sites	28	One or more databases of known polymorphic sites used to exclude regions around known polymorphisms from analysis. This algorithm treats every reference mismatch as an indication of error. However, real genetic variation is expected to mismatch the reference, so it is critical that a database of known polymorphic sites is given to the tool in order to skip over those sites. This tool accepts any number of Feature-containing files (VCF, BCF, BED, etc.) for use as this database. For users wishing to exclude an interval list of known variation simply use -XL my.interval.list to skip over processing those sites. Please note however that the statistics reported by the tool will not accurately reflected those sites skipped by the -XL argument.
reference	FastaWithIndexes	-R	5	Reference sequence file
javaOptions	Optional<Array<String>>
compression_level	Optional<Integer>			Compression level for all compressed files created (e.g. BAM and VCF). Default value: 2.
tmpDir	Optional<String>	–tmp-dir		Temp directory to use.
outputFilename	Optional<Filename>	-O	8	The output recalibration table filename to create. After the header, data records occur one per line until the end of the file. The first several items on a line are the values of the individual covariates and will change depending on which covariates were specified at runtime. The last three items are the data- that is, number of observations for this combination of covariates, number of reference mismatches, and the raw empirical quality score calculated by phred-scaling the mismatch rate. Use ‘/dev/stdout’ to print to standard out.
intervals	Optional<bed>	–intervals		-L (BASE) One or more genomic intervals over which to operate
intervalStrings	Optional<Array<String>>	–intervals		-L (BASE) One or more genomic intervals over which to operate

Workflow Description Language¶

version development

task Gatk4BaseRecalibrator {
  input {
    Int? runtime_cpu
    Int? runtime_memory
    Int? runtime_seconds
    Int? runtime_disks
    Array[String]? javaOptions
    Int? compression_level
    String? tmpDir
    File bam
    File bam_bai
    Array[File] knownSites
    Array[File] knownSites_tbi
    File reference
    File reference_fai
    File reference_amb
    File reference_ann
    File reference_bwt
    File reference_pac
    File reference_sa
    File reference_dict
    String? outputFilename
    File? intervals
    Array[String]? intervalStrings
  }
  command <<<
    set -e
    cp -f '~{bam_bai}' $(echo '~{bam}' | sed 's/\.[^.]*$//').bai
    gatk BaseRecalibrator \
      --java-options '-Xmx~{((select_first([runtime_memory, 16, 4]) * 3) / 4)}G ~{if (defined(compression_level)) then ("-Dsamjdk.compress_level=" + compression_level) else ""} ~{sep(" ", select_first([javaOptions, []]))}' \
      ~{if defined(select_first([tmpDir, "/tmp/"])) then ("--tmp-dir '" + select_first([tmpDir, "/tmp/"]) + "'") else ""} \
      ~{if defined(intervals) then ("--intervals '" + intervals + "'") else ""} \
      ~{if (defined(intervalStrings) && length(select_first([intervalStrings])) > 0) then "--intervals '" + sep("' --intervals '", select_first([intervalStrings])) + "'" else ""} \
      -R '~{reference}' \
      -I '~{bam}' \
      -O '~{select_first([outputFilename, "~{basename(bam, ".bam")}.table"])}' \
      ~{if length(knownSites) > 0 then "--known-sites '" + sep("' --known-sites '", knownSites) + "'" else ""}
  >>>
  runtime {
    cpu: select_first([runtime_cpu, 1, 1])
    disks: "local-disk ~{select_first([runtime_disks, 20])} SSD"
    docker: "broadinstitute/gatk:4.1.4.0"
    duration: select_first([runtime_seconds, 86400])
    memory: "~{select_first([runtime_memory, 16, 4])}G"
    preemptible: 2
  }
  output {
    File out = select_first([outputFilename, "~{basename(bam, ".bam")}.table"])
  }
}

Common Workflow Language¶

#!/usr/bin/env cwl-runner
class: CommandLineTool
cwlVersion: v1.2
label: 'GATK4: Base Recalibrator'
doc: |-
  First pass of the base quality score recalibration. Generates a recalibration table based on various covariates.
  The default covariates are read group, reported quality score, machine cycle, and nucleotide context.

  This walker generates tables based on specified covariates. It does a by-locus traversal operating only at sites
  that are in the known sites VCF. ExAc, gnomAD, or dbSNP resources can be used as known sites of variation.
  We assume that all reference mismatches we see are therefore errors and indicative of poor base quality.
  Since there is a large amount of data one can then calculate an empirical probability of error given the
  particular covariates seen at this site, where p(error) = num mismatches / num observations. The output file is a
  table (of the several covariate values, num observations, num mismatches, empirical quality score).

requirements:
- class: ShellCommandRequirement
- class: InlineJavascriptRequirement
- class: DockerRequirement
  dockerPull: broadinstitute/gatk:4.1.4.0

inputs:
- id: javaOptions
  label: javaOptions
  type:
  - type: array
    items: string
  - 'null'
- id: compression_level
  label: compression_level
  doc: |-
    Compression level for all compressed files created (e.g. BAM and VCF). Default value: 2.
  type:
  - int
  - 'null'
- id: tmpDir
  label: tmpDir
  doc: Temp directory to use.
  type: string
  default: /tmp/
  inputBinding:
    prefix: --tmp-dir
- id: bam
  label: bam
  doc: BAM/SAM/CRAM file containing reads
  type: File
  secondaryFiles:
  - |-
    ${

            function resolveSecondary(base, secPattern) {
              if (secPattern[0] == "^") {
                var spl = base.split(".");
                var endIndex = spl.length > 1 ? spl.length - 1 : 1;
                return resolveSecondary(spl.slice(undefined, endIndex).join("."), secPattern.slice(1));
              }
              return base + secPattern
            }

            return [
                    {
                        location: resolveSecondary(self.location, "^.bai"),
                        basename: resolveSecondary(self.basename, ".bai"),
                        class: "File",
                    }
            ];

    }
  inputBinding:
    prefix: -I
    position: 6
- id: knownSites
  label: knownSites
  doc: |-
    **One or more databases of known polymorphic sites used to exclude regions around known polymorphisms from analysis.** This algorithm treats every reference mismatch as an indication of error. However, real genetic variation is expected to mismatch the reference, so it is critical that a database of known polymorphic sites is given to the tool in order to skip over those sites. This tool accepts any number of Feature-containing files (VCF, BCF, BED, etc.) for use as this database. For users wishing to exclude an interval list of known variation simply use -XL my.interval.list to skip over processing those sites. Please note however that the statistics reported by the tool will not accurately reflected those sites skipped by the -XL argument.
  type:
    type: array
    inputBinding:
      prefix: --known-sites
    items: File
  inputBinding:
    position: 28
- id: reference
  label: reference
  doc: Reference sequence file
  type: File
  secondaryFiles:
  - pattern: .fai
  - pattern: .amb
  - pattern: .ann
  - pattern: .bwt
  - pattern: .pac
  - pattern: .sa
  - pattern: ^.dict
  inputBinding:
    prefix: -R
    position: 5
- id: outputFilename
  label: outputFilename
  doc: |-
    **The output recalibration table filename to create.** After the header, data records occur one per line until the end of the file. The first several items on a line are the values of the individual covariates and will change depending on which covariates were specified at runtime. The last three items are the data- that is, number of observations for this combination of covariates, number of reference mismatches, and the raw empirical quality score calculated by phred-scaling the mismatch rate. Use '/dev/stdout' to print to standard out.
  type:
  - string
  - 'null'
  default: generated.table
  inputBinding:
    prefix: -O
    position: 8
    valueFrom: $(inputs.bam.basename.replace(/.bam$/, "")).table
- id: intervals
  label: intervals
  doc: -L (BASE) One or more genomic intervals over which to operate
  type:
  - File
  - 'null'
  inputBinding:
    prefix: --intervals
- id: intervalStrings
  label: intervalStrings
  doc: -L (BASE) One or more genomic intervals over which to operate
  type:
  - type: array
    inputBinding:
      prefix: --intervals
    items: string
  - 'null'
  inputBinding: {}

outputs:
- id: out
  label: out
  type: File
  outputBinding:
    glob: $(inputs.bam.basename.replace(/.bam$/, "")).table
    loadContents: false
stdout: _stdout
stderr: _stderr

baseCommand:
- gatk
- BaseRecalibrator
arguments:
- prefix: --java-options
  position: -1
  valueFrom: |-
    $("-Xmx{memory}G {compression} {otherargs}".replace(/\{memory\}/g, (([inputs.runtime_memory, 16, 4].filter(function (inner) { return inner != null })[0] * 3) / 4)).replace(/\{compression\}/g, (inputs.compression_level != null) ? ("-Dsamjdk.compress_level=" + inputs.compression_level) : "").replace(/\{otherargs\}/g, [inputs.javaOptions, []].filter(function (inner) { return inner != null })[0].join(" ")))

hints:
- class: ToolTimeLimit
  timelimit: |-
    $([inputs.runtime_seconds, 86400].filter(function (inner) { return inner != null })[0])
id: Gatk4BaseRecalibrator