GATK4: CollectInsertSizeMetrics¶

Gatk4CollectInsertSizeMetrics · 1 contributor · 4 versions

Provides useful metrics for validating library construction including the insert size distribution and read orientation of paired-end libraries

Quickstart¶

from janis_bioinformatics.tools.gatk4.collectinsertsizemetrics.versions import Gatk4CollectInsertSizeMetrics_4_1_4

wf = WorkflowBuilder("myworkflow")

wf.step(
    "gatk4collectinsertsizemetrics_step",
    Gatk4CollectInsertSizeMetrics_4_1_4(
        bam=None,
    )
)
wf.output("out", source=gatk4collectinsertsizemetrics_step.out)
wf.output("outHistogram", source=gatk4collectinsertsizemetrics_step.outHistogram)

OR

Install Janis
Ensure Janis is configured to work with Docker or Singularity.
Ensure all reference files are available:

Note

More information about these inputs are available below.

Generate user input files for Gatk4CollectInsertSizeMetrics:

# user inputs
janis inputs Gatk4CollectInsertSizeMetrics > inputs.yaml

inputs.yaml

bam: bam.bam

Run Gatk4CollectInsertSizeMetrics with:

janis run [...run options] \
    --inputs inputs.yaml \
    Gatk4CollectInsertSizeMetrics

Information¶

ID:	`Gatk4CollectInsertSizeMetrics`
URL:	https://gatk.broadinstitute.org/hc/en-us/articles/360036715591-CollectInsertSizeMetrics-Picard-
Versions:	4.1.4.0, 4.1.3.0, 4.1.2.0, 4.0.12.0
Container:	broadinstitute/gatk:4.1.4.0
Authors:	Jiaan Yu
Citations:	See https://software.broadinstitute.org/gatk/documentation/article?id=11027 for more information
Created:	2020-02-17
Updated:	2020-02-17

Outputs¶

name	type	documentation
out	TextFile
outHistogram	File

Additional configuration (inputs)¶

name	type	prefix	position	documentation
bam	IndexedBam	-I	10	Input SAM or BAM file. Required.
javaOptions	Optional<Array<String>>
compression_level	Optional<Integer>			Compression level for all compressed files created (e.g. BAM and VCF). Default value: 2.
outputFilename	Optional<Filename>	-O		File to write the output to. Required.
outputHistogram	Optional<Filename>	-H		File to write insert size Histogram chart to. Required.
argumentsFile	Optional<Array<File>>	–arguments_file	10	read one or more arguments files and add them to the command line
assumeSorted	Optional<Boolean>	–ASSUME_SORTED	11	If true (default), then the sort order in the header file will be ignored. Default value: true. Possible values: {true, false}
deviations	Optional<Double>	–DEVIATIONS	11	Generate mean, sd and plots by trimming the data down to MEDIAN + DEVIATIONS*MEDIAN_ABSOLUTE_DEVIATION. This is done because insert size data typically includes enough anomalous values from chimeras and other artifacts to make the mean and sd grossly misleading regarding the real distribution. Default value: 10.0.
histogramWidth	Optional<Integer>	–HISTOGRAM_WIDTH	11	Explicitly sets the Histogram width, overriding automatic truncation of Histogram tail. Also, when calculating mean and standard deviation, only bins <= Histogram_WIDTH will be included. Default value: null.
includeDuplicates	Optional<Boolean>	–INCLUDE_DUPLICATES	11	If true, also include reads marked as duplicates in the insert size histogram. Default value: false. Possible values: {true, false}
metricAccumulationLevel	Optional<String>	–METRIC_ACCUMULATION_LEVEL	11	The level(s) at which to accumulate metrics. This argument may be specified 0 or more times. Default value: [ALL_READS]. Possible values: {ALL_READS, SAMPLE, LIBRARY, READ_GROUP} .
minimumPCT	Optional<Float>	–MINIMUM_PCT	11	When generating the Histogram, discard any data categories (out of FR, TANDEM, RF) that have fewer than this percentage of overall reads. (Range: 0 to 1). Default value: 0.05.
stopAfter	Optional<Integer>	–STOP_AFTER	11	Stop after processing N reads, mainly for debugging. Default value: 0.
version	Optional<Boolean>	–version	11	display the version number for this tool Default value: false. Possible values: {true, false}
showHidden	Optional<Boolean>	–showHidden	11	display hidden arguments Default value: false. Possible values: {true, false}

Workflow Description Language¶

version development

task Gatk4CollectInsertSizeMetrics {
  input {
    Int? runtime_cpu
    Int? runtime_memory
    Int? runtime_seconds
    Int? runtime_disks
    Array[String]? javaOptions
    Int? compression_level
    File bam
    File bam_bai
    String? outputFilename
    String? outputHistogram
    Array[File]? argumentsFile
    Boolean? assumeSorted
    Float? deviations
    Int? histogramWidth
    Boolean? includeDuplicates
    String? metricAccumulationLevel
    Float? minimumPCT
    Int? stopAfter
    Boolean? version
    Boolean? showHidden
  }
  command <<<
    set -e
    gatk CollectInsertSizeMetrics \
      --java-options '-Xmx~{((select_first([runtime_memory, 8, 4]) * 3) / 4)}G ~{if (defined(compression_level)) then ("-Dsamjdk.compress_level=" + compression_level) else ""} ~{sep(" ", select_first([javaOptions, []]))}' \
      -O '~{select_first([outputFilename, "~{basename(bam, ".bam")}.metrics.txt"])}' \
      -H '~{select_first([outputHistogram, "~{basename(bam, ".bam")}.histogram.pdf"])}' \
      -I '~{bam}' \
      ~{if (defined(argumentsFile) && length(select_first([argumentsFile])) > 0) then "--arguments_file '" + sep("' --arguments_file '", select_first([argumentsFile])) + "'" else ""} \
      ~{if (defined(assumeSorted) && select_first([assumeSorted])) then "--ASSUME_SORTED" else ""} \
      ~{if defined(deviations) then ("--DEVIATIONS " + deviations) else ''} \
      ~{if defined(histogramWidth) then ("--HISTOGRAM_WIDTH " + histogramWidth) else ''} \
      ~{if (defined(includeDuplicates) && select_first([includeDuplicates])) then "--INCLUDE_DUPLICATES" else ""} \
      ~{if defined(metricAccumulationLevel) then ("--METRIC_ACCUMULATION_LEVEL '" + metricAccumulationLevel + "'") else ""} \
      ~{if defined(minimumPCT) then ("--MINIMUM_PCT " + minimumPCT) else ''} \
      ~{if defined(stopAfter) then ("--STOP_AFTER " + stopAfter) else ''} \
      ~{if (defined(version) && select_first([version])) then "--version" else ""} \
      ~{if (defined(showHidden) && select_first([showHidden])) then "--showHidden" else ""}
  >>>
  runtime {
    cpu: select_first([runtime_cpu, 1, 1])
    disks: "local-disk ~{select_first([runtime_disks, 20])} SSD"
    docker: "broadinstitute/gatk:4.1.4.0"
    duration: select_first([runtime_seconds, 86400])
    memory: "~{select_first([runtime_memory, 8, 4])}G"
    preemptible: 2
  }
  output {
    File out = select_first([outputFilename, "~{basename(bam, ".bam")}.metrics.txt"])
    File outHistogram = select_first([outputHistogram, "~{basename(bam, ".bam")}.histogram.pdf"])
  }
}

Common Workflow Language¶

#!/usr/bin/env cwl-runner
class: CommandLineTool
cwlVersion: v1.2
label: 'GATK4: CollectInsertSizeMetrics'
doc: |-
  Provides useful metrics for validating library construction including the insert size distribution and read orientation of paired-end libraries

requirements:
- class: ShellCommandRequirement
- class: InlineJavascriptRequirement
- class: DockerRequirement
  dockerPull: broadinstitute/gatk:4.1.4.0

inputs:
- id: javaOptions
  label: javaOptions
  type:
  - type: array
    items: string
  - 'null'
- id: compression_level
  label: compression_level
  doc: |-
    Compression level for all compressed files created (e.g. BAM and VCF). Default value: 2.
  type:
  - int
  - 'null'
- id: bam
  label: bam
  doc: Input SAM or BAM file.  Required.
  type: File
  secondaryFiles:
  - pattern: .bai
  inputBinding:
    prefix: -I
    position: 10
- id: outputFilename
  label: outputFilename
  doc: File to write the output to.  Required.
  type:
  - string
  - 'null'
  default: generated.metrics.txt
  inputBinding:
    prefix: -O
    valueFrom: $(inputs.bam.basename.replace(/.bam$/, "")).metrics.txt
- id: outputHistogram
  label: outputHistogram
  doc: 'File to write insert size Histogram chart to.  Required. '
  type:
  - string
  - 'null'
  default: generated.histogram.pdf
  inputBinding:
    prefix: -H
    valueFrom: $(inputs.bam.basename.replace(/.bam$/, "")).histogram.pdf
- id: argumentsFile
  label: argumentsFile
  doc: read one or more arguments files and add them to the command line
  type:
  - type: array
    inputBinding:
      prefix: --arguments_file
    items: File
  - 'null'
  inputBinding:
    position: 10
- id: assumeSorted
  label: assumeSorted
  doc: |-
    If true (default), then the sort order in the header file will be ignored.  Default value: true. Possible values: {true, false}
  type:
  - boolean
  - 'null'
  inputBinding:
    prefix: --ASSUME_SORTED
    position: 11
- id: deviations
  label: deviations
  doc: |-
    Generate mean, sd and plots by trimming the data down to MEDIAN + DEVIATIONS*MEDIAN_ABSOLUTE_DEVIATION. This is done because insert size data typically includes enough anomalous values from chimeras and other artifacts to make the mean and sd grossly misleading regarding the real distribution.  Default value: 10.0.
  type:
  - double
  - 'null'
  inputBinding:
    prefix: --DEVIATIONS
    position: 11
- id: histogramWidth
  label: histogramWidth
  doc: |-
    Explicitly sets the Histogram width, overriding automatic truncation of Histogram tail. Also, when calculating mean and standard deviation, only bins <= Histogram_WIDTH will be included.  Default value: null.
  type:
  - int
  - 'null'
  inputBinding:
    prefix: --HISTOGRAM_WIDTH
    position: 11
- id: includeDuplicates
  label: includeDuplicates
  doc: |-
    If true, also include reads marked as duplicates in the insert size histogram.  Default value: false. Possible values: {true, false}
  type:
  - boolean
  - 'null'
  inputBinding:
    prefix: --INCLUDE_DUPLICATES
    position: 11
- id: metricAccumulationLevel
  label: metricAccumulationLevel
  doc: |-
    The level(s) at  which to accumulate metrics.    This argument may be specified 0 or more times. Default value: [ALL_READS]. Possible values: {ALL_READS, SAMPLE, LIBRARY, READ_GROUP} .
  type:
  - string
  - 'null'
  inputBinding:
    prefix: --METRIC_ACCUMULATION_LEVEL
    position: 11
- id: minimumPCT
  label: minimumPCT
  doc: |-
    When generating the Histogram, discard any data categories (out of FR, TANDEM, RF) that have fewer than this percentage of overall reads. (Range: 0 to 1).  Default value: 0.05.
  type:
  - float
  - 'null'
  inputBinding:
    prefix: --MINIMUM_PCT
    position: 11
- id: stopAfter
  label: stopAfter
  doc: 'Stop after  processing N reads, mainly for debugging.  Default value: 0. '
  type:
  - int
  - 'null'
  inputBinding:
    prefix: --STOP_AFTER
    position: 11
- id: version
  label: version
  doc: |-
    display the version number for this tool Default value: false. Possible values: {true, false}
  type:
  - boolean
  - 'null'
  inputBinding:
    prefix: --version
    position: 11
- id: showHidden
  label: showHidden
  doc: |-
    display hidden  arguments  Default  value: false.  Possible values: {true, false}
  type:
  - boolean
  - 'null'
  inputBinding:
    prefix: --showHidden
    position: 11

outputs:
- id: out
  label: out
  type: File
  outputBinding:
    glob: $(inputs.bam.basename.replace(/.bam$/, "")).metrics.txt
    loadContents: false
- id: outHistogram
  label: outHistogram
  type: File
  outputBinding:
    glob: $(inputs.bam.basename.replace(/.bam$/, "")).histogram.pdf
    loadContents: false
stdout: _stdout
stderr: _stderr

baseCommand:
- gatk
- CollectInsertSizeMetrics
arguments:
- prefix: --java-options
  position: -1
  valueFrom: |-
    $("-Xmx{memory}G {compression} {otherargs}".replace(/\{memory\}/g, (([inputs.runtime_memory, 8, 4].filter(function (inner) { return inner != null })[0] * 3) / 4)).replace(/\{compression\}/g, (inputs.compression_level != null) ? ("-Dsamjdk.compress_level=" + inputs.compression_level) : "").replace(/\{otherargs\}/g, [inputs.javaOptions, []].filter(function (inner) { return inner != null })[0].join(" ")))

hints:
- class: ToolTimeLimit
  timelimit: |-
    $([inputs.runtime_seconds, 86400].filter(function (inner) { return inner != null })[0])
id: Gatk4CollectInsertSizeMetrics