GATK4: Mark Duplicates¶

Gatk4MarkDuplicates · 1 contributor · 4 versions

MarkDuplicates (Picard): Identifies duplicate reads.

This tool locates and tags duplicate reads in a BAM or SAM file, where duplicate reads are defined as originating from a single fragment of DNA. Duplicates can arise during sample preparation e.g. library construction using PCR. See also EstimateLibraryComplexity for additional notes on PCR duplication artifacts. Duplicate reads can also result from a single amplification cluster, incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument. These duplication artifacts are referred to as optical duplicates.

The MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file. An BARCODE_TAG option is available to facilitate duplicate marking using molecular barcodes. After duplicate reads are collected, the tool differentiates the primary and duplicate reads using an algorithm that ranks reads by the sums of their base-quality scores (default method).

The tool’s main output is a new SAM or BAM file, in which duplicates have been identified in the SAM flags field for each read. Duplicates are marked with the hexadecimal value of 0x0400, which corresponds to a decimal value of 1024. If you are not familiar with this type of annotation, please see the following blog post for additional information.

Although the bitwise flag annotation indicates whether a read was marked as a duplicate, it does not identify the type of duplicate. To do this, a new tag called the duplicate type (DT) tag was recently added as an optional output in the ‘optional field’ section of a SAM/BAM file. Invoking the TAGGING_POLICY option, you can instruct the program to mark all the duplicates (All), only the optical duplicates (OpticalOnly), or no duplicates (DontTag). The records within the output of a SAM/BAM file will have values for the ‘DT’ tag (depending on the invoked TAGGING_POLICY), as either library/PCR-generated duplicates (LB), or sequencing-platform artifact duplicates (SQ). This tool uses the READ_NAME_REGEX and the OPTICAL_DUPLICATE_PIXEL_DISTANCE options as the primary methods to identify and differentiate duplicate types. Set READ_NAME_REGEX to null to skip optical duplicate detection, e.g. for RNA-seq or other data where duplicate sets are extremely large and estimating library complexity is not an aim. Note that without optical duplicate counts, library size estimation will be inaccurate.

MarkDuplicates also produces a metrics file indicating the numbers of duplicates for both single- and paired-end reads.

The program can take either coordinate-sorted or query-sorted inputs, however the behavior is slightly different. When the input is coordinate-sorted, unmapped mates of mapped records and supplementary/secondary alignments are not marked as duplicates. However, when the input is query-sorted (actually query-grouped), then unmapped mates and secondary/supplementary reads are not excluded from the duplication test and can be marked as duplicate reads.

If desired, duplicates can be removed using the REMOVE_DUPLICATE and REMOVE_SEQUENCING_DUPLICATES options.

Quickstart¶

from janis_bioinformatics.tools.gatk4.markduplicates.versions import Gatk4MarkDuplicates_4_1_4

wf = WorkflowBuilder("myworkflow")

wf.step(
    "gatk4markduplicates_step",
    Gatk4MarkDuplicates_4_1_4(
        bam=None,
    )
)
wf.output("out", source=gatk4markduplicates_step.out)
wf.output("metrics", source=gatk4markduplicates_step.metrics)

OR

Install Janis
Ensure Janis is configured to work with Docker or Singularity.
Ensure all reference files are available:

Note

More information about these inputs are available below.

Generate user input files for Gatk4MarkDuplicates:

# user inputs
janis inputs Gatk4MarkDuplicates > inputs.yaml

inputs.yaml

bam:
- bam_0.bam
- bam_1.bam

Run Gatk4MarkDuplicates with:

janis run [...run options] \
    --inputs inputs.yaml \
    Gatk4MarkDuplicates

Information¶

ID:	`Gatk4MarkDuplicates`
URL:	https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_sam_markduplicates_MarkDuplicates.php
Versions:	4.1.4.0, 4.1.3.0, 4.1.2.0, 4.0.12.0
Container:	broadinstitute/gatk:4.1.4.0
Authors:	Michael Franklin
Citations:	See https://software.broadinstitute.org/gatk/documentation/article?id=11027 for more information
Created:	2018-12-24
Updated:	2019-01-24

Outputs¶

name	type	documentation
out	IndexedBam
metrics	tsv

Additional configuration (inputs)¶

name	type	prefix	position	documentation
bam	Array<BAM>	-I	10	One or more input SAM or BAM files to analyze. Must be coordinate sorted.
outputPrefix	Optional<String>
outputFilename	Optional<Filename>	-O	10	File to write duplication metrics to
metricsFilename	Optional<Filename>	-M	10	The output file to write marked records to.
javaOptions	Optional<Array<String>>
compression_level	Optional<Integer>			Compression level for all compressed files created (e.g. BAM and VCF). Default value: 2.
argumentsFile	Optional<Array<File>>	–arguments_file	10	read one or more arguments files and add them to the command line
assumeSortOrder	Optional<String>	-ASO		If not null, assume that the input file has this order even if the header says otherwise. Exclusion: This argument cannot be used at the same time as ASSUME_SORTED. The –ASSUME_SORT_ORDER argument is an enumerated type (SortOrder), which can have one of the following values: [unsorted, queryname, coordinate, duplicate, unknown]
barcodeTag	Optional<String>	–BARCODE_TAG		Barcode SAM tag (ex. BC for 10X Genomics)
comment	Optional<Array<String>>	-CO		Comment(s) to include in the output file’s header.
createIndex	Optional<Boolean>	–CREATE_INDEX	11	Whether to create a BAM index when writing a coordinate-sorted BAM file.
createMd5File	Optional<Boolean>	–CREATE_MD5_FILE	11	Whether to create an MD5 digest for any BAM or FASTQ files created.
maxRecordsInRam	Optional<Integer>	–MAX_RECORDS_IN_RAM	11	When writing SAM files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort a SAM file, and increases the amount of RAM needed.
quiet	Optional<Boolean>	–QUIET	11	Whether to suppress job-summary info on System.err.
tmpDir	Optional<String>	–TMP_DIR	11	Undocumented option
useJdkDeflater	Optional<Boolean>	–use_jdk_deflater	11	Whether to use the JdkDeflater (as opposed to IntelDeflater)
useJdkInflater	Optional<Boolean>	–use_jdk_inflater	11	Whether to use the JdkInflater (as opposed to IntelInflater)
validationStringency	Optional<String>	–VALIDATION_STRINGENCY	11	Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.The –VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values: [STRICT, LENIENT, SILENT]
verbosity	Optional<String>	–verbosity	11	The –verbosity argument is an enumerated type (LogLevel), which can have one of the following values: [ERROR, WARNING, INFO, DEBUG]
opticalDuplicatePixelDistance	Optional<Integer>	–OPTICAL_DUPLICATE_PIXEL_DISTANCE		The maximum offset between two duplicate clusters in order to consider them optical duplicates. The default is appropriate for unpatterned versions of the Illumina platform. For the patterned flowcell models, 2500 is more appropriate. For other platforms and models, users should experiment to find what works best.

Workflow Description Language¶

version development

task Gatk4MarkDuplicates {
  input {
    Int? runtime_cpu
    Int? runtime_memory
    Int? runtime_seconds
    Int? runtime_disks
    Array[File] bam
    String? outputPrefix
    String? outputFilename
    String? metricsFilename
    Array[String]? javaOptions
    Int? compression_level
    Array[File]? argumentsFile
    String? assumeSortOrder
    String? barcodeTag
    Array[String]? comment
    Boolean? createIndex
    Boolean? createMd5File
    Int? maxRecordsInRam
    Boolean? quiet
    String? tmpDir
    Boolean? useJdkDeflater
    Boolean? useJdkInflater
    String? validationStringency
    String? verbosity
    Int? opticalDuplicatePixelDistance
  }
  command <<<
    set -e
    gatk MarkDuplicates \
      --java-options '-Xmx~{((select_first([runtime_memory, 8, 4]) * 3) / 4)}G ~{if (defined(compression_level)) then ("-Dsamjdk.compress_level=" + compression_level) else ""} ~{sep(" ", select_first([javaOptions, []]))}' \
      ~{if defined(assumeSortOrder) then ("-ASO '" + assumeSortOrder + "'") else ""} \
      ~{if defined(barcodeTag) then ("--BARCODE_TAG '" + barcodeTag + "'") else ""} \
      ~{if (defined(comment) && length(select_first([comment])) > 0) then "-CO '" + sep("' '", select_first([comment])) + "'" else ""} \
      ~{if defined(opticalDuplicatePixelDistance) then ("--OPTICAL_DUPLICATE_PIXEL_DISTANCE " + opticalDuplicatePixelDistance) else ''} \
      ~{if length(bam) > 0 then "-I '" + sep("' '", bam) + "'" else ""} \
      -O '~{select_first([outputFilename, "~{select_first([outputPrefix, "generated"])}.markduped.bam"])}' \
      -M '~{select_first([metricsFilename, "~{select_first([outputPrefix, "generated"])}.metrics.txt"])}' \
      ~{if (defined(argumentsFile) && length(select_first([argumentsFile])) > 0) then "--arguments_file '" + sep("' '", select_first([argumentsFile])) + "'" else ""} \
      ~{if select_first([createIndex, true]) then "--CREATE_INDEX" else ""} \
      ~{if (defined(createMd5File) && select_first([createMd5File])) then "--CREATE_MD5_FILE" else ""} \
      ~{if defined(maxRecordsInRam) then ("--MAX_RECORDS_IN_RAM " + maxRecordsInRam) else ''} \
      ~{if (defined(quiet) && select_first([quiet])) then "--QUIET" else ""} \
      ~{if defined(select_first([tmpDir, "tmp/"])) then ("--TMP_DIR '" + select_first([tmpDir, "tmp/"]) + "'") else ""} \
      ~{if (defined(useJdkDeflater) && select_first([useJdkDeflater])) then "--use_jdk_deflater" else ""} \
      ~{if (defined(useJdkInflater) && select_first([useJdkInflater])) then "--use_jdk_inflater" else ""} \
      ~{if defined(validationStringency) then ("--VALIDATION_STRINGENCY '" + validationStringency + "'") else ""} \
      ~{if defined(verbosity) then ("--verbosity '" + verbosity + "'") else ""}
    if [ -f $(echo '~{select_first([outputFilename, "~{select_first([outputPrefix, "generated"])}.markduped.bam"])}' | sed 's/\.[^.]*$//').bai ]; then ln -f $(echo '~{select_first([outputFilename, "~{select_first([outputPrefix, "generated"])}.markduped.bam"])}' | sed 's/\.[^.]*$//').bai $(echo '~{select_first([outputFilename, "~{select_first([outputPrefix, "generated"])}.markduped.bam"])}' ).bai; fi
  >>>
  runtime {
    cpu: select_first([runtime_cpu, 4, 1])
    disks: "local-disk ~{select_first([runtime_disks, 20])} SSD"
    docker: "broadinstitute/gatk:4.1.4.0"
    duration: select_first([runtime_seconds, 86400])
    memory: "~{select_first([runtime_memory, 8, 4])}G"
    preemptible: 2
  }
  output {
    File out = select_first([outputFilename, "~{select_first([outputPrefix, "generated"])}.markduped.bam"])
    File out_bai = select_first([outputFilename, "~{select_first([outputPrefix, "generated"])}.markduped.bam"]) + ".bai"
    File metrics = select_first([metricsFilename, "~{select_first([outputPrefix, "generated"])}.metrics.txt"])
  }
}

Common Workflow Language¶

#!/usr/bin/env cwl-runner
class: CommandLineTool
cwlVersion: v1.2
label: 'GATK4: Mark Duplicates'
doc: |-
  MarkDuplicates (Picard): Identifies duplicate reads.

  This tool locates and tags duplicate reads in a BAM or SAM file, where duplicate reads are
  defined as originating from a single fragment of DNA. Duplicates can arise during sample
  preparation e.g. library construction using PCR. See also EstimateLibraryComplexity for
  additional notes on PCR duplication artifacts. Duplicate reads can also result from a single
  amplification cluster, incorrectly detected as multiple clusters by the optical sensor of the
  sequencing instrument. These duplication artifacts are referred to as optical duplicates.

  The MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads
  and read-pairs in a SAM/BAM file. An BARCODE_TAG option is available to facilitate duplicate
  marking using molecular barcodes. After duplicate reads are collected, the tool differentiates
  the primary and duplicate reads using an algorithm that ranks reads by the sums of their
  base-quality scores (default method).

  The tool's main output is a new SAM or BAM file, in which duplicates have been identified
  in the SAM flags field for each read. Duplicates are marked with the hexadecimal value of 0x0400,
  which corresponds to a decimal value of 1024. If you are not familiar with this type of annotation,
  please see the following blog post for additional information.

  Although the bitwise flag annotation indicates whether a read was marked as a duplicate,
  it does not identify the type of duplicate. To do this, a new tag called the duplicate type (DT)
  tag was recently added as an optional output in the 'optional field' section of a SAM/BAM file.
  Invoking the TAGGING_POLICY option, you can instruct the program to mark all the duplicates (All),
  only the optical duplicates (OpticalOnly), or no duplicates (DontTag). The records within the
  output of a SAM/BAM file will have values for the 'DT' tag (depending on the invoked TAGGING_POLICY),
  as either library/PCR-generated duplicates (LB), or sequencing-platform artifact duplicates (SQ).
  This tool uses the READ_NAME_REGEX and the OPTICAL_DUPLICATE_PIXEL_DISTANCE options as the
  primary methods to identify and differentiate duplicate types. Set READ_NAME_REGEX to null to
  skip optical duplicate detection, e.g. for RNA-seq or other data where duplicate sets are
  extremely large and estimating library complexity is not an aim. Note that without optical
  duplicate counts, library size estimation will be inaccurate.

  MarkDuplicates also produces a metrics file indicating the numbers
  of duplicates for both single- and paired-end reads.

  The program can take either coordinate-sorted or query-sorted inputs, however the behavior
  is slightly different. When the input is coordinate-sorted, unmapped mates of mapped records
  and supplementary/secondary alignments are not marked as duplicates. However, when the input
  is query-sorted (actually query-grouped), then unmapped mates and secondary/supplementary
  reads are not excluded from the duplication test and can be marked as duplicate reads.

  If desired, duplicates can be removed using the REMOVE_DUPLICATE and REMOVE_SEQUENCING_DUPLICATES options.

requirements:
- class: ShellCommandRequirement
- class: InlineJavascriptRequirement
- class: DockerRequirement
  dockerPull: broadinstitute/gatk:4.1.4.0

inputs:
- id: bam
  label: bam
  doc: One or more input SAM or BAM files to analyze. Must be coordinate sorted.
  type:
    type: array
    items: File
  inputBinding:
    prefix: -I
    position: 10
- id: outputPrefix
  label: outputPrefix
  type:
  - string
  - 'null'
- id: outputFilename
  label: outputFilename
  doc: File to write duplication metrics to
  type:
  - string
  - 'null'
  default: generated.markduped.bam
  inputBinding:
    prefix: -O
    position: 10
    valueFrom: |-
      $([inputs.outputPrefix, "generated"].filter(function (inner) { return inner != null })[0]).markduped.bam
- id: metricsFilename
  label: metricsFilename
  doc: The output file to write marked records to.
  type:
  - string
  - 'null'
  default: generated.metrics.txt
  inputBinding:
    prefix: -M
    position: 10
    valueFrom: |-
      $([inputs.outputPrefix, "generated"].filter(function (inner) { return inner != null })[0]).metrics.txt
- id: javaOptions
  label: javaOptions
  type:
  - type: array
    items: string
  - 'null'
- id: compression_level
  label: compression_level
  doc: |-
    Compression level for all compressed files created (e.g. BAM and VCF). Default value: 2.
  type:
  - int
  - 'null'
- id: argumentsFile
  label: argumentsFile
  doc: read one or more arguments files and add them to the command line
  type:
  - type: array
    items: File
  - 'null'
  inputBinding:
    prefix: --arguments_file
    position: 10
- id: assumeSortOrder
  label: assumeSortOrder
  doc: |-
    If not null, assume that the input file has this order even if the header says otherwise. Exclusion: This argument cannot be used at the same time as ASSUME_SORTED. The --ASSUME_SORT_ORDER argument is an enumerated type (SortOrder), which can have one of the following values: [unsorted, queryname, coordinate, duplicate, unknown]
  type:
  - string
  - 'null'
  inputBinding:
    prefix: -ASO
- id: barcodeTag
  label: barcodeTag
  doc: Barcode SAM tag (ex. BC for 10X Genomics)
  type:
  - string
  - 'null'
  inputBinding:
    prefix: --BARCODE_TAG
- id: comment
  label: comment
  doc: Comment(s) to include in the output file's header.
  type:
  - type: array
    items: string
  - 'null'
  inputBinding:
    prefix: -CO
- id: createIndex
  label: createIndex
  doc: Whether to create a BAM index when writing a coordinate-sorted BAM file.
  type: boolean
  default: true
  inputBinding:
    prefix: --CREATE_INDEX
    position: 11
- id: createMd5File
  label: createMd5File
  doc: Whether to create an MD5 digest for any BAM or FASTQ files created.
  type:
  - boolean
  - 'null'
  inputBinding:
    prefix: --CREATE_MD5_FILE
    position: 11
- id: maxRecordsInRam
  label: maxRecordsInRam
  doc: |-
    When writing SAM files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort a SAM file, and increases the amount of RAM needed.
  type:
  - int
  - 'null'
  inputBinding:
    prefix: --MAX_RECORDS_IN_RAM
    position: 11
- id: quiet
  label: quiet
  doc: Whether to suppress job-summary info on System.err.
  type:
  - boolean
  - 'null'
  inputBinding:
    prefix: --QUIET
    position: 11
- id: tmpDir
  label: tmpDir
  doc: Undocumented option
  type: string
  default: tmp/
  inputBinding:
    prefix: --TMP_DIR
    position: 11
- id: useJdkDeflater
  label: useJdkDeflater
  doc: Whether to use the JdkDeflater (as opposed to IntelDeflater)
  type:
  - boolean
  - 'null'
  inputBinding:
    prefix: --use_jdk_deflater
    position: 11
- id: useJdkInflater
  label: useJdkInflater
  doc: Whether to use the JdkInflater (as opposed to IntelInflater)
  type:
  - boolean
  - 'null'
  inputBinding:
    prefix: --use_jdk_inflater
    position: 11
- id: validationStringency
  label: validationStringency
  doc: |-
    Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values: [STRICT, LENIENT, SILENT]
  type:
  - string
  - 'null'
  inputBinding:
    prefix: --VALIDATION_STRINGENCY
    position: 11
- id: verbosity
  label: verbosity
  doc: |-
    The --verbosity argument is an enumerated type (LogLevel), which can have one of the following values: [ERROR, WARNING, INFO, DEBUG]
  type:
  - string
  - 'null'
  inputBinding:
    prefix: --verbosity
    position: 11
- id: opticalDuplicatePixelDistance
  label: opticalDuplicatePixelDistance
  doc: |-
    The maximum offset between two duplicate clusters in order to consider them optical duplicates. The default is appropriate for unpatterned versions of the Illumina platform. For the patterned flowcell models, 2500 is more appropriate. For other platforms and models, users should experiment to find what works best.
  type:
  - int
  - 'null'
  inputBinding:
    prefix: --OPTICAL_DUPLICATE_PIXEL_DISTANCE

outputs:
- id: out
  label: out
  type: File
  secondaryFiles:
  - |-
    ${

            function resolveSecondary(base, secPattern) {
              if (secPattern[0] == "^") {
                var spl = base.split(".");
                var endIndex = spl.length > 1 ? spl.length - 1 : 1;
                return resolveSecondary(spl.slice(undefined, endIndex).join("."), secPattern.slice(1));
              }
              return base + secPattern
            }
            return [
                    {
                        path: resolveSecondary(self.path, "^.bai"),
                        basename: resolveSecondary(self.basename, ".bai"),
                        class: "File",
                    }
            ];

    }
  outputBinding:
    glob: |-
      $([inputs.outputPrefix, "generated"].filter(function (inner) { return inner != null })[0]).markduped.bam
    loadContents: false
- id: metrics
  label: metrics
  type: File
  outputBinding:
    glob: |-
      $([inputs.outputPrefix, "generated"].filter(function (inner) { return inner != null })[0]).metrics.txt
    loadContents: false
stdout: _stdout
stderr: _stderr

baseCommand:
- gatk
- MarkDuplicates
arguments:
- prefix: --java-options
  position: -1
  valueFrom: |-
    $("-Xmx{memory}G {compression} {otherargs}".replace(/\{memory\}/g, (([inputs.runtime_memory, 8, 4].filter(function (inner) { return inner != null })[0] * 3) / 4)).replace(/\{compression\}/g, (inputs.compression_level != null) ? ("-Dsamjdk.compress_level=" + inputs.compression_level) : "").replace(/\{otherargs\}/g, [inputs.javaOptions, []].filter(function (inner) { return inner != null })[0].join(" ")))

hints:
- class: ToolTimeLimit
  timelimit: |-
    $([inputs.runtime_seconds, 86400].filter(function (inner) { return inner != null })[0])
id: Gatk4MarkDuplicates