Secondary / Accessory Files¶

In some domains (looking specifically at Bioinformatics here), a single file isn’t enough to contain all the information. Janis borrows the concept of secondary files from the CWL specification, and in fact we use the same pattern for grouping these files.

Often a secondary or accessory file is used to provide additional information, potentially a quick access index. These files are attached to the original file by a specific file pattern. We’ll talk about that more soon.

For this reason, Janis allows data_types that inherit from a File to specify a secondary_file list of files to be bundled with.

Secondary file pattern¶

As earlier mentioned, we follow the Common Workflow Language secondary file pattern:

If string begins with one or more caret ^ characters, for each caret, remove the last file extension from the path (the last period . and all following characters). If there are no file extensions, the path is unchanged.
Append the remainder of the string to the end of the file path.

Examples¶

IndexedBam
- Pattern: .bai
- Files:
  
  Base: myfile.bam
  
  myfile.bam.bai
FastaWithIndexes:
- Pattern: .amb, .ann, .bwt, .pac, .sa, .fai, ^.dict
- Files:
  
  Base: reference.fasta
  
  reference.fasta.amb
  
  reference.fasta.ann
  
  reference.fasta.bwt
  
  reference.fasta.pac
  
  reference.fasta.sa,
  
  reference.fasta.fai
  
  reference.dict

Proposed¶

Implement optional secondary files as per CWL v1.1.

Implementation note¶

CWL¶

As we mimic the CWL secondary file pattern, we don’t need to do any extra work except by providing this pattern to a:

If you use the secondaries_present_as on a janis.ToolInput or janis.ToolOutput, a CWL expression is generated to rename the secondary file expression. More information can be found about this in common-workflow-language/cwltool#1232.

Issues¶

CWLTool has an issue when attempting to scatter using multiple fields (using the dotproduct or *_crossproduct methods), more information can be found on common-workflow-language/cwltool#1208.

WDL¶

The translation for WDL to implement secondary files was one of the most challenging aspects of the translation. Notably, WDL has no concept of secondary files. There are a few things we had to consider:

Every file needs to be individually localised.
A data type with secondary files can be used in an array of inputs
Secondary files may need to be globbed if used as an Output data type
An array of files with secondaries can be scattered on (including scattered by multiple fields)
Janis should fill the input job with these secondary files (with the correct extension)

Implementation¶

Let’s just break this down into different sections

Case 1: Simple index¶

The following workflow.input("my_bam", BamBai), definition when connected to a tool might look like the following

workflow WGSGermlineGATK {
  input {
    File my_bam
    File my_bam_bai

  }
  call my_tool {
    input:
      bam=my_bam
      bam_bai=my_bam_bai
  }
  output {
    File out_bam = my_tool.out
    File out_bam_bai = my_tool.out_bai
  }
}

Note the extra annotations and mappings fot the bai type.

Case 2: Array of inputs with simple scatter¶

This is modification of the first example, nb: this isn’t full functional workflow code:

workflow.input("my_bams", Array(BamBai))

workflow.step(
    "my_step",
    MyTool(bam=workflow.my_bams),
    scatter="bam"
)

Might result in the following workflow:

workflow WGSGermlineGATK {
  input {
    Array[File] my_bams
    Array[File] my_bams_bai

  }
  scatter (Q in zip(my_bams, my_bams_bai)) {
    call my_tool as my_step {
      input:
        bam=Q.left
        bam_bai=Q.right
    }
  }

  output {
    Array[File] out_bams = my_tool_that_accepts_array.out
    Array[File] out_bams_bai = my_tool_that_accepts_array.out_bai
  }
}

Case 3: Multiple array inputs, scattering by multiple fields¶

Consider the following workflow:

workflow.input("my_bams", Array(BamBai))
workflow.input("my_references", Array(FastaBwa))

workflow.step(
    "my_step",
    ToolTypeThatAcceptsMultipleBioinfTypes(
        bam=workflow.my_bams, reference=workflow.my_references
    ),
    scatter=["bam", "reference"],
)

workflow.output("out_bam", source=workflow.my_step.out_bam)
workflow.output("out_reference", source=workflow.my_step.out_reference)

This gets complicated quickly:

workflow scattered_bioinf_complex {
  input {
    Array[File] my_bams
    Array[File] my_bams_bai
    Array[File] my_references
    Array[File] my_references_amb
    Array[File] my_references_ann
    Array[File] my_references_bwt
    Array[File] my_references_pac
    Array[File] my_references_sa
  }
  scatter (Q in zip(transpose([my_bams, my_bams_bai]), transpose([my_references, my_references_amb, my_references_ann, my_references_bwt, my_references_pac, my_references_sa]))) {
     call MyTool as my_step {
      input:
        bam=Q.left[0],
        bam_bai=Q.left[1],
        reference=Q.right[0],
        reference_amb=Q.right[1],
        reference_ann=Q.right[2],
        reference_bwt=Q.right[3],
        reference_pac=Q.right[4],
        reference_sa=Q.right[5]
    }
  }
  output {
    Array[File] out_bam = my_step.out_bam
    Array[File] out_reference = my_step.out_reference
  }
}

Known limitations¶

There is no namespace collision:
- Two files with similar prefixes but differences in punctuation will clash
- A second input that is suffixed with the secondary’s extension will clash: eg: mybam_bai will clash with mybam with a secondary of .bai.
Globbing a secondary file might not be possible when the original file extension is unknown. There are 2 considerations for this:
- Subclasses of File should caller super() with the expected extension
- Globbing based on a generated Filename (through InputSelector), will consider the extension property.

Relevant WDL issues: