Tutorial 1 - Building a Workflow¶
In this stage, we’re going to build a simple workflow to align short reads of DNA.
- Start with a pair of compressed
FASTQ
files, - Align these reads using
BWA MEM
into an uncompressedSAM
file (the de facto standard for short read alignments), - Compress this into the binary equivalent
BAM
file usingsamtools
, and finally - Sort the reads using
GATK4 SortSam
.
These tools already exist within the Janis Tool Registry, you can see their documentation online:
Preparation¶
To prepare for this tutorial, we’re going to create a folder and download some data:
mkdir janis-tutorials && cd janis-tutorials
# If WGET is installed
wget -q -O- "https://github.com/PMCC-BioinformaticsCore/janis-workshops/raw/master/janis-data.tar" | tar -xz
# If CURL is installed
curl -Ls "https://github.com/PMCC-BioinformaticsCore/janis-workshops/raw/master/janis-data.tar" | tar -xz
Creating our file¶
A Janis workflow is a Python script, so we can start by creating a file called alignment.py
and importing Janis.
mkdir tools
vim tools/alignment.py # or vim, emacs, sublime, vscode
From the janis_core
library, we’re going to import WorkflowBuilder
and a String
:
from janis_core import WorkflowBuilder, String
Imports¶
We have four inputs we want to expose on this workflow:
- Sequencing Reads (
FastqGzPair
- paired end sequence) - Sample name (
String
) - Read group header (
String
) - Reference files (
Fasta
+ index files (FastaWithIndex
))
We’ve already imported the String
type, and we can import FastqGzPair
and FastaWithIndex
from the janis_bioinformatics
registry:
from janis_bioinformatics.data_types import FastqGzPairedEnd, FastaWithDict
Tools¶
We’ve discussed the tools we’re going to use. The documentation for each tool has a row in the tbale caled “Python” that gives you the import statement. This is how we’ll import how tools:
from janis_bioinformatics.tools.bwa import BwaMemLatest
from janis_bioinformatics.tools.samtools import SamToolsView_1_9
from janis_bioinformatics.tools.gatk4 import Gatk4SortSam_4_1_2
Declaring our workflow¶
We’ll create an instance of the WorkflowBuilder
class, this just requires a name for your workflow (can contain alphanumeric characters and underscores).
w = WorkflowBuilder("alignmentWorkflow")
A workflow has 3 methods for building workflows:
workflow.input
- Used for creating inputs,workflow.step
- Creates a step on a workflow,workflow.output
- Exposes an output on a workflow.
We give each input / step / output a unique identifier, which then becomes a node in our workflow graph. We can refer to the created node using dot-notation (eg: w.input_name
). We’ll see how this works in the later sections.
More information about each step will be linked from this page about the Workflow
and WorkflowBuilder
class.
Creating inputs on a workflow¶
Further reading: Creating an input
To create an input on a workflow, you can use the Workflow.input
method, which has the following structure:
Workflow.input(
identifier: str,
datatype: DataType,
default: any = None,
doc: str = None
)
An input requires a unique identifier (string) and a DataType (String, FastqGzPair, etc). Let’s prepare the inputs for our workflow:
w.input("sample_name", String)
w.input("read_group", String)
w.input("fastq", FastqGzPairedEnd)
w.input("reference", FastaWithDict)
Declaring our steps and connections¶
Further reading: Creating a step
Similar to exposing inputs, we create steps with the Workflow.step
method. It has the following structure:
Workflow.step(
identifier: str,
tool: janis_core.tool.tool.Tool,
scatter: Union[str, List[str], ScatterDescription] = None,
)
We provide a identifier for the step (unique amongst the other nodes in the workflow), and intialise our tool, passing our inputs of the step as parameters.
We can refer to an input (or previous result) using the dot notation. For example, to refer to the fastq
input, we can use w.fastq
.
BWA MEM¶
We use bwa mem’s documentation to determine that we need to provide the following inputs:
reads
:FastqGzPair
(connect tow.fastq
)readGroupHeaderLine
:String
(connect tow.read_group
)reference
:FastaWithDict
(connect tow.reference
)
We can connect them to the relevant inputs to get the following step definition:
w.step(
"bwamem", # identifier
BwaMemLatest(
reads=w.fastq,
readGroupHeaderLine=w.read_group,
reference=w.reference
)
)
Samtools view¶
We’ll use a very similar pattern for Samtools View, except this time we’ll reference the output of bwamem
. From bwa mem’s documentation, there is one output called out
with type Sam
. We’ll connect this to SamtoolsView
only input, called sam
.
w.step(
"samtoolsview",
SamToolsView_1_9(
sam=w.bwamem.out
)
)
SortSam¶
In addition to connecting the output of samtoolsview
to Gatk4 SortSam, we want to tell SortSam to use the following values:
- sortOrder:
"coordinate"
- createIndex:
True
Instead of connecting an input or a step, we just just provide the literal value.
w.step(
"sortsam",
Gatk4SortSam_4_1_2(
bam=w.samtoolsview.out,
sortOrder="coordinate",
createIndex=True
)
)
Exposing outputs¶
Further reading: Creating an output
Outputs have a very similar syntax to both inputs and steps, they take an identifier
and a named source
parameter. Here is the structure:
Workflow.output(
identifier: str,
datatype: DataType = None,
source: Node = None,
output_folder: List[Union[String, Node]] = None,
output_name: Union[String, Node] = None
)
Often, we don’t want to specify the output data type, because we can let Janis do this for us. We’ll talk about the output_folder
and output_name
in the next few sections. For now, we just have to specify an output identifier and a source.
w.output("out", source=w.sortsam.out)
Workflow + Translation¶
Hopefully you have a workflow that looks like the following!
from janis_core import WorkflowBuilder, String
from janis_bioinformatics.data_types import FastqGzPairedEnd, FastaWithDict
from janis_bioinformatics.tools.bwa import BwaMemLatest
from janis_bioinformatics.tools.samtools import SamToolsView_1_9
from janis_bioinformatics.tools.gatk4 import Gatk4SortSam_4_1_2
w = WorkflowBuilder("alignmentWorkflow")
# Inputs
w.input("sample_name", String)
w.input("read_group", String)
w.input("fastq", FastqGzPairedEnd)
w.input("reference", FastaWithDict)
# Steps
w.step(
"bwamem",
BwaMemLatest(
reads=w.fastq,
readGroupHeaderLine=w.read_group,
reference=w.reference
)
)
w.step(
"samtoolsview",
SamToolsView_1_9(
sam=w.bwamem.out
)
)
w.step(
"sortsam",
Gatk4SortSam_4_1_2(
bam=w.samtoolsview.out,
sortOrder="coordinate",
createIndex=True
)
)
# Outputs
w.output("out", source=w.sortsam.out)
We can translate the following file into Workflow Description Language using janis from the terminal:
janis translate tools/alignment.py wdl
Running the alignment workflow¶
We’ll run the workflow against the current directory.
janis run -o . --engine cwltool \
tools/alignment.py \
--fastq data/BRCA1_R*.fastq.gz \
--reference reference/hg38-brca1.fasta \
--sample_name NA12878 \
--read_group "@RG\tID:NA12878\tSM:NA12878\tLB:NA12878\tPL:ILLUMINA"
After the workflow has run, you’ll see the outputs in the current directory:
ls
# drwxr-xr-x mfranklin 1677682026 160B data
# drwxr-xr-x mfranklin 1677682026 256B janis
# -rw-r--r-- mfranklin wheel 2.7M out.bam
# -rw-r--r-- mfranklin wheel 296B out.bam.bai
# drwxr-xr-x mfranklin 1677682026 320B reference
# drwxr-xr-x mfranklin 1677682026 128B tools
OPTIONAL: Run with Cromwell¶
If you have java
installed, Janis can run the workflow in the Crowmell execution engine by using the --engine cromwell
parameter:
janis run -o run-with-cromwell --engine cromwell \
tools/alignment.py \
--fastq data/BRCA1_R*.fastq.gz \
--reference reference/hg38-brca1.fasta \
--sample_name NA12878 \
--read_group "@RG\tID:NA12878\tSM:NA12878\tLB:NA12878\tPL:ILLUMINA"