Tutorial 3 - Naming and organising outputs¶
Sometimes it’s useful to organise your outputs in specific ways, especially when Janis might need to interact with other specific systems. By default, outputs are named by the tag of the output. The extension is derived from the type of the output (the Bam filetype knows it uses a .bam
extension).
For example, if you had an output called out
which had type BamBai
, your output files by default would have the name out.bam
and out.bam.bai
.
When exposing an output on a workflow, there are two arguments you can provide to override this behaviour:
output_name: Union[str, InputSelector, InputNode] = None
output_folder: Union[str, Tuple[Node, str], List[Union[str, InputSelector, Tuple[Node, str]]]] = None
Preparation¶
This tutorial uses the workflow build in Tutorial 1. You can follow this guide, but the example will require you to have built (or acquired) the workflow from tutorial 1.
Output name¶
Simply put, output_name
is the dervied filename of the output without the extension. By default, this is the tag
of the output.
You can specify a new output name in 2 ways:
- A static string:
output_name="new name for output"
- Selecting an input value (given “sample_name” is the name of an input):
output_name=workflow.sample_name
output_name=InputSelector("sample_name")
You should make the following considerations:
The input you select should be a string, or
If the output you’re naming is an array, the input you select should either be:
- singular
- have the same number of elements in it.
Janis will either fall back to the first element if it’s a list, or default to the output tag. This may cause outputs to override each other.
Output folder¶
Similar to the output name, the output_folder
is folder, or group of nested folders into which your output will be written. By default, this field has no value and outputs are linked directly into the output directory.
If the output_folder field is an array, a nested folder is created for each element in ascending order (eg: ["parent", "child", "child_of_child"]
).
There are multiple ways to specify output directories:
- A static string:
output_folder="my_outputs"
- Selecting an input value (given “sample_name” is the name of an input):
output_folder=workflow.sample_name
output_folder=InputSelector("sample_name")
- An array of a combination of values:
output_folder=["variants", "unmerged"]
output_folder=["variants", w.sample_name]
output_folder=[w.other_field, w.sample_name]
Alignment workflow¶
In our alignment example (tools/alignment.py
), we have the following outputs:
w.output("out", source=w.sortsam.out)
w.output("stats", source=w.flagstat.stats)
We want to name the output in the following way:
- Grouped the bam
out
into the folder:bams
, - Group the samtools summary into folder
statistics
- Both outputs named:
sample_name
(Janis will automatically add the.bam
or.txt
extension).
w.output(
"out",
source=w.sortsam.out,
output_name=w.sample_name,
output_folder=["bams"]
)
w.output(
"stats",
source=w.flagstat.stats,
output_name=w.sample_name,
output_folder=["statistics"]
)
Run the workflow again and inspect the new results:
janis translate tools/alignment.py wdl
janis run -o tutorial3 tools/alignment.py \
--fastq data/BRCA1_R*.fastq.gz \
--reference reference/hg38-brca1.fasta \
--sample_name NA12878 \
--read_group "@RG\tID:NA12878\tSM:NA12878\tLB:NA12878\tPL:ILLUMINA"
And then after the workflow has finished, we find the following output structure:
$ ls tutorial3/*
tutorial3/bams:
-rw-r--r-- PMCI\Bioinf-Cluster 2.7M NA12878.bam
-rw-r--r-- PMCI\Bioinf-Cluster 296B NA12878.bam.bai
tutorial3/statistics:
-rw-r--r-- PMCI\Bioinf-Cluster 408B NA12878.txt
tutorial3/janis:
[...ommitted]