Running CAW with Singularity

Maxime 2017-11-15
Last modified 2017-11-17

This post was written for the Nextflow blog. Over there is the final version, here is the version I shared with my colleagues to get comments and suggestions.

The CAW Pipeline

alt text

Cancer Analysis Workflow (CAW for short) is an analysis pipeline developed for the analysis of tumour : normal pairs. It is developed in collaboration with two infrastructures within Science for Life Laboratory: National Genomics Infrastructure (NGI), in The Stockholm Genomics Applications Development Facility to be precise and National Bioinformatics Infrastructure Sweden (NBIS).

CAW is based on GATK Best Practices for the preprocessing of FastQ files, then uses various variant calling tools to look for somatic SNVs and small indels (MuTect1, MuTect2, Strelka, Freebayes), (GATK HaplotyeCaller), for structural variants(Manta) and for CNVs (ASCAT). Annotation tools (snpEff, VEP) are also used, and finally MultiQC for handling reports.

We are currently working on a manuscript, but you’re welcome to look at (or even contribute to) our github repository or talk with us on our gitter channel.

Singularity and UPPMAX

Singularity is a tool package software dependencies into a contained environment, much like Docker. It’s designed to run on HPC environments where Docker is often a problem due to its requirement for administrative privileges.

We’re based in Sweden, and Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX) provides Computational infrastructures for all Swedish researchers. Since we’re analyzing sensitive data, we are using secure clusters (with a two factor authentication), set up by UPPMAX: SNIC-SENS.

In my case, since we’re still developing the pipeline, I am mainly using the research cluster Bianca. So I can only transfer files and data in one specific repository using SFTP.

UPPMAX provides computing resources for Swedish researchers for all scientific domains, so getting software updates can occasionally take some time. Typically, environment modules are used which allow several versions of different tools - this is good for reproducibility and is quite easy to use. However, the approach is not portable across different clusters outside of UPPMAX.

Why use containers?

The idea of using containers, for improved portability and reproducibility, and more up to date tools, came naturally to us, as it is easily managed within Nextflow. We cannot use Docker on our secure cluster, so we wanted to run CAW with Singularity images instead.

How was the switch made?

We were already using Docker containers for our continuous integration testing with Travis, and since we use many tools, I took the approach of making (almost) a container for each process. Because this process is quite slow, repetitive and I‘m lazy like to automate everything, I made a simple NF script to build and push all docker containers. Basically it’s just build and pull for all containers, with some configuration possibilities.

docker build -t ${repository}/${container}:${tag} ${baseDir}/containers/${container}/.

docker push ${repository}/${container}:${tag}

Since Singularity can directly pull images from DockerHub, I made the build script to pull all containers from DockerHub to have local Singularity image files.

singularity pull --name ${container}-${tag}.img docker://${repository}/${container}:${tag}

After this, it’s just a matter of moving all containers to the secure cluster we’re using, and using the right configuration file in the profile. I’ll spare you the details of the SFTP transfer. This is what the configuration file for such Singularity images looks like: singularity-path.config

/*
vim: syntax=groovy
-*- mode: groovy;-*-
 * -------------------------------------------------
 * Nextflow config file for CAW project
 * -------------------------------------------------
 * Paths to Singularity images for every process
 * No image will be pulled automatically
 * Need to transfer and set up images before
 * -------------------------------------------------
 */

singularity {
  enabled = true
  runOptions = "--bind /scratch"
}

params {
  containerPath='containers'
  tag='1.2.3'
}

process {
  $ConcatVCF.container      = "${params.containerPath}/caw-${params.tag}.img"
  $RunMultiQC.container     = "${params.containerPath}/multiqc-${params.tag}.img"
  $IndelRealigner.container = "${params.containerPath}/gatk-${params.tag}.img"
  // I'm not putting the whole file here
  // you probably already got the point
}

This approach ran (almost) perfectly on the first try, except a process failing due to a typo on a container name…

Conclusion

This switch was completed a couple of months ago and has been a great success. We are now using Singularity containers in almost all of our Nextflow pipelines developed at NGI. Even if we do enjoy the improved control, we must not forgot that:

With great power comes great responsibility!

Your email address will not be published