Running PopGLen on a cluster using a job queue

PopGLen is primarily made for running on a high performance computing system that has a job scheduler. This system allows for each rule to be submitted as a single job on a system with a large number of nodes, enabling large levels of parallelization. Snakemake is well integrated to many job scheduling systems, and some examples can be found on the Snakemake plugin catalog.

Command line options for cluster execution

Here, we will walk through what an execution might look like on an HPC with the SLURM job scheduler. This requires a conda environment with Snakemake installed, along with the snakemake-executor-plugin-slurm installed:

conda create \
    -n popglen \
    -c conda-forge -c bioconda \
    snakemake=8.25.0 \
    snakemake-executor-plugin-slurm

conda activate popglen

## OR, if you already set up a popglen environment and just need to add the
## executor plugin:

conda activate popglen

conda install -c conda-forge -c bioconda snakemake-executor-plugin-slurm

Once you've set up a working directory and workflow for PopGLen, you can run it using the following command:

snakemake --use-conda --use-singularity --executor slurm \
    --set-resources slurm_account=<project> slurm_partition=<partition name>

This will ensure that rules will be submitted as jobs to the job queue, by default using the SLURM account and partition defined in --set-resources. The values entered for these will depend on your system, but would be the equivalent of the account and partition options you set in the header of SLURM scripts:

#SBATCH -A <project>
#SBATCH -p <partition name>

Combining command line options into a profile

You may wish to set a few more options, such as the maximum number of threads available to each job, the maximum number of jobs to have running and in the queue at once based on your queue's limits, and how many local cores to use for rules that run outside the job system (in PopGLen these are quick commands like making lists of BAM files or making symbolic links). These quickly begin to add up, so take a look at how we define these all in an example profile for the HPC system Dardel at PDC:

profiles/dardel/config.yaml

restart-times: 3
local-cores: 1
use-conda: true
use-singularity: true
jobs: 999
keep-going: true
max-threads: 128
executor: slurm
singularity-args: '--tmp-sandbox -B /cfs/klemming'
default-resources:
  - "mem_mb=(threads*1700)"
  - "runtime=60"
  - "slurm_account=<account>"
  - "slurm_partition=shared"
  - "tmpdir='/cfs/klemming/scratch/u/user'"

Assuming we place this file in the working directory under profiles/dardel, we can run Snakemake with simply:

snakemake --profile ./profiles/dardel

and it will automatically use the options defined in the profile. Here, we allow up to 3 retries for failed jobs, 1 local core for running local jobs, enable both conda and singularity usage, set a max of 999 jobs to be running at once, tell rules to keep going if one fails, set a maximum of 128 threads for a single job (the size of a node on Dardel), define SLURM as the executor, set some required arguments to be passed to Singularity on Dardel, and set several default resources, including giving each rule 1.7GB memory per thread, which is the amount available on Dardel's shared partition. We also set the temporary directory to use, as it is not set on Dardel by default.

Note, that if you're not on Dardel, this will need some changes, so make a new profile that matches your system. You'll likely need to change max-threads, slurm_account, slurm_partition, and tmpdir. mem_mb can have 1700 changed to match the amount of memory your system has per core. singularity-args is specific to Dardel, and can be omitted, unless your system requires something similar. At the minimum, -B /cfs/klemming will need to be changed or removed.

Executing the workflow while away using Screen

If you run Snakemake in the login shell of your system, it will cancel when you logout, which is not ideal for the expected long runtimes for processing WGS data. If your system lets you submit new jobs from within other jobs, you can submit your Snakemake command as a long running SLURM job. Be sure to activate the conda environment for PopGLen either inside the job before running Snakemake or before submitting the job, as the job inherits your active environment.

If you can't submit jobs from inside other jobs, or you would like the option to interact with it while its running, you can run it inside a screen, if screen is available on your system. First start a new screen:

screen -S project-name

This will open up a virtual terminal that will stay active as long as your system is online. Inside this terminal, you can activate snakemake, do dry runs, and start the workflow:

conda activate popglen

# do a dry run
snakemake --profile ./profiles/my-cluster -n

# do a real run
snakemake --profile ./profiles/my-cluster

Snakemake will then submit jobs from inside the screen. You can disconnect from the screen with CTRL-A + D, and safely log out without Snakemake being interrupted. Then, you can go back in by resuming your screen:

screen -r project-name

and see how far it has gotten (or if it fails and you need to change something).

When you're done with a screen, you can kill it with CTRL-A + K.

Notes for clusters where worker nodes have no network access

When workflows are deployed as modules, the code is often stored remotely. As Snakemake retrieves the code for external script rules during the running job, this will cause an error if the worker nodes do not have network access. To resolve this, clone the repository locally and use that clone as the working directory or point to the local copy for the module, rather than using a module pointing to the GitHub, as is the default if deployed with Snakedeploy.

This same issue applies to wrappers, and will cause an error even if the repository is cloned locally. This can be resolved by cloning the Snakemake wrappers repository locally, checking out the appropriate version (for PopGLen, this is v0.4.0), and pointing to that local copy with the --wrapper-prefix option in the Snakemake command. See this issue for more information on this situation.