Getting Started
Tutorial
A tutorial is available with a small(ish) dataset where biologically meaningful results can be produced. This can help get an understanding of a good workflow to use different modules. You can also follow along with your own data and just skip analyses you don't want. I recommend at least looking at this to get an idea of the scope of the pipeline as well as to see a few tips. If you prefer to just jump in instead, below describes how to get a new project up and running.
Requirements
This pipeline can be run on Linux systems with Conda and Apptainer/Singularity installed. All other dependencies will be handled with the workflow, and thus, sufficient storage space is needed for these installations (~10GB, though this is only when using all features of the pipeline, if you are not using any of the mapping module, it will be more like 3-5GB). It can be run on a local workstation with sufficient resources and storage space (dataset dependent), but is aimed at execution on high performance computing systems with job queuing systems.
Data-wise, you'll need a reference genome (uncompressed) and some sequencing data for your samples. The latter can be either raw FASTQ files, BAM alignments to the reference, or SRA accession numbers for already published FASTQ files.
Deploying the workflow
The pipeline can be deployed in two ways: (1) using Snakedeploy which will deploy the pipeline as a module (recommended); (2) clone the repository at the version/branch you prefer (recommended if you will need to change workflow code beyond what is possible with module definitions). If you are curious how modularization works in Snakemake, take a look at the docs.
Both methods require a Snakemake environment to run the pipeline in.
Preparing the environment
First, create an environment for Snakemake, including Snakedeploy if you intend to deploy that way:
conda create -c conda-forge -c bioconda --name popglen snakemake snakedeploy
If you already have a Snakemake environment, you can use that instead. Snakemake
versions >=8 are likely to work, but most testing is on 8.20. If you intend to
use a job queuing system with the pipeline, be sure to install the appropriate
executor plugin. Most
of the testing has been with snakemake-executor-plugin-slurm
, which can be
installed in the environment you are running Snakemake from.
Activate the Snakemake environment:
conda activate snakemake
Option 1. Deploying with Snakedeploy
Make your working directory:
mkdir -p /path/to/work-dir
cd /path/to/work-dir
And deploy the workflow, using the tag for the version you want to deploy:
snakedeploy deploy-workflow https://github.com/zjnolen/PopGLen . --tag v0.4.1
This will generate a simple Snakefile in a workflow
folder that loads the
pipeline as a module. It will also download the template config.yaml
,
samples.tsv
, and units.tsv
in the config
folder.
Option 2. Cloning from GitHub
Go to the folder you would like you working directory to be created in and clone the GitHub repo:
git clone https://github.com/zjnolen/PopGLen.git
If you would like, you can change the name of the directory:
mv PopGLen work-dir-name
Move into the working directory (PopGLen
or work-dir-name
if you changed it)
and checkout the version you would like to use:
git checkout v0.4.1
This can also be used to checkout specific branches or commits.
Configuring the workflow
Now you are ready to configure the workflow, see the documentation for that here.