Getting Started
Tutorial
Note
A tutorial is in progress, but not yet available. The pipeline can still be used by following the rest of the guide.
A tutorial is available with a small(ish) dataset where biologically meaningful results can be produced. This can help get an understanding of a good workflow to use different modules. You can also follow along with your own data and just skip analyses you don't want. If you prefer to just jump in instead, below describes how to quickly get a new project up and running.
Requirements
This pipeline can be run on Linux systems with Conda and Apptainer/Singularity installed. All other dependencies will be handled with the workflow, and thus, sufficient storage space is needed for these installations (~10GB, but this needs verification). It can be run on a local workstation with sufficient resources and storage space (dataset dependent), but is aimed at execution on high performance computing systems with job queuing systems.
Data-wise, you'll need a reference genome (uncompressed) and some sequencing data for your samples. The latter can be either raw fastq files, bam alignments to the reference, or accession numbers for already published fastq files.
Deploying the workflow
The pipeline can be deployed in two ways: (1) using Snakedeploy which will deploy the pipeline as a module (recommended); (2) clone the repository at the version/branch you prefer (recommended if you will change any workflow code).
Both methods require a Snakemake environment to run the pipeline in.
Preparing the environment
First, create an environment for Snakemake, including Snakedeploy if you intend to deploy that way:
mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy
If you already have a Snakemake environment, you can use that, so long as
it has snakemake
(not just snakemake-minimal
) installed. Snakemake
versions >=7.25 are likely to work, but most testing is on 7.32.4. It is
compatible with Snakemake v8, but you may need to install additional plugins for
cluster execution due to the new executor plugin system. See the
Snakemake docs for what
additional executor plugin you might need to enable cluster execution for your
system.
Activate the Snakemake environment:
conda activate snakemake
Deploying with Snakedeploy
Make your working directory:
mkdir -p /path/to/work-dir
cd /path/to/work-dir
And deploy the workflow, using the tag for the version you want to deploy:
snakedeploy deploy-workflow https://github.com/zjnolen/PopGLen . --tag v0.2.0
This will generate a simple Snakefile in a workflow
folder that loads the
pipeline as a module. It will also download the template config.yaml
,
samples.tsv
, and units.tsv
in the config
folder.
Cloning from GitHub
Go to the folder you would like you working directory to be created in and clone the GitHub repo:
git clone https://github.com/zjnolen/PopGLen.git
If you would like, you can change the name of the directory:
mv PopGLen work-dir-name
Move into the working directory (PopGLen
or work-dir-name
if you changed it) and checkout the version you would like to use:
git checkout v0.2.0
This can also be used to checkout specific branches or commits.
Configuring the workflow
Now you are ready to configure the workflow, see the documentation for that here.