Tutorial
This tutorial aims to get users familiar with the general usage of PopGLen by walking through a small(ish) dataset. Users can walk through this dataset to get familiar with using PopGLen on their server/machine, or can use it as a template to follow along with their own data, modifying settings when suitable.
The tutorial dataset is a subset of the data used in this manuscript and contains sequencing data for 20 Cyaniris semiargus butterfly individuals from southern Sweden. These samples all come from the province of Skåne, a region where grassland cover has considerably declined over the past century as agriculture has intensified. In this tutorial, 10 individuals come from historical samples from the entomological collections of the Biological Museum, Lund University, half from a locality in western Skåne and the other from a locality in eastern Skåne. The other 10 individuals come from modern samples collected in nearby localities in 2021, which will allow us to compare genomic metrics over time. This is one of the main use cases for PopGLen - a dataset with both modern and historical samples, but most of the functionality will be the same if your dataset has only modern or only historical samples.
For reference, I have included a report that utilizes the full data, which can be useful to reference to get a better idea of what the results would look like with real data. [Full report]
How much of the original data is in the tutorial dataset?
To speed up processing times for the purpose of the tutorial, I only included ~10Mb across four chromosomes of the Cyaniris semiargus reference and enough reads to get 2-5X sequencing depth across that region. This is enough to get close to the results from the full genome, but some analyses will be more impacted than others. I will mention when this is the case, and include the results using the full data alongside the tutorial data for comparison.
Prepping the workflow
To run PopGLen, you will need access to Conda/Mamba and Apptainer/Singularity as these will be used to run and install the software utilized in the pipeline, including Snakemake. Due to the Apptainer/Singularity requirement, this means that it will only run on Linux machines. I recommend performing the tutorial on the machine you are likely to run your dataset on, rather than testing it on your local machine.
Once you have installed Conda/Mamba and Apptainer/Singularity on your machine, you're ready to start setting up the environment for PopGLen to use. We will install a simple conda environment with Snakemake, the only dependency needed:
conda create -n popglen \
-c conda-forge -c bioconda \
snakemake=8.25.0 snakedeploy
Once run, this will install Snakemake and Snakedeploy to the environment 'popglen', which we will activate to download and run the pipeline:
conda activate popglen
Now, create a working directory for your project, deploy the workflow inside:
snakedeploy deploy-workflow https://github.com/zjnolen/PopGLen . --tag v0.4.1
This will download a workflow
and config
folder containing a the workflow
file and a configuration template. Now we are ready to set up our dataset.
Prepping the dataset
If you are following along with the tutorial data, please download it from Figshare here. You can download it in the command line using the following command:
wget -O popglen-tut-data.zip https://figshare.com/ndownloader/articles/27453978/versions/1
This dataset contains three folders: data
, config
, and resources
. These
contain the raw fastq reads, configuration files, and subset reference genome,
respectively. Move these three folders into the base of your working directory,
where we deployed the workflow. You'll be replacing the contents of the config
folder that was deployed with the workflow with what's in the config
in the
tutorial data.
Example set up code for tutorial
Here is an example of how the tutorial can be deployed, with explicit commands.
# First, create a working dir
mkdir popglen-tutorial
# Change into it
cd popglen-tutorial
# deploy the workflow in here
snakedeploy deploy-workflow https://github.com/zjnolen/PopGLen . --tag v0.4.1
# download the tutorial data
wget -O popglen-tut-data.zip https://figshare.com/ndownloader/articles/27453978/versions/1
# unzip the tutorial data
unzip popglen-tut-data.zip
# move the contents of the tutorial data folder into the working dir
mv popglen-tutorial-data/data popglen-tutorial-data/resources ./
mv popglen-tutorial-data/config/* config/
# remove the zip file and unzipped folder
rm -r popglen-tutorial-data popglen-tutorial-data.zip
This should get you set up as expected.
Let's take a look at the configuration files in the tutorial data. First, our sample list:
sample population time depth
MZLU153246 ESkane1956 historical low
MZLU153247 ESkane1956 historical low
MZLU153248 ESkane1956 historical low
MZLU153251 ESkane1956 historical low
MZLU153250 ESkane1956 historical low
MZLU153204 WSkane1951 historical low
MZLU153205 WSkane1951 historical low
MZLU153206 WSkane1951 historical low
MZLU153216 WSkane1951 historical low
MZLU153221 WSkane1951 historical low
MZLU107458 ESkane2021 modern high
MZLU107463 ESkane2021 modern high
MZLU107503 ESkane2021 modern high
MZLU107504 ESkane2021 modern high
MZLU107507 ESkane2021 modern high
MZLU107457 WSkane2021 modern high
MZLU107460 WSkane2021 modern high
MZLU107497 WSkane2021 modern high
MZLU107498 WSkane2021 modern high
MZLU107500 WSkane2021 modern high
Here, we list all the samples we will use, the population identifiers we will group them by, the time period they come from, and the sequencing depth category we will assign them to. There is one line per sample, and there can be as many or as few population groups as you'd like.
Time must be 'modern' or 'historical' and decides how samples will be treated when modern and historical samples are treated differently. If all your samples are modern, you can simply leave all as modern.
Depth groups samples into depth categories that allow depth filtering to be done per category. This can help to ensure that if you have samples with considerable differences in depth, your depth filters won't be exclusively acting on the upper tail of the distribution driven by high coverage samples and the lower tail driven by low coverage samples. You can have as many or as few depth categories as you want, with whatever names you would like, but they should not use names that also appear in other columns (i.e. use 'low' rather than 'historical' or 'pop1depth' rather than reusing 'pop1' from the population column).
Now, for the units list, which points to our data files:
sample unit lib platform fq1 fq2
MZLU107457 AHW5NGDSX2.3 UDP0046 ILLUMINA data/MZLU107457.R1.sub.fastq.gz data/MZLU107457.R2.sub.fastq.gz
MZLU107458 AHW5NGDSX2.3 UDP0047 ILLUMINA data/MZLU107458.R1.sub.fastq.gz data/MZLU107458.R2.sub.fastq.gz
MZLU107460 AHW5NGDSX2.3 UDP0049 ILLUMINA data/MZLU107460.R1.sub.fastq.gz data/MZLU107460.R2.sub.fastq.gz
MZLU107463 AHW5NGDSX2.3 UDP0052 ILLUMINA data/MZLU107463.R1.sub.fastq.gz data/MZLU107463.R2.sub.fastq.gz
MZLU107497 AHW5NGDSX2.3 UDP0084 ILLUMINA data/MZLU107497.R1.sub.fastq.gz data/MZLU107497.R2.sub.fastq.gz
MZLU107498 AHW5NGDSX2.3 UDP0085 ILLUMINA data/MZLU107498.R1.sub.fastq.gz data/MZLU107498.R2.sub.fastq.gz
MZLU107500 AHW5NGDSX2.3 UDP0087 ILLUMINA data/MZLU107500.R1.sub.fastq.gz data/MZLU107500.R2.sub.fastq.gz
MZLU107503 AHW5NGDSX2.3 UDP0090 ILLUMINA data/MZLU107503.R1.sub.fastq.gz data/MZLU107503.R2.sub.fastq.gz
MZLU107504 AHW5NGDSX2.3 UDP0091 ILLUMINA data/MZLU107504.R1.sub.fastq.gz data/MZLU107504.R2.sub.fastq.gz
MZLU107507 AHW5NGDSX2.3 UDP0094 ILLUMINA data/MZLU107507.R1.sub.fastq.gz data/MZLU107507.R2.sub.fastq.gz
MZLU153221 BHVN22DSX2.2 208188 ILLUMINA data/MZLU153221.R1.sub.fastq.gz data/MZLU153221.R2.sub.fastq.gz
MZLU153206 BHVN22DSX2.2 207139 ILLUMINA data/MZLU153206.R1.sub.fastq.gz data/MZLU153206.R2.sub.fastq.gz
MZLU153204 BHVN22DSX2.2 003091 ILLUMINA data/MZLU153204.R1.sub.fastq.gz data/MZLU153204.R2.sub.fastq.gz
MZLU153205 BHVN22DSX2.2 008096 ILLUMINA data/MZLU153205.R1.sub.fastq.gz data/MZLU153205.R2.sub.fastq.gz
MZLU153248 BHVN22DSX2.2 021130 ILLUMINA data/MZLU153248.R1.sub.fastq.gz data/MZLU153248.R2.sub.fastq.gz
MZLU153216 BHVN22DSX2.2 093148 ILLUMINA data/MZLU153216.R1.sub.fastq.gz data/MZLU153216.R2.sub.fastq.gz
MZLU153251 BHVN22DSX2.2 028116 ILLUMINA data/MZLU153251.R1.sub.fastq.gz data/MZLU153251.R2.sub.fastq.gz
MZLU153250 BHVN22DSX2.2 036137 ILLUMINA data/MZLU153250.R1.sub.fastq.gz data/MZLU153250.R2.sub.fastq.gz
MZLU153247 BHVN22DSX2.2 043143 ILLUMINA data/MZLU153247.R1.sub.fastq.gz data/MZLU153247.R2.sub.fastq.gz
MZLU153246 BHVN22DSX2.2 052142 ILLUMINA data/MZLU153246.R1.sub.fastq.gz data/MZLU153246.R2.sub.fastq.gz
Here, we point to the data for each sample, and assign sequencing metadata. We are using FASTQ inputs for this tutorial, so we will fill out (1) the 'unit', an identifier for the sequencing lane (here it is the run barcode and lane number), (2) the 'lib', an identifier for the library batch the reads come from, (3) the 'platform', the sequencing platform they were run on, and (4) 'fq1' and 'fq2', the paths to the forward and reverse read fastq files (either absolute or relative to the working directory). You can have multiple lines per sample if you have multiple combinations of sequencing runs and/or libraries per sample, and these will be merged at the appropriate steps by the workflow. You can also use SRA accession numbers to download the FASTQ data if it is not local by replacing the 'fq1' and 'fq2' columns with an 'sra' column and filling it out with the accession number. If we were starting with BAM files, we could omit all the columns except 'sample', and add one column 'bam' that points to the BAM file.
Now, we are ready for the configuration.
Initial configuration
The config/config.yaml
file included with the tutorial dataset has the basic
configuration filled out. It contains all the available options for the
workflow, with some default parameters and most analyses turned off. There is an
additional file, config/config-full.yaml
, that enables all the tutorial
analyses, allowing the entire tutorial to be run in one Snakemake run. For
details on all the settings in the config file, please see the
configuration documentation.
No matter what analyses are to be run, the first two sections of the config are required:
config/config.yaml | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
Here, we set a few important settings:
samples
points to the sample list fileunits
points to the unit list filedataset
gives a name for the dataset run (multiple datasets can be run in the same working dir and reuse generated BAMs and reference resources when possible)chunk_size
sets a size for the genome to be chunked into. This allows for parallelizing over genomic regions where possible, but without running into issues when genomes have a high number of contigs. It should be a value equal to or greater than the largest contig length in your genome. If you have a chromosome level genome, the size of the biggest chromosome is probably best. If you have something less contiguous, it depends on how many jobs you expect to run in parallel. I have mostly worked with 500Mb genomes, setting this to 30Mb usually. For the tutorial, we will actually set it to a value larger than the tutorial genome, as it is already rather small, ~10Mb.reference/name
is a name for the reference genome you're mapping/mapped to. It will be included in the output file names of the BAM alignments and many other files.reference/fasta
is the path to the reference genome FASTA. It must be locally available and should not be compressed. (Open an issue of using a compressed reference FASTA is something you'd much prefer and I will look into making it possible.)reference/mito
is a list of any mitochondrial scaffolds. They will be excluded from analyses (future versions may include some mitochondrial analyses). In the tutorial, we drop the sequence called 'mitochondrion'.reference/sex-linked
is a list of any sex linked scaffolds. They will be excluded from analyses (future versions may include ways to analyze these). In the tutorial, we drop the 'Z' chromosome.reference/exclude
is a list of scaffolds to exclude that aren't in any other category of this section. We won't put anything here this time.reference/min_size
is a minimum size required to include a scaffold. If you have many small scaffolds, you can set a threshold to drop them. We only have one small scaffold in the tutorial reference, the mitochondrion, and since it is smaller than the 1Mb we set here, it will be dropped by this setting, even if it wasn't listed in thereference/mito
section.ancestral
is a FASTA file containing ancestral states for your reference genome. Several analyses can utilize ancestral states for polarization of the site frequency spectrum, providing derived/ancestral allele counts rather than major/minor. If you have one, you can put it here, but know that analyses that use it will be limited to the sites with data in this sequence. We don't have one, so we won't use it and will 'fold' our site frequency spectra to account for this. The main outputs of the workflow largely are suitable with folded spectra.
Running the pipeline
With these basic configurations complete, the pipeline is ready to run.
Snakemake allows us to do a dry run with the option -n
:
snakemake \
--use-conda --use-singularity \
--default-resources "mem_mb=(threads*1700)" \
--cores <threads> \
-n
As this pipeline uses conda environments and singularity containers,
--use-conda
and --use-singularity
are always required. --default-resources
is needed for some rules that require knowledge of memory usage (primarily java
tools), where we give 1700MB of memory per thread used in each rule. --cores
is needed for local runs and should be set to the number of cores you'd like to
give to PopGLen for a local run.
Running with a job scheduler
For this tutorial, I have shown the command for a local run. This is good if you are running on a machine without a job scheduler or if you want to submit the workflow to run inside a single large job on your cluster. The latter is probably the best way to run the tutorial, but for large datasets, it is likely most efficient to use Snakemake's executor plugins to have it submit each job in the workflow as its own job in your scheduler, maximizing the parallelization potential across multiple nodes. While the best place for how to set this up is the Snakemake documentation, I have also made a brief doc on this here.
When we do a dry run, we should see that with the default options, Snakemake plans to do only the raw sequence data processing, i.e. adapter trimming, collapsing of historical read pairs, mapping to the reference, duplicate removal, indel realignment, clipping of overlapping reads for modern samples, and indexing of the final processed BAMs.
Quick start vs. Walkthrough
I've written the rest of this tutorial with two approaches in mind: a 'quick start' and a 'walkthrough'.
The quick start approach starts with a configuration with all analyses enabled. This means Snakemake will run the entire tutorial in a single run, you can then examine the outputs to see what can be done using this pipeline. I recommend this if you are familiar with a lot of the analyses and with Snakemake and can use the configuration documentation yourself to adjust settings for your data.
The walkthrough approach starts with a configuration with most analyses disabled and walks through enabling analyses and running the pipeline piece by piece. This may help some who are less familiar with the tools to understand how the analyses fit together.
Regardless of the approach you use for the tutorial, realize that the settings in the tutorial may not work as well for other datasets, so look into them and update them for your needs when using your own data. Please feel to reach out in the issues tracker with questions or to suggest ways the pipeline may be improved to suit a greater breadth of datasets.
Option 1: Quick Start
For this, we will perform a whole run using the config/config_full.yaml
. Your
Snakemake command should look something like this for a local run:
snakemake \
--configfile config/config_full.yaml \
--use-conda --use-singularity \
--default-resources "mem_mb=threads*1700" \
--cores <threads>
and should be similar, but with cluster/executor configuration options, if using
a system with a job scheduler like SLURM. Note the --configfile
option, which
overrides the automatic config path (config/config.yaml
) with the one we put
here. Feel free to do a dry run with -n
first to see what the pipeline plans
to run.
How long does the tutorial take to run?
This will depend on the system it is run on, as both the number of available CPUs and the performance of those CPUs will impact the results. I tested it in two environments: (1) in an Ubuntu VM on a 2020 Macbook Pro with 8 cores and (2) inside a single job on PDC's Dardel cluster with 64 cores (this is half of a node on there). On the Macbook it took ~4 hours to complete the whole tutorial, while on Dardel it completed in under half an hour. The difference is largely from the number of cores available to perform analyses in parallel, but also as the performance of individual cores on the Macbook is a bit weaker. Note: I didn't include downloading the Singularity containers and setting up the conda environments in this estimate.
First, Snakemake solves the directed action graph - the workflow required to get from the input files present on the system to the output files requested by our configuration. Once solved, the required containers and conda environments are set up and the workflow will begin, with outputs from each step flowing into the input of the next. Upon completion, we can make a report:
snakemake \
--configfile config/config_full.yaml \
--use-conda --use-singularity \
--report tutorial_report.html
This report contains the main results of all the major analyses, including both
tables and plots. The actual results can be found in the 'results' folder, which
contains the folders 'datasets', 'mapping', 'preprocessing', and 'ref', which
have outputs from the population genomic analyses for different datasets (each
with a subfolder containing the dataset name, for the tutorial it is just one,
which we named popglen-tutorial
in the config), outputs from the mapping
process, outputs from the trimming process, and outputs specific to the
reference (indices, repeat libraries, etc.), respectively.
Option 2: Walkthrough
It can be useful to run analyses in batches, assess their outputs, and use these
to inform set up for subsequent analyses. For this walkthrough, we will
gradually change config/config.yaml
to perform analyses, pointing out
relevant settings in the file as we go. With the first sections matching the
versions shown in Initial Configuration, we are ready
to start the walkthrough. In each step, a report will be made with the results.
These reports are included in this tutorial, so you can just click through them
if you'd rather not run the tutorial on your end.
A note on temporary files
I have flagged several intermediate files in this workflow as temporary so
that they will be cleaned up when they are no longer needed. While this is
great when you're all done with them, if you are enabling pieces of the
workflow at a time as we are in this tutorial, it can lead to intermediate
files disappearing after some steps and needing to be re-generated when we
enable other analyses that use them later down the line. While I've tried to
avoid making most 'keystone' intermediate files (Beagles, SAFs, etc.)
temporary, it can be good to use the --notemp
option in snakemake
commands, which will keep any files marked as temporary. Then, when you know
you're done, you can run it again without this option to clean up the
intermediate files.
Raw sequence data processing and quality control
I already have BAMs for my data, do I need this part?
If you already have BAMs this part may still be relevant to you, as it also
will run some quality control analyses on the BAM files. I recommend quickly
reading through it to see the QC options. To make the tutorial more similar
to what processing your data would be like, I've included the BAM files in
the tutorial dataset, so you can start from there. Just change the
units.tsv
file to look like this:
sample bam
MZLU107457 data/bams/MZLU107457.tutorial-ref.bam
MZLU107458 data/bams/MZLU107458.tutorial-ref.bam
MZLU107460 data/bams/MZLU107460.tutorial-ref.bam
MZLU107463 data/bams/MZLU107463.tutorial-ref.bam
MZLU107497 data/bams/MZLU107497.tutorial-ref.bam
MZLU107498 data/bams/MZLU107498.tutorial-ref.bam
MZLU107500 data/bams/MZLU107500.tutorial-ref.bam
MZLU107503 data/bams/MZLU107503.tutorial-ref.bam
MZLU107504 data/bams/MZLU107504.tutorial-ref.bam
MZLU107507 data/bams/MZLU107507.tutorial-ref.bam
MZLU153221 data/bams/MZLU153221.tutorial-ref.bam
MZLU153206 data/bams/MZLU153206.tutorial-ref.bam
MZLU153204 data/bams/MZLU153204.tutorial-ref.bam
MZLU153205 data/bams/MZLU153205.tutorial-ref.bam
MZLU153248 data/bams/MZLU153248.tutorial-ref.bam
MZLU153216 data/bams/MZLU153216.tutorial-ref.bam
MZLU153251 data/bams/MZLU153251.tutorial-ref.bam
MZLU153250 data/bams/MZLU153250.tutorial-ref.bam
MZLU153247 data/bams/MZLU153247.tutorial-ref.bam
MZLU153246 data/bams/MZLU153246.tutorial-ref.bam
or, when you run the Snakemake command, you can override the units file
configuration by adding --config units="config/units_bams.tsv
.
If you use this units list and run this section of the tutorial, no mapping will be done, only the QC, so it'll be a lot faster to run this section.
The first stage of the pipeline will be to map the raw reads and perform a few quality control analyses. The mapping pipeline roughly looks like this for each sample type:
Modern:
- Quality and adapter trim paired end reads using defaults with fastp
- Map paired end reads for each sample with bwa mem. Do this individually for separate sequencing runs and libraries, then merge with samtools.
- Remove duplicates with Picard MarkDuplicates.
- Realign around insertions/deletions with GATK IndelRealigner.
- Clip overlapping read pairs to avoid double counting reads in ANGSD using bamutil.
Historical:
- Collapse overlapping paired end reads into single strand with fastp.
- Map collapsed reads to reference with bwa aln.
- Remove duplicates per library with dedup and merge libraries.
- Realign around insertions/deletions with GATK IndelRealigner.
- (Optional, we will not do this for the tutorial) Rescale base quality scores to account for DNA damage with mapdamage2.
Use the code blocks below to update the appropriate lines in the
config/config.yaml
file to set this run up. Since mapping is always enabled in
PopGLen if starting from FASTQ, we only have to check some settings for the
mapping and turn on some QC analyses:
config/config.yaml | |
---|---|
33 34 35 36 37 |
|
config/config.yaml | |
---|---|
47 48 49 50 51 |
|
The first block describes settings for historical sample mapping and determines that we should only map reads with some overlap (i.e. fragments <270bp) and we should use bwa aln to align historical reads rather than bwa mem. We will leave them this way for the tutorial, but you can change them if needed for your data.
The next block enables some quality control analyses:
qualimap
- a detailed report of mapping statisticsibs_ref_bias
- calculates the average identity state distance to the reference for each sample using a custom script around ANGSD's -doIBS optiondamageprofiler
- estimate DNA damage on historical samplesmapdamage_rescale
- rescale base quality scores for potentially damaged bases in historical samples (adds step to BAM generation)
We will turn on the first three options to get a good batch of quality control
options for our data. While we could also use MapDamage to rescale the base
quality scores of putatively damaged reads, we will handle DNA damage in a
different way down the line. In this dataset, we find MapDamage correction can
cause a substantial reference bias in historical samples. For this reason, it
can be useful to check the results of ibs_ref_bias
with and without rescaling
enabled to determine the effect on your samples. Estimates of mean sequencing
depth and mapping rates are always enabled in PopGLen, so don't need to be
turned on.
Additional sequence processing settings
There are several settings available at the end of the configuration file for each of the tools in the PopGLen workflow. For the tutorial, these are already set to suitable defaults (and these should be suitable for most datasets using sequencing data generated by patterned flow cells), but you may want to look into them in the configuration docs to adapt them to your dataset.
config/config.yaml | |
---|---|
118 119 120 121 122 123 124 125 126 127 128 129 |
|
Now we are ready to run our processing, its best to do a dry run first to make sure everything looks good. This run will likely take the longest of all the runs we will perform, as it involves all the mapping steps and will do a lot of the initial software installs.
Once the run is complete, the easiest way to look at the results is to generate a report:
snakemake \
--configfile config/config_full.yaml \
--use-conda --use-singularity \
--report tutorial_report.html
Here is an example for the tutorial report you should get after running these steps.
You'll have quite a few QC outputs:
- fastp trimming reports per sample and combined in a multiqc file
- A filtering summary describing how much of genome will be used in downstream analyses (for now, only removing small contigs and non-autosomal chrs)
- A multiqc DNA damage report for the historical samples from DamageProfiler
- A multiqc report of the Qualimap reports for all samples
- A table of endogenous content (mapping rate) and sequencing depth per sample
- Boxplots and tables of the identity by state distance to the reference for all samples in various groupings.
Explore the report outputs, but remember this is a tutorial dataset, so it may not be a great representation of 'good' data. As the first four are basically reports from their respective tools, we will only look at the last two.
First, let's open the section 'Sample depth and endogenous content'. This is a searchable, sortable, adjustable table containing sample names and meta-data, the mapping rate for collapsed reads, uncollapsed reads, and all reads, as well as the mean and standard deviation of sequencing depth calculated three ways. Notice that modern samples only have uncollapsed mapping rates and historical only have collapsed because of the different processing paths of the sample types.
The three depth metrics are calculated with no filtering
(allsites-unfilt.depth
), with only mapping and base quality filtering
(allsites-mapq30-baseq30-filt.depth
), and with the mapping and base quality
filters plus the sites filters (allsites-filts.depth
). We'll get more into
these filters in the next section. The main thing to see here is that we can
have different depths depending on how we calculate it. I tend to care about the
last one the most, this is the sequencing depth at the sites you will use in
your analyses.
Next, let's look at the IBS results. This will help us see if the short fragment length of our historical samples has led to reference bias. For this, we will open the plot in the '7 Sample IBS to reference (ref bias)' section, looking at the one with 'allsites-filts' for the filtering and 'Time' for the grouping:
It would appear not, in fact, they are further from the reference. This is expected for two reasons: (1) we haven't corrected for DNA damage, so some excess differences are expected due to damage and (2) the historical samples are more distant in time from the reference than the modern and we will observe the genetic drift over that time as differences. This second point will likely depend on the sampling of your reference. Either way, we do not see evidence that we face reference bias due to the historical samples' small fragment size.
Is this useful if I only have modern samples?
Yes, this can help you determine if there is some potential sources of reference bias by population (see the other grouping boxplots), or if some samples seem to be especially strange, potentially suggesting that there is something off in their underlying data. For instance, a sample with considerably lower IBS than the rest may indicate a misidentified species or a contaminated sample.
Sites filters
When doing analyses, only reads passing the base and mapping quality thresholds in the config file can be used, which are set here:
config/config.yaml | |
---|---|
115 116 |
|
By default I've set mapping quality to 30 (representing a 1 in 1000 error rate),
and base quality to 20 (representing a 1 in 100 error rate), which are common
cutoffs, but can be adjusted based on what you feel suits your data. If you want
to investigate the distributions of these scores in your data, consider checking
the Qualimap reports (I don't believe this is in the MultiQC in the Snakemake
report, but can be found in the individual sample reports in
results/mapping/qc/qualimap
if you've aligned your data with PopGLen or
results/datasets/{dataset}/qc/user-provided-bams/qualimap/
if you've brought
your own BAMs).
An important component of this workflow is the site based filtering. To do this, we limit our analyses to a list of 'good sites' determined by a set of filters. This filtering scheme was inspired by similar sites based filtering used in studies using ANGSD (e.g. Pečnerová et al. 2021, Curr. Biol. and Quinn et al. 2023, Mol. Ecol.). With everything disabled as we have it now, you should already have a sites file summarized in the report (under '3 Filtering Summary'), but it will only describe the removal of the contigs we specified in the beginning of the config file:
So, how else can we filter? We have a few options that can be enabled:
config/config.yaml | |
---|---|
38 39 40 41 42 43 44 45 46 |
|
The above section of the config file handles enabling of the main filter sets. For this tutorial, we will turn on two as shown above:
pileup-mappability
uses GenMap to calculate mappability scores for each base and we use a custom script to calculate the average mappability of each site for all overlapping fragments. This helps us to remove sites that are difficult to map to due to being a part of sequences that are frequently repeated.extreme_depth
removes sites that have extremely high or low global sequencing depth across your dataset and/or the depth categories in your sample list.
In addition to these two filters, there are three others. First, we could filter out repeats, with three built-in ways to do this. One is to provide a BED file with the coordinates of repeats to be filtered out. The second is to use a local repeat library FASTA file to identify repeats with RepeatMasker. The third is to use RepeatModeler to build a repeat library de novo from the reference, then use RepeatMasker to use that library to identify the repeat regions
Repeat filter vs mappability filter
Generally, the repeat filter and the pileup mappability filter have a lot of
overlap, so I only enable one usually. In this tutorial, we use the pileup
mappability filter as it takes less time to run, but for your data you may
wish to use the repeat based filters. For these, you should use the option
highest on the list that you have the resources to use. If you have a BED of
repeats, use that as it will take no time at all. If you only have a repeat
library, then use the local_lib
option. If you don't have anything, then
you should build your own, but this often can take days. See the
configuration docs for how to set these filters.
The remaining two options allow you to filter out sites based on data missingness, removing sites with no usable reads in too many samples in the dataset or in any population. I do not recommend using these unless you know you need them (i.e. you require identical sites to be used across all analyses). Instead, it is better to set the missing data filter in ANGSD as is standard practice (and available later in configuration).
Alternate ways to filter extreme depth
We aren't diving into configuring the extreme depth filter, leaving it set to keep only positions between 0.5x and 1.5x the global median depth. You can customize these multipliers or set percentiles as filters instead. See the configuration docs for more details on this.
All right, with that set up, let's run the workflow again. You'll see some rules are being rerun if you do a dry run. This is because with our new filters, calculations made on filtered sites must be redone. Snakemake is good about recognizing these changes, and I have tried to ensure that this pipeline is compatible with all of them, but please open an issue if you find they are not triggering when expected.
If you make the report again (here is an example for this point in the tutorial), you'll find that the table under '3 Filtering Summary' should look more like this now:
As you can see, we've added a few lines, pileup mappability, and various depth filters. These are all calculated independently across the genome, and you can see how much of the genome passes each filter. By far mappability cuts out the most, and that's because we have a good bit of the genome being repetitive content that we can't reliably map to. Ultimately, we are using ~22% of the 'genome' for our analyses, but we can be quite confident in our ability to genotype that 22%.
You might also want to take a look at the '6 Sample depth and endogenous
content' section again. Before, the mean depths for
allsites-mapq30-baseq30-filt.depth
and allsites-filts.depth
were quite
similar. This is because the first applies only the mapping and base quality
filters and the second applies these, plus the sites filter. Before this step,
our sites filter barely did anything, but now it removes quite a bit. Now there
is a slightly larger difference between the two measures, and the depth at our
filtered sites is on average higher than we get with just the mapping and base
quality filters. There's also a lower standard deviation, suggesting that we
have successfully removed repeats that might get really variable mapping rates.
Can I use my own filters?
If you have extra filters you want to add, you can provide them to the pipeline in BED or GFF format. These files should contain the sites you want included. These are added in the filter sets section of the config:
config/config.yaml | |
---|---|
106 107 108 109 110 111 |
|
In this example, we add a BED file containing neutral sites as a filter.
This is intersected without our any filters used in PopGLen to create a
filter scheme called neutral-filts
in addition to the allsites-filts
one generated by default and analyses will all be run once for each filter
scheme. If you enable only_filter_beds
and turn off any of the filters in
PopGLen, you will essentially only use your own filter set. For more info,
see the configuration docs for more details on these settings.
Population genomic analyses
Next, we will enable some population genomic analyses. We will do this slowly, enabling a few analyses at a time and looking at results before moving to the next.
Note
I often enable many analyses at once, or group together different analyses. See the section at the end of this tutorial for my recommendations on reasonable batches to run your analyses in that should get most of your results generated within a couple runs of PopGLen.
ANGSD Settings
As most of these analyses are run with ANGSD or from genotype likelihoods generated in ANGSD, we will first take a look at the settings that are going to be passed to ANGSD:
config/config.yaml | |
---|---|
141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
|
Users familiar with ANGSD will likely be able to determine what these options entail:
gl_model
corresponds to-GL
and sets what model to use to calculate likelihoods. SAMtools and GATK are most popular, I mostly use GATK, though see ANGSD's docs for a comparison.maxdepth
corresponds to-maxDepth
on-doDepth
calls, and sets the highest depth histogram bin for single sample depth estimates. The same will be set to this value times the number of samples for global depth calculation. (ANGSD docs).mindepthind
corresponds to-setMinDepthInd
and is applied at all times, setting the minimum depth to include a site for an individual (1 is ANGSD default, ANGSD docs).minind_dataset
is a proportion that will be multiplied by the number of samples in the dataset and fed to-minInd
for analyses that include the whole dataset (i.e. PCA, Admix, IBS matrix). Sites with data for fewer than this proportion of individuals are excluded from these analyses (docs).minind_pop
is a proportion that will be multiplied by the number of samples in a population fed to-minInd
for analyses that include only that population (i.e. thetas, SFS, FST, runs of homozygosity). Sites with data for fewer than this proportion of individuals are excluded from the analysis for that population (docs).rmtrans
determines whether transitions should be excluded as a means to account for DNA damage in historical samples. This is applied across all analyses using the correct option (-rmTrans
for beagle files and-noTrans
for saf files, docs for saf and docs for beagle).extra*
are various extra parameters to pass to ANGSD at all times (extra
) or for certain analyses (extra_saf
andextra_beagle
). This is mostly useful for options like-trim
which can help correct for DNA damage or-baq
if you want to run baq recalibration on your quality scores. Mapping and base quality filters are automatically added and need not be specified. Leaving these blank is equivalent to ANGSD default filtering as seen here. Note that though many studies enable-uniqueOnly 1
, it is unneeded here as BWA is used for mapping, which sets multi-mapping reads to a mapping quality of 0.
A note about the -baq
and -C
options
I have noticed that the commonly used ANGSD options of -baq 1 -C 50
(which themselves are originally from SAMtools) create a sequencing depth
bias in my results when I have large differences in depth between samples.
This has occurred across multiple datasets I've worked containing different
species. Consider leaving these out, or at least checking that this does not
apply to your data.
domajorminor
corresponds to-doMajorMinor
and is used to set how major and minor alleles are inferred for relevant analyses (docs).domaf
correspons to-doMaf
and sets how allele frequencies are inferred where relevant (docs).snp_pval
corresponds to-SNP_pval
and sets the p-value threshold for calling a site as a SNP. 1e-6 is commonly used (docs)min_maf
corresponds to-minMaf
and determines the minimum minor allele frequency required to include a SNP in analyses (docs).mindepthind_heterozygosity
corresponds to-setMinDepthInd
but only when estimating individual heterozygosity. This is useful if you only want to consider sites with a certain depth when estimating heterozygosity, which can improve the reliability of estimates.
Most of this will be kept to our defaults for this tutorial and the defaults are
meant to match the average settings I've found for studies that utilize ANGSD in
the ways PopGLen targets. The only changes we make are setting rmtrans
to
true
, and min_maf
to 0.05
. The first is to account for post-mortem DNA
damage expected in our historical samples. For datasets with no historical
samples or that correct via another method (such as MapDamage or using -trim
),
this can be kept on false
. We set a minor allele frequency filter to remove
low frequency variants from SNP based analyses. Whether you use this filter,
and what value you set, will depend on your dataset's needs. Just remember that
this only controls the filtering at SNP calling, and downstream tools may
filter minor allele frequency as well (for instance, by default, NGSadmix and
PCAngsd set a MAF filter of 0.05).
These settings simply get us ready to run analyses, so we've added no new analyses. However, if you'd like, you can rerun now and see that some rules will be rerun. This is because they now will exclude transitions. This can be especially useful to look at in the IBS to reference QC results.
Linkage Disequilibrium
First, we will look at linkage disequilibrium, as it can help inform some settings down the line, such as sliding window size and linkage pruning settings:
config/config.yaml | |
---|---|
52 53 54 |
|
Here, we will enable just the LD decay estimation with ngsLD. This plots a decay curve showing how linkage disequilibrium decreases with distance between SNPs. We can additionally enable the first option if we want linkage disequilibrium estimates for a random sample of the genome.
ngsLD configuration
There are a few parameters for the LD estimation available in the config file that we won't dive into here, and I've set the tutorial up to run quick rather than get the most robust estimates. Check the configuration docs for details.
If we run now, genotype likelihoods will be estimated and SNPs called for the dataset and per population and handed off to ngsLD to do the LD analyses. Once completed, we can make another report (here is an example of the report at this stage). Inside you should find plots like below in the '01 Linkage Disequilibrium Decay' section:
It seems to decay pretty fast in the dataset (top), and a bit slower with higher background LD in each population (bottom), which is expected as individuals in a population likely are more closely related on average than individuals across the dataset, assuming the dataset is not completely panmictic. With this knowledge, we might choose a window size of 25kb for our window based analyses and consider SNPs as independent after 25kb or below 0.2 r2 for the dataset and after 25kb or below 0.4 r2 for a population when pruning. We will look at these when we set up the respective analyses.
Principal Component Analysis
Principal Component Analysis allows us to explore population genetic structure. As it assumes independence of sites, we will need to prune our SNP data first. The pipeline will use ngsLD and prune_graph to prune the Beagle file and then estimate a covariance matrix using PCAngsd and plot the results. To set up the run, enable it in your config and make sure the pruning settings are as we want for the dataset:
config/config.yaml | |
---|---|
52 53 54 55 |
|
config/config.yaml | |
---|---|
159 160 161 162 163 164 165 166 167 |
|
Now, we can run the pipeline again. Notice that we won't have to make the Beagle file, it has already been made for the LD analysis. Instead, we will go straight to pruning. Once the run is complete, you can make the report (example here), and we will have plots of PC1/2 and PC3/4:
There seems to be some population structure, with the modern Western Skåne population standing out the most. However, there is something strange here: two samples tightly clustered at the edge of PC1. This can commonly happen when there are close relatives in our dataset, and we should have actually checked for this first, as genomic PCAs assume there are not. I wanted to illustrate what this might look like, but we can easily remove them using our config file. First, however, we should identify which samples are related.
Relatedness
Two ways to estimate relatedness are currently available in PopGLen, IBSrelate (Waples et al. 2019, Mol. Ecol.) and NgsRelate v2 (Hanghøj et al. 2019, GigaScience). For IBSrelate, both the IBS-based and SFS-based method implemented in ANGSD are available. The SFS-based method can also be run from ngsRelate using SNP data. Finally, if population samples are large enough to estimate allele frequencies, the allele frequency based method of NgsRelate is also available. You can enable as many of these as you want in the following lines, but for the tutorial we will perform the SFS-based IBSrelate method as it is implemented in NgsRelate.
config/config.yaml | |
---|---|
58 59 60 61 62 |
|
This is a relatively fast way to assess relatedness, and we already have the Beagle file ready to do it (currently in parallel processed chunks), so you'll see there's really only merging of the chunks and running ngsRelate to be done.
Let's generate a report (find an example here) and look under '02 Relatedness'. There we will have a table. To find the most related individuals, sort it by the KING-kinship coefficient:
These coefficients are going to be a bit off for some individuals, as we are using such a small slice of the actual genome (<1%). However, we are still able to see two relatively highly related pairs of individuals. The first are two individuals from WSkane1951, the ones that are likely skewing our PCA. If we look at the report using the full genome data, we can see that these are full siblings (expected KING = 0.25), so we are going to want to remove one from the PC and Admix analyses. There's also another pair with a high coefficient from WSkane2021. These actually are not very related (using the full data we see KING = 0.004), but by chance have high IBS similarity in this subset of the genome. As that's the data we are using for our analyses, we will remove one of them too. We can remove related individuals from the PCA and Admix by modifying the following option in the config:
config/config.yaml | |
---|---|
23 24 25 26 27 |
|
Only one individual needs to be removed from each pair of relatives. Let's re-run with this configuration to see how the PCA changes. Only the PCA will have to be re-run, after the individuals are dropped from the input Beagle file. This would generate a plot more like the following [example report]:
In this example we only see weak structure, but this may be due to our tutorial dataset having little data to infer this. Looking at the full data helps us see things a bit better, and is more what you might expect from a real dataset:
Admixture
As a way to more formally look into population clustering, we can use admixture analysis, for which we will use NGSadmix. This tool uses the same pruned, relative removed Beagle file as PCAngsd. Turn it on with the following option:
config/config.yaml | |
---|---|
52 53 54 55 56 |
|
PopGLen uses a custom wrapper to run up to 100 replicates per K of NGSadmix (Skotte et al. 2013; Genetics), and estimate the best K, using the method described by Pečnerová et al. (2021, Curr. Biol.). In this method, the best K is the highest K in which the three highest likelihood replicates converge (<2 log likelihood unit difference by default). Additionally, we use EvalAdmix (Garcia-Erill & Albrechtsen; Mol. Ecol. Res.) to plot the correlation of residuals for each value of K as another means to assess fit. Often, limited sampling prevents higher values of K from converging, but EvalAdmix may help determine how well a model fits even if sampling limits the ability to converge at higher Ks.
We can set which values of K to run, and customize the convergence criteria in the config file, but the settings included are good for this tutorial:
config/config.yaml | |
---|---|
186 187 188 189 190 191 192 |
|
-maxiter 4000
?
It's good to make sure that your replicate runs are finishing because they converge, not because they hit the default maxiter value for NGSadmix. We keep it set to 4000 as it will not increase run times for datasets that easily converge, but have found it enables replicates to converge more reliably for datasets where structure is weak.
If we run again now, we will see that only NGSadmix and EvalAdmix need to run. On real data, this can be one of the longest running analyses in PopGLen, so I recommend running it in parallel with others, but for the tutorial its fast enough to run now. After generating a report for the new run (example here), you will find the admixture results from NGSadmix and EvalAdmix under '03.2 Admixture'. The admixture plots for all values of K will be assembled in a plot like the one below:
In this plot, the K's for which multiple replicates converged will be in bold. This is only K=1 for the tutorial dataset, likely because the population structure here is quite weak and this subset of the genome isn't enough to get a good estimate of it. We can also check the correlation of residuals estimated by EvalAdmix to see how well just one cluster fits the data. In this plot, a positive correlation suggests samples/populations have a shared history that is not captured in the model and a negative suggest they have a more different history than the model suggests.
These correlations are positive between each population and itself, suggesting individuals within populations share more similar backgrounds than currently modeled. We also have negative correlations between the two historical populations and the modern Western Skåne population, suggesting that they are more distinct than currently modeled. Let's take a look at what the admixture results are when we perform the analysis using the whole genome, rather than just the tutorial subset:
Here we see that the results converge at K = 3, which divides our dataset into three clusters, one containing all the historical samples and one each for the two modern populations. The correlation of residuals for this K are much closer to zero for both samples and populations, suggesting that this model is quite a good fit for our data:
Take a look at the documenation for NGSadmix and EvalAdmix for more information on how to use and interpret these tools.
Site frequency spectra, genetic diversity, and FST
Let's try enabling a few things at once, to get an idea of what that's like. In this case, there are several metrics that use similar files, so we will work with them. To calculate thetas (genetic diversity metrics) and neutrality statistics, as well as FST, first a SAF file is made for each population, containing the site allele frequencies, which is used to make a site frequency spectrum, which is used as a prior to estimate theta and FST globally and in windows. Individual heterozygosity is estimated by making a SAF file for each individual, generating a single sample SFS, and using the middle value (heterozygous site count) divided by the total number of sites.
Let's enable these:
config/config.yaml | |
---|---|
62 63 64 65 66 67 68 69 70 |
|
Here, we enable the generation of estimates of Watterson's θ, nucleotide
diversity (π), and Tajima's D (thetas_angsd
), individual heterozygosity
estimates (heterozygosity_angsd
), and FST for all population pairs
(fst_angsd/populations
). You might for some reason want to do FST for all
pairs of individuals, so that's an option too, but we won't do it here.
Notice how I have not enabled any of the SFS options. PopGLen knows the SFS is a prerequisite for the latter analyses, so it doesn't need these to be enabled to generate it. If you only wanted the SFS and not the other things, you could turn these on (and it is a good idea to do so, to ensure that the prior going into these analyses looks reasonable). Here, we won't turn them on just as a means of illustrating that intermediate files are generated as needed, even if not requested. There are also options to generate bootstrapped versions of the SFS, which are often used for estimating uncertainty in many demographic modeling tools, but we won't go into that in this tutorial. Check the configuration docs for details on how to set that up.
When enabling thetas and FST, we will generate sliding window and global estimates. Let's look into the settings later in the config for this:
config/config.yaml | |
---|---|
178 179 180 181 182 183 184 185 |
|
Here, we set a few things. By default we use the Hudson-Bhatia estimator
(Bhatia et al. 2013; Genome Res.)
to estimate FST, as it is more robust to small and uneven sample sizes
(equivalent to -whichFst 1
in the command.
-whichFst 0
is the ANGSD default). For both FST and thetas we set our
windows to a size of 25kb, with a 10kb step, based on the rather steep LD decay
we saw in our earlier results. With quick LD decay, large window sizes may
'smooth' out local signal too much. However, it is useful to explore different
window sizes, so you can easily change these and rerun the pipeline.
Thetas have an extra option minsites
. This refers to the number of sites a
window must have data from to be included in the global theta estimates and the
distribution plots generated for the report. Since global FST estimation is
made by ANGSD, and we only plot that for the report, not the windows, it is not
needed for the pipeline. The window output from ANGSD remains unaltered for
both, this setting is just for the reporting by the pipeline.
What about polarizing to ancestral states?
By default, we are assuming no ancestral states are available, but if you have them in a reference included in your config, you might want to make an additional change in your config to keep the SFS unfolded:
config/config.yaml | |
---|---|
175 176 177 |
|
One last thing for this section, since we are enabling heterozygosity, we also need to look at how many bootstrap replicates we will do for the SFS. These will be used to calculate confidence intervals for the estimates. This can be found in the realSFS section of the config:
config/config.yaml | |
---|---|
175 176 177 |
|
While for this you might want a high number (like 300) for the real data to get good confidence intervals, we will lower it to 50 here, just to make this run a bit faster. Note that currently confidence interval estimation for heterozygosity shares a configuration with bootstrap count for 1D and 2D SFS. Configuring each of these individually can be relatively simply separated out, so please request if desired and I will implement it.
All right, now lets run it again, this time, quite a few analyses will process in parallel. Once done, we can make a report (example here) and you'll find a few new sections: '04.1 Watterson's Theta', '04.2 Nucleotide Diversity (Pi)', '04.3 Tajima's D', '04.4 Heterozygosity', and '05 Fst'. For the first three, you will get violin plots illustrating the distribution of windowed estimates for the given metric in each population:
As we can see, the general distributions of nucleotide diversity and Watterson's theta are lower in the modern populations than the historical, in line with a decline in genetic diversity. We also see that genome-wide estimates of Tajima's D are higher in modern populations, suggesting they've been going through a population bottleneck as rare variants are becoming less common compared to common ones. Note that the general estimates of Tajima's D are negative for all populations, which indicates a population expansion. However, this is expected for many populations in northern Europe, which expanded out of glacial refugia. This highlights how absolute estimates can be misleading, and proper baselines must be used to create context, such as the historical populations we use here.
In addition to the violin plots, you'll get a table for each metric with the mean for each population along with 95% confidence intervals estimated from 1000 bootstrap estimates of the mean, resampling windows with replacement:
Under '04.4 Heterozygosity', we will find some similar files. There will be a table with heterozygosity estimates with confidence intervals for each individual, this time estimating the intervals from the bootstrapped SFS. Two plots are made for heterozygosity - a boxplot of the estimates grouped by population and a plot showing individual estimates with confidence intervals:
Finally, under '05 Fst', we will have one plot, a heat map of global FST estimates between all population pairs. This plots the weighted FST estimate produced by ANGSD:
Note that the optimization approach used by ANGSD can sometimes produce FST estimates below zero, but these are a methodological artifact and should be considered equivalent to zero. With that in mind, we can see that the FST between the historical populations is lower than between the modern, suggesting that populations have become more differentiated over the time period. This may be simply because they already were isolated and drift has progressed over the period, or even sped up due to the smaller population sizes. It can also mean that gene flow has become reduced. This possibility is especially likely given our clustering analyses and complete lack of differentiation in historical populations, suggesting that they were well connected in the past.
All of these metrics have additionally been performed in windows, which are not included in the report, largely as I find the uses for window based versions of these estimates are quite diverse and likely better summarized using whatever scripts you make yourself. Check the output file summary for information on where you can find these files.
Impact of sample size on these metrics
Genetic diversity, differentiation and neutrality metrics like nucleotide diversity, Watterson's theta, FST, and Tajima's D can have different responses to sample size. Many studies aim for 8+ samples per population, but variance in sample size can affect these metrics differently. We have tested how close estimates of these metrics can get to the estimate produced with 8 individuals at lower sample sizes (see Figure 4 in Nolen et al. 2024, Biol. Conserv.):
Nucleotide diversity is most influenced by common variants, which often can be captured at lower sample sizes, so as expected, we found that even just two samples captured similar signal to eight. FST required closer to 4, at least when we used the Bhatia-Hudson estimator. Rare variants are more impactful to Watterson's theta and Tajima's D (which measures the deviation from the ratio of rare and common variants expected under neutrality). As such, the estimates can change considerably with each sample added. We have largely utilized repeated subsamples of populations to equivalent size to assess the variation in estimates driven by sample size, and code extending PopGLen to do this can be found in the above mentioned manuscript. If such subsampling would be useful to include natively, or if you have a suggestion for an easier way to implement it (currently we just make many replicate bamlists, then follow the SAF -> SFS -> Theta/Fst workflow), please make a feature or pull request.
Sample size of course also affects our other analyses, but many of the tools used here have documented these effects in their original papers, so are best discussed there.
Inbreeding and Runs of Homozygosity
Next up, let's look at estimating inbreeding using identical by descent tracts (runs of homozygosity) with ngsF-HMM. This tool uses a model based approach, that assumes your samples are in Hardy-Weinberg Equilibrium (except for inbreeding). It also assumes independence of sites. So, for our default pathway, we call SNPs and prune them per population defined in the sample file. Enabling this analysis is a single config line:
config/config.yaml | |
---|---|
71 |
|
Let's also take a look at the settings for the tool in the config file:
config/config.yaml | |
---|---|
168 169 170 171 172 173 174 |
|
Here, we can toggle whether we want to run the tool for each population (true
)
or for the dataset (false
), whether we want to prune, and if so, what settings
to use for pruning (remember, that background LD is often higher within
populations, so probably requires a higher min weight than for the dataset or
pruning may take a very long time or prune away too many sites). We will set
them to the values we came up with in the LD decay section.
We also have a few options about runs of homozygosity sizes. ngsF-HMM will
output a model based value for the inbreeding coefficient per sample. Many
studies, especially in conservation genomics, calculate the inbreeding
coefficient as the proportion of the genome in runs of homozygosity greater than
a certain size (FRoH). PopGLen will calculate FRoH for you from the
tracts inferred by ngsF-HMM while keeping the F estimates from ngsF-HMM in the
original outputs. FRoH will be estimated from the runs of homozygosity
larger than min_roh_length
and the autosomal length estimated from the
reference, excluding any excluded contigs. A plot will be made to show the
composition of FRoH by size classes of runs of homozygosity, given as a list
of integers to roh_bins
. This is just for plotting in the report, you still
get the raw files from ngsF-HMM for downstream analyses. The settings here work
fine for the tutorial.
Now, let's give it a run and see what we get (example report). Note that we are looking for long runs of homozygosity, yet our tutorial dataset is from a tiny subset of the genome. In fact, I've included no continuous section of the genome longer than 3Mb, so it will be impossible to find tracts longer than this. Still, if you generate a report and check the plots under '06 Inbreeding' you should see that at least some runs of homozygosity were found:
This figure shows the total inbreeding coefficient for each individual (total height of the bar), grouped by population. This bar is segmented into a grayscale gradient, with each shade representing a range of IBD tract lengths. These sizes can tell us a bit about the time to the common ancestor of the identical by descent segments, so seeing how much of the inbreeding coefficient comes from IBD tracts of different sizes can help to understand inbreeding patterns in the population, with more recent inbreeding being composed of longer tracts, whereas shorter tracts illustrate historical or background inbreeding. Since the sizes that matter will depend on the recombination rate of your organism, these bins can be customized in the config.
But this can be a bit hard to see in the tutorial dataset, since we have little information to actually estimate these tracts from. Let's take a look at the same samples, but using the whole genome to get a better idea of what we might see in this plot:
Here, we can see more of the variation in FRoH across individuals, as well as what size classes we estimate contribute to this. We can see that the modern populations have individuals with higher inbreeding coefficients than the historical, in line with the genetic diversity declines and increases in population structure we've already seen. Most of the coefficients come from IBD tracts shorter than 1Mb, which likely means that background relatedness has just increased, rather than close-kin matings, for which we would see more long IBD tracts.
A second type of plot is available in the report that looks at the relationship between the number of runs of homozygosity and their cumulative length. This helps to identify individuals that have recent, close-kin inbreeding contributing to their inbreeding coefficients, as these individuals will have fewer, but longer runs of homozygosity, putting them more towards the bottom right of the plot (we will just look at the one for the full data as this plot really doesn't show anything in the tutorial subset):
We don't really see this much here, there's quite a linear relationship between the number of runs of homozygosity and their cumulative length. This is in line with what we saw in the barplot: most of the inbreeding coefficient comes from IBD tracts that are relatively short rather than long.
Identity by state matrix
Some tools that are popular with low coverage datasets take an identity by state matrix as input, such as EEMS (Petkova et al. Nat. Genet.). ANGSD has a nice way to calculate this by randomly sampling a base at each position per sample. This is implemented in a basic form here, essentially calculating an IBS matrix from SNPs estimated for the Beagle file. Enable it with the toggle in the config:
config/config.yaml | |
---|---|
72 |
|
You can also change how the state is calculated (the setting of ANGSD's -doIBS
option). The default is randomly sampling a base (1
), but you can also use the
consensus base (2
):
config/config.yaml | |
---|---|
193 194 |
|
There's no plots here, you just get the matrix in your results folder at
results/datasets/{dataset}/analyses/IBS/{dataset}.{ref}_{population}{dp}_{sites}-filts.ibsMat
if you run it. Feel free to give it a try on the tutorial data and see what it
looks like.
Population-specific allele frequencies
While the SFS gives us a summary of the allele frequency distribution within
a population, you may wish to have allele frequencies for each site within each
population. This can be estimated using ANGSD by enabling the pop_allele_freqs
option in the config:
config/config.yaml | |
---|---|
73 |
|
This will generate minor allele frequency counts for each population using the
-doMaf
option in ANGSD.
This will utilize the ANGSD settings in the config, in particular
domajorminor
, domaf
, snp_pval
, and min_maf
. Enabling this analysis in
the pipeline generates two files per population, one with the allele frequencies
for SNPs segregating across the entire dataset and one for SNPs that are
segregating within the population. Let's enable and run this, then look at
the following output files. These are long tables, so aren't in the report, and
we can just compare the first few lines.
First, let's examine the output for the 2021 western Skåne population, looking
at the sites that are segregating in the population, allowing the population to
decide the major allele. This file will have pop-maj
towards the end. You can
see the first few lines with the command:
zcat results/datasets/popglen-tutorial/mafs/popglen-tutorial.tutorial-ref_WSkane2021_allsites-filts.pop-maj.mafs.gz| head -n 5
results/datasets/popglen-tutorial/mafs/popglen-tutorial.tutorial-ref_WSkane2021_allsites-filts.pop-maj.mafs.gz | |
---|---|
1 2 3 4 5 |
|
We have a few columns here, the chromosome, position, major allele, minor allele, reference allele, ancestral allele, the allele frequency from the optimization process assuming a known major and minor, the p-value for the frequency and the number of individuals used to calculate the frequency.
Now, let's look at the file produced using sites segregating across the dataset,
allowing the dataset to determine the major allele. This file will have
dataset-maj
at the end and we can look at the first few lines with the
command:
zcat results/datasets/popglen-tutorial/mafs/popglen-tutorial.tutorial-ref_WSkane2021_allsites-filts.dataset-maj.mafs.gz| head -n 5
results/datasets/popglen-tutorial/mafs/popglen-tutorial.tutorial-ref_WSkane2021_allsites-filts.dataset-maj.mafs.gz | |
---|---|
1 2 3 4 5 |
|
The columns here are the same as in the previous file, except for a missing
pK-EM column. This is because we just used the SNPs found in our dataset Beagle
file, so we didn't actually calculate the SNP p-value. The rows are also
different. We have two additional SNPs at positions 581 and 593, and we are
missing the SNP at 713. This is because we are getting frequencies for SNPs
called across the whole dataset. At position 581 and 593, this population is
invariable, so we have an allele frequency (knownEM
) of 0. In the previous
file, we had SNPs that were called within the population, so these were not
included as they were invariable. However, they are variable somewhere in the
dataset, so they are included here. This means that the dataset-maj
files for
all populations will be exactly the same length, the major and minor allele will
always be the same, and the allele frequency can range from 0-1. This is likely
the file you will want for downstream analyses, as it allows direct comparison
of frequencies at SNPs across your dataset.
Now, what happened to the SNP at position 713? Well, this was variable in our
population, but at low frequency: ~0.14 across four samples, which is likely
just a singleton. If this singleton was the only time the minor allele appeared
in the dataset, it likely was removed by the minor allele frequency or SNP
p-value filter when calling SNPs across the dataset. This means that the
pop-maj
files will all be different lengths, and even at the same positions
may have different major and minor alleles. In this case, the major and minor
alleles across the dataset and the population are the same, so the frequencies
at those positions (638 and 749) are also the same. Let's try to find a position
where they are not the same. The easiest way to do this is to look for positions
in the dataset-maj
where the frequency is higher than 0.5, as these are sites
where the minor allele across the dataset is the major allele
within the population. This command should get us there:
zcat results/datasets/popglen-tutorial/mafs/popglen-tutorial.tutorial-ref_WSkane2021_allsites-filts.dataset-maj.mafs.gz | awk '$7>0.6' | head -n 2
results/datasets/popglen-tutorial/mafs/popglen-tutorial.tutorial-ref_WSkane2021_allsites-filts.dataset-maj.mafs.gz | |
---|---|
1 2 |
|
Let's see what this SNP looks like in the pop-maj
file:
zcat results/datasets/popglen-tutorial/mafs/popglen-tutorial.tutorial-ref_WSkane2021_allsites-filts.pop-maj.mafs.gz | awk '$1=="1" && $2=="5953"'
1 5953 C G G G 0.353786 3.540612e-11 5
Notice that the major and minor are reversed, and the allele frequency is the
inverse of that in the dataset-maj
file. This is because the major and minor
are defined by the frequency of the population, rather than the dataset.
Hopefully this clarifies the two different formats you'll get. Most of the time,
you'll want dataset-maj
, as it lets you compare across all your populations
directly at every SNP. pop-maj
may be useful if you have downstream
applications where you don't need to worry about keeping all the same sites,
just having some per-site frequencies for a population. For example, the allele
frequency based relatedness method in NgsRelate only requires
population-specific allele frequencies to estimate relatedness within
populations, so we pass it the pop-maj
file along with the Beagle file in
PopGLen.
Allele frequencies using reference/alternate instead of major/minor
If you are used to thinking of your SNPs in terms of reference and alternate
alleles, and would prefer to polarize your frequencies this way, you can
simply set domajorminor
to 4
in the config. This will ensure the major
allele always matches the reference, and the minor will be the inferred
minor allele, just as it would be in Samtools. This will mean any analyses
that use Beagle files will also get their input polarized this way, but as
far as I am aware that won't affect the outcomes, except that you may lose
some sites if the segregating alleles in the dataset do not include the
reference allele.
Subsampling sequencing depth
While genotype likelihoods can be very good at handling datasets with variable sequencing depth, it may be valuable to check that your results are not predominantly driven by sequencing depth variation. This is especially of concern for datasets with historical and modern samples where the modern 'treatment' is expected to be higher depth, or for studies combining genomic data from multiple sources.
PopGLen allows you to rerun any of the included analyses after subsampling the BAM files to a lower sequencing depth with the following section of the config:
config/config.yaml | |
---|---|
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
|
Here, you can give any number of mean sequencing depths to subsample to as a
list of integers to subsample_dp
. Then, enable any analyses you want to
perform at subsampled depth. So, say we want to rerun our PCA and thetas with
all samples subsampled to 3X mean sequencing depth (a bit higher than our
lowest depth samples in this dataset), we would set it up like this:
config/config.yaml | |
---|---|
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
|
If you would like, you can run this and see what happens. Several analyses will run run, but these aren't re-runs, rather they are new version of the analyses using BAM files subsampled to 3X filtered mean sequencing depth.
If we generate the report and look at the QC tables for sequencing depth, we
will see a new one for '3X' which is the QC for our new subsampled BAM files,
you'll see that in the allsites-filts.depth
column, we have a mean depth of ~3
for all samples, except those already lower than 3X, so the subsampling worked
well. The other mean depth estimates are not uniform across samples and aren't
3. In order to ensure the sequencing depth is the same in the actual analyses,
we use the mean sequencing depth only at sites passing our filter to determine
how to subsample. If we used the other measures of depth that include regions
outside the filtered sites file, we would get uniform depth in those measures,
but not at the filtered sites, which means depth would still vary in the actual
analyses. This is still possible to do, and is what the subsample_by
option
selects, so check the configuration documentation if you're
interested in that.
Suggested workflow batches
This tutorial has done two approaches that are a bit extreme - one is to run everything all at once, the other is to run each analysis almost one by one. If you do the first, you risk running some analyses before you know that your samples or settings are good. If you do the second one, you still get a lot of benefits from PopGLen (single file configuration, automated reproducible flow of inputs and outputs, software management), but will have to manually start it up several times.
Likely, you will want something in between - check the results at key points that minimize the number of runs you'll do. When I do projects, I tend to follow this working path with PopGLen:
- Run all mapping and main QC. Determine if any samples need to be dropped based on quality or some other concern and drop them if so. Also turn on my expected filter sets and see if the filtered proportion of the genome makes sense (I usually use ~30% of the butterfly genome, removing repeats, keeping only complete autosomal chromosomes, and removing extreme depth). Usually, the amount I use is not a whole lot lower than the amount left over after repeat removal, if it is, the reference may be too distant to map well to.
- If I am not confident in my LD pruning settings, I run the LD decay analysis. In this run I can also turn on anything that doesn't involve LD pruning, even window based analyses as they are quick to rerun if their settings change. If I already feel my LD pruning settings will be good enough (usually from previous work), I might skip this batch. This is also a good point to do a relatedness check to find out if any individuals should be removed from the PCA and Admix, as those require pruning, but the relatedness methods do not.
- Next, using the LD results I set my pruning setting and run everything else.
- Run analyses at subsampled depth if desired.
If all goes well, nothing will need to be modified and you can start working on your analyses downstream of PopGLen. Your config, samples, and units files together make a reproducible record of your analyses, allowing everything to be reproduced using your input data. If you want to change any settings or samples, you can simply update the configs and re-run, what needs to be updated will get updated.
If PopGLen will form a large part of your workflow, I recommend using it as a module in the way we did with Snakedeploy in this tutorial. This allows you to add rules to the Snakefile for your downstream analyses. This makes it relatively easy to share your code with a manuscript in a way that will enable reproducing all steps. You can check out repositories for two papers I have used early versions of PopGLen for zjnolen/polyommatini-landcover-diversity and zjnolen/polyommatini-temporal-genomics for ideas on how to do this.