Tutorial: Microbial Diversity Analysis with QIIME2 and Kraken2
December 2024 (2874 Words, 16 Minutes)
Introduction
Microbial diversity analysis is a very important aspect of microbiome research, helping researchers to study the composition, abundance, and variation of microbial communities in different environments. This tutorial focuses on combining two widely used bioinformatics tools: Kraken2 and QIIME2.
Kraken2 is a computationally efficient tool for taxonomic classification of metagenomic and
microbiome sequencing data. It assigns taxonomic labels to sequences using a k-mer–based approach,
making it a popular choice for analyzing large datasets quickly and accurately.
Bracken is a tool by the same developers of Kraken2
that re-estimates the abundances outputted by Kraken2 at a single taxonomic level.
For this tutorial, Bracken reports in kreport
format can also be used.
QIIME2 is an open-source, community-driven platform for microbiome data analysis.
It provides a framework for performing taxonomic profiling, diversity analyses,
and generating interactive visualizations.
While the developers of Kraken2 have written a few programs that help with calculating microbial diversity from Bracken outputs, these scripts do not provide immediate visualization like QIIME2 does. By combining Kraken2’s (or Bracken’s) classification output with the QIIME2 platform, researchers can perform advanced diversity analyses, including various alpha and beta diversity metrics, and easily create visualizations to get insights into their data.
This tutorial will guide you through:
- Installing QIIME2
- Converting Kreports to BIOM format
- Importing data into QIIME2
- Formatting metadata
- Creating interactive barplot
- Calculating Alpha diversity
- Calculating Beta diversity
- Core diversity
By following these steps, you will learn how to use Kraken2 and QIIME2 together to explore microbial diversity in your datasets.
Installing QIIME2
There are multiple distributions of QIIME2 that each offer a different set of plugins for specific research goals. For our research goal; the amplicon distribution is the most suitable because it includes plugins such as diversity
, emperor
, feature-table
, metadata
, phylogeny
, taxa
and types
.
The easiest way to install any of the distributions is with Conda:
#create env based on yml
conda env create -n q2-amplicon2410 \
--file https://data.qiime2.org/distro/amplicon/qiime2-amplicon-2024.10-py310-linux-conda.yml
#activate
conda activate q2-amplicon2410
If you want to update QIIME2, you need to make a new conda environment based on the updated yaml file. That’s why the version (24.10) is included in the environment name.
To enable tab completion of QIIME2 commands, run:
source tab-qiime
Or add this command to your .bashrc/
or .bash_profile
.
Converting Kreports to BIOM format
Most of the times people use QIIME2, they start their analysis with the platform with raw reads. Since our raw reads are already processed and we only want to use QIIME2 to access the microbial diversity, we need a way to get our taxonomic classification results into QIIME2. The easiest way to achieve this is to use the tool Kraken-Biom. This tool was specifically designed to convert report outputs from Kraken2 or Bracken (“Kreports”) to the BIOM format. The BIOM format is a widely used format in the microbiome world that is compatible with QIIME2.
Installing kraken-biom
Again, the easiest way to install Kraken-biom is by using Conda to create a new environment:
#create and activate new env
conda create -n biom
conda activate biom
#install Kraken-biom
conda install bioconda::kraken-biom
A new, clean environment is recommended because QIIME2 has a lot of dependencies which often cause conflicts when trying to install other tools in the same environment.
Converting kreports
Make sure you have a directory that only contains the kreports that you want to include in the analysis in QIIME2. For example, if the kreports are located at /home/data/kreports
use a command like this to convet them to BIOM format:
kraken-biom \
-k "/home/data/kreports" \
-o samples_kr2.biom
This will result in a BIOM file called samples_kr2.biom
. To check if the biom file looks good, use the summarize-table
command from the biom program, one of the dependencies of kraken-biom:
biom summarize-table \
-i samples_kr2.biom \
-o samples_kr2_summary.txt
The summary file should show the same amount of samples as Kraken2 reports that were used as input for kraken-biom. Additionally, there should be an “Observation Metadata Category” of “taxonomy”.
Deactivate the environment so you can move on to the analysis in QIIME2:
conda deactivate
Importing data into QIIME2
To actually get the data into a QIIME2 artifact (qza
) import the BIOM file like this:
#frequency table import
qiime tools import \
--input-path samples_kr2.biom \
--type FeatureTable[Frequency] \
--input-format BIOMV210Format \
--output-path samples_freq.qza
#taxonomy import
qiime tools import \
--input-path samples_kr2.biom \
--type FeatureData[Taxonomy] \
--input-format BIOMV210Format \
--output-path samples_tax.qza
Now, the information from the BIOM file is stored in 2 different artifacts. One contains the frequencies observed of each classification (samples_freq.qza
), the other one contains the taxonomic name belonging to each of the classification IDs (samples_tax.qza
).
Formatting metadata
Metadata in QIIME2 has to follow a specific format, like the example below (obtained from the QIIME2 moving pictures tutorial):
sample-id barcode-sequence body-site year month day subject reported-antibiotic-usage days-since-experiment-start
#q2:types categorical categorical numeric numeric numeric categorical categorical numeric
L1S8 AGCTGACTAGTC gut 2008 10 28 subject-1 Yes 0
L1S57 ACACACTATGGC gut 2009 1 20 subject-1 No 84
L1S76 ACTACGTGTGGT gut 2009 2 17 subject-1 No 112
L1S105 AGTGCGATGCGT gut 2009 3 17 subject-1 No 140
The most important thing to keep in mind is that the row below the header names should contain “#q2:types” in the first column that should be named “sample-id”. For each of the other columns, this “types” row contains the datatype that the column contains. Save the file as .tsv
or .txt
.
If you want to check if your metadata is formatted properly, there is an extension available in google sheets called Keemei that can validate the metadata.
Creating interactive barplot
To check if the artifacts are made properly and if the data looks like you expect it to, quickly create an interactive barplot with this command from the taxa
plugin:
#new dir for visualizations
mkdir visual/
#barplot command
qiime taxa barplot\
--i-table samples_freq.qza \
--i-taxonomy samples_tax.qza \
--o-visualization visual/samples_barplot.qzv
The qiime taxa barplot
command takes both artifacts (frequency and taxonomy) to create a qzv
object. This is another type of artifact created by QIIME2 that can not be processed further within QIIME2 but does contain some kind of visual output.
Every qzv
can be visualized by uploading it to QIIME2 view. Depending on the data in the object multiple functions are available on the site to interactively gain insights into the data.
Calculating Alpha diversity
If the data looks good and if you want to learn more about the diversity of the samples, the plugin diversity
is the way to go. To check the within sample (alpha) diversity, there are multiple commands available. Most of the times, a diversity analysis starts with performing rarefaction to see what a good depth is.
Sampling depth
For diversity analyses, a decision has to be made on how many samples you want to keep in the analysis. If you want to retain all samples, the sequencing depth of the sample with the lowest depth automatically is the depth on which diversity is calculated. This can result in a loss of information if there are a few outliers with very low sequencing depth. For example; if the average depth of your samples is 3000, but there is one sample with only 800 reads, diversity of all samples will be calculated over only 800 reads. So, it’s best to select a depth that removes samples with very low depths but does not remove too much samples, because this also results in a loss of information.
Overview depth in samples
To get an idea of how deep each of the samples is sequenced and to see the effects on losing samples when choosing a certain sampling depth during diversity calculations, the plugin feature-table
and the command summarize
are useful. This command creates a qzv
where you can interactively check sample information:
qiime feature-table summarize \
--i-table samples_freq.qza \
--m-sample-metadata-file samples_metadata.tsv \
--o-visualization visual/samples_freq_summary.qzv
Alpha rarefaction
To see if a depth you choose based on how many samples you want to keep in the analysis is deep enough for reliable diversity calculations, use the alpha-rarefaction
command from the diversity
plugin. This also generates a qzv
where the rarefaction plots are shown for each of the metrics included in the analysis, which will give you an idea of the most reliable metrics based on your data. To perform rarefaction for the Alpha diversity indices Chao1, observed features, Pielou’s evenness, Shannon and Simpson at a depth of 2000 reads per sample; use a command like this:
qiime diversity alpha-rarefaction \
--i-table samples_freq.qza \
--p-max-depth 2000 \
--p-metrics chao1 observed_features pielou_e shannon simpson \
--o-visualization visual/samples_alpha_rare_1500.qzv
Removing samples below chosen depth
Based on the information obtained from the summarize
and alpha-rarefaction
command, the samples that you want to remove now have to be filtered from the initial feature table. To do this, the same plugin (feature-table
) provides the command filter-samples
. The command below shows how to remove all samples with a depth lower than 1500.
qiime feature-table filter-samples \
--i-table samples_freq.qza \
--p-min-frequency 1500 \
--o-filtered-table 1500_samples_freq.qza
Alpha diversity
When you have selected one or multiple metrics for the Alpha diversity analysis, the alpha
command from the diversity
plugin is needed. The commands shown below calculate Alpha diversity for some of the same metrics used during the example Alpha rarefaction. A new command is used for each metric!
#new dir to save alpha diversity data
mkdir alpha/
#alpha commands
qiime diversity alpha \
--i-table 1500_samples_freq.qza \
--p-metric chao1 \
--o-alpha-diversity alpha/samples_chao1_1500.qza
qiime diversity alpha \
--i-table 1500_samples_freq.qza \
--p-metric observed_features \
--o-alpha-diversity alpha/samples_obsfea_1500.qza
qiime diversity alpha \
--i-table 1500_samples_freq.qza \
--p-metric shannon \
--o-alpha-diversity alpha/samples_shannon_1500.qza
These commands all output a qza
, which means that the Alpha diversity can not be directly visualized with QIIME2 view. To do this, another step needs to be performed.
Alpha diversity metadata significance
To visually check the Alpha diversity results, QIIME2 needs some information from the metadata. The same diversity
plugin can be used, but with the command alpha-group-significance
:
qiime diversity alpha-group-significance \
--i-alpha-diversity alpha/samples_chao1_1500.qza \
--m-metadata-file samples_metadata.tsv \
--o-visualization visual/chao1_1500_sign.qzv
qiime diversity alpha-group-significance \
--i-alpha-diversity alpha/samples_obsfea_1500.qza \
--m-metadata-file samples_metadata.tsv \
--o-visualization visual/obsfea_1500_sign.qzv
qiime diversity alpha-group-significance \
--i-alpha-diversity alpha/samples_shannon_1500.qza \
--m-metadata-file samples_metadata.tsv \
--o-visualization visual/shannon_1500_sign.qzv
These commands do output visualizations and show the statistical significance of sample metadata and the previously calculated Alpha diversity.
Calculating Beta diversity
To compare the microbial diversity between sample groups, Beta diversity is most often used. Beta diversity also has a lot of metrics available, so a few common ones are used in the example commands below. This analysis also starts with rarefaction to determine a good depth for the analysis, although most of the times the same depth is used for both Alpha and Beta diversity.
Beta rarefaction
In the same way rarefaction is often performed before calculating Alpha diversity, Beta rarefaction can be calculated. Again, the diversity
plugin provides a command for this:
qiime diversity beta-rarefaction \
--i-table samples_freq.qza \
--p-metric braycurtis \
--p-clustering-method nj \
--m-metadata-file samples_metadata.tsv \
--p-sampling-depth 2000 \
--o-visualization visual/samples_braycurtis_rare_2000.qzv
qiime diversity beta-rarefaction \
--i-table samples_freq.qza \
--p-metric jaccard \
--p-clustering-method nj \
--m-metadata-file samples_metadata.tsv \
--p-sampling-depth 2000 \
--o-visualization visual/samples_jaccard_rare_2000.qzv
With this command, the input table is rarefied based on the Bray-Curtis and Jaccard Beta diversity indices, with the clustering method neighbour joining at a depth of 1500 reads for each sample. Another method is available for clustering: UPGMA. For this analysis, it’s best to choose neighbour joining because UPGMA works best when rooted phylogeny is involved, which is not the case now.
Beta diversity
Based on the Beta rarefaction outcomes, it’s possible to filter samples from the data that do not meet the sampling depth that you want for the analysis with the same method described here Removing samples below chosen depth.
To perform the Beta diversity calculations, use the beta
command from the diversity
plugin. This function computes Beta diversity for user-specified metrics for all pairs of samples in the inputted feature table.
#new directory for beta diversity results
mkdir beta/
#beta commands
qiime diversity beta \
--i-table 1500_samples_freq.qza \
--p-metric braycurtis \
--o-distance-matrix beta/samples_braycurtis_1500.qza
qiime diversity beta \
--i-table 1500_samples_freq.qza \
--p-metric jaccard \
--o-distance-matrix beta/samples_jaccard_1500.qza
Now, Beta diversity matrices are created for 2 metrics, the same ones used during Beta rarefaction (Bray-Curtis and Jaccard).
PCoA of Beta diversity
To visualize the Beta diversity results, the best way is to use a Principal Coordinate Analysis (PCoA) plot. For this plot, you first need to perform the PCoA, which can also be done with a command from the diversity
plugin: pcoa
:
qiime diversity pcoa \
--i-distance-matrix beta/samples_braycurtis_1500.qza \
--o-pcoa beta/pcoa_braycurtis_1500.qza
qiime diversity pcoa \
--i-distance-matrix beta/samples_jaccard_1500.qza \
--o-pcoa beta/pcoa_jaccard_1500.qza
These commands both output PCoA results in qza
format, each on a different Beta diversity index.
Plotting PCoA Beta diversity
To create a visualization based on the PCoA results, QIIME2 provides the emperor
plugin. This tool was developed to visualize microbial ecology data, and can make 3D interactive plots based on a previously calculated PCoA:
qiime emperor plot \
--i-pcoa beta/pcoa_braycurtis_1500.qza \
--m-metadata-file samples_metadata.tsv \
--o-visualization visual/pcoaplot_braycurtis_1500.qzv
qiime emperor plot \
--i-pcoa beta/pcoa_jaccard_1500.qza \
--m-metadata-file samples_metadata.tsv \
--o-visualization visual/pcoaplot_jaccard_1500.qzv
When the just generated qzv
’s are uploaded to QIIME2 view, a 3D plot is shown that you can fully customize.
Beta diversity metadata significance
To check what Beta diversity results are significant with regards to differences between metadata categories, use the command beta-group-significance
from the diversity
plugin. This command does almost the same as the alpha-group-significance
one, but of course based on Beta diversity data. However, for this command you need to specify the categorical metadata column on which the statistical tests should be performed; otherwise, the process can become very computationally intensive. So, to calculate significance based on the categorical columns “age_group” and “sex” for both calculated Beta diversity matrices; use commands like the commands below:
qiime diversity beta-group-significance \
--i-distance-matrix beta/samples_braycurtis_1500.qza \
--m-metadata-file samples_metadata.tsv \
--m-metadata-column age_group \
--o-visualization visual/braycurtis_age_1500_sign.qzv
qiime diversity beta-group-significance \
--i-distance-matrix beta/samples_braycurtis_1500.qza \
--m-metadata-file samples_metadata.tsv \
--m-metadata-column sex \
--o-visualization visual/braycurtis_sex_1500_sign.qzv
qiime diversity beta-group-significance \
--i-distance-matrix beta/samples_jaccard_1500.qza \
--m-metadata-file samples_metadata.tsv \
--m-metadata-column age_group \
--o-visualization visual/jaccard_age_1500_sign.qzv
qiime diversity beta-group-significance \
--i-distance-matrix beta/samples_jaccard_1500.qza \
--m-metadata-file samples_metadata.tsv \
--m-metadata-column sex \
--o-visualization visual/jaccard_sex_1500_sign.qzv
The output shows if differences are significant for each different categorical metadata column.
Core diversity
If you don’t have specific metrics that you want to include in your diversity analysis, or don’t feel like running multiple commands to get to the results, the diversity plugin of QIIME2 also offers the core-diversity
command. This function calculates a few different commonly used diversity metrics at once:
qiime diversity core-metrics \
--i-table samples_freq.qza \
--p-sampling-depth 2000 \
--m-metadata-file samples_metadata.tsv \
--p-n-jobs 32 \
--output-dir core-metrics_2000
The output directory will contain a rarefied table to a depth of 2000 reads per sample and Alpha diversity matrices for indices observed features, Shannon, evenness and Jaccard. Bray-Curtis and Jaccard results are outputted for Beta diversity, including PCoA plots of these results.
Other information
- Fun fact: QIIME2 artifacts are nothing more than a compressed archive with a fancy name, so you can just unzip them to see the raw data they contain
- Links to the documentation might not work in the future because the links are QIIME2 version specific
Now you know how to perform basic microbial diversity analyses by using QIIME2 in combination with Kraken2 outputs. Check the QIIME2 documentation for all the other analysis options available after you’ve imported your data into QIIME2!