Home..

Tutorial: Microbial Diversity Analysis with QIIME2 and Kraken2

December 2024 (2874 Words, 16 Minutes)

Introduction

Microbial diversity analysis is a very important aspect of microbiome research, helping researchers to study the composition, abundance, and variation of microbial communities in different environments. This tutorial focuses on combining two widely used bioinformatics tools: Kraken2 and QIIME2.

Kraken2 is a computationally efficient tool for taxonomic classification of metagenomic and microbiome sequencing data. It assigns taxonomic labels to sequences using a k-mer–based approach, making it a popular choice for analyzing large datasets quickly and accurately. Bracken is a tool by the same developers of Kraken2 that re-estimates the abundances outputted by Kraken2 at a single taxonomic level. For this tutorial, Bracken reports in kreport format can also be used. QIIME2 is an open-source, community-driven platform for microbiome data analysis. It provides a framework for performing taxonomic profiling, diversity analyses, and generating interactive visualizations.

While the developers of Kraken2 have written a few programs that help with calculating microbial diversity from Bracken outputs, these scripts do not provide immediate visualization like QIIME2 does. By combining Kraken2’s (or Bracken’s) classification output with the QIIME2 platform, researchers can perform advanced diversity analyses, including various alpha and beta diversity metrics, and easily create visualizations to get insights into their data.

This tutorial will guide you through:

Installing QIIME2
Converting Kreports to BIOM format
Importing data into QIIME2
Formatting metadata
Creating interactive barplot
Calculating Alpha diversity
Calculating Beta diversity
Core diversity

By following these steps, you will learn how to use Kraken2 and QIIME2 together to explore microbial diversity in your datasets.

Installing QIIME2

There are multiple distributions of QIIME2 that each offer a different set of plugins for specific research goals. For our research goal; the amplicon distribution is the most suitable because it includes plugins such as diversity, emperor, feature-table, metadata, phylogeny, taxa and types.

The easiest way to install any of the distributions is with Conda:

#create env based on yml
conda env create -n q2-amplicon2410 \
--file https://data.qiime2.org/distro/amplicon/qiime2-amplicon-2024.10-py310-linux-conda.yml
#activate
conda activate q2-amplicon2410

If you want to update QIIME2, you need to make a new conda environment based on the updated yaml file. That’s why the version (24.10) is included in the environment name.

Docs installation

To enable tab completion of QIIME2 commands, run:

source tab-qiime

Or add this command to your .bashrc/ or .bash_profile.

Docs autocompletion

Converting Kreports to BIOM format

Most of the times people use QIIME2, they start their analysis with the platform with raw reads. Since our raw reads are already processed and we only want to use QIIME2 to access the microbial diversity, we need a way to get our taxonomic classification results into QIIME2. The easiest way to achieve this is to use the tool Kraken-Biom. This tool was specifically designed to convert report outputs from Kraken2 or Bracken (“Kreports”) to the BIOM format. The BIOM format is a widely used format in the microbiome world that is compatible with QIIME2.

Installing kraken-biom

Again, the easiest way to install Kraken-biom is by using Conda to create a new environment:

#create and activate new env
conda create -n biom
conda activate biom
#install Kraken-biom
conda install bioconda::kraken-biom

A new, clean environment is recommended because QIIME2 has a lot of dependencies which often cause conflicts when trying to install other tools in the same environment.

Converting kreports

Make sure you have a directory that only contains the kreports that you want to include in the analysis in QIIME2. For example, if the kreports are located at /home/data/kreports use a command like this to convet them to BIOM format:

kraken-biom \
-k "/home/data/kreports" \
-o samples_kr2.biom

This will result in a BIOM file called samples_kr2.biom. To check if the biom file looks good, use the summarize-table command from the biom program, one of the dependencies of kraken-biom:

biom summarize-table \
-i samples_kr2.biom \
-o samples_kr2_summary.txt

The summary file should show the same amount of samples as Kraken2 reports that were used as input for kraken-biom. Additionally, there should be an “Observation Metadata Category” of “taxonomy”.

Deactivate the environment so you can move on to the analysis in QIIME2:

conda deactivate

Importing data into QIIME2

To actually get the data into a QIIME2 artifact (qza) import the BIOM file like this:

#frequency table import
qiime tools import \
--input-path samples_kr2.biom \
--type FeatureTable[Frequency] \
--input-format BIOMV210Format \
--output-path samples_freq.qza
#taxonomy import
qiime tools import \
--input-path samples_kr2.biom \
--type FeatureData[Taxonomy] \
--input-format BIOMV210Format \
--output-path samples_tax.qza

Now, the information from the BIOM file is stored in 2 different artifacts. One contains the frequencies observed of each classification (samples_freq.qza), the other one contains the taxonomic name belonging to each of the classification IDs (samples_tax.qza).

Formatting metadata

Metadata in QIIME2 has to follow a specific format, like the example below (obtained from the QIIME2 moving pictures tutorial):

sample-id	barcode-sequence	body-site	year	month	day	subject	reported-antibiotic-usage	days-since-experiment-start
#q2:types	categorical	categorical	numeric	numeric	numeric	categorical	categorical	numeric
L1S8	AGCTGACTAGTC	gut	2008	10	28	subject-1	Yes	0
L1S57	ACACACTATGGC	gut	2009	1	20	subject-1	No	84
L1S76	ACTACGTGTGGT	gut	2009	2	17	subject-1	No	112
L1S105	AGTGCGATGCGT	gut	2009	3	17	subject-1	No	140

The most important thing to keep in mind is that the row below the header names should contain “#q2:types” in the first column that should be named “sample-id”. For each of the other columns, this “types” row contains the datatype that the column contains. Save the file as .tsv or .txt.

If you want to check if your metadata is formatted properly, there is an extension available in google sheets called Keemei that can validate the metadata.

Creating interactive barplot

To check if the artifacts are made properly and if the data looks like you expect it to, quickly create an interactive barplot with this command from the taxa plugin:

#new dir for visualizations
mkdir visual/
#barplot command
qiime taxa barplot\
--i-table samples_freq.qza \
--i-taxonomy samples_tax.qza \
--o-visualization visual/samples_barplot.qzv

The qiime taxa barplot command takes both artifacts (frequency and taxonomy) to create a qzv object. This is another type of artifact created by QIIME2 that can not be processed further within QIIME2 but does contain some kind of visual output. Every qzv can be visualized by uploading it to QIIME2 view. Depending on the data in the object multiple functions are available on the site to interactively gain insights into the data.

Barplot docs

Calculating Alpha diversity

If the data looks good and if you want to learn more about the diversity of the samples, the plugin diversity is the way to go. To check the within sample (alpha) diversity, there are multiple commands available. Most of the times, a diversity analysis starts with performing rarefaction to see what a good depth is.

Sampling depth

For diversity analyses, a decision has to be made on how many samples you want to keep in the analysis. If you want to retain all samples, the sequencing depth of the sample with the lowest depth automatically is the depth on which diversity is calculated. This can result in a loss of information if there are a few outliers with very low sequencing depth. For example; if the average depth of your samples is 3000, but there is one sample with only 800 reads, diversity of all samples will be calculated over only 800 reads. So, it’s best to select a depth that removes samples with very low depths but does not remove too much samples, because this also results in a loss of information.

Overview depth in samples

To get an idea of how deep each of the samples is sequenced and to see the effects on losing samples when choosing a certain sampling depth during diversity calculations, the plugin feature-table and the command summarize are useful. This command creates a qzv where you can interactively check sample information:

qiime feature-table summarize \
--i-table samples_freq.qza \
--m-sample-metadata-file samples_metadata.tsv \
--o-visualization visual/samples_freq_summary.qzv

Summarize table docs

Alpha rarefaction

To see if a depth you choose based on how many samples you want to keep in the analysis is deep enough for reliable diversity calculations, use the alpha-rarefaction command from the diversity plugin. This also generates a qzv where the rarefaction plots are shown for each of the metrics included in the analysis, which will give you an idea of the most reliable metrics based on your data. To perform rarefaction for the Alpha diversity indices Chao1, observed features, Pielou’s evenness, Shannon and Simpson at a depth of 2000 reads per sample; use a command like this:

qiime diversity alpha-rarefaction \
--i-table samples_freq.qza \
--p-max-depth 2000 \
--p-metrics chao1 observed_features pielou_e shannon simpson \
--o-visualization visual/samples_alpha_rare_1500.qzv

Alpha rarefaction docs

Removing samples below chosen depth

Based on the information obtained from the summarize and alpha-rarefaction command, the samples that you want to remove now have to be filtered from the initial feature table. To do this, the same plugin (feature-table) provides the command filter-samples. The command below shows how to remove all samples with a depth lower than 1500.

qiime feature-table filter-samples \
--i-table samples_freq.qza \
--p-min-frequency 1500 \
--o-filtered-table 1500_samples_freq.qza

Filter samples docs

Alpha diversity

When you have selected one or multiple metrics for the Alpha diversity analysis, the alpha command from the diversity plugin is needed. The commands shown below calculate Alpha diversity for some of the same metrics used during the example Alpha rarefaction. A new command is used for each metric!

#new dir to save alpha diversity data
mkdir alpha/
#alpha commands
qiime diversity alpha \
--i-table 1500_samples_freq.qza \
--p-metric chao1 \
--o-alpha-diversity alpha/samples_chao1_1500.qza

qiime diversity alpha \
--i-table 1500_samples_freq.qza \
--p-metric observed_features \
--o-alpha-diversity alpha/samples_obsfea_1500.qza

qiime diversity alpha \
--i-table 1500_samples_freq.qza \
--p-metric shannon \
--o-alpha-diversity alpha/samples_shannon_1500.qza

These commands all output a qza, which means that the Alpha diversity can not be directly visualized with QIIME2 view. To do this, another step needs to be performed.

Alpha diversity docs

Alpha diversity metadata significance

To visually check the Alpha diversity results, QIIME2 needs some information from the metadata. The same diversity plugin can be used, but with the command alpha-group-significance:

qiime diversity alpha-group-significance \
--i-alpha-diversity alpha/samples_chao1_1500.qza \
--m-metadata-file samples_metadata.tsv \
--o-visualization visual/chao1_1500_sign.qzv

qiime diversity alpha-group-significance \
--i-alpha-diversity alpha/samples_obsfea_1500.qza \
--m-metadata-file samples_metadata.tsv \
--o-visualization visual/obsfea_1500_sign.qzv

qiime diversity alpha-group-significance \
--i-alpha-diversity alpha/samples_shannon_1500.qza \
--m-metadata-file samples_metadata.tsv \
--o-visualization visual/shannon_1500_sign.qzv

These commands do output visualizations and show the statistical significance of sample metadata and the previously calculated Alpha diversity.

Alpha group significance docs

Calculating Beta diversity

To compare the microbial diversity between sample groups, Beta diversity is most often used. Beta diversity also has a lot of metrics available, so a few common ones are used in the example commands below. This analysis also starts with rarefaction to determine a good depth for the analysis, although most of the times the same depth is used for both Alpha and Beta diversity.

Beta rarefaction

In the same way rarefaction is often performed before calculating Alpha diversity, Beta rarefaction can be calculated. Again, the diversity plugin provides a command for this:

qiime diversity beta-rarefaction \
--i-table samples_freq.qza \
--p-metric braycurtis \
--p-clustering-method nj \
--m-metadata-file samples_metadata.tsv \
--p-sampling-depth 2000 \
--o-visualization visual/samples_braycurtis_rare_2000.qzv

qiime diversity beta-rarefaction \
--i-table samples_freq.qza \
--p-metric jaccard \
--p-clustering-method nj \
--m-metadata-file samples_metadata.tsv \
--p-sampling-depth 2000 \
--o-visualization visual/samples_jaccard_rare_2000.qzv

With this command, the input table is rarefied based on the Bray-Curtis and Jaccard Beta diversity indices, with the clustering method neighbour joining at a depth of 1500 reads for each sample. Another method is available for clustering: UPGMA. For this analysis, it’s best to choose neighbour joining because UPGMA works best when rooted phylogeny is involved, which is not the case now.

Beta rarefaction docs

Beta diversity

Based on the Beta rarefaction outcomes, it’s possible to filter samples from the data that do not meet the sampling depth that you want for the analysis with the same method described here Removing samples below chosen depth. To perform the Beta diversity calculations, use the beta command from the diversity plugin. This function computes Beta diversity for user-specified metrics for all pairs of samples in the inputted feature table.

#new directory for beta diversity results
mkdir beta/
#beta commands
qiime diversity beta \
--i-table 1500_samples_freq.qza \
--p-metric braycurtis \
--o-distance-matrix beta/samples_braycurtis_1500.qza

qiime diversity beta \
--i-table 1500_samples_freq.qza \
--p-metric jaccard \
--o-distance-matrix beta/samples_jaccard_1500.qza

Now, Beta diversity matrices are created for 2 metrics, the same ones used during Beta rarefaction (Bray-Curtis and Jaccard).

Beta diversity docs

PCoA of Beta diversity

To visualize the Beta diversity results, the best way is to use a Principal Coordinate Analysis (PCoA) plot. For this plot, you first need to perform the PCoA, which can also be done with a command from the diversity plugin: pcoa:

qiime diversity pcoa \
--i-distance-matrix beta/samples_braycurtis_1500.qza \
--o-pcoa beta/pcoa_braycurtis_1500.qza

qiime diversity pcoa \
--i-distance-matrix beta/samples_jaccard_1500.qza \
--o-pcoa beta/pcoa_jaccard_1500.qza

These commands both output PCoA results in qza format, each on a different Beta diversity index.

PCoA docs

Plotting PCoA Beta diversity

To create a visualization based on the PCoA results, QIIME2 provides the emperor plugin. This tool was developed to visualize microbial ecology data, and can make 3D interactive plots based on a previously calculated PCoA:

qiime emperor plot \
--i-pcoa beta/pcoa_braycurtis_1500.qza \
--m-metadata-file samples_metadata.tsv \
--o-visualization visual/pcoaplot_braycurtis_1500.qzv

qiime emperor plot \
--i-pcoa beta/pcoa_jaccard_1500.qza \
--m-metadata-file samples_metadata.tsv \
--o-visualization visual/pcoaplot_jaccard_1500.qzv

When the just generated qzv’s are uploaded to QIIME2 view, a 3D plot is shown that you can fully customize.

PCoA plot docs

Beta diversity metadata significance

To check what Beta diversity results are significant with regards to differences between metadata categories, use the command beta-group-significance from the diversity plugin. This command does almost the same as the alpha-group-significance one, but of course based on Beta diversity data. However, for this command you need to specify the categorical metadata column on which the statistical tests should be performed; otherwise, the process can become very computationally intensive. So, to calculate significance based on the categorical columns “age_group” and “sex” for both calculated Beta diversity matrices; use commands like the commands below:

qiime diversity beta-group-significance \
--i-distance-matrix beta/samples_braycurtis_1500.qza \
--m-metadata-file samples_metadata.tsv \
--m-metadata-column age_group \
--o-visualization visual/braycurtis_age_1500_sign.qzv

qiime diversity beta-group-significance \
--i-distance-matrix beta/samples_braycurtis_1500.qza \
--m-metadata-file samples_metadata.tsv \
--m-metadata-column sex \
--o-visualization visual/braycurtis_sex_1500_sign.qzv

qiime diversity beta-group-significance \
--i-distance-matrix beta/samples_jaccard_1500.qza \
--m-metadata-file samples_metadata.tsv \
--m-metadata-column age_group \
--o-visualization visual/jaccard_age_1500_sign.qzv

qiime diversity beta-group-significance \
--i-distance-matrix beta/samples_jaccard_1500.qza \
--m-metadata-file samples_metadata.tsv \
--m-metadata-column sex \
--o-visualization visual/jaccard_sex_1500_sign.qzv

The output shows if differences are significant for each different categorical metadata column.

Beta group significance docs

Core diversity

If you don’t have specific metrics that you want to include in your diversity analysis, or don’t feel like running multiple commands to get to the results, the diversity plugin of QIIME2 also offers the core-diversity command. This function calculates a few different commonly used diversity metrics at once:

qiime diversity core-metrics \
--i-table samples_freq.qza \
--p-sampling-depth 2000 \
--m-metadata-file samples_metadata.tsv \
--p-n-jobs 32 \
--output-dir core-metrics_2000

The output directory will contain a rarefied table to a depth of 2000 reads per sample and Alpha diversity matrices for indices observed features, Shannon, evenness and Jaccard. Bray-Curtis and Jaccard results are outputted for Beta diversity, including PCoA plots of these results.

Diversity core metrics docs

Other information

Fun fact: QIIME2 artifacts are nothing more than a compressed archive with a fancy name, so you can just unzip them to see the raw data they contain
Links to the documentation might not work in the future because the links are QIIME2 version specific

Now you know how to perform basic microbial diversity analyses by using QIIME2 in combination with Kraken2 outputs. Check the QIIME2 documentation for all the other analysis options available after you’ve imported your data into QIIME2!