Home..

Cancer Microbiome Drama

November 2024 (1686 Words, 10 Minutes)

The idea that microbial DNA within human tumors could help diagnose cancer gained a lot of attention with the 2020 Nature paper by Gregory Poore and colleagues. By analyzing unmapped reads from cancer sequencing data in The Cancer Genome Atlas (TCGA), they identified microbial patterns they claimed could classify cancer types with very high accuracy. This study, considered “groundbreaking,” quickly became a big point of interest and inspired further research, including a follow-up study by Hermida et al., which linked the microbiome to cancer treatment outcomes. Recent critiques from Steven Salzberg’s lab challenge the validity of Poore et al. their findings, stating they have greatly overestimated microbial signals due to a flawed approach.

As someone who has been working in cancer microbiome research myself, I was intrigued by Poore et al.’s approach and tried to recreate it—this time accounting for the methodological concerns raised by critics, including potential contamination and optimized removal of human reads.

When I explained the back-and-forth between labs in the cancer microbiome field to someone recently, they couldn’t help but laugh, saying it sounded like a soap opera with all its drama. In this blog post, I’ll provide an overview of the ongoing controversies and conflicts, along with some background information.

Main Drama

Poore et al.

Once upon a time there were Gregory Poore and his colleagues. from the lab of Rob Knight …

In 2020, they published a paper called “Microbiome analyses of blood and tissues suggest cancer diagnostic approach” in Nature. In this paper they state that the microbiome found by metagenomic sequencing of various tumor types can used to be distinguish cancer types. They used data from The Cancer Genome Atlas (TCGA), a big cohort that characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. A quick overview of their methods:

Download reads that were not mapped (with BWA or Bowtie2, depending on analysis date) to a human reference genome (GRCh37 or GRCh38 depending on sequencing date) by TCGA from multiple cancer types
Perform taxonomic classification with Kraken2, based on a database with all known (including drafts) bacterial, archaeal and viral genomes
Decontaminate with four different approaches, resulting in four decontaminated datasets
Perform supervised normalization with Voom
Train stochastic gradient-boosting ML models based on the classified normalized data
Use models to predict the cancer type of reads based on the classified data

This paper was seen as “groundbreaking”, and Poore and a few of his colleagues even started a biotech company based on the results called “Micronoma”.

Hermida et al.

In 2022, Hermida et al. published a paper named “Predicting cancer prognosis and drug response from the tumor microbiome” was published in Nature. This paper was a follow-up on the paper published by Poore et al, where they used the results of Poore et al. to build machine learning (ML) models that could link the microbiome to treatment outcomes. These ML models could predict the treatment outcome with almost the same accuracy as ML models based on gene expression data.

Gihawi et al.

In 2023, two pre-prints appeared on bioRxiv written by Gihawi et al., members of the lab of Steven Salzberg (Caution Regarding the Specificities of Pan-Cancer Microbial Structure and Major data analysis errors invalidate cancer microbiome findings). The first of the two pre-prints was published in January, and a more extended version was published in July. Both of the pre-prints were eventually published; the first one in Microbial Genomics, the second one in mBio. Both of these papers state the the results found by Poore et al. are invalid because they used very flawed methods, and so are the papers that build on their results (Hermida et al.). In brief, the main errors, according to them, are:

Not removing human reads from the TCGA data properly
Using a database that did not have a human reference genome included during taxonomic classification
Using a database with draft bacterial genomes (which are contaminated with human DNA) during taxonomic classification
Using normalized data where hidden signatures have been introduced for the ML models In my opinion, the paper is a really interesting read so I would recommend reading it, but I also provide a summary below:

Summary Major data analysis errors invalidate cancer microbiome findings

The first thing that stood out to Gihawi et al. about the Poore et al. paper, was that genera that normally do not occur in humans were very important for the ML models
- For example: Neptumonas, a bacterium that lives in oceans, and Fervidobacterium, a genus that only lives in high temperature (>60 degrees Celsius) water
The study of Poore et al. began with downloading reads that were not aligned to a human reference genome by TCGA
- The alignment process used by TCGA is imperfect, and a lot of human reads were still included in the data they downloaded
- Gihawi et al. proved this by re-aligning a subset of the unmapped reads from TCGA to the CHM13 human genome with Bowtie2 and found that between 15% and 51% of these reads (depending on the cancer type) matched to the human reference
  - This means that per sample between 1.4M and 3.3M human reads were still included in the data about which Poore et al. assumed that they were microbial
During taxonomic classification, Poore et al. used a database that included many draft genomes, no common vector sequences and no human reference genome
- It has been previously proven that these draft genomes are often contaminated with sequences derived from humans (Publication from the Salzberg lab: Human contamination in bacterial genomes has created thousands of spurious proteins)
- Because the data they used still contained millions of human reads and they were partially classified based on human sequences that were falsely labeled as microbes, the amount of bacterial classifications was greatly inflated
- This problem can be solved by including the human reference genome in the database, and using only complete bacterial genomes
Gihawi et al. performed their own analysis on a subset of the data used by Poore et al., but with the suggested adjustments
- In 98.5% of all cases the read counts reported by Poore et al. were at least 10 times higher than what was found during the re-analysis
Gihawi et al. demonstrated how improperly removing human reads from the data can lead to such high over-estimations of bacterial read counts
For the machine learning approach, Poore et al. used normalized data to remove batch effects, but in the process of converting raw counts data was accidentally tagged with distinct values
- These “tags” were specific for cancer types, which led to the ML models being trained on the tags and very high accuracy of the models

Rebuttal of Poore et al.

After the first pre-print of Gihawi et al. appeared, Poore et al. wrote an extensive rebuttal where they replicated the main findings from their initial study Poore et al. but only with genera that had proven biological relevance. In addition, they created a GitHub repo where they show the code they used to find cancer type-specific signals in the data used by Gihawi et al.. A newer version of this pre-print was published in February 2024: Robustness of cancer microbiome signals over a broad range of methodological variation In their response, they reference another more recent paper of their lab: Pan-cancer analyses reveal cancer-type-specific fungal ecologies and bacteriome interactions, in which they performed more or less the same analysis but with a focus on fungi instead of bacteria and viruses.

Retraction of Poore et al.

On 07-02-2024, an editors note was added to the Poore et al. paper that stated: “Readers are alerted that concerns have been raised about the data and conclusions presented in this article. Further editorial action will be taken once this matter has been resolved.”.

On 26-06-2024, the paper was officially retracted. The retraction note states that due to the Gihawi et al. paper, concerns about the robustness of the microbial signatures were brought to the attention of the Editors. Due to peer review of the raised issues and the authors’ responses, it was concluded that some of the findings of the article are affected and that the conclusions are no longer supported. The note also references the Rebuttal of Poore et al.

On 04-07-2024 an editors note was also added to the Hermida et al. paper that stated: “Readers are alerted that the authors of this study have informed the Editors that the starting input for data processing was downloaded from the online repository referenced in an article that has been recently retracted ( https://doi.org/10.1038/s41586-024-07656-x ). The Editors are working with the authors to address this issue. Further editorial action will be taken once this matter has been resolved.”

Chronological overview papers

Pre-prints not included

Background information

Gregory Poore

Gregory Poore is a researcher in the lab of Rob Knight. Due to his work in the cancer microbiome field and a paper showing groundbreaking results he made it on the Forbes 30 under 30 list in 2023. He has published 21 papers, of which most are related to microbiology. Google scholar

Micronoma

A San-Diego-based biotech startup co-founded by Gregory Poore and colleagues, focusing on developing early cancer detection methods through microbiome-based biomarkers. They raised almost 18 million dollar to fund the company. The scientific foundation of this company is the Poore et al. “Microbiome analyses of blood and tissues suggest cancer diagnostic approach” paper. Since the paper has been retracted, the website of micronoma is no longer accessible.

Rob Knight

Rob Knight is a computational microbiologist based at the University of California, San Diego. He is among the top researchers in microbiome research and computational biology. He was involved in developing QIIME(2), UniFrac and UCHIME, and has over 430K citations. Google scholar

Steven Salzberg

Steven Salzberg, a professor at Johns Hopkins University, holds positions in both the Department of Biomedical Engineering and the Department of Computer Science. Salzberg is a pioneer in bioinformatics and genomics, with contributions to tools like Bowtie, the Tuxedo Suite for RNA-Seq analysis and Kraken(2). He has over 360K citations. Google scholar