NEWS & BLOGS

Heading

Enterotype Analysis Pipeline: Development and Implementation

Published on: 2025-03-15 | By: EDITOR


1. MAGs Generation Workflow

2. Taxonomic Relative Abundance Estimation via Kraken2 and Bracken

3. Enterotyping Analysis Pipeline


Gut Enterotyping Analysis Overview

This enterotyping analysis leverages the FelMGDB database and genus-level abundance data from 412 samples, quantified using the Kraken2 and Bracken pipeline. Below is the R-based implementation of the workflow. Additionally, you can utilize the FelMGDB database's online tool for analyzing feline gut microbiota, which provides a web-based version of this pipeline with consistent processes and results.


1. Data Import & Cleaning

Action: Load species abundance data and preprocess to remove noise.

  • r
     
    taxa_abundance <- read.csv("fraction_total_reads_genus.csv",  
                               row.names = 1,   # Use sample names as row IDs  
                               check.names = FALSE)  # Preserve special characters in column names  
    taxa_abundance[is.na(taxa_abundance)] <- 0  # Replace missing values with 0  
    taxa_abundance <- taxa_abundance[, colSums(taxa_abundance) > 0]  # Remove zero-sum columns (species)  
    taxa_abundance <- taxa_abundance[rowSums(taxa_abundance) > 0, ]   # Remove zero-sum rows (samples)  

2. Taxon Filtering

Action: Retain species meeting prevalence and abundance thresholds.

  • r
     
    filter_threshold <- function(x) {  
      (mean(x > 0) >= 0.10) & (sum(x) > 0.5)  # Keep species present in ≥10% of samples and total abundance >0.5%  
    }  
    keep_species <- apply(taxa_abundance, 2, filter_threshold)  # Apply threshold column-wise  
    taxa_filtered <- taxa_abundance[, keep_species]  

3. Hellinger Transformation

Action: Normalize abundance data to reduce dominance of highly abundant taxa.

  • r
     
    library(vegan)  
    taxa_hel <- decostand(taxa_filtered, method = "hellinger")  # Hellinger standardization  
    taxa_hel[is.na(taxa_hel)] <- 0  # Handle potential NA values post-transformation  

4. Bray-Curtis Distance Calculation

Action: Compute pairwise dissimilarity between samples.

  • r
     
    bray_dist <- vegdist(taxa_hel, method = "bray")  # Bray-Curtis dissimilarity matrix  

5. Optimal Cluster Number Determination

Action: Evaluate silhouette scores to identify the optimal number of enterotypes (k=2–5).

  • r
     
    library(cluster)  
    silhouette_scores <- sapply(2:5, function(k) {  
      pam_cluster <- pam(bray_dist, k = k)  
      mean(silhouette(pam_cluster$clustering, bray_dist)[, "sil_width"])  # Average silhouette width  
    })  
    optimal_k <- which.max(silhouette_scores) + 1  # Select k with highest score  

6. PAM Clustering

Action: Partition samples into enterotypes using Partitioning Around Medoids (PAM).

  • r
     
    set.seed(123)  # Ensure reproducibility  
    pam_result <- pam(bray_dist, k = optimal_k)  

7. Enterotype Annotation

Action: Assign readable labels to clusters.

  • r
     
    annotation_df <- data.frame(  
      Enterotype = factor(  
        pam_result$clustering,  
        levels = sort(unique(pam_result$clustering)),  
        labels = paste("Enterotype", sort(unique(pam_result$clustering)))  # Format labels as "Enterotype 1", etc.  
      ),  
      row.names = rownames(taxa_filtered)  # Match sample IDs  
    )