1. MAGs Generation Workflow
2. Taxonomic Relative Abundance Estimation via Kraken2 and Bracken
3. Enterotyping Analysis Pipeline
Gut Enterotyping Analysis Overview
This enterotyping analysis leverages the FelMGDB database and genus-level abundance data from 412 samples, quantified using the Kraken2 and Bracken pipeline. Below is the R-based implementation of the workflow. Additionally, you can utilize the FelMGDB database's online tool for analyzing feline gut microbiota, which provides a web-based version of this pipeline with consistent processes and results.
1. Data Import & Cleaning
Action: Load species abundance data and preprocess to remove noise.
taxa_abundance <- read.csv("fraction_total_reads_genus.csv",
row.names = 1, # Use sample names as row IDs
check.names = FALSE) # Preserve special characters in column names
taxa_abundance[is.na(taxa_abundance)] <- 0 # Replace missing values with 0
taxa_abundance <- taxa_abundance[, colSums(taxa_abundance) > 0] # Remove zero-sum columns (species)
taxa_abundance <- taxa_abundance[rowSums(taxa_abundance) > 0, ] # Remove zero-sum rows (samples)
2. Taxon Filtering
Action: Retain species meeting prevalence and abundance thresholds.
filter_threshold <- function(x) {
(mean(x > 0) >= 0.10) & (sum(x) > 0.5) # Keep species present in ≥10% of samples and total abundance >0.5%
}
keep_species <- apply(taxa_abundance, 2, filter_threshold) # Apply threshold column-wise
taxa_filtered <- taxa_abundance[, keep_species]
3. Hellinger Transformation
Action: Normalize abundance data to reduce dominance of highly abundant taxa.
library(vegan)
taxa_hel <- decostand(taxa_filtered, method = "hellinger") # Hellinger standardization
taxa_hel[is.na(taxa_hel)] <- 0 # Handle potential NA values post-transformation
4. Bray-Curtis Distance Calculation
Action: Compute pairwise dissimilarity between samples.
bray_dist <- vegdist(taxa_hel, method = "bray") # Bray-Curtis dissimilarity matrix
5. Optimal Cluster Number Determination
Action: Evaluate silhouette scores to identify the optimal number of enterotypes (k=2–5).
library(cluster)
silhouette_scores <- sapply(2:5, function(k) {
pam_cluster <- pam(bray_dist, k = k)
mean(silhouette(pam_cluster$clustering, bray_dist)[, "sil_width"]) # Average silhouette width
})
optimal_k <- which.max(silhouette_scores) + 1 # Select k with highest score
6. PAM Clustering
Action: Partition samples into enterotypes using Partitioning Around Medoids (PAM).
set.seed(123) # Ensure reproducibility
pam_result <- pam(bray_dist, k = optimal_k)
7. Enterotype Annotation
Action: Assign readable labels to clusters.
annotation_df <- data.frame(
Enterotype = factor(
pam_result$clustering,
levels = sort(unique(pam_result$clustering)),
labels = paste("Enterotype", sort(unique(pam_result$clustering))) # Format labels as "Enterotype 1", etc.
),
row.names = rownames(taxa_filtered) # Match sample IDs
)