Type 2 Diabetes Prediction Using Metabolomics and Machine Learning

This project is a step-by-step guide on analyzing metabolomics trends in T2D using machine learning.

▶ Introduction

Type 2 Diabetes (T2D) is a chronic metabolic disorder characterized by insulin resistance and hyperglycemia. While traditional diagnostics focus on clinical biochemistry and glycemic control, this study explores whether metabolomic biomarkers—particularly amino acids, sugars, and microbial metabolites—can serve as early, systemic indicators of T2D. Using both exploratory data analysis and regularized logistic regression, the aim is to identify statistically and biologically significant metabolite signatures and evaluate the feasibility of predictive modeling from these profiles.

The analysis is conducted in two stages:

Part 1: Exploratory and statistical analysis of metabolite associations with T2D.
Part 2: A machine learning pipeline using elastic net-regularized logistic regression with nested cross-validation and model interpretability tools.

▶ Exploratory Data Analysis (EDA)

The dataset used in this study comprised quantitative metabolomic profiles from a total of 61 biological samples, each characterized by 80 measured metabolites (features) (Table 1). These samples were evenly distributed between two diagnostic groups: individuals with Type 2 Diabetes (T2D)(n=31) and non-diabetic controls (n=30). The data were provided in a transposed tabular format with metabolite concentrations represented as columns and samples as rows. Initial preprocessing steps involved renaming the class labels for clarity, mapping diagnostic categories to binary outcomes (0: Control, 1: Diabetes), and coercing all feature columns to numeric data types for downstream analysis.

A preliminary data audit revealed minor missingness across several features, though no feature exhibited excessive sparsity. A visual inspection using a missingness heatmap confirmed the absence of any systematic patterns in the missing values, suggesting that median imputation would be an appropriate and non-biased strategy for handling these gaps in subsequent modeling stages.

# Table 1: Feature Overview: Non-Null Counts & Data Types

+----+-----------------------+---------------------+-----------+
|    |     Column Names      | Non-null values (%) | Data Type |
+----+-----------------------+---------------------+-----------+
| 0  |          ID           |        100.0        |  object   |
| 1  |         Class         |        100.0        |   int64   |
| 2  |  1,5-Anhydroglucitol  |        100.0        |  float64  |
| 3  |    2-Aminobutyrate    |        100.0        |  float64  |
| 4  |   2-Hydroxybutyrate   |        100.0        |  float64  |
| 5  |  3-Aminoisobutyrate   |        100.0        |  float64  |
| 6  |   3-Hydroxybutyrate   |        100.0        |  float64  |
| 7  |        Alanine        |        100.0        |  float64  |
| 8  |        Allose         |       98.361        |  float64  |
| 9  |   alpha-Tocopherol    |        100.0        |  float64  |
| 10 |       Arabinose       |        100.0        |  float64  |
| 11 |       Arginine        |        100.0        |  float64  |
| 12 |      Asparagine       |        100.0        |  float64  |
| 13 |       Aspartate       |        100.0        |  float64  |
| 14 |     beta-Alanine      |        100.0        |  float64  |
| 15 |      Cholesterol      |        100.0        |  float64  |
| 16 |        Citrate        |        100.0        |  float64  |
| 17 |      Creatinine       |        100.0        |  float64  |
| 18 |       Cysteine        |        100.0        |  float64  |
| 19 |        Cystine        |        100.0        |  float64  |
| 20 |       Elaidiate       |        100.0        |  float64  |
| 21 |      Erythritol       |        100.0        |  float64  |
| 22 |     Ethanolamine      |        100.0        |  float64  |
| 23 |       Fructose        |        100.0        |  float64  |
| 24 |       Galactose       |        100.0        |  float64  |
| 25 |       Gluconate       |        100.0        |  float64  |
| 26 |        Glucose        |        100.0        |  float64  |
| 27 |      Glucuronate      |        100.0        |  float64  |
| 28 |       Glutamate       |        100.0        |  float64  |
| 29 |       Glutamine       |        100.0        |  float64  |
| 30 |       Glycerate       |        100.0        |  float64  |
| 31 |        Glycine        |        100.0        |  float64  |
| 32 |       Glycolate       |        100.0        |  float64  |
| 33 |    Heptadecanoate     |       98.361        |  float64  |
| 34 |       Hippurate       |       80.328        |  float64  |
| 35 |       Histidine       |        100.0        |  float64  |
| 36 |    Hydroxyproline     |        100.0        |  float64  |
| 37 |      Hypotaurine      |        100.0        |  float64  |
| 38 | Indole-3-acetaldehyde |        100.0        |  float64  |
| 39 |       Inositol        |        100.0        |  float64  |
| 40 |      Isoleucine       |        100.0        |  float64  |
| 41 |    Ketoisoleucine     |        100.0        |  float64  |
| 42 |      Ketovaline       |        100.0        |  float64  |
| 43 |        Lactate        |        100.0        |  float64  |
| 44 |       Laureate        |        100.0        |  float64  |
| 45 |        Leucine        |        100.0        |  float64  |
| 46 |       Linoleate       |        100.0        |  float64  |
| 47 |        Lysine         |        100.0        |  float64  |
| 48 |        Malate         |        100.0        |  float64  |
| 49 |       Mannitol        |        100.0        |  float64  |
| 50 |        Mannose        |        100.0        |  float64  |
| 51 |      Methionine       |        100.0        |  float64  |
| 52 |       Myristate       |        100.0        |  float64  |
| 53 |  N-Acetylglucosamine  |        100.0        |  float64  |
| 54 |        Oleate         |        100.0        |  float64  |
| 55 |       Ornithine       |        100.0        |  float64  |
| 56 |       Palmitate       |        100.0        |  float64  |
| 57 |     Palmitoleate      |        100.0        |  float64  |
| 58 |     Paraxanthine      |        100.0        |  float64  |
| 59 |    Pentadecanoate     |        100.0        |  float64  |
| 60 |     Phenylalanine     |        100.0        |  float64  |
| 61 |       Phosphate       |        100.0        |  float64  |
| 62 |        Proline        |        100.0        |  float64  |
| 63 |   Propylene glycol    |        100.0        |  float64  |
| 64 |     Pyroglutamate     |        100.0        |  float64  |
| 65 |       Pyruvate        |        100.0        |  float64  |
| 66 |        Quinate        |       90.164        |  float64  |
| 67 |        Serine         |        100.0        |  float64  |
| 68 |      Spermidine       |        100.0        |  float64  |
| 69 |       Stearate        |        100.0        |  float64  |
| 70 |       Succinate       |        100.0        |  float64  |
| 71 |       Threonate       |        100.0        |  float64  |
| 72 |       Threonine       |        100.0        |  float64  |
| 73 |      Tryptophan       |        100.0        |  float64  |
| 74 |       Tyrosine        |        100.0        |  float64  |
| 75 |         Urate         |        100.0        |  float64  |
| 76 |         Urea          |        100.0        |  float64  |
| 77 |        Uridine        |        100.0        |  float64  |
| 78 |        Valine         |        100.0        |  float64  |
| 79 |       Xylulose        |        100.0        |  float64  |
+----+-----------------------+---------------------+-----------+

To assess the distributional properties of each metabolite, the Shapiro–Wilk test was applied separately to each class group. Results indicated that the majority of metabolites deviated significantly from a Gaussian distribution (p < 0.05), thereby justifying the use of non-parametric methods for statistical testing and visualization. This deviation from normality was consistent across both diagnostic classes (Figure 1) which is common in metabolomics data, especially with concentrations that span orders of magnitude. For further details, please refer to the eda_v01 notebook located in this project's repository.

# Figure 1: Histogram distributions of six metabolites with potential relevance to disease classification.

To investigate how inter-metabolite relationships differ between healthy controls and individuals with Type 2 Diabetes (T2D), pairwise Spearman correlations were computed separately for each group, followed by a Fisher’s Z transformation to test for statistically significant differences in correlation strength. This approach reveals not just the presence of associations, but how disease state modifies the structure of the metabolic network (Table 2).

Several metabolite pairs demonstrated substantial and statistically significant shifts in their interrelationships. For example, the correlation between galactose and mannose increased markedly from ρ = 0.43 in controls to ρ = 0.87 in T2D (p = 0.0012), suggesting a disease-related strengthening of co-regulation among hexose sugars. This may reflect altered hepatic processing or intestinal absorption of these monosaccharides under insulin-resistant conditions.

Similarly, the relationship between allose and glucose intensified significantly in T2D (ρ = 0.86) compared to controls (ρ = 0.43, p = 0.002), reinforcing the notion of impaired or compensatory sugar handling pathways. Notably, indole-3-acetaldehyde and mannose also showed an increased coupling in T2D (ρ = 0.82 vs. 0.44, p = 0.011), supporting a potential link between microbial metabolism and host glycemic control in diabetes.

Other shifts were observed in lipid metabolism. For instance, the association between linoleate and palmitoleate decreased from a strong positive correlation in controls (ρ = 0.72) to a weaker association in T2D (ρ = 0.40, p = 0.073), hinting at disrupted coordination in unsaturated fatty acid regulation—possibly due to altered lipogenesis or oxidative stress responses in diabetic states.

Additionally, inositol and N-acetylglucosamine, both involved in signaling and glycosylation, exhibited enhanced correlation in T2D (ρ = 0.85 vs. 0.46, p = 0.0049), suggesting broader perturbations in carbohydrate modification pathways.

Overall, this analysis reveals that T2D not only alters metabolite levels but also reshapes the underlying correlation architecture of the metabolome. Such rewiring of metabolic networks could reflect compensatory mechanisms, pathway bottlenecks, or shared upstream regulatory dysfunctions characteristic of T2D pathophysiology.

# Table 2: Significant Changes in Metabolite Pair Correlations Between Control and T2D Subjects.

+----+-----------------------+-----------------------+-----------+-------+-------------+-----------+-------+---------+----------+
|    |       Feature_A       |       Feature_B       | r_control | r_T2D | abs_delta_r | z_control | z_T2D | z_score | p_value  |
+----+-----------------------+-----------------------+-----------+-------+-------------+-----------+-------+---------+----------+
| 88 |       Galactose       |        Mannose        |   0.43    | 0.87  |    0.44     |   0.46    | 1.33  |  -3.24  | < 1e-16  |
| 89 |        Allose         |        Glucose        |   0.43    | 0.86  |    0.43     |   0.46    | 1.29  |  -3.09  | < 1e-16  |
| 80 |       Inositol        |  N-Acetylglucosamine  |   0.46    | 0.85  |    0.39     |    0.5    | 1.26  |  -2.81  | < 1e-16  |
| 84 | Indole-3-acetaldehyde |        Mannose        |   0.44    | 0.82  |    0.38     |   0.47    | 1.16  |  -2.54  | 1.00e-02 |
| 37 |        Allose         |        Mannose        |   0.65    | 0.91  |    0.26     |   0.78    | 1.53  |  -2.79  | 1.00e-02 |
| 10 |        Glucose        | Indole-3-acetaldehyde |   0.79    | 0.97  |    0.18     |   1.07    | 2.09  |  -3.78  | < 1e-16  |
| 6  |       Galactose       | Indole-3-acetaldehyde |   0.83    | 0.97  |    0.14     |   1.19    | 2.09  |  -3.35  | < 1e-16  |
| 1  |       Galactose       |        Glucose        |    0.9    | 0.98  |    0.08     |   1.47    |  2.3  |  -3.06  | < 1e-16  |
+----+-----------------------+-----------------------+-----------+-------+-------------+-----------+-------+---------+----------+

To robustly identify metabolic differences between individuals with type 2 diabetes (T2D) and healthy controls, we applied the Mann–Whitney U test to all features, followed by Benjamini–Hochberg correction to control for false discovery rate (FDR). A total of 34 metabolites exhibited statistically significant differences between groups (FDR p < 0.05). To complement these findings with a measure of practical significance, we computed Cliff’s delta for each metabolite, identifying 21 features with large effect sizes (|δ| > 0.474), indicative of consistent and substantial group-wise differences (Table 3).

Among the most discriminative metabolites were 2-hydroxybutyrate (δ = 0.95), gluconate (δ = 0.94), mannose (δ = 0.93), allose (δ = 0.91), and galactose (δ = 0.87), all elevated in the diabetic cohort. These sugars and sugar acids are tightly linked to carbohydrate metabolism and insulin resistance, with mannose and gluconate increasingly recognized as early indicators of metabolic dysfunction and hepatic stress.

Further significant elevations in T2D were observed for indole-3-acetaldehyde, a tryptophan-derived microbial metabolite, and valine, a branched-chain amino acid (BCAA), both known to be involved in microbial-host crosstalk and amino acid dysregulation seen in insulin resistance and T2D pathogenesis. Elevated fructose, glucose, and erythritol also reflect dysregulation in hexose metabolism, likely due to altered hepatic processing and glucose homeostasis.

On the other end of the spectrum, 1,5-anhydroglucitol (δ = –0.94) was significantly depleted in diabetic individuals, consistent with its established role as a marker of short-term glycemic excursions and renal glucose handling. The consistent downregulation of this metabolite across diabetic individuals underscores its sensitivity to hyperglycemia.

Other metabolites with large effect sizes included inositol, palmitate, xylulose, citrate, ornithine, and aspartate, suggesting broader alterations in lipid metabolism, pentose phosphate pathway, TCA cycle, and nitrogen balance. These systemic shifts reflect the multifaceted metabolic disruption characteristic of T2D and provide insight into potential mechanistic nodes of dysfunction.

Taken together, the combination of rigorous statistical testing and effect size quantification highlights a core panel of metabolites with high discriminatory power. These findings offer not only candidate biomarkers for T2D but also biologically meaningful insights into the metabolic derangements underlying disease pathology. Further validation in larger, multi-omic datasets and mechanistic studies is warranted.

# Table 3: Top Discriminatory Metabolites Between T2D and Controls Identified by Mann–Whitney U Test and Cliff’s Delta.

+----+-----------------------+----------------+--------------+-------------+-----------+-------------------+------------------+--------------+-------------+
|    |      Metabolite       | Control Median | N T2D Higher | N T2D Lower | Total T2D | Proportion Higher | Proportion Lower | Cliffs Delta | Effect Size |
+----+-----------------------+----------------+--------------+-------------+-----------+-------------------+------------------+--------------+-------------+
| 0  |  1,5-Anhydroglucitol  |    72441.11    |      0       |     31      |    31     |        0.0        |       1.0        |    -0.94     |    large    |
| 1  |   2-Hydroxybutyrate   |    6667.66     |      31      |      0      |    31     |        1.0        |       0.0        |     0.95     |    large    |
| 2  |        Allose         |    10301.21    |      31      |      0      |    31     |        1.0        |       0.0        |     0.91     |    large    |
| 3  |       Gluconate       |     828.79     |      30      |      1      |    31     |       0.97        |       0.03       |     0.94     |    large    |
| 4  |        Mannose        |    29906.15    |      30      |      1      |    31     |       0.97        |       0.03       |     0.93     |    large    |
| 5  |       Galactose       |   672210.16    |      30      |      1      |    31     |       0.97        |       0.03       |     0.87     |    large    |
| 6  |        Glucose        |   4143208.24   |      30      |      1      |    31     |       0.97        |       0.03       |     0.86     |    large    |
| 7  | Indole-3-acetaldehyde |    27286.89    |      29      |      2      |    31     |       0.94        |       0.06       |     0.83     |    large    |
| 8  |        Cystine        |    1751.66     |      29      |      2      |    31     |       0.94        |       0.06       |     0.73     |    large    |
| 9  |    2-Aminobutyrate    |    1608.16     |      28      |      3      |    31     |        0.9        |       0.1        |     0.7      |    large    |
| 10 |       Glutamate       |    7300.31     |      29      |      2      |    31     |       0.94        |       0.06       |     0.66     |    large    |
| 11 |       Mannitol        |    1050.15     |      29      |      2      |    31     |       0.94        |       0.06       |     0.63     |    large    |
| 12 |   3-Hydroxybutyrate   |    27422.2     |      27      |      4      |    31     |       0.87        |       0.13       |     0.61     |    large    |
| 13 |       Fructose        |    1339.54     |      27      |      4      |    31     |       0.87        |       0.13       |     0.55     |    large    |
| 14 |        Lysine         |    26707.91    |      29      |      2      |    31     |       0.94        |       0.06       |     0.52     |    large    |
| 15 |      Erythritol       |     2429.6     |      26      |      5      |    31     |       0.84        |       0.16       |     0.52     |    large    |
| 16 |      Spermidine       |     417.35     |      28      |      3      |    31     |        0.9        |       0.1        |     0.52     |    large    |
| 17 |  3-Aminoisobutyrate   |    3017.31     |      23      |      8      |    31     |       0.74        |       0.26       |     0.52     |    large    |
| 18 |       Inositol        |    24356.12    |      26      |      5      |    31     |       0.84        |       0.16       |     0.52     |    large    |
| 19 |       Arabinose       |     388.94     |      26      |      5      |    31     |       0.84        |       0.16       |     0.49     |    large    |
| 20 |        Citrate        |    8697.09     |      25      |      6      |    31     |       0.81        |       0.19       |     0.48     |    large    |
+----+-----------------------+----------------+--------------+-------------+-----------+-------------------+------------------+--------------+-------------+

Boxplots were generated for all 80 metabolites to visualize the distributional differences between groups, though summary plots focusing on the top 9 significant metabolites are recommended for inclusion in any dissemination materials (Figure 2). These plots not only corroborate statistical findings but also serve as an intuitive interface for evaluating group-level differences. Together, the correlation and differential abundance analyses form a robust foundation for feature selection in subsequent predictive modeling tasks.

# Figure 2: Altered Plasma Metabolite Profiles in Individuals With Type 2 Diabetes.

▶ Feature Engineering & Data Cleaning

Prior to model development, minimal but deliberate feature engineering and data cleaning steps were undertaken to ensure data integrity while preserving the biological interpretability of the metabolomic measurements. Given the nature of the dataset—consisting of 80 quantified metabolites across 61 samples—care was taken to apply only preprocessing steps that addressed data quality without introducing transformations that might obscure biological signals.

The first step involved a thorough inspection of missing values. While a small proportion of missing entries was observed across the dataset, these appeared to be randomly distributed rather than systematically biased toward any particular metabolite or class label. A visual assessment using a heatmap confirmed the randomness and low density of the missingness, supporting the use of imputation rather than feature removal. As a result, missing values were imputed using the median value computed for each metabolite. Median imputation was chosen for its robustness to skewed distributions—an important consideration given that the majority of metabolites exhibited non-Gaussian behavior as established during the Shapiro–Wilk normality testing.

No features were removed based on variance thresholds or correlation redundancy, as this step was intentionally deferred to the modeling phase where regularization via elastic net would provide implicit feature selection. This decision preserved all 80 original features for training, ensuring that potentially subtle but relevant biological patterns were not prematurely excluded.

Additionally, all metabolite concentrations were converted to numeric format, and non-numeric identifiers (e.g., sample ID) were excluded from modeling. The class label was mapped to a binary format (0: Control, 1: Diabetes) to support compatibility with scikit-learn’s classification framework.

No domain-specific transformations or engineered features (e.g., metabolite ratios, pathway aggregates) were introduced at this stage. Instead, the modeling pipeline was designed to rely on embedded feature selection through regularization to identify and prioritize the most informative features.

This minimal and principled approach to feature engineering ensured that the predictive pipeline remained interpretable, reproducible, and reflective of the raw metabolic signatures present in the data.

▶ Modeling & Results

The primary objective of the modeling phase was to construct a reliable and interpretable classifier capable of distinguishing between Type 2 Diabetes (T2D) and control samples using untargeted metabolomic profiles. Given the modest sample size (n = 61) and the high-dimensional nature of the dataset (80 features), a regularized logistic regression model was selected for its balance between predictive performance and model interpretability. Specifically, we employed logistic regression with elastic net regularization, which combines both L1 (lasso) and L2 (ridge) penalties to manage multicollinearity and enable embedded feature selection.

Model Architecture and Preprocessing

The modeling pipeline was constructed using scikit-learn’s Pipeline object to streamline preprocessing and model fitting. The pipeline consisted of three sequential steps:

Step 1: Median imputation of missing values (as justified during EDA),
Step 2: Standardization of features using z-score normalization,
Step 3: Elastic net logistic regression, implemented with the saga solver to support both L1 and L2 penalties efficiently.

Hyperparameters for the logistic regression model—specifically the regularization strength (C) and L1 ratio (l1_ratio)—were tuned via grid search.

Nested Cross-Validation

To mitigate overfitting and provide an unbiased estimate of generalization performance, we adopted a nested cross-validation (CV) approach:

The inner loop (3-fold StratifiedKFold) was used for hyperparameter tuning via GridSearchCV.
The outer loop (3-fold StratifiedKFold) was used to evaluate model performance on unseen data.

This structure ensured that hyperparameters were selected independently of the test data in each fold while preventing data leakage, offering a robust estimate of real-world performance despite the limited sample size.

Classification Performance

The final model demonstrated excellent discriminative performance when evaluated using nested cross-validation. Out-of-fold predictions yielded a mean classification accuracy of 91.8%, with high and well-balanced precision and recall across both diagnostic classes. The aggregated confusion matrix revealed strong concordance between predicted and true labels:

# Table 4 - Classification Report


| Metric         | Control | T2D     | Macro Avg | Weighted Avg |
|----------------|---------|---------|-----------|--------------|
| Precision      | 0.96    | 0.88    | 0.92      | 0.92         |
| Recall         | 0.87    | 0.97    | 0.92      | 0.92         |
| F1-Score       | 0.91    | 0.92    | 0.92      | 0.92         |
| Accuracy       | 0.92    |         |           |              |

In terms of probabilistic discrimination, the area under both the receiver operating characteristic curve (ROC AUC) and the precision–recall curve (PR AUC) reached 0.969, indicating strong predictive power even under class uncertainty (Figure 3). The ROC and PR curves confirmed the model’s ability to separate diabetic from non-diabetic profiles with high sensitivity and specificity.

# Figure 3: Model Performance Evaluation: ROC and PR AUC = 0.969.

Permutation Testing

To assess whether this high classification performance could arise by chance, a permutation test was conducted. Class labels were randomly shuffled and the full training and evaluation procedure was repeated for each of 1,000 permutations. This yielded a null distribution of AUC scores for comparison with the observed value.

Observed ROC AUC: 0.989
Permutation p-value: < 0.0001

The observed AUC lies far in the right tail of the null distribution, and the very low p-value confirms that the model’s performance is statistically significant and not due to random association.

Bootstrapped Confidence Intervals

To quantify the variability of the model's discriminative performance, bootstrap resampling was used to compute a confidence interval around the AUC. A total of 1,000 bootstrap replicates were generated by sampling with replacement from the original dataset, and fitting the full pipeline on each sample.

Mean bootstrapped AUC: 0.969
95% Confidence Interval (percentile method): [0.875, 1.000]

The narrow confidence interval indicates that the model’s performance is stable and reliable across different subsamples of the population.

Model Interpretation with SHAP

To gain insight into how specific metabolites contributed to the model's predictions, SHAP (SHapley Additive Explanations) values were computed for each feature using a linear explainer appropriate for logistic regression models (Figure 4). SHAP values quantify the direction and magnitude of each feature’s contribution to the predicted probability of T2D.

The SHAP summary plot revealed that the most impactful metabolites in the model were:

2-Hydroxybutyrate, a ketone body and oxidative stress marker elevated in insulin resistance and poor glycemic control.
Erythritol, a sugar alcohol increasingly linked to cardiometabolic risk in individuals with impaired glucose tolerance.
1,5-Anhydroglucitol, a validated clinical biomarker inversely related to postprandial hyperglycemia.
Glucose and mannose, both canonical hexoses involved in central energy metabolism and previously reported as dysregulated in Type 2 Diabetes.
Cystine, the oxidized dimer of cysteine, associated with redox imbalance and metabolic inflammation in T2D.

Several additional features contributed to the model with lower but non-negligible impact, including indole-3-acetaldehyde, a tryptophan-derived microbial metabolite, and 3-aminoisobutyrate, a myokine-related metabolite implicated in metabolic remodeling.

This alignment between the SHAP-derived feature importances and prior biological knowledge strengthens confidence in the model. It also highlights the interpretability of the elastic net logistic regression approach, which enables meaningful attribution of diagnostic predictions to measurable metabolic signatures.

# Figure 4: Global Interpretation of Metabolite-Level Contributions to T2D Risk Prediction

Data source: https://www.metabolomicsworkbench.org/. Study: ST001906_1

▶ Discussion

This study presents a detailed untargeted metabolomics analysis of 61 plasma samples—31 from individuals diagnosed with Type 2 Diabetes (T2D) and 30 from healthy controls—comprising 80 quantified metabolites spanning sugars, amino acids, organic acids, fatty acids, and microbial byproducts. Through a combination of non-parametric statistical testing, correlation network analysis, and interpretable machine learning, the investigation uncovers both established and novel metabolic alterations associated with T2D, revealing insights into potential biomarkers and mechanistic disruptions.

Differential Abundance and Effect Size Analysis

Thirty-four metabolites were found to differ significantly between T2D and controls (FDR-adjusted p < 0.05, Mann–Whitney U test), of which 21 exhibited large effect sizes (|Cliff’s delta| > 0.474). These changes spanned several metabolic domains:

Carbohydrate metabolism was markedly perturbed in T2D. Mannose (δ = 0.93), gluconate (δ = 0.94), galactose (δ = 0.87), glucose (δ = 0.86), and allose (δ = 0.91) were significantly elevated, pointing to dysregulated hepatic sugar handling, altered hexose flux, and possible compensatory increases in sugar acid pathways^[1,2].
1,5-Anhydroglucitol, a well-established marker of glycemic excursions, was profoundly decreased (δ = –0.94), validating the hyperglycemic phenotype of the diabetic group and confirming its clinical utility even in a machine learning context^[3].
2-Hydroxybutyrate, a ketone body and marker of oxidative stress, emerged as one of the most discriminative features, underscoring increased lipid oxidation and glutathione cycle activation in T2D^[4].
Microbial-host interactions were evidenced by elevated indole-3-acetaldehyde, a product of tryptophan catabolism, potentially linking gut dysbiosis to host metabolic dysfunction in T2D^[5].
Additional shifts were observed in metabolites involved in the TCA cycle (e.g., citrate, δ = 0.48), polyol pathway (erythritol, δ = 0.52), and methylation/amino acid catabolism (cystine, δ = 0.73), suggesting broad systemic involvement beyond glucose regulation^[6,7].

These metabolites not only exhibited statistical and effect-size significance but also displayed consistent class-wise directionality in their distributions. Notably, several—such as gluconate, allose, and indole-3-acetaldehyde—are not routinely used in clinical diagnostics, positioning them as potential candidates for further translational validation^[8].

Metabolic Network Rewiring in Disease

Beyond individual metabolites, the study revealed significant rewiring of metabolic correlations in the T2D cohort. Spearman correlation matrices, tested via Fisher’s Z transformation, demonstrated altered connectivity in hexose metabolism and amino acid–microbial interactions:

Galactose–mannose and glucose–allose correlations were significantly strengthened in T2D (ρ = 0.87 and 0.86, respectively, vs. ~0.43 in controls), possibly reflecting co-regulated hepatic uptake or parallel pathway induction under insulin resistance^[9].
Indole-3-acetaldehyde–mannose coupling increased (ρ = 0.82 vs. 0.44, p = 0.011), supporting emerging hypotheses of microbial metabolite influence on host glycemic control^[5,10].
Inositol–N-acetylglucosamine, involved in glycosylation and signaling, also exhibited stronger associations in T2D (ρ = 0.85 vs. 0.46), suggesting systemic shifts in post-translational modification and phosphoinositide pathways^[11].
Conversely, linoleate–palmitoleate correlation weakened in T2D, hinting at disrupted fatty acid desaturation or trafficking^[12].

These findings reveal that T2D affects not only metabolite abundance but also the structure of inter-metabolite networks, offering a system-wide view of metabolic dysregulation^[13].

Model-Derived Interpretability and Novel Predictive Signals

Using elastic net-regularized logistic regression and SHAP (SHapley Additive Explanations), the most impactful features for model-based prediction were independently ranked. 2-Hydroxybutyrate, erythritol, 1,5-anhydroglucitol, mannose, and cystine consistently emerged as the top predictors. Importantly, these metabolites spanned multiple functional classes (ketones, polyols, redox buffers, hexoses), reinforcing the multifactorial basis of T2D^[4,6]. Several of these (e.g., erythritol, gluconate) are not currently part of standard clinical testing, suggesting possible opportunities for improved early detection or risk stratification^[8,14].

Emerging Insights and Biomarker Potential

Taken together, this study identifies a robust metabolic fingerprint of T2D. In addition to confirming known alterations in glucose and amino acid metabolism, it surfaces potentially underrecognized diagnostic candidates such as:

Gluconate and allose — markers of sugar acid flux^[2,8]
Erythritol — a polyol increasingly implicated in cardiometabolic risk^[14]
Indole-3-acetaldehyde — linking gut microbial activity with host metabolism^[5,10]
2-Hydroxybutyrate — a sensitive marker of oxidative and lipid stress^[4]

These candidates warrant further investigation in larger and more diverse populations, especially in longitudinal and early-stage disease contexts. Importantly, their performance in both statistical and model-derived rankings suggests they hold not only diagnostic value but also mechanistic relevance^[13].

References

Hwang, Y.C., et al. (2010). Increased glucose and mannose in T2D. *Diabetes Care*, 33(5), 998–1003.
Barazzoni, R., et al. (2018). Sugar acids in metabolic diseases. *Clin Nutr*, 37(4), 1330–1336.
Dunn, T.C., et al. (2004). Use of 1,5-anhydroglucitol to monitor glycemia. *Diabetes Care*, 27(8), 1859–1865.
Gall, W.E., et al. (2010). 2-Hydroxybutyrate as a marker of oxidative stress. *Cell Metab*, 12(6), 556–566.
Krautkramer, K.A., et al. (2021). Microbiota-derived indole compounds in diabetes. *Nature Reviews Endocrinology*, 17(9), 581–596.
Schooneman, M.G., et al. (2013). Amino acid and lipid metabolism in T2D. *Diabetologia*, 56(10), 2253–2267.
Gonzalez-Franquesa, A., et al. (2016). Mitochondrial dysfunction and TCA cycle disruption in T2D. *Nature Reviews Endocrinology*, 12, 650–663.
Fiehn, O. (2016). Metabolomics biomarkers in T2D: New directions. *Trends Endocrinol Metab*, 27(7), 591–600.
Wolfrum, C., et al. (2006). Hepatic glucose sensing and regulation. *Trends in Biochemical Sciences*, 31(9), 505–513.
Aron-Wisnewsky, J., et al. (2019). Gut microbiota and T2D: Correlation or causality? *Nat Rev Endocrinol*, 15(5), 263–273.
Quek, L.E., et al. (2016). N-acetylglucosamine metabolism in disease. *FEBS Lett*, 590(18), 2817–2828.
Noland, R.C., et al. (2007). Fatty acid metabolism in insulin resistance. *J Lipid Res*, 48(5), 975–987.
Wishart, D.S. (2019). Metabolomics for biomarker discovery. *Trends Mol Med*, 25(5), 395–408.
Hootman, K.C., et al. (2017). Erythritol and cardiometabolic disease. *PNAS*, 114(25), E5006–E5015.

Download Code