// Project
HDL Cholesterol Prediction Program
4.4611
Final RMSE
0.728
CV R²
20+
Engineered Features
7
Base Learners
About this project
A stacked ensemble machine learning model that predicts HDL cholesterol levels from 97 health variables across 1,200 individuals using the 2024 NHANES dataset. Built for the ASA South Florida Student Data Challenge — placed first in the undergraduate category, earning the Excellence Award as the sole recipient.
Key Features
Stacked Ensemble Model
7 diverse base learners — multiple XGBoost configs, LightGBM configs, Gradient Boosting, and Extra Trees — generating out-of-fold predictions fed into a Ridge regression meta-learner to prevent information leakage.
Feature Engineering
Created 20+ engineered features including waist-to-BMI ratio, macronutrient percentage breakdowns, saturated/mono/polyunsaturated fat ratios, body composition x gender interactions, and quadratic terms for age and waist.
EDA & Feature Analysis
Identified waist circumference (r = -0.60), gender (r = +0.52), and BMI (r = -0.48) as dominant predictors. Most dietary variables individually had weak correlations (|r| < 0.15–0.20).
5-Fold Cross-Validation
Tested Ridge regression, Random Forest, Gradient Boosting, XGBoost, and LightGBM as baselines before building the final stacked ensemble.
Research Report
4-page professional PDF with target distribution chart, feature correlation chart, stacking architecture diagram, feature importance plot, performance tables, and clinical interpretation.
Tech Stack
Development Roadmap
Challenges & Solutions
Noise-Perturbed Target Variable
The HDL cholesterol target had a noise floor of ~3.5–4.0 RMSE baked in, creating a hard ceiling on accuracy that no model could break through.
Focused on ensemble diversity and feature engineering rather than chasing lower error. The stacking architecture with diverse base learners squeezed a 2.3% improvement over the best single model, achieving 4.695 CV RMSE.
Weak Individual Predictors
95 predictor variables spanning dietary recall, demographics, and body measurements — but most dietary variables had correlations below |r| < 0.20 with HDL.
Engineered interaction features (body composition x gender, alcohol x gender/age) and ratio features (macronutrient percentages, fat ratios) to extract signal that individual variables couldn't capture alone.
Preventing Information Leakage in Stacking
Naively training base learners on the full dataset and then feeding their predictions to a meta-learner would leak target information, inflating validation scores and producing an overfit ensemble.
Used out-of-fold predictions — each base learner only generated predictions for validation folds it never trained on. The meta-learner (Ridge) was then trained exclusively on these held-out predictions, ensuring no data point's target influenced its own meta-features.
Meta-Learner Regularization Tuning
The Ridge meta-learner's alpha hyperparameter had to balance combining 7 base learner predictions without overfitting to their correlated outputs, especially since several XGBoost and LightGBM variants produced similar prediction distributions.
Swept alpha values from 0.01 to 50.0 under nested 5-fold CV on the out-of-fold matrix. StandardScaler normalized base learner predictions before Ridge fitting to prevent scale-dominant models from dominating the ensemble weights.