HDL Cholesterol Prediction Program

// OVERVIEW

About this project

A stacked ensemble machine learning model that predicts HDL cholesterol levels from 97 health variables across 1,200 individuals using the 2024 NHANES dataset. Built for the ASA South Florida Student Data Challenge — placed first in the undergraduate category, earning the Excellence Award as the sole recipient.

// FEATURES

Key features

Stacked Ensemble Model

7 diverse base learners — multiple XGBoost configs, LightGBM configs, Gradient Boosting, and Extra Trees — generating out-of-fold predictions fed into a Ridge regression meta-learner to prevent information leakage.

Feature Engineering

Created 20+ engineered features including waist-to-BMI ratio, macronutrient percentage breakdowns, saturated/mono/polyunsaturated fat ratios, body composition x gender interactions, and quadratic terms for age and waist.

EDA & Feature Analysis

Identified waist circumference (r = -0.60), gender (r = +0.52), and BMI (r = -0.48) as dominant predictors. Most dietary variables individually had weak correlations (|r| < 0.15–0.20).

5-Fold Cross-Validation

Tested Ridge regression, Random Forest, Gradient Boosting, XGBoost, and LightGBM as baselines before building the final stacked ensemble.

Research Report

4-page professional PDF with target distribution chart, feature correlation chart, stacking architecture diagram, feature importance plot, performance tables, and clinical interpretation.

// STATS

By the numbers

4.4611

Final RMSE

0.728

CV R²

20+

Engineered Features

Base Learners

// ROADMAP

Development roadmap

DONEData Acquisition & EDA
DONEFeature Engineering
DONEModel Selection & Baseline
DONEStacked Ensemble Architecture
DONEResults & Competition Outcome

// CHALLENGES

Challenges and solutions

Noise-Perturbed Target Variable

PROBLEM

The HDL cholesterol target had a noise floor of ~3.5–4.0 RMSE baked in, creating a hard ceiling on accuracy that no model could break through.

SOLUTION

Focused on ensemble diversity and feature engineering rather than chasing lower error. The stacking architecture with diverse base learners squeezed a 2.3% improvement over the best single model, achieving 4.695 CV RMSE.

Weak Individual Predictors

PROBLEM

95 predictor variables spanning dietary recall, demographics, and body measurements — but most dietary variables had correlations below |r| < 0.20 with HDL.

SOLUTION

Engineered interaction features (body composition x gender, alcohol x gender/age) and ratio features (macronutrient percentages, fat ratios) to extract signal that individual variables couldn't capture alone.

Preventing Information Leakage in Stacking

PROBLEM

Naively training base learners on the full dataset and then feeding their predictions to a meta-learner would leak target information, inflating validation scores and producing an overfit ensemble.

SOLUTION

Used out-of-fold predictions — each base learner only generated predictions for validation folds it never trained on. The meta-learner (Ridge) was then trained exclusively on these held-out predictions, ensuring no data point's target influenced its own meta-features.

Meta-Learner Regularization Tuning

PROBLEM

The Ridge meta-learner's alpha hyperparameter had to balance combining 7 base learner predictions without overfitting to their correlated outputs, especially since several XGBoost and LightGBM variants produced similar prediction distributions.

SOLUTION

Swept alpha values from 0.01 to 50.0 under nested 5-fold CV on the out-of-fold matrix. StandardScaler normalized base learner predictions before Ridge fitting to prevent scale-dominant models from dominating the ensemble weights.