// Project

HDL Cholesterol Prediction Program

PythonMachine LearningData ScienceXGBoostScikit-Learn

4.4611

Final RMSE

0.728

CV R²

20+

Engineered Features

7

Base Learners

About this project

A stacked ensemble machine learning model that predicts HDL cholesterol levels from 97 health variables across 1,200 individuals using the 2024 NHANES dataset. Built for the ASA South Florida Student Data Challenge — placed first in the undergraduate category, earning the Excellence Award as the sole recipient.

Key Features

Stacked Ensemble Model

7 diverse base learners — multiple XGBoost configs, LightGBM configs, Gradient Boosting, and Extra Trees — generating out-of-fold predictions fed into a Ridge regression meta-learner to prevent information leakage.

Feature Engineering

Created 20+ engineered features including waist-to-BMI ratio, macronutrient percentage breakdowns, saturated/mono/polyunsaturated fat ratios, body composition x gender interactions, and quadratic terms for age and waist.

EDA & Feature Analysis

Identified waist circumference (r = -0.60), gender (r = +0.52), and BMI (r = -0.48) as dominant predictors. Most dietary variables individually had weak correlations (|r| < 0.15–0.20).

5-Fold Cross-Validation

Tested Ridge regression, Random Forest, Gradient Boosting, XGBoost, and LightGBM as baselines before building the final stacked ensemble.

Research Report

4-page professional PDF with target distribution chart, feature correlation chart, stacking architecture diagram, feature importance plot, performance tables, and clinical interpretation.

Tech Stack

PythonXGBoostLightGBMScikit-LearnPandasNumPyMatplotlib

Development Roadmap

Data Acquisition & EDA
Feature Engineering
Model Selection & Baseline
Stacked Ensemble Architecture
Results & Competition Outcome

Challenges & Solutions

Noise-Perturbed Target Variable

The HDL cholesterol target had a noise floor of ~3.5–4.0 RMSE baked in, creating a hard ceiling on accuracy that no model could break through.

Focused on ensemble diversity and feature engineering rather than chasing lower error. The stacking architecture with diverse base learners squeezed a 2.3% improvement over the best single model, achieving 4.695 CV RMSE.

Weak Individual Predictors

95 predictor variables spanning dietary recall, demographics, and body measurements — but most dietary variables had correlations below |r| < 0.20 with HDL.

Engineered interaction features (body composition x gender, alcohol x gender/age) and ratio features (macronutrient percentages, fat ratios) to extract signal that individual variables couldn't capture alone.

Preventing Information Leakage in Stacking

Naively training base learners on the full dataset and then feeding their predictions to a meta-learner would leak target information, inflating validation scores and producing an overfit ensemble.

Used out-of-fold predictions — each base learner only generated predictions for validation folds it never trained on. The meta-learner (Ridge) was then trained exclusively on these held-out predictions, ensuring no data point's target influenced its own meta-features.

Meta-Learner Regularization Tuning

The Ridge meta-learner's alpha hyperparameter had to balance combining 7 base learner predictions without overfitting to their correlated outputs, especially since several XGBoost and LightGBM variants produced similar prediction distributions.

Swept alpha values from 0.01 to 50.0 under nested 5-fold CV on the out-of-fold matrix. StandardScaler normalized base learner predictions before Ridge fitting to prevent scale-dominant models from dominating the ensemble weights.