Project Overview
This project delivers a full-cycle machine learning workflow to forecast daily sales for Rossmann stores with a time‑aware XGBoost model. The pipeline cleans and engineers features, then evaluates on a final six‑week holdout to mirror deployment.
Highlights
- Objective: Predict daily log1p(Sales) per store.
- Data coverage: 2013‑01‑01 → 2015‑07‑31; 844,338 open‑store rows; 1,115 stores.
- Final test (log scale): RMSE 0.169, MAE 0.130, R² 0.833.
- Key drivers: Promotions, competition distance, weekday and store effects.
Repository Structure
.
├── 01_EDA.ipynb / 01_EDA.pdf # Cleaning, EDA, feature engineering
├── 02_Modeling.ipynb / 02_Modeling.pdf# Time-based split, tuning, evaluation
├── data/
│ ├── train.csv # Kaggle Rossmann 'train' (place here)
│ ├── store.csv # Kaggle Rossmann 'store' (place here)
│ └── data_processed.csv # Exported from EDA for modeling
├── exports/ # Created by the modeling notebook
│ ├── xgb_final_model.joblib
│ ├── X_test_processed.csv
│ ├── y_test_processed.csv
│ └── test_predictions.csv
└── Rossmann_Final_Report_Pro.md # Project report
Quickstart
Create an environment
python -m venv .venv && source .venv/bin/activate # or use conda pip install -U pandas numpy scikit-learn xgboost optuna shap matplotlibAdd data
- Place
train.csvandstore.csvintodata/.
- Place
Run EDA
- Open
01_EDA.ipynband run all cells. - It writes
data/data_processed.csvused by modeling.
- Open
Run Modeling
- Open
02_Modeling.ipynband run all cells. - It creates a chronological split (last six weeks = test), tunes XGBoost, evaluates once on the holdout, and saves artifacts under
exports/.
- Open
Results at a glance
- Test performance (log scale): RMSE 0.169, MAE 0.130, R² 0.833.
- Top signals: Promo variables, CompetitionDistance, weekday and store effects.
Reproduce Locally
- EDA: cleans data, builds engineered features (for example,
ActiveStoreCount, promo interactions), applies log transforms, and exportsdata_processed.csv. - Modeling: verifies leakage protection, encodes categoricals, tunes XGBoost, and exports the trained model and test artifacts.
Predict with the Saved Model
import joblib, pandas as pd, numpy as np
model = joblib.load("exports/xgb_final_model.joblib")
X = pd.read_csv("exports/X_test_processed.csv")
pred_log = model.predict(X)
pred = np.expm1(pred_log) # back-transform if training used log1p(Sales)Justin Castillo
jcastillo.hotels@gmail.com
GitHub
LinkedIn