Project Overview
Twitter US Airline Sentiment — End‑to‑End NLP Pipeline
This repository delivers an end‑to‑end sentiment‑analysis workflow for the Twitter US Airline Sentiment dataset (≈ 14,600 tweets, Feb 2015): data prep → EDA → TF‑IDF + Logistic Regression → DistilBERT fine‑tune → evaluation & interpretability → Streamlit demo and FastAPI service → Docker + CI for production‑style packaging.
Key results (held‑out test)
| Model | Accuracy | Macro‑F1 | Notes |
|---|---|---|---|
| DistilBERT (fine‑tuned) | 0.836 | 0.788 | Strong ROC/PR; main confusions are negative ↔︎ neutral. |
| TF‑IDF + Logistic Regression (tuned) | 0.790 | 0.739 | Macro ROC‑AUC ≈ 0.907; transparent token weights; good fallback. |
Operational guidance. Errors concentrate at mid confidences (~0.40–0.70); consider auto‑route ≥ 0.80 and human‑review 0.50–0.79 in production.
Dataset
- Source. Kaggle — Twitter US Airline Sentiment (Feb–Mar 2015).
- Labels.
negative,neutral,positive, plus an optionalnegativereasonfield for error analysis. - Imbalance. ~63 % negative, 21 % neutral, 16 % positive. Use stratified splits and class weighting for classical models.
- File layout. Place the raw CSV at
data/raw/Tweets.csv. Processed parquet and feather splits are written todata/processed/.
Repository layout
notebooks/ # EDA → modeling → deployment (PDF exports in repo)
models/
distilbert_twitter/final/ # Fine‑tuned DistilBERT checkpoint + tokenizer
logreg_tfidf.joblib # Baseline sklearn pipeline
logreg_tfidf_optuna.joblib # (optional) tuned sklearn pipeline
src_2/
app.py # FastAPI service (POST /predict)
streamlit_app.py # Streamlit demo UI
model.py, predict.py # Unified loader + thin inference API
data_prep.py, paths.py # Lightweight text prep + robust project paths
reports/ # Metrics JSON + evaluation figures
Dockerfile # Multi‑stage image (Python 3.11‑slim)
.github/workflows/ci.yml # Lint → tests → docker build → optional push on tags
Pipeline overview
1) Data preparation & EDA
- Minimal, robust cleaning (normalize whitespace; lowercase only for sklearn paths).
- Cleaned data are saved to Parquet; stratified Feather splits
X_*/y_*are produced fortrain/val/test(11,712 / 1,464 / 1,464).
- EDA: class balance, lengths, negative‑reason categories, and airline‑level sentiment.
2) Classical baseline — TF‑IDF + Logistic Regression
- Bi‑gram TF‑IDF (
max_featuresup to 50k) +LogisticRegression(class_weight='balanced').
- Validation ≈ 0.79 accuracy; macro‑F1 ≈ 0.74.
- Optuna tuning jointly searches TF‑IDF and LR hyper‑parameters; the best pipeline is persisted as
models/logreg_tfidf_optuna.joblib.
- Diagnostics: confusion matrix, ROC, token‑importance bars, confidence histograms, t‑SNE, lift curves.
3) Transformer — DistilBERT
- Base:
distilbert-base-uncased, max length 128, batch 16, LR 2e‑5, 2 epochs.
- Test results ≈ 0.84 accuracy / ≈ 0.79 macro‑F1; negative/positive are cleanly separated, neutral is harder.
- Final operating policy derived from calibration & confidence diagnostics.
Inference
A) Streamlit (local demo)
streamlit run src_2/streamlit_app.pyUse the sidebar to switch between distilbert and sklearn, and to return all class scores.
B) FastAPI (programmatic)
uvicorn src_2.app:app --host 0.0.0.0 --port 8000
# POST /predict with either payload
# { "text": "The flight was delayed" }
# { "texts": ["Great staff!", "Late again :("], "model_name": "distilbert", "return_all_scores": true }C) From Python
from src_2.predict import predict_texts
predict_texts("Great service!", model_name="distilbert", return_all_scores=True)Docker & CI
Build locally
docker build -t airline-sentiment:latest .
docker run -p 8000:8000 airline-sentiment:latest
# Then POST to http://127.0.0.1:8000/predictThe Dockerfile uses a multi‑stage Python 3.11‑slim base and launches Uvicorn with src_2.app:app as the entrypoint.
GitHub Actions
The CI workflow lints & tests the repo, builds the Docker image on every commit, and (on tags like v*.*.*) can push to a container registry.
Reproducibility & artifacts
- Models:
models/distilbert_twitter/final/,models/logreg_tfidf*.joblib
- Metrics & figures: confusion matrices, ROC/PR, calibration, t‑SNE, confidence histograms, lift curves under
reports/
Next steps
- Apply temperature scaling for tighter probability calibration.
- Collect / augment neutral examples; consider class‑weighted or focal loss.
- Explore domain‑specific LMs (e.g., TweetEval‑style) if neutral recall remains the main pain point.
This README reflects the implemented stack (TF‑IDF + LogReg and DistilBERT), the shipped demo apps (Streamlit + FastAPI), and the packaging (Docker + CI).