DistilBERT Transformer

Fine‑tune a lightweight transformer (DistilBERT) on the Twitter‑Airline Sentiment dataset and benchmark it against a classical TF‑IDF + Logistic Regression baseline.

Model distilbert‑base‑uncased
Training split 90 % of cleaned data (stratified)
Validation split 10 % (held‑out during fine‑tuning)
Test set Untouched split created in 04_baseline_model.ipynb
Artifacts saved to models/distilbert_twitter/

1. Imports & Global Config

Everything we need in one place:

Path handling (pathlib.Path) so the notebook is platform‑agnostic.
Reproducibility seeds for Python, NumPy, and (if available) CUDA.
Key Hugging Face classes (AutoTokenizer, AutoModelForSequenceClassification, Trainer, …).
A line that tells Transformers to ignore TensorFlow so only PyTorch is used.

Code

import os
os.environ["TRANSFORMERS_NO_TF"] = "1"          # use PyTorch only

from pathlib import Path
import random
import numpy as np
import pandas as pd
import torch
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer, DataCollatorWithPadding,
)
from datasets import Dataset
from evaluate import load as load_metric
import json
import pprint

# reproducibility 
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

# repo‑aware paths 
PROJ_ROOT = Path.cwd().parent
PROC_DIR  = PROJ_ROOT / "data" / "processed"
MODEL_DIR = PROJ_ROOT / "models" / "distilbert_twitter"
MODEL_DIR.mkdir(parents=True, exist_ok=True)

c:\Users\justi\Anaconda3\envs\twitter-sentiment-env\Lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

2. Load Pre-Split Feather Data

Split	Rows	Columns
train	11 712	`text` (tweet)
val	1 464	`text` (tweet)

Assertion checks ensure that every tweet is paired with exactly one sentiment label.

Take-away: the data already passed cleaning and stratified splitting elsewhere in the pipeline—nothing to redo here.

Code

# Load pre‑made Feather splits 
def _load_xy_split(split: str):
    """
    Return (X, y) for the given split.
    X : DataFrame with 'text'
    y : Series with 'label'
    """
    X = pd.read_feather(PROC_DIR / f"X_{split}.ftr")        # ['text']
    y = pd.read_feather(PROC_DIR / f"y_{split}.ftr")["label"]
    return X, y

X_train, y_train = _load_xy_split("train")
X_val,   y_val   = _load_xy_split("val")

for name, X, y in [("train", X_train, y_train), ("val", X_val, y_val)]:
    assert list(X.columns) == ["text"]
    assert y.name == "label"
    assert len(X) == len(y)
    print(f"{name:5} | rows: {len(X):,}")

display(X_train.head())
display(y_train.head())

train | rows: 11,712
val   | rows: 1,464

	text
0	over an hour on hold so far
1	your gif game is strong.
2	i'm excited too, but perhaps you could scale y...
3	while other airlines weren't cancelled flighti...
4	conf number fmjtyl delayed - any chance of get...

0    negative
1    negative
2    positive
3    negative
4     neutral
Name: label, dtype: object

3. Tokenisation → HF `Dataset` Objects

Label mapping
label2id = {"negative": 0, "neutral": 1, "positive": 2}
Tokenizer — distilbert-base-uncased converts each tweet into input_ids & attention_mask (max len 128).
Conversion — Dataset.from_pandas yields memory-mapped datasets; raw text columns are dropped.

Dataset	Columns retained	Rows
train_ds	`input_ids`, `attention_mask`, `labels`	11 712
val_ds	`input_ids`, `attention_mask`, `labels`	1 464

Why HF Datasets? Zero-copy slices during training and built-in compatibility with Trainer.

Code

# Tokenisation -> HF Datasets
TEXT_COL  = "text"
LABEL_COL = "label"

LABELS   = ["negative", "neutral", "positive"]
label2id = {lab: i for i, lab in enumerate(LABELS)}
id2label = {i: lab for lab, i in label2id.items()}

tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def encode(batch):
    enc = tok(batch[TEXT_COL],
              truncation=True, padding="max_length", max_length=128)
    enc["labels"] = [label2id[x] for x in batch[LABEL_COL]]
    return enc

cols = [TEXT_COL, LABEL_COL]
train_ds = (Dataset.from_pandas(pd.concat([X_train, y_train], axis=1)[cols])
                     .map(encode, batched=True, remove_columns=cols))
val_ds   = (Dataset.from_pandas(pd.concat([X_val,   y_val],   axis=1)[cols])
                     .map(encode, batched=True, remove_columns=cols))

print("train_ds →", train_ds.column_names, "| rows:", train_ds.num_rows)
print("val_ds   →", val_ds.column_names,   "| rows:", val_ds.num_rows)

Map: 100%|██████████| 11712/11712 [00:01<00:00, 10540.41 examples/s]
Map: 100%|██████████| 1464/1464 [00:00<00:00, 3819.59 examples/s]

train_ds → ['input_ids', 'attention_mask', 'labels'] | rows: 11712
val_ds   → ['input_ids', 'attention_mask', 'labels'] | rows: 1464

4. Model Instantiation

Code

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=len(LABELS),
    id2label=id2label,
    label2id=label2id,
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

5. Training Configuration

Hyper-param	Value
Epochs	2
Batch	16
LR	2 × 10⁻⁵
Weight decay	0.01
Eval / Save cadence	once per epoch
Best-model criterion	`val_f1` (macro)

TrainingArguments keeps only the last 2 checkpoints to save disk space.

Code

# Training Arguments 
EPOCHS        = 2
BATCH_SIZE    = 16
LEARNING_RATE = 2e-5

train_args = TrainingArguments(
    output_dir              = MODEL_DIR / "checkpoints",
    eval_strategy           = "epoch",
    save_strategy           = "epoch",
    load_best_model_at_end  = True,
    metric_for_best_model   = "eval_f1",
    greater_is_better       = True,
    learning_rate           = LEARNING_RATE,
    per_device_train_batch_size = BATCH_SIZE,
    per_device_eval_batch_size  = BATCH_SIZE,
    num_train_epochs        = EPOCHS,
    weight_decay            = 0.01,
    seed                    = SEED,
    logging_steps           = 50,
    save_total_limit        = 2,    
    report_to               = "none",
)

6. Trainer & Fine-Tune Results

Epoch	Train Loss	Val Loss	Accuracy	F1 (macro)
1	0.485 000	0.410 365	0.837 432	0.789 787
2	0.319 500	0.419 928	0.840 164	0.798 308

Interpretation
• Rapid convergence in two epochs; validation accuracy ~84 %.
• Macro-F1 ≈ 0.80 suggests balanced performance across classes despite label skew.

Code

# Trainer + Fine‑Tune 
data_collator = DataCollatorWithPadding(tokenizer=tok, return_tensors="pt")

metric_acc = load_metric("accuracy")
metric_f1  = load_metric("f1")

def compute_metrics(eval_pred):
    preds = eval_pred.predictions.argmax(-1)
    refs  = eval_pred.label_ids
    return {
        "accuracy": metric_acc.compute(predictions=preds, references=refs)["accuracy"],
        "f1": metric_f1.compute(predictions=preds, references=refs, average="macro")["f1"],
    }

trainer = Trainer(
    model           = model,
    args            = train_args,
    train_dataset   = train_ds,
    eval_dataset    = val_ds,
    data_collator   = data_collator,
    compute_metrics = compute_metrics,
)

trainer.train()

c:\Users\justi\Anaconda3\envs\twitter-sentiment-env\Lib\site-packages\torch\utils\data\dataloader.py:665: UserWarning: 'pin_memory' argument is set as true but no accelerator is found, then device pinned memory won't be used.
  warnings.warn(warn_msg)

[1464/1464 1:15:00, Epoch 2/2]

Epoch	Training Loss	Validation Loss	Accuracy	F1
1	0.485000	0.410365	0.837432	0.787987
2	0.319500	0.419298	0.840164	0.798038

c:\Users\justi\Anaconda3\envs\twitter-sentiment-env\Lib\site-packages\torch\utils\data\dataloader.py:665: UserWarning: 'pin_memory' argument is set as true but no accelerator is found, then device pinned memory won't be used.
  warnings.warn(warn_msg)

TrainOutput(global_step=1464, training_loss=0.4235726233388557, metrics={'train_runtime': 4505.8953, 'train_samples_per_second': 5.199, 'train_steps_per_second': 0.325, 'total_flos': 775742920556544.0, 'train_loss': 0.4235726233388557, 'epoch': 2.0})

7. Save Artifacts & Export

Outputs written to models/distilbert_twitter/

Model weights & config .../final/
Tokenizer vocab & config .../tokenizer/
Validation metrics JSON val_metrics.json

Code

VAL_METRICS = trainer.evaluate()            # fetch best‑epoch metrics

SAVE_DIR = MODEL_DIR / "final"
TOKEN_DIR = SAVE_DIR / "tokenizer"

SAVE_DIR.mkdir(parents=True, exist_ok=True)

# model & tokenizer
trainer.save_model(SAVE_DIR)                # saves both config & weights
tok.save_pretrained(TOKEN_DIR)

# metrics
with open(SAVE_DIR / "val_metrics.json", "w") as fp:
    json.dump(VAL_METRICS, fp, indent=2)

print("Artefacts saved to", SAVE_DIR.resolve())
pprint.pp(VAL_METRICS)

c:\Users\justi\Anaconda3\envs\twitter-sentiment-env\Lib\site-packages\torch\utils\data\dataloader.py:665: UserWarning: 'pin_memory' argument is set as true but no accelerator is found, then device pinned memory won't be used.
  warnings.warn(warn_msg)

[92/92 01:02]

✅ Artefacts saved to C:\Projects\twitter-airline-analysis\models\distilbert_twitter\final
{'eval_loss': 0.41929781436920166,
 'eval_accuracy': 0.8401639344262295,
 'eval_f1': 0.7980384320135547,
 'eval_runtime': 63.4491,
 'eval_samples_per_second': 23.074,
 'eval_steps_per_second': 1.45,
 'epoch': 2.0}

Notebook Outline

1. Imports & Global Config

2. Load Pre-Split Feather Data

3. Tokenisation → HF Dataset Objects

4. Model Instantiation

5. Training Configuration

6. Trainer & Fine-Tune Results

7. Save Artifacts & Export

3. Tokenisation → HF `Dataset` Objects