This notebook demonstrates the src.data_prep pipeline:
- Loads the raw CSV
- Shows a before/after sample of 10 tweets
- Saves the cleaned data to Parquet and prints the output path
Notebook Outline
Code
# standard imports
from twitter_airline_analysis.data_prep import load_raw, preprocess, save_parquet
import pandas as pd
# regenerate_splits.py
from pathlib import Path
from sklearn.model_selection import train_test_split
PROJ_ROOT = Path.cwd().parent
RAW_DIR = PROJ_ROOT / "data" / "raw"
PROC_DIR = PROJ_ROOT / "data" / "processed"
PROC_DIR.mkdir(parents=True, exist_ok=True)
# load the raw DataFrame
df_raw = load_raw()
print(f"Raw data: {df_raw.shape[0]:,} rows × {df_raw.shape[1]} columns")
df_raw.head(10)
Raw data: 14,640 rows × 15 columns
| 0 |
570306133677760513 |
neutral |
1.0000 |
NaN |
NaN |
Virgin America |
NaN |
cairdin |
NaN |
0 |
@VirginAmerica What @dhepburn said. |
NaN |
2015-02-24 11:35:52 -0800 |
NaN |
Eastern Time (US & Canada) |
| 1 |
570301130888122368 |
positive |
0.3486 |
NaN |
0.0000 |
Virgin America |
NaN |
jnardino |
NaN |
0 |
@VirginAmerica plus you've added commercials t... |
NaN |
2015-02-24 11:15:59 -0800 |
NaN |
Pacific Time (US & Canada) |
| 2 |
570301083672813571 |
neutral |
0.6837 |
NaN |
NaN |
Virgin America |
NaN |
yvonnalynn |
NaN |
0 |
@VirginAmerica I didn't today... Must mean I n... |
NaN |
2015-02-24 11:15:48 -0800 |
Lets Play |
Central Time (US & Canada) |
| 3 |
570301031407624196 |
negative |
1.0000 |
Bad Flight |
0.7033 |
Virgin America |
NaN |
jnardino |
NaN |
0 |
@VirginAmerica it's really aggressive to blast... |
NaN |
2015-02-24 11:15:36 -0800 |
NaN |
Pacific Time (US & Canada) |
| 4 |
570300817074462722 |
negative |
1.0000 |
Can't Tell |
1.0000 |
Virgin America |
NaN |
jnardino |
NaN |
0 |
@VirginAmerica and it's a really big bad thing... |
NaN |
2015-02-24 11:14:45 -0800 |
NaN |
Pacific Time (US & Canada) |
| 5 |
570300767074181121 |
negative |
1.0000 |
Can't Tell |
0.6842 |
Virgin America |
NaN |
jnardino |
NaN |
0 |
@VirginAmerica seriously would pay $30 a fligh... |
NaN |
2015-02-24 11:14:33 -0800 |
NaN |
Pacific Time (US & Canada) |
| 6 |
570300616901320704 |
positive |
0.6745 |
NaN |
0.0000 |
Virgin America |
NaN |
cjmcginnis |
NaN |
0 |
@VirginAmerica yes, nearly every time I fly VX... |
NaN |
2015-02-24 11:13:57 -0800 |
San Francisco CA |
Pacific Time (US & Canada) |
| 7 |
570300248553349120 |
neutral |
0.6340 |
NaN |
NaN |
Virgin America |
NaN |
pilot |
NaN |
0 |
@VirginAmerica Really missed a prime opportuni... |
NaN |
2015-02-24 11:12:29 -0800 |
Los Angeles |
Pacific Time (US & Canada) |
| 8 |
570299953286942721 |
positive |
0.6559 |
NaN |
NaN |
Virgin America |
NaN |
dhepburn |
NaN |
0 |
@virginamerica Well, I didn't…but NOW I DO! :-D |
NaN |
2015-02-24 11:11:19 -0800 |
San Diego |
Pacific Time (US & Canada) |
| 9 |
570295459631263746 |
positive |
1.0000 |
NaN |
NaN |
Virgin America |
NaN |
YupitsTate |
NaN |
0 |
@VirginAmerica it was amazing, and arrived an ... |
NaN |
2015-02-24 10:53:27 -0800 |
Los Angeles |
Eastern Time (US & Canada) |
Before / After Cleaning
Below we show the first 10 tweets in their original form, then the cleaned clean_text column.
Code
# take a 10-row sample for demo
sample = df_raw.head(10).copy()
# apply the cleaning pipeline
df_tidy = preprocess(sample)
# display side-by-side
pd.concat(
[
sample[["tweet_id", "text"]].rename(columns={"text": "original_text"}),
df_tidy[["clean_text"]]
],
axis=1
)
| 0 |
570306133677760513 |
@VirginAmerica What @dhepburn said. |
what said. |
| 1 |
570301130888122368 |
@VirginAmerica plus you've added commercials t... |
plus you've added commercials to the experienc... |
| 2 |
570301083672813571 |
@VirginAmerica I didn't today... Must mean I n... |
i didn't today... must mean i need to take ano... |
| 3 |
570301031407624196 |
@VirginAmerica it's really aggressive to blast... |
it's really aggressive to blast obnoxious "ent... |
| 4 |
570300817074462722 |
@VirginAmerica and it's a really big bad thing... |
and it's a really big bad thing about it |
| 5 |
570300767074181121 |
@VirginAmerica seriously would pay $30 a fligh... |
seriously would pay $30 a flight for seats tha... |
| 6 |
570300616901320704 |
@VirginAmerica yes, nearly every time I fly VX... |
yes, nearly every time i fly vx this “ear worm... |
| 7 |
570300248553349120 |
@VirginAmerica Really missed a prime opportuni... |
really missed a prime opportunity for men with... |
| 8 |
570299953286942721 |
@virginamerica Well, I didn't…but NOW I DO! :-D |
well, i didn't...but now i do! :-d |
| 9 |
570295459631263746 |
@VirginAmerica it was amazing, and arrived an ... |
it was amazing, and arrived an hour early. you... |
Save to Parquet
Now we save the full cleaned dataset to Parquet and display the path.
Code
# load & preprocess full dataset
full_raw = load_raw()
full_tidy = preprocess(full_raw)
# save and capture the file path
out_path = save_parquet(full_tidy)
print(f"✅ Saved {len(full_tidy):,} rows to:\n{out_path}")
✅ Saved 14,640 rows to:
C:\Projects\twitter-airline-analysis\data\processed\tweets.parquet
Code
df = pd.read_parquet(PROC_DIR / "tweets.parquet")
X = df["clean_text"]
y = df["airline_sentiment"]
# 20 % validation, 20 % test
X_temp, X_test, y_temp, y_test = train_test_split(X, y,
test_size=0.10,
stratify=y,
random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp,
test_size=0.1111, # 0.1111 * 0.9 ≈ 0.10
stratify=y_temp,
random_state=42)
# 3. SAVE as Feather for later notebooks
(pd.Series(X_train, name="text") .reset_index(drop=True)
.to_frame()
.to_feather(PROC_DIR / "X_train.ftr"))
(pd.Series(y_train, name="label") .reset_index(drop=True)
.to_frame()
.to_feather(PROC_DIR / "y_train.ftr"))
(pd.Series(X_val, name="text") .reset_index(drop=True)
.to_frame()
.to_feather(PROC_DIR / "X_val.ftr"))
(pd.Series(y_val, name="label") .reset_index(drop=True)
.to_frame()
.to_feather(PROC_DIR / "y_val.ftr"))
(pd.Series(X_test, name="text") .reset_index(drop=True)
.to_frame()
.to_feather(PROC_DIR / "X_test.ftr"))
(pd.Series(y_test, name="label") .reset_index(drop=True)
.to_frame()
.to_feather(PROC_DIR / "y_test.ftr"))
print("Validation / test splits written to", PROC_DIR.resolve())
✅ Validation / test splits written to C:\Projects\twitter-airline-analysis\data\processed