Data Preparation

This notebook demonstrates the src.data_prep pipeline:

Loads the raw CSV
Shows a before/after sample of 10 tweets
Saves the cleaned data to Parquet and prints the output path

Notebook Outline

Before / After Cleaning
Save to Parquet

Code

# standard imports
from twitter_airline_analysis.data_prep import load_raw, preprocess, save_parquet
import pandas as pd
# regenerate_splits.py
from pathlib import Path
from sklearn.model_selection import train_test_split

PROJ_ROOT = Path.cwd().parent      
RAW_DIR   = PROJ_ROOT / "data" / "raw"
PROC_DIR  = PROJ_ROOT / "data" / "processed"
PROC_DIR.mkdir(parents=True, exist_ok=True)

# load the raw DataFrame
df_raw = load_raw()
print(f"Raw data: {df_raw.shape[0]:,} rows × {df_raw.shape[1]} columns")
df_raw.head(10)

Raw data: 14,640 rows × 15 columns

	tweet_id	airline_sentiment	airline_sentiment_confidence	negativereason	negativereason_confidence	airline	airline_sentiment_gold	name	negativereason_gold	text	tweet_coord	tweet_created	tweet_location	user_timezone
0	570306133677760513	neutral	1.0000	NaN	NaN	Virgin America	NaN	cairdin	NaN	@VirginAmerica What @dhepburn said.	NaN	2015-02-24 11:35:52 -0800	NaN	Eastern Time (US & Canada)
1	570301130888122368	positive	0.3486	NaN	0.0000	Virgin America	NaN	jnardino	NaN	@VirginAmerica plus you've added commercials t...	NaN	2015-02-24 11:15:59 -0800	NaN	Pacific Time (US & Canada)
2	570301083672813571	neutral	0.6837	NaN	NaN	Virgin America	NaN	yvonnalynn	NaN	@VirginAmerica I didn't today... Must mean I n...	NaN	2015-02-24 11:15:48 -0800	Lets Play	Central Time (US & Canada)
3	570301031407624196	negative	1.0000	Bad Flight	0.7033	Virgin America	NaN	jnardino	NaN	@VirginAmerica it's really aggressive to blast...	NaN	2015-02-24 11:15:36 -0800	NaN	Pacific Time (US & Canada)
4	570300817074462722	negative	1.0000	Can't Tell	1.0000	Virgin America	NaN	jnardino	NaN	@VirginAmerica and it's a really big bad thing...	NaN	2015-02-24 11:14:45 -0800	NaN	Pacific Time (US & Canada)
5	570300767074181121	negative	1.0000	Can't Tell	0.6842	Virgin America	NaN	jnardino	NaN	@VirginAmerica seriously would pay $30 a fligh...	NaN	2015-02-24 11:14:33 -0800	NaN	Pacific Time (US & Canada)
6	570300616901320704	positive	0.6745	NaN	0.0000	Virgin America	NaN	cjmcginnis	NaN	@VirginAmerica yes, nearly every time I fly VX...	NaN	2015-02-24 11:13:57 -0800	San Francisco CA	Pacific Time (US & Canada)
7	570300248553349120	neutral	0.6340	NaN	NaN	Virgin America	NaN	pilot	NaN	@VirginAmerica Really missed a prime opportuni...	NaN	2015-02-24 11:12:29 -0800	Los Angeles	Pacific Time (US & Canada)
8	570299953286942721	positive	0.6559	NaN	NaN	Virgin America	NaN	dhepburn	NaN	@virginamerica Well, I didn't…but NOW I DO! :-D	NaN	2015-02-24 11:11:19 -0800	San Diego	Pacific Time (US & Canada)
9	570295459631263746	positive	1.0000	NaN	NaN	Virgin America	NaN	YupitsTate	NaN	@VirginAmerica it was amazing, and arrived an ...	NaN	2015-02-24 10:53:27 -0800	Los Angeles	Eastern Time (US & Canada)

Before / After Cleaning

Below we show the first 10 tweets in their original form, then the cleaned clean_text column.

Code

# take a 10-row sample for demo
sample = df_raw.head(10).copy()

# apply the cleaning pipeline
df_tidy = preprocess(sample)

# display side-by-side 
pd.concat(
    [
        sample[["tweet_id", "text"]].rename(columns={"text": "original_text"}),
        df_tidy[["clean_text"]]
    ],
    axis=1
)

	tweet_id	original_text	clean_text
0	570306133677760513	@VirginAmerica What @dhepburn said.	what said.
1	570301130888122368	@VirginAmerica plus you've added commercials t...	plus you've added commercials to the experienc...
2	570301083672813571	@VirginAmerica I didn't today... Must mean I n...	i didn't today... must mean i need to take ano...
3	570301031407624196	@VirginAmerica it's really aggressive to blast...	it's really aggressive to blast obnoxious "ent...
4	570300817074462722	@VirginAmerica and it's a really big bad thing...	and it's a really big bad thing about it
5	570300767074181121	@VirginAmerica seriously would pay $30 a fligh...	seriously would pay $30 a flight for seats tha...
6	570300616901320704	@VirginAmerica yes, nearly every time I fly VX...	yes, nearly every time i fly vx this “ear worm...
7	570300248553349120	@VirginAmerica Really missed a prime opportuni...	really missed a prime opportunity for men with...
8	570299953286942721	@virginamerica Well, I didn't…but NOW I DO! :-D	well, i didn't...but now i do! :-d
9	570295459631263746	@VirginAmerica it was amazing, and arrived an ...	it was amazing, and arrived an hour early. you...

Save to Parquet

Now we save the full cleaned dataset to Parquet and display the path.

Code

# load & preprocess full dataset
full_raw  = load_raw()
full_tidy = preprocess(full_raw)

# save and capture the file path
out_path = save_parquet(full_tidy)
print(f"✅ Saved {len(full_tidy):,} rows to:\n{out_path}")

✅ Saved 14,640 rows to:
C:\Projects\twitter-airline-analysis\data\processed\tweets.parquet

Code

df = pd.read_parquet(PROC_DIR / "tweets.parquet") 
X    = df["clean_text"]
y    = df["airline_sentiment"]

# 20 % validation, 20 % test
X_temp, X_test, y_temp, y_test = train_test_split(X, y,
                                                  test_size=0.10,
                                                  stratify=y,
                                                  random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp,
                                                  test_size=0.1111,   # 0.1111 * 0.9 ≈ 0.10
                                                  stratify=y_temp,
                                                  random_state=42)

# 3. SAVE as Feather for later notebooks
(pd.Series(X_train,   name="text")   .reset_index(drop=True)
                                  .to_frame()
                                  .to_feather(PROC_DIR / "X_train.ftr"))

(pd.Series(y_train,   name="label")  .reset_index(drop=True)
                                   .to_frame()
                                   .to_feather(PROC_DIR / "y_train.ftr"))

(pd.Series(X_val,     name="text")   .reset_index(drop=True)
                                  .to_frame()
                                  .to_feather(PROC_DIR / "X_val.ftr"))

(pd.Series(y_val,     name="label")  .reset_index(drop=True)
                                   .to_frame()
                                   .to_feather(PROC_DIR / "y_val.ftr"))

(pd.Series(X_test,    name="text")   .reset_index(drop=True)
                                  .to_frame()
                                  .to_feather(PROC_DIR / "X_test.ftr"))

(pd.Series(y_test,    name="label")  .reset_index(drop=True)
                                   .to_frame()
                                   .to_feather(PROC_DIR / "y_test.ftr"))

print("Validation / test splits written to", PROC_DIR.resolve())

✅ Validation / test splits written to C:\Projects\twitter-airline-analysis\data\processed