Data Preparation


This notebook demonstrates the src.data_prep pipeline:

  1. Loads the raw CSV
  2. Shows a before/after sample of 10 tweets
  3. Saves the cleaned data to Parquet and prints the output path

Notebook Outline

Code
# standard imports
from twitter_airline_analysis.data_prep import load_raw, preprocess, save_parquet
import pandas as pd
# regenerate_splits.py
from pathlib import Path
from sklearn.model_selection import train_test_split

PROJ_ROOT = Path.cwd().parent      
RAW_DIR   = PROJ_ROOT / "data" / "raw"
PROC_DIR  = PROJ_ROOT / "data" / "processed"
PROC_DIR.mkdir(parents=True, exist_ok=True)

# load the raw DataFrame
df_raw = load_raw()
print(f"Raw data: {df_raw.shape[0]:,} rows × {df_raw.shape[1]} columns")
df_raw.head(10)
Raw data: 14,640 rows × 15 columns
tweet_id airline_sentiment airline_sentiment_confidence negativereason negativereason_confidence airline airline_sentiment_gold name negativereason_gold retweet_count text tweet_coord tweet_created tweet_location user_timezone
0 570306133677760513 neutral 1.0000 NaN NaN Virgin America NaN cairdin NaN 0 @VirginAmerica What @dhepburn said. NaN 2015-02-24 11:35:52 -0800 NaN Eastern Time (US & Canada)
1 570301130888122368 positive 0.3486 NaN 0.0000 Virgin America NaN jnardino NaN 0 @VirginAmerica plus you've added commercials t... NaN 2015-02-24 11:15:59 -0800 NaN Pacific Time (US & Canada)
2 570301083672813571 neutral 0.6837 NaN NaN Virgin America NaN yvonnalynn NaN 0 @VirginAmerica I didn't today... Must mean I n... NaN 2015-02-24 11:15:48 -0800 Lets Play Central Time (US & Canada)
3 570301031407624196 negative 1.0000 Bad Flight 0.7033 Virgin America NaN jnardino NaN 0 @VirginAmerica it's really aggressive to blast... NaN 2015-02-24 11:15:36 -0800 NaN Pacific Time (US & Canada)
4 570300817074462722 negative 1.0000 Can't Tell 1.0000 Virgin America NaN jnardino NaN 0 @VirginAmerica and it's a really big bad thing... NaN 2015-02-24 11:14:45 -0800 NaN Pacific Time (US & Canada)
5 570300767074181121 negative 1.0000 Can't Tell 0.6842 Virgin America NaN jnardino NaN 0 @VirginAmerica seriously would pay $30 a fligh... NaN 2015-02-24 11:14:33 -0800 NaN Pacific Time (US & Canada)
6 570300616901320704 positive 0.6745 NaN 0.0000 Virgin America NaN cjmcginnis NaN 0 @VirginAmerica yes, nearly every time I fly VX... NaN 2015-02-24 11:13:57 -0800 San Francisco CA Pacific Time (US & Canada)
7 570300248553349120 neutral 0.6340 NaN NaN Virgin America NaN pilot NaN 0 @VirginAmerica Really missed a prime opportuni... NaN 2015-02-24 11:12:29 -0800 Los Angeles Pacific Time (US & Canada)
8 570299953286942721 positive 0.6559 NaN NaN Virgin America NaN dhepburn NaN 0 @virginamerica Well, I didn't…but NOW I DO! :-D NaN 2015-02-24 11:11:19 -0800 San Diego Pacific Time (US & Canada)
9 570295459631263746 positive 1.0000 NaN NaN Virgin America NaN YupitsTate NaN 0 @VirginAmerica it was amazing, and arrived an ... NaN 2015-02-24 10:53:27 -0800 Los Angeles Eastern Time (US & Canada)

Before / After Cleaning

Below we show the first 10 tweets in their original form, then the cleaned clean_text column.

Code
# take a 10-row sample for demo
sample = df_raw.head(10).copy()

# apply the cleaning pipeline
df_tidy = preprocess(sample)

# display side-by-side 
pd.concat(
    [
        sample[["tweet_id", "text"]].rename(columns={"text": "original_text"}),
        df_tidy[["clean_text"]]
    ],
    axis=1
)
tweet_id original_text clean_text
0 570306133677760513 @VirginAmerica What @dhepburn said. what said.
1 570301130888122368 @VirginAmerica plus you've added commercials t... plus you've added commercials to the experienc...
2 570301083672813571 @VirginAmerica I didn't today... Must mean I n... i didn't today... must mean i need to take ano...
3 570301031407624196 @VirginAmerica it's really aggressive to blast... it's really aggressive to blast obnoxious "ent...
4 570300817074462722 @VirginAmerica and it's a really big bad thing... and it's a really big bad thing about it
5 570300767074181121 @VirginAmerica seriously would pay $30 a fligh... seriously would pay $30 a flight for seats tha...
6 570300616901320704 @VirginAmerica yes, nearly every time I fly VX... yes, nearly every time i fly vx this “ear worm...
7 570300248553349120 @VirginAmerica Really missed a prime opportuni... really missed a prime opportunity for men with...
8 570299953286942721 @virginamerica Well, I didn't…but NOW I DO! :-D well, i didn't...but now i do! :-d
9 570295459631263746 @VirginAmerica it was amazing, and arrived an ... it was amazing, and arrived an hour early. you...

Save to Parquet

Now we save the full cleaned dataset to Parquet and display the path.

Code
# load & preprocess full dataset
full_raw  = load_raw()
full_tidy = preprocess(full_raw)

# save and capture the file path
out_path = save_parquet(full_tidy)
print(f"✅ Saved {len(full_tidy):,} rows to:\n{out_path}")
✅ Saved 14,640 rows to:
C:\Projects\twitter-airline-analysis\data\processed\tweets.parquet
Code
df = pd.read_parquet(PROC_DIR / "tweets.parquet") 
X    = df["clean_text"]
y    = df["airline_sentiment"]

# 20 % validation, 20 % test
X_temp, X_test, y_temp, y_test = train_test_split(X, y,
                                                  test_size=0.10,
                                                  stratify=y,
                                                  random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp,
                                                  test_size=0.1111,   # 0.1111 * 0.9 ≈ 0.10
                                                  stratify=y_temp,
                                                  random_state=42)

# 3. SAVE as Feather for later notebooks
(pd.Series(X_train,   name="text")   .reset_index(drop=True)
                                  .to_frame()
                                  .to_feather(PROC_DIR / "X_train.ftr"))

(pd.Series(y_train,   name="label")  .reset_index(drop=True)
                                   .to_frame()
                                   .to_feather(PROC_DIR / "y_train.ftr"))

(pd.Series(X_val,     name="text")   .reset_index(drop=True)
                                  .to_frame()
                                  .to_feather(PROC_DIR / "X_val.ftr"))

(pd.Series(y_val,     name="label")  .reset_index(drop=True)
                                   .to_frame()
                                   .to_feather(PROC_DIR / "y_val.ftr"))

(pd.Series(X_test,    name="text")   .reset_index(drop=True)
                                  .to_frame()
                                  .to_feather(PROC_DIR / "X_test.ftr"))

(pd.Series(y_test,    name="label")  .reset_index(drop=True)
                                   .to_frame()
                                   .to_feather(PROC_DIR / "y_test.ftr"))

print("Validation / test splits written to", PROC_DIR.resolve())
✅ Validation / test splits written to C:\Projects\twitter-airline-analysis\data\processed