import numpy as np
import pandas as pd
# Load saved embeddings and metadata
vectors = np.load("embeddings.npy") # (N, 3072)
df = pd.read_csv("metadata.csv") # video_id, politician, party, date, ...
df["date"] = pd.to_datetime(df["date"])
df["year_month"] = df["date"].dt.to_period("M")
print(f"Total Shorts: {len(df)}")
print(f"Politicians: {df['politician'].nunique()}")
print(f"Date range: {df['date'].min()} to {df['date'].max()}")4. Panel Design - Aggregation, Regression, and Scale-Up
Unit of analysis
The core unit is politician x month: the average multimodal embedding of all Shorts published by a given politician in a given month.
Raw data: 1 row per Short (video_id, politician, party, date, embedding)
Panel: 1 row per politician-month (politician, year_month, avg_embedding)
Why monthly aggregation?
- Daily is too noisy: most politicians post 0-2 Shorts per day.
- Quarterly is too coarse: election campaigns shift strategies within weeks.
- Monthly balances granularity with statistical power. A politician posting 10 Shorts/month yields a stable average vector.
Constructing the panel
Step 1: Load embeddings and metadata
Step 2: Monthly average embedding
panel_rows = []
for (pol, ym), group in df.groupby(["politician", "year_month"]):
idx = group.index.values
avg_vec = vectors[idx].mean(axis=0) # (3072,)
avg_vec /= np.linalg.norm(avg_vec) # re-normalize
panel_rows.append({
"politician": pol,
"year_month": str(ym),
"party": group["party"].iloc[0],
"n_shorts": len(group),
"embedding_idx": len(panel_rows), # pointer into panel_vectors
})
df_panel = pd.DataFrame(panel_rows)
panel_vectors = np.stack([
vectors[df.loc[
(df["politician"] == r["politician"]) &
(df["year_month"].astype(str) == r["year_month"])
].index].mean(axis=0)
for _, r in df_panel.iterrows()
])
# Re-normalize all panel vectors
norms = np.linalg.norm(panel_vectors, axis=1, keepdims=True)
panel_vectors = panel_vectors / norms
print(f"Panel rows: {len(df_panel)}")
print(f"Panel vectors shape: {panel_vectors.shape}")Averaging multiple unit vectors does not produce a unit vector. Re-normalizing ensures cosine similarity remains interpretable (range [-1, 1]).
Step 3: Pairwise similarity per month
For regression, we need pairwise similarity scores between all politician pairs within each month.
from sklearn.metrics.pairwise import cosine_similarity
from itertools import combinations
pair_rows = []
for ym in df_panel["year_month"].unique():
month_df = df_panel[df_panel["year_month"] == ym]
month_vecs = panel_vectors[month_df.index]
pols = month_df["politician"].values
parties = month_df["party"].values
sim_matrix = cosine_similarity(month_vecs)
for i, j in combinations(range(len(pols)), 2):
pair_rows.append({
"year_month": ym,
"pol_i": pols[i],
"pol_j": pols[j],
"party_i": parties[i],
"party_j": parties[j],
"same_party": int(parties[i] == parties[j]),
"cosine_sim": sim_matrix[i, j],
})
df_pairs = pd.DataFrame(pair_rows)
print(f"Pairwise observations: {len(df_pairs)}")
print(f"Mean similarity: {df_pairs['cosine_sim'].mean():.4f}")Regression model
Research question
What predicts stylistic similarity between politicians’ Shorts?
Specification
The dependent variable is pairwise cosine similarity between politician \(i\) and politician \(j\) in month \(t\):
\[ \text{Sim}_{ij,t} = \alpha + \beta_1 \cdot \text{SameParty}_{ij} + \beta_2 \cdot \text{SameDistrict}_{ij} + \gamma \mathbf{X}_{ij,t} + \delta_t + \epsilon_{ij,t} \]
| Symbol | Description |
|---|---|
| \(\text{SameParty}_{ij}\) | 1 if \(i\) and \(j\) belong to the same party |
| \(\text{SameDistrict}_{ij}\) | 1 if same metro area (e.g., both Seoul) |
| \(\mathbf{X}_{ij,t}\) | Controls: seniority difference, committee overlap, etc. |
| \(\delta_t\) | Month fixed effects |
| \(\epsilon_{ij,t}\) | Clustered at the dyad level |
Implementation
import statsmodels.formula.api as smf
# Add covariates (example)
# df_pairs["seniority_diff"] = ...
# df_pairs["same_metro"] = ...
model = smf.ols(
"cosine_sim ~ same_party + C(year_month)",
data=df_pairs
).fit(cov_type="cluster", cov_kwds={"groups": df_pairs[["pol_i", "pol_j"]].apply(tuple, axis=1)})
print(model.summary().tables[1])\(\beta_1 > 0\) means same-party politicians produce more similar Shorts. If \(\beta_1 = 0.03\), same-party pairs are 3 percentage points more similar on a 0-1 cosine scale.
Electoral convergence (event study)
Design
Do politicians’ Shorts become more similar as elections approach?
Define treatment as “months until the next election.” Compare similarity trends in the pre-election window (6 months before) to the baseline period (12+ months before).
from datetime import datetime
# Example: April 2024 general election
election_date = datetime(2024, 4, 10)
df_pairs["date"] = pd.to_datetime(df_pairs["year_month"])
df_pairs["months_to_election"] = (
(election_date.year - df_pairs["date"].dt.year) * 12
+ (election_date.month - df_pairs["date"].dt.month)
)
# Keep a symmetric window
window = df_pairs[
(df_pairs["months_to_election"] >= -3) &
(df_pairs["months_to_election"] <= 18)
].copy()
# Event study regression
model_es = smf.ols(
"cosine_sim ~ C(months_to_election, Treatment(reference=12)) * same_party",
data=window
).fit(cov_type="cluster", cov_kwds={"groups": window[["pol_i", "pol_j"]].apply(tuple, axis=1)})Visualization
import matplotlib.pyplot as plt
# Extract coefficients for months_to_election dummies
coefs = []
for m in range(-3, 19):
if m == 12:
coefs.append({"month": m, "beta": 0, "se": 0})
continue
key = f"C(months_to_election, Treatment(reference=12))[T.{m}]"
if key in model_es.params:
coefs.append({
"month": m,
"beta": model_es.params[key],
"se": model_es.bse[key],
})
df_coefs = pd.DataFrame(coefs)
fig, ax = plt.subplots(figsize=(10, 5))
ax.errorbar(df_coefs["month"], df_coefs["beta"],
yerr=1.96 * df_coefs["se"],
fmt="o-", capsize=3, color="#1B64D1")
ax.axhline(0, color="gray", linestyle="--", linewidth=0.8)
ax.axvline(0, color="red", linestyle=":", linewidth=0.8, label="Election")
ax.set_xlabel("Months to Election")
ax.set_ylabel("Δ Cosine Similarity (vs. baseline)")
ax.set_title("Electoral Convergence in YouTube Shorts Style")
ax.legend()
ax.invert_xaxis()
plt.tight_layout()
plt.show()VVG regression model
The Visual-Verbal Gap as a variable
The VVG (defined in Section 5) can serve as both a dependent and independent variable.
VVG as DV: What predicts visual-verbal divergence?
\[ \text{VVG}_{it} = \alpha + \beta_1 \cdot \text{Opposition}_{it} + \beta_2 \cdot \text{ElectionProximity}_{t} + \gamma \mathbf{X}_{it} + \delta_t + \mu_i + \epsilon_{it} \]
| Symbol | Description |
|---|---|
| \(\text{Opposition}_{it}\) | 1 if legislator \(i\) belongs to the opposition at time \(t\) |
| \(\text{ElectionProximity}_{t}\) | Months until the next election (April 2024) |
| \(\mathbf{X}_{it}\) | Controls: seniority, gender, PR vs. SMD, committee chair |
| \(\delta_t\) | Month fixed effects |
| \(\mu_i\) | Legislator fixed effects |
import statsmodels.formula.api as smf
# Panel: legislator x month, with VVG computed from subsample
model_vvg = smf.ols(
"vvg ~ opposition + months_to_election + seniority + female"
" + pr_seat + C(year_month)",
data=df_panel_vvg
).fit(cov_type="cluster", cov_kwds={"groups": df_panel_vvg["legislator_id"]})
print(model_vvg.summary().tables[1])VVG as IV: Does visual performance predict engagement?
\[ \ln(\text{views}_{i}) = \alpha + \beta_1 \cdot \text{VVG}_{i} + \beta_2 \cdot \text{ContentType}_{i} + \mu_{\text{channel}} + \epsilon_{i} \]
If \(\beta_1 > 0\), higher visual-verbal divergence predicts more views. This would suggest that platform algorithms reward visual performance regardless of verbal substance.
Scale-up plan (updated after pilot)
The pilot (Section 5) confirmed that multimodal embedding adds substantial value (distance = 0.432 vs. audio-only, far above the 0.05 threshold). Based on this, we adopt a tiered cost strategy.
Phase 1: Pilot (completed)
| Item | Result |
|---|---|
| Videos | 15 (10 high-engagement + 5 meme/music) |
| Strategies | 4 (text_title, text_transcript, audio, multimodal) |
| Total API calls | 72 |
| Cost | ~$1.15 (free tier) |
| Outcome | 4 key findings, VVG concept validated |
Phase 2: Full-corpus text embedding
| Item | Target |
|---|---|
| Shorts | 51,197 (all Shorts with Whisper transcripts) |
| Strategy | text_transcript only |
| Cost | ~$0.50 |
| Runtime | ~1 hour |
| Goal | Full-corpus baseline for clustering, embedding regression, UMAP |
Phase 3: Stratified multimodal subsample
| Item | Target |
|---|---|
| Shorts | 2,500 (hypothesis-driven stratified sample) |
| Strategies | video_only + multimodal + audio |
| Cost | ~$234 |
| Runtime | ~2 hours |
| Goal | VVG computation, hypothesis testing (HVVG1-HVVG4) |
Sampling strategy for Phase 3
| Block | N | Selection criterion |
|---|---|---|
| Party comparison | 1,000 | Balanced ruling vs. opposition |
| Election proximity | 800 | Around April 2024 general election |
| Content type | 500 | Stratified by GPT-classified category |
| Engagement extremes | 200 | Top/bottom decile by views |
Phase 4: Optional full-corpus expansion
| Item | Target |
|---|---|
| Shorts | 51,197 |
| Strategy | video_only (for full-corpus VVG) |
| Cost | ~$2,200 |
| Decision gate | Proceed only if Phase 3 shows strong VVG signal |
Decision tree (updated)
Phase 2 results (full-corpus text)
├─ Text clustering reveals meaningful structure
│ └─ Proceed to Phase 3 (multimodal subsample)
│ ├─ VVG predicts engagement (H3 supported)
│ │ ├─ Budget allows → Phase 4 (full-corpus video)
│ │ └─ Budget tight → Publish with subsample results
│ └─ VVG does not predict engagement
│ └─ Report null finding, focus on clustering paper
└─ Text clustering is noise
└─ Investigate: transcript quality? embedding model? sample?
Confirm embedding strategyDone - multimodal confirmed in pilot- Set up batch processing with rate limiting (2,800 RPM conservative)
- Save embeddings incrementally with checkpoint/resume (every 500 embeddings)
- Budget: ~$235 total for Phases 2-3 (approved)
Data management
File structure
multimodal-analysis/
├── data/
│ ├── raw/ # Downloaded MP4s (not in git)
│ ├── audio/ # Extracted MP3s (not in git)
│ ├── metadata.csv # Video metadata from YouTube API
│ ├── embeddings_text.npy # (N, 3072) text-only
│ ├── embeddings_audio.npy # (N, 3072) audio-only
│ ├── embeddings_multi.npy # (N, 3072) multimodal
│ └── panel.csv # Aggregated politician-month panel
├── quarto-site/ # This documentation site
├── notebooks/ # Exploratory Jupyter notebooks
└── scripts/ # Production batch scripts