4. Panel Design - Aggregation, Regression, and Scale-Up

Published

March 13, 2026

Unit of analysis

The core unit is politician x month: the average multimodal embedding of all Shorts published by a given politician in a given month.

Raw data:  1 row per Short  (video_id, politician, party, date, embedding)
Panel:     1 row per politician-month  (politician, year_month, avg_embedding)

Why monthly aggregation?

Daily is too noisy: most politicians post 0-2 Shorts per day.
Quarterly is too coarse: election campaigns shift strategies within weeks.
Monthly balances granularity with statistical power. A politician posting 10 Shorts/month yields a stable average vector.

Constructing the panel

Step 1: Load embeddings and metadata

import numpy as np
import pandas as pd

# Load saved embeddings and metadata
vectors = np.load("embeddings.npy")        # (N, 3072)
df = pd.read_csv("metadata.csv")           # video_id, politician, party, date, ...

df["date"] = pd.to_datetime(df["date"])
df["year_month"] = df["date"].dt.to_period("M")

print(f"Total Shorts: {len(df)}")
print(f"Politicians:  {df['politician'].nunique()}")
print(f"Date range:   {df['date'].min()} to {df['date'].max()}")

Step 2: Monthly average embedding

panel_rows = []

for (pol, ym), group in df.groupby(["politician", "year_month"]):
    idx = group.index.values
    avg_vec = vectors[idx].mean(axis=0)         # (3072,)
    avg_vec /= np.linalg.norm(avg_vec)          # re-normalize

    panel_rows.append({
        "politician": pol,
        "year_month": str(ym),
        "party": group["party"].iloc[0],
        "n_shorts": len(group),
        "embedding_idx": len(panel_rows),       # pointer into panel_vectors
    })

df_panel = pd.DataFrame(panel_rows)
panel_vectors = np.stack([
    vectors[df.loc[
        (df["politician"] == r["politician"]) &
        (df["year_month"].astype(str) == r["year_month"])
    ].index].mean(axis=0)
    for _, r in df_panel.iterrows()
])

# Re-normalize all panel vectors
norms = np.linalg.norm(panel_vectors, axis=1, keepdims=True)
panel_vectors = panel_vectors / norms

print(f"Panel rows: {len(df_panel)}")
print(f"Panel vectors shape: {panel_vectors.shape}")

Why re-normalize?

Averaging multiple unit vectors does not produce a unit vector. Re-normalizing ensures cosine similarity remains interpretable (range [-1, 1]).

Step 3: Pairwise similarity per month

For regression, we need pairwise similarity scores between all politician pairs within each month.

from sklearn.metrics.pairwise import cosine_similarity
from itertools import combinations

pair_rows = []

for ym in df_panel["year_month"].unique():
    month_df = df_panel[df_panel["year_month"] == ym]
    month_vecs = panel_vectors[month_df.index]
    pols = month_df["politician"].values
    parties = month_df["party"].values

    sim_matrix = cosine_similarity(month_vecs)

    for i, j in combinations(range(len(pols)), 2):
        pair_rows.append({
            "year_month": ym,
            "pol_i": pols[i],
            "pol_j": pols[j],
            "party_i": parties[i],
            "party_j": parties[j],
            "same_party": int(parties[i] == parties[j]),
            "cosine_sim": sim_matrix[i, j],
        })

df_pairs = pd.DataFrame(pair_rows)
print(f"Pairwise observations: {len(df_pairs)}")
print(f"Mean similarity: {df_pairs['cosine_sim'].mean():.4f}")

Regression model

Research question

What predicts stylistic similarity between politicians’ Shorts?

Specification

The dependent variable is pairwise cosine similarity between politician $i$ and politician $j$ in month $t$:

\[ \text{Sim}_{ij,t} = \alpha + \beta_1 \cdot \text{SameParty}_{ij} + \beta_2 \cdot \text{SameDistrict}_{ij} + \gamma \mathbf{X}_{ij,t} + \delta_t + \epsilon_{ij,t} \]

Symbol	Description
$\text{SameParty}_{ij}$	1 if $i$ and $j$ belong to the same party
$\text{SameDistrict}_{ij}$	1 if same metro area (e.g., both Seoul)
$\mathbf{X}_{ij,t}$	Controls: seniority difference, committee overlap, etc.
$\delta_t$	Month fixed effects
$\epsilon_{ij,t}$	Clustered at the dyad level

Implementation

import statsmodels.formula.api as smf

# Add covariates (example)
# df_pairs["seniority_diff"] = ...
# df_pairs["same_metro"] = ...

model = smf.ols(
    "cosine_sim ~ same_party + C(year_month)",
    data=df_pairs
).fit(cov_type="cluster", cov_kwds={"groups": df_pairs[["pol_i", "pol_j"]].apply(tuple, axis=1)})

print(model.summary().tables[1])

Interpreting the coefficient

$\beta_1 > 0$ means same-party politicians produce more similar Shorts. If $\beta_1 = 0.03$, same-party pairs are 3 percentage points more similar on a 0-1 cosine scale.

Electoral convergence (event study)

Design

Do politicians’ Shorts become more similar as elections approach?

Define treatment as “months until the next election.” Compare similarity trends in the pre-election window (6 months before) to the baseline period (12+ months before).

from datetime import datetime

# Example: April 2024 general election
election_date = datetime(2024, 4, 10)

df_pairs["date"] = pd.to_datetime(df_pairs["year_month"])
df_pairs["months_to_election"] = (
    (election_date.year - df_pairs["date"].dt.year) * 12
    + (election_date.month - df_pairs["date"].dt.month)
)

# Keep a symmetric window
window = df_pairs[
    (df_pairs["months_to_election"] >= -3) &
    (df_pairs["months_to_election"] <= 18)
].copy()

# Event study regression
model_es = smf.ols(
    "cosine_sim ~ C(months_to_election, Treatment(reference=12)) * same_party",
    data=window
).fit(cov_type="cluster", cov_kwds={"groups": window[["pol_i", "pol_j"]].apply(tuple, axis=1)})

Visualization

import matplotlib.pyplot as plt

# Extract coefficients for months_to_election dummies
coefs = []
for m in range(-3, 19):
    if m == 12:
        coefs.append({"month": m, "beta": 0, "se": 0})
        continue
    key = f"C(months_to_election, Treatment(reference=12))[T.{m}]"
    if key in model_es.params:
        coefs.append({
            "month": m,
            "beta": model_es.params[key],
            "se": model_es.bse[key],
        })

df_coefs = pd.DataFrame(coefs)

fig, ax = plt.subplots(figsize=(10, 5))
ax.errorbar(df_coefs["month"], df_coefs["beta"],
            yerr=1.96 * df_coefs["se"],
            fmt="o-", capsize=3, color="#1B64D1")
ax.axhline(0, color="gray", linestyle="--", linewidth=0.8)
ax.axvline(0, color="red", linestyle=":", linewidth=0.8, label="Election")
ax.set_xlabel("Months to Election")
ax.set_ylabel("Δ Cosine Similarity (vs. baseline)")
ax.set_title("Electoral Convergence in YouTube Shorts Style")
ax.legend()
ax.invert_xaxis()
plt.tight_layout()
plt.show()

VVG regression model

The Visual-Verbal Gap as a variable

The VVG (defined in Section 5) can serve as both a dependent and independent variable.

VVG as DV: What predicts visual-verbal divergence?

\[ \text{VVG}_{it} = \alpha + \beta_1 \cdot \text{Opposition}_{it} + \beta_2 \cdot \text{ElectionProximity}_{t} + \gamma \mathbf{X}_{it} + \delta_t + \mu_i + \epsilon_{it} \]

Symbol	Description
$\text{Opposition}_{it}$	1 if legislator $i$ belongs to the opposition at time $t$
$\text{ElectionProximity}_{t}$	Months until the next election (April 2024)
$\mathbf{X}_{it}$	Controls: seniority, gender, PR vs. SMD, committee chair
$\delta_t$	Month fixed effects
$\mu_i$	Legislator fixed effects

import statsmodels.formula.api as smf

# Panel: legislator x month, with VVG computed from subsample
model_vvg = smf.ols(
    "vvg ~ opposition + months_to_election + seniority + female"
    " + pr_seat + C(year_month)",
    data=df_panel_vvg
).fit(cov_type="cluster", cov_kwds={"groups": df_panel_vvg["legislator_id"]})

print(model_vvg.summary().tables[1])

VVG as IV: Does visual performance predict engagement?

\[ \ln(\text{views}_{i}) = \alpha + \beta_1 \cdot \text{VVG}_{i} + \beta_2 \cdot \text{ContentType}_{i} + \mu_{\text{channel}} + \epsilon_{i} \]

If $\beta_1 > 0$, higher visual-verbal divergence predicts more views. This would suggest that platform algorithms reward visual performance regardless of verbal substance.

Scale-up plan (updated after pilot)

The pilot (Section 5) confirmed that multimodal embedding adds substantial value (distance = 0.432 vs. audio-only, far above the 0.05 threshold). Based on this, we adopt a tiered cost strategy.

Phase 1: Pilot (completed)

Item	Result
Videos	15 (10 high-engagement + 5 meme/music)
Strategies	4 (text_title, text_transcript, audio, multimodal)
Total API calls	72
Cost	~$1.15 (free tier)
Outcome	4 key findings, VVG concept validated

Phase 2: Full-corpus text embedding

Item	Target
Shorts	51,197 (all Shorts with Whisper transcripts)
Strategy	text_transcript only
Cost	~$0.50
Runtime	~1 hour
Goal	Full-corpus baseline for clustering, embedding regression, UMAP

Phase 3: Stratified multimodal subsample

Item	Target
Shorts	2,500 (hypothesis-driven stratified sample)
Strategies	video_only + multimodal + audio
Cost	~$234
Runtime	~2 hours
Goal	VVG computation, hypothesis testing (H_VVG1-H_VVG4)

Sampling strategy for Phase 3

Block	N	Selection criterion
Party comparison	1,000	Balanced ruling vs. opposition
Election proximity	800	Around April 2024 general election
Content type	500	Stratified by GPT-classified category
Engagement extremes	200	Top/bottom decile by views

Phase 4: Optional full-corpus expansion

Item	Target
Shorts	51,197
Strategy	video_only (for full-corpus VVG)
Cost	~$2,200
Decision gate	Proceed only if Phase 3 shows strong VVG signal

Decision tree (updated)

Phase 2 results (full-corpus text)
├─ Text clustering reveals meaningful structure
│  └─ Proceed to Phase 3 (multimodal subsample)
│     ├─ VVG predicts engagement (H3 supported)
│     │  ├─ Budget allows → Phase 4 (full-corpus video)
│     │  └─ Budget tight  → Publish with subsample results
│     └─ VVG does not predict engagement
│        └─ Report null finding, focus on clustering paper
└─ Text clustering is noise
   └─ Investigate: transcript quality? embedding model? sample?

Before scaling up

~~Confirm embedding strategy~~ Done - multimodal confirmed in pilot
Set up batch processing with rate limiting (2,800 RPM conservative)
Save embeddings incrementally with checkpoint/resume (every 500 embeddings)
Budget: ~$235 total for Phases 2-3 (approved)

Data management

File structure

multimodal-analysis/
├── data/
│   ├── raw/                    # Downloaded MP4s (not in git)
│   ├── audio/                  # Extracted MP3s (not in git)
│   ├── metadata.csv            # Video metadata from YouTube API
│   ├── embeddings_text.npy     # (N, 3072) text-only
│   ├── embeddings_audio.npy    # (N, 3072) audio-only
│   ├── embeddings_multi.npy    # (N, 3072) multimodal
│   └── panel.csv               # Aggregated politician-month panel
├── quarto-site/                # This documentation site
├── notebooks/                  # Exploratory Jupyter notebooks
└── scripts/                    # Production batch scripts

Reproducibility checklist

All embeddings saved as .npy with matching metadata.csv
UMAP uses random_state=42 for reproducibility
File API upload timestamps logged (48-hour deletion window)
API costs tracked per batch run
Git tracks code and documentation; data stored locally only

--- title: "4. Panel Design - Aggregation, Regression, and Scale-Up" date: "2026-03-13" execute: eval: false --- ## Unit of analysis The core unit is **politician x month**: the average multimodal embedding of all Shorts published by a given politician in a given month. ``` Raw data: 1 row per Short (video_id, politician, party, date, embedding) Panel: 1 row per politician-month (politician, year_month, avg_embedding) ``` ### Why monthly aggregation? - **Daily** is too noisy: most politicians post 0-2 Shorts per day. - **Quarterly** is too coarse: election campaigns shift strategies within weeks. - **Monthly** balances granularity with statistical power. A politician posting 10 Shorts/month yields a stable average vector. ## Constructing the panel ### Step 1: Load embeddings and metadata ```{python} import numpy as np import pandas as pd # Load saved embeddings and metadata vectors = np.load("embeddings.npy") # (N, 3072) df = pd.read_csv("metadata.csv") # video_id, politician, party, date, ... df["date"] = pd.to_datetime(df["date"]) df["year_month"] = df["date"].dt.to_period("M") print(f"Total Shorts: {len(df)}") print(f"Politicians: {df['politician'].nunique()}") print(f"Date range: {df['date'].min()} to {df['date'].max()}") ``` ### Step 2: Monthly average embedding ```{python} panel_rows = [] for (pol, ym), group in df.groupby(["politician", "year_month"]): idx = group.index.values avg_vec = vectors[idx].mean(axis=0) # (3072,) avg_vec /= np.linalg.norm(avg_vec) # re-normalize panel_rows.append({ "politician": pol, "year_month": str(ym), "party": group["party"].iloc[0], "n_shorts": len(group), "embedding_idx": len(panel_rows), # pointer into panel_vectors }) df_panel = pd.DataFrame(panel_rows) panel_vectors = np.stack([ vectors[df.loc[ (df["politician"] == r["politician"]) & (df["year_month"].astype(str) == r["year_month"]) ].index].mean(axis=0) for _, r in df_panel.iterrows() ]) # Re-normalize all panel vectors norms = np.linalg.norm(panel_vectors, axis=1, keepdims=True) panel_vectors = panel_vectors / norms print(f"Panel rows: {len(df_panel)}") print(f"Panel vectors shape: {panel_vectors.shape}") ``` ::: {.callout-tip} ## Why re-normalize? Averaging multiple unit vectors does not produce a unit vector. Re-normalizing ensures cosine similarity remains interpretable (range [-1, 1]). ::: ### Step 3: Pairwise similarity per month For regression, we need pairwise similarity scores between all politician pairs within each month. ```{python} from sklearn.metrics.pairwise import cosine_similarity from itertools import combinations pair_rows = [] for ym in df_panel["year_month"].unique(): month_df = df_panel[df_panel["year_month"] == ym] month_vecs = panel_vectors[month_df.index] pols = month_df["politician"].values parties = month_df["party"].values sim_matrix = cosine_similarity(month_vecs) for i, j in combinations(range(len(pols)), 2): pair_rows.append({ "year_month": ym, "pol_i": pols[i], "pol_j": pols[j], "party_i": parties[i], "party_j": parties[j], "same_party": int(parties[i] == parties[j]), "cosine_sim": sim_matrix[i, j], }) df_pairs = pd.DataFrame(pair_rows) print(f"Pairwise observations: {len(df_pairs)}") print(f"Mean similarity: {df_pairs['cosine_sim'].mean():.4f}") ``` ## Regression model ### Research question What predicts stylistic similarity between politicians' Shorts? ### Specification The dependent variable is pairwise cosine similarity between politician $i$ and politician $j$ in month $t$: $$ \text{Sim}_{ij,t} = \alpha + \beta_1 \cdot \text{SameParty}_{ij} + \beta_2 \cdot \text{SameDistrict}_{ij} + \gamma \mathbf{X}_{ij,t} + \delta_t + \epsilon_{ij,t} $$ | Symbol | Description | |--------|-------------| | $\text{SameParty}_{ij}$ | 1 if $i$ and $j$ belong to the same party | | $\text{SameDistrict}_{ij}$ | 1 if same metro area (e.g., both Seoul) | | $\mathbf{X}_{ij,t}$ | Controls: seniority difference, committee overlap, etc. | | $\delta_t$ | Month fixed effects | | $\epsilon_{ij,t}$ | Clustered at the dyad level | ### Implementation ```{python} import statsmodels.formula.api as smf # Add covariates (example) # df_pairs["seniority_diff"] = ... # df_pairs["same_metro"] = ... model = smf.ols( "cosine_sim ~ same_party + C(year_month)", data=df_pairs ).fit(cov_type="cluster", cov_kwds={"groups": df_pairs[["pol_i", "pol_j"]].apply(tuple, axis=1)}) print(model.summary().tables[1]) ``` ::: {.callout-note} ## Interpreting the coefficient $\beta_1 > 0$ means same-party politicians produce more similar Shorts. If $\beta_1 = 0.03$, same-party pairs are 3 percentage points more similar on a 0-1 cosine scale. ::: ## Electoral convergence (event study) ### Design Do politicians' Shorts become more similar as elections approach? Define treatment as "months until the next election." Compare similarity trends in the pre-election window (6 months before) to the baseline period (12+ months before). ```{python} from datetime import datetime # Example: April 2024 general election election_date = datetime(2024, 4, 10) df_pairs["date"] = pd.to_datetime(df_pairs["year_month"]) df_pairs["months_to_election"] = ( (election_date.year - df_pairs["date"].dt.year) * 12 + (election_date.month - df_pairs["date"].dt.month) ) # Keep a symmetric window window = df_pairs[ (df_pairs["months_to_election"] >= -3) & (df_pairs["months_to_election"] <= 18) ].copy() # Event study regression model_es = smf.ols( "cosine_sim ~ C(months_to_election, Treatment(reference=12)) * same_party", data=window ).fit(cov_type="cluster", cov_kwds={"groups": window[["pol_i", "pol_j"]].apply(tuple, axis=1)}) ``` ### Visualization ```{python} import matplotlib.pyplot as plt # Extract coefficients for months_to_election dummies coefs = [] for m in range(-3, 19): if m == 12: coefs.append({"month": m, "beta": 0, "se": 0}) continue key = f"C(months_to_election, Treatment(reference=12))[T.{m}]" if key in model_es.params: coefs.append({ "month": m, "beta": model_es.params[key], "se": model_es.bse[key], }) df_coefs = pd.DataFrame(coefs) fig, ax = plt.subplots(figsize=(10, 5)) ax.errorbar(df_coefs["month"], df_coefs["beta"], yerr=1.96 * df_coefs["se"], fmt="o-", capsize=3, color="#1B64D1") ax.axhline(0, color="gray", linestyle="--", linewidth=0.8) ax.axvline(0, color="red", linestyle=":", linewidth=0.8, label="Election") ax.set_xlabel("Months to Election") ax.set_ylabel("Δ Cosine Similarity (vs. baseline)") ax.set_title("Electoral Convergence in YouTube Shorts Style") ax.legend() ax.invert_xaxis() plt.tight_layout() plt.show() ``` ## VVG regression model ### The Visual-Verbal Gap as a variable The VVG (defined in Section 5) can serve as both a dependent and independent variable. ### VVG as DV: What predicts visual-verbal divergence? $$ \text{VVG}_{it} = \alpha + \beta_1 \cdot \text{Opposition}_{it} + \beta_2 \cdot \text{ElectionProximity}_{t} + \gamma \mathbf{X}_{it} + \delta_t + \mu_i + \epsilon_{it} $$ | Symbol | Description | |--------|-------------| | $\text{Opposition}_{it}$ | 1 if legislator $i$ belongs to the opposition at time $t$ | | $\text{ElectionProximity}_{t}$ | Months until the next election (April 2024) | | $\mathbf{X}_{it}$ | Controls: seniority, gender, PR vs. SMD, committee chair | | $\delta_t$ | Month fixed effects | | $\mu_i$ | Legislator fixed effects | ```{python} import statsmodels.formula.api as smf # Panel: legislator x month, with VVG computed from subsample model_vvg = smf.ols( "vvg ~ opposition + months_to_election + seniority + female" " + pr_seat + C(year_month)", data=df_panel_vvg ).fit(cov_type="cluster", cov_kwds={"groups": df_panel_vvg["legislator_id"]}) print(model_vvg.summary().tables[1]) ``` ### VVG as IV: Does visual performance predict engagement? $$ \ln(\text{views}_{i}) = \alpha + \beta_1 \cdot \text{VVG}_{i} + \beta_2 \cdot \text{ContentType}_{i} + \mu_{\text{channel}} + \epsilon_{i} $$ If $\beta_1 > 0$, higher visual-verbal divergence predicts more views. This would suggest that platform algorithms reward visual performance regardless of verbal substance. ## Scale-up plan (updated after pilot) The pilot (Section 5) confirmed that multimodal embedding adds substantial value (distance = 0.432 vs. audio-only, far above the 0.05 threshold). Based on this, we adopt a tiered cost strategy. ### Phase 1: Pilot (**completed**) | Item | Result | |------|--------| | Videos | 15 (10 high-engagement + 5 meme/music) | | Strategies | 4 (text_title, text_transcript, audio, multimodal) | | Total API calls | 72 | | Cost | ~$1.15 (free tier) | | Outcome | 4 key findings, VVG concept validated | ### Phase 2: Full-corpus text embedding | Item | Target | |------|--------| | Shorts | 51,197 (all Shorts with Whisper transcripts) | | Strategy | text_transcript only | | Cost | ~$0.50 | | Runtime | ~1 hour | | Goal | Full-corpus baseline for clustering, embedding regression, UMAP | ### Phase 3: Stratified multimodal subsample | Item | Target | |------|--------| | Shorts | 2,500 (hypothesis-driven stratified sample) | | Strategies | video_only + multimodal + audio | | Cost | ~$234 | | Runtime | ~2 hours | | Goal | VVG computation, hypothesis testing (H~VVG1~-H~VVG4~) | ### Sampling strategy for Phase 3 | Block | N | Selection criterion | |-------|---|-------------------| | Party comparison | 1,000 | Balanced ruling vs. opposition | | Election proximity | 800 | Around April 2024 general election | | Content type | 500 | Stratified by GPT-classified category | | Engagement extremes | 200 | Top/bottom decile by views | ### Phase 4: Optional full-corpus expansion | Item | Target | |------|--------| | Shorts | 51,197 | | Strategy | video_only (for full-corpus VVG) | | Cost | ~$2,200 | | Decision gate | Proceed only if Phase 3 shows strong VVG signal | ### Decision tree (updated) ``` Phase 2 results (full-corpus text) ├─ Text clustering reveals meaningful structure │ └─ Proceed to Phase 3 (multimodal subsample) │ ├─ VVG predicts engagement (H3 supported) │ │ ├─ Budget allows → Phase 4 (full-corpus video) │ │ └─ Budget tight → Publish with subsample results │ └─ VVG does not predict engagement │ └─ Report null finding, focus on clustering paper └─ Text clustering is noise └─ Investigate: transcript quality? embedding model? sample? ``` ::: {.callout-important} ## Before scaling up 1. ~~Confirm embedding strategy~~ **Done** - multimodal confirmed in pilot 2. Set up batch processing with rate limiting (2,800 RPM conservative) 3. Save embeddings incrementally with checkpoint/resume (every 500 embeddings) 4. Budget: ~$235 total for Phases 2-3 (approved) ::: ## Data management ### File structure ``` multimodal-analysis/ ├── data/ │ ├── raw/ # Downloaded MP4s (not in git) │ ├── audio/ # Extracted MP3s (not in git) │ ├── metadata.csv # Video metadata from YouTube API │ ├── embeddings_text.npy # (N, 3072) text-only │ ├── embeddings_audio.npy # (N, 3072) audio-only │ ├── embeddings_multi.npy # (N, 3072) multimodal │ └── panel.csv # Aggregated politician-month panel ├── quarto-site/ # This documentation site ├── notebooks/ # Exploratory Jupyter notebooks └── scripts/ # Production batch scripts ``` ### Reproducibility checklist - [ ] All embeddings saved as `.npy` with matching `metadata.csv` - [ ] UMAP uses `random_state=42` for reproducibility - [ ] File API upload timestamps logged (48-hour deletion window) - [ ] API costs tracked per batch run - [ ] Git tracks code and documentation; data stored locally only