import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# Load pilot embeddings (4 strategies x 15 videos x 3072 dimensions)
data_v1 = np.load("outputs/pilot_embeddings.npz")
data_v2 = np.load("outputs/pilot_v2_embeddings.npz")5. Pilot Results - What We Learned from 15 Videos
Pilot design
We embedded 15 YouTube Shorts from Korean National Assembly legislators using four strategies, testing whether multimodal embedding captures information beyond text alone.
Sample composition
| Group | Videos | Purpose |
|---|---|---|
| High-engagement Shorts (DP 5, PPP 3, RPK 1, NRP 1) | 10 | Core pilot across parties |
| Meme / music-heavy Shorts | 5 | Stress-test non-speech content |
| Total | 15 |
Embedding strategies
| Strategy | Input | Model |
|---|---|---|
text_title |
YouTube title + description | Gemini Embedding 2 |
text_transcript |
Whisper transcript (full speech) | Gemini Embedding 2 |
audio |
Extracted MP3 audio track | Gemini Embedding 2 |
multimodal |
Full MP4 video + audio + title text | Gemini Embedding 2 |
Each strategy produces a 3072-dimensional vector in the same shared embedding space, enabling direct cross-strategy comparison via cosine distance.
Finding 1: Audio captures speech, not acoustics
For speech-heavy content (policy briefings, floor speeches, constituent reports), the audio embedding is functionally redundant with the transcript embedding.
# Cosine distance between audio and transcript embeddings
# for speech-heavy Shorts (pilot v2, 8 videos with transcripts)
for vid_id in speech_heavy_ids:
vec_audio = embeddings["audio"][vid_id]
vec_transcript = embeddings["text_transcript"][vid_id]
dist = 1 - cosine_similarity(
vec_audio.reshape(1, -1),
vec_transcript.reshape(1, -1)
)[0, 0]
print(f" {vid_id}: cosine_distance = {dist:.3f}")
# Typical range: 0.11 - 0.20Cosine distance (audio vs. transcript) for speech-heavy Shorts: 0.11 - 0.20
This is expected. Google’s documentation states that audio embedding support is “optimized for speech.” When a Short consists primarily of a legislator talking, the audio encoder extracts the same semantic content as the transcript.
If ~70-80% of legislative Shorts are speech-heavy, audio embedding adds little marginal information over transcript embedding for the majority of the corpus. The $360 cost of embedding all 51K audio tracks can be redirected to a targeted multimodal subsample.
Finding 2: Audio diverges for non-speech content
For meme/humor Shorts with music, sound effects, and minimal speech, audio and transcript embeddings diverge sharply.
# Same comparison for meme/music-heavy Shorts
for vid_id in meme_ids:
vec_audio = embeddings["audio"][vid_id]
vec_transcript = embeddings["text_transcript"][vid_id]
dist = 1 - cosine_similarity(
vec_audio.reshape(1, -1),
vec_transcript.reshape(1, -1)
)[0, 0]
print(f" {vid_id}: cosine_distance = {dist:.3f}")
# Typical range: 0.40 - 0.43Cosine distance (audio vs. transcript) for meme Shorts: 0.40 - 0.43
When speech is absent, the audio encoder captures non-verbal information (music genre, sound effects, ambient noise) that the transcript cannot represent. This divergence is itself informative: it flags content where the auditory channel carries meaning independent of words.
Finding 3: Video channel provides the largest marginal information gain
Cross-strategy distances for the same Short reveal that the video channel adds the most new information beyond text.
# Cross-strategy cosine distances (single multimodal Short, all four strategies)
strategies = ["text_transcript", "audio", "multimodal"]
pairs = [
("text_transcript", "audio"),
("text_transcript", "multimodal"),
("audio", "multimodal"),
]
print("Cross-strategy cosine distances:")
print("-" * 50)
for s1, s2 in pairs:
vec1 = embeddings[s1][example_vid]
vec2 = embeddings[s2][example_vid]
dist = 1 - cosine_similarity(
vec1.reshape(1, -1), vec2.reshape(1, -1)
)[0, 0]
print(f" {s1:20s} vs {s2:20s}: {dist:.3f}")| Strategy pair | Cosine distance |
|---|---|
| text_transcript vs. audio | 0.202 |
| text_transcript vs. multimodal | 0.478 |
| audio vs. multimodal | 0.432 |
The multimodal embedding (which includes the video channel) is far from both text and audio, confirming that visual content carries substantial semantic information not captured by speech alone. This validates the multimodal approach for our research.
Visual elements such as setting (office vs. rally vs. studio), graphics and text overlays, editing style (jump cuts vs. static frame), gestures and physical framing, and the presence of other people. These are precisely the elements that distinguish performative from deliberative political communication.
Finding 4: Title-based party signal is artifactual
The most surprising finding: party clustering depends entirely on which text input is used.
# Within-party vs between-party similarity gap
for strategy in ["text_title", "text_transcript"]:
vecs = np.stack([embeddings[strategy][v] for v in all_vids])
sim = cosine_similarity(vecs)
same, diff = [], []
for i in range(len(all_vids)):
for j in range(i + 1, len(all_vids)):
val = sim[i, j]
if parties[i] == parties[j]:
same.append(val)
else:
diff.append(val)
gap = np.mean(same) - np.mean(diff)
print(f" {strategy:20s}: party gap = {gap:+.3f}")| Strategy | Same-party vs. cross-party similarity gap |
|---|---|
text_title |
+0.124 (strong party signal) |
text_transcript |
-0.018 (no party signal) |
The title-based signal is driven by politician names and party hashtags embedded in YouTube titles (e.g., “#DemocraticParty”, “#PPP”), not by semantic content. When we use full speech transcripts, the party signal vanishes entirely.
YouTube metadata (titles, descriptions, tags) reflects self-labeling, not content. Any embedding strategy that relies on metadata will capture identity markers rather than communication style. For studying how politicians actually communicate, transcript-based embedding is the correct baseline.
The Visual-Verbal Gap (VVG)
Concept
The pilot findings motivate a new measure. If transcript embeddings capture what politicians say and video-only embeddings capture what politicians show, the distance between them measures the degree of visual-verbal divergence.
\[ \text{VVG}_i = 1 - \cos(\mathbf{e}^{\text{transcript}}_i, \mathbf{e}^{\text{video\_only}}_i) \]
where \(\mathbf{e}^{\text{transcript}}_i\) is the Gemini Embedding 2 vector from the Whisper transcript and \(\mathbf{e}^{\text{video\_only}}_i\) is the embedding from the muted video (audio track stripped via ffmpeg -an).
Interpretation
| VVG value | Example | Meaning |
|---|---|---|
| Low (~0.1-0.2) | Legislator reading policy brief at desk | Visuals match words |
| Medium (~0.3-0.4) | Interview with graphics overlaid | Some visual divergence |
| High (~0.5+) | Meme video with music and rapid editing | Visuals diverge from words |
Why video_only, not multimodal?
Finding 2 showed that audio and transcript overlap for speech-heavy content but diverge for meme content. Using multimodal (which includes audio) would conflate audio divergence with visual divergence. By stripping the audio track, video_only isolates the visual channel.
# Strip audio to create video-only input
ffmpeg -i input.mp4 -an -c:v copy video_only.mp4def compute_vvg(client, mp4_path: str, transcript: str) -> float:
"""Compute Visual-Verbal Gap for a single Short."""
# 1. Embed transcript (text only)
result_text = client.models.embed_content(
model="gemini-embedding-exp-03-07",
contents=transcript,
config=types.EmbedContentConfig(
task_type="SEMANTIC_SIMILARITY"
)
)
vec_text = np.array(result_text.embeddings[0].values)
# 2. Create video-only version (strip audio)
video_only = Path(mp4_path).with_name(
Path(mp4_path).stem + "_silent.mp4"
)
subprocess.run([
"ffmpeg", "-i", mp4_path,
"-an", "-c:v", "copy", str(video_only),
"-y", "-loglevel", "error"
], check=True)
# 3. Embed video-only
video_uri = upload_and_wait(client, str(video_only))
result_video = client.models.embed_content(
model="gemini-embedding-exp-03-07",
contents=types.Content(
parts=[types.Part.from_uri(
file_uri=video_uri, mime_type="video/mp4"
)]
)
)
vec_video = np.array(result_video.embeddings[0].values)
# 4. Cosine distance
sim = cosine_similarity(
vec_text.reshape(1, -1), vec_video.reshape(1, -1)
)[0, 0]
return 1 - simHypotheses
The VVG measure generates four testable hypotheses about political communication on YouTube Shorts.
HVVG1: Opposition legislators have higher VVG
Opposition legislators lack institutional platforms (no government press conferences, no ministerial briefings). They face stronger incentives to adopt performative visual strategies that diverge from verbal substance.
\[ \text{VVG}_{it} = \alpha + \beta_1 \cdot \text{Opposition}_{it} + \gamma \mathbf{X}_{it} + \delta_t + \mu_i + \epsilon_{it} \]
HVVG2: VVG increases near elections
Electoral proximity intensifies the incentive to maximize reach over deliberation. As elections approach, legislators shift toward attention-grabbing visual content while verbal platforms remain constrained by policy discussion.
Test: Legislator-by-month panel with months coded by distance to the April 2024 general election.
HVVG3: Higher VVG predicts higher engagement
Platform algorithms reward visually engaging content. If VVG captures visual performance, it should predict views, controlling for content type and channel size.
\[ \ln(\text{views}_{i}) = \alpha + \beta_1 \cdot \text{VVG}_{i} + \beta_2 \cdot \text{ContentType}_{i} + \mu_{\text{channel}} + \epsilon_{i} \]
HVVG4: VVG varies systematically by content type
If VVG is a valid measure, it should align with face-valid expectations:
- Meme/humor Shorts: high VVG (music, rapid editing, text overlays divorced from speech)
- Policy briefings: low VVG (congruent visuals and words)
- Attack/campaign Shorts: medium-high VVG (visual framing often diverges from spoken argument)
This hypothesis serves as construct validation. If VVG does not distinguish content types, the measure is suspect.
Validity concerns
Modality gap
Liang et al. (2022) document a systematic displacement between modalities in contrastive embedding spaces. Cross-modal distances may partly reflect architectural features rather than semantic content.
Planned validation:
- Baseline calibration: Compute VVG for “talking head” Shorts where visual and verbal content are maximally aligned. If VVG is still high, the gap is architectural noise.
- Human judgment: Compare VVG rankings against research assistant ratings on a 50-video subsample.
- Within-modality control: If within-text similarity patterns match within-multimodal patterns, the modality gap is additive (shifts distances uniformly) and does not affect rankings.
Reference
- Liang, V. W., Zhang, Y., Kwon, Y., Yeung, S., & Zou, J. Y. (2022). Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. NeurIPS 2022. arXiv:2203.02053
Pilot checklist (updated)
Based on pilot results, the original checklist is now answered: