5. Pilot Results - What We Learned from 15 Videos

Published

March 26, 2026

Pilot design

We embedded 15 YouTube Shorts from Korean National Assembly legislators using four strategies, testing whether multimodal embedding captures information beyond text alone.

Sample composition

Group Videos Purpose
High-engagement Shorts (DP 5, PPP 3, RPK 1, NRP 1) 10 Core pilot across parties
Meme / music-heavy Shorts 5 Stress-test non-speech content
Total 15

Embedding strategies

Strategy Input Model
text_title YouTube title + description Gemini Embedding 2
text_transcript Whisper transcript (full speech) Gemini Embedding 2
audio Extracted MP3 audio track Gemini Embedding 2
multimodal Full MP4 video + audio + title text Gemini Embedding 2

Each strategy produces a 3072-dimensional vector in the same shared embedding space, enabling direct cross-strategy comparison via cosine distance.

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Load pilot embeddings (4 strategies x 15 videos x 3072 dimensions)
data_v1 = np.load("outputs/pilot_embeddings.npz")
data_v2 = np.load("outputs/pilot_v2_embeddings.npz")

Finding 1: Audio captures speech, not acoustics

For speech-heavy content (policy briefings, floor speeches, constituent reports), the audio embedding is functionally redundant with the transcript embedding.

# Cosine distance between audio and transcript embeddings
# for speech-heavy Shorts (pilot v2, 8 videos with transcripts)

for vid_id in speech_heavy_ids:
    vec_audio = embeddings["audio"][vid_id]
    vec_transcript = embeddings["text_transcript"][vid_id]
    dist = 1 - cosine_similarity(
        vec_audio.reshape(1, -1),
        vec_transcript.reshape(1, -1)
    )[0, 0]
    print(f"  {vid_id}: cosine_distance = {dist:.3f}")

# Typical range: 0.11 - 0.20

Cosine distance (audio vs. transcript) for speech-heavy Shorts: 0.11 - 0.20

This is expected. Google’s documentation states that audio embedding support is “optimized for speech.” When a Short consists primarily of a legislator talking, the audio encoder extracts the same semantic content as the transcript.

ImportantImplication for cost decisions

If ~70-80% of legislative Shorts are speech-heavy, audio embedding adds little marginal information over transcript embedding for the majority of the corpus. The $360 cost of embedding all 51K audio tracks can be redirected to a targeted multimodal subsample.

Finding 2: Audio diverges for non-speech content

For meme/humor Shorts with music, sound effects, and minimal speech, audio and transcript embeddings diverge sharply.

# Same comparison for meme/music-heavy Shorts
for vid_id in meme_ids:
    vec_audio = embeddings["audio"][vid_id]
    vec_transcript = embeddings["text_transcript"][vid_id]
    dist = 1 - cosine_similarity(
        vec_audio.reshape(1, -1),
        vec_transcript.reshape(1, -1)
    )[0, 0]
    print(f"  {vid_id}: cosine_distance = {dist:.3f}")

# Typical range: 0.40 - 0.43

Cosine distance (audio vs. transcript) for meme Shorts: 0.40 - 0.43

When speech is absent, the audio encoder captures non-verbal information (music genre, sound effects, ambient noise) that the transcript cannot represent. This divergence is itself informative: it flags content where the auditory channel carries meaning independent of words.

Finding 3: Video channel provides the largest marginal information gain

Cross-strategy distances for the same Short reveal that the video channel adds the most new information beyond text.

# Cross-strategy cosine distances (single multimodal Short, all four strategies)
strategies = ["text_transcript", "audio", "multimodal"]
pairs = [
    ("text_transcript", "audio"),
    ("text_transcript", "multimodal"),
    ("audio", "multimodal"),
]

print("Cross-strategy cosine distances:")
print("-" * 50)
for s1, s2 in pairs:
    vec1 = embeddings[s1][example_vid]
    vec2 = embeddings[s2][example_vid]
    dist = 1 - cosine_similarity(
        vec1.reshape(1, -1), vec2.reshape(1, -1)
    )[0, 0]
    print(f"  {s1:20s} vs {s2:20s}: {dist:.3f}")
Strategy pair Cosine distance
text_transcript vs. audio 0.202
text_transcript vs. multimodal 0.478
audio vs. multimodal 0.432

The multimodal embedding (which includes the video channel) is far from both text and audio, confirming that visual content carries substantial semantic information not captured by speech alone. This validates the multimodal approach for our research.

TipWhat the video channel captures

Visual elements such as setting (office vs. rally vs. studio), graphics and text overlays, editing style (jump cuts vs. static frame), gestures and physical framing, and the presence of other people. These are precisely the elements that distinguish performative from deliberative political communication.

Finding 4: Title-based party signal is artifactual

The most surprising finding: party clustering depends entirely on which text input is used.

# Within-party vs between-party similarity gap
for strategy in ["text_title", "text_transcript"]:
    vecs = np.stack([embeddings[strategy][v] for v in all_vids])
    sim = cosine_similarity(vecs)

    same, diff = [], []
    for i in range(len(all_vids)):
        for j in range(i + 1, len(all_vids)):
            val = sim[i, j]
            if parties[i] == parties[j]:
                same.append(val)
            else:
                diff.append(val)

    gap = np.mean(same) - np.mean(diff)
    print(f"  {strategy:20s}: party gap = {gap:+.3f}")
Strategy Same-party vs. cross-party similarity gap
text_title +0.124 (strong party signal)
text_transcript -0.018 (no party signal)

The title-based signal is driven by politician names and party hashtags embedded in YouTube titles (e.g., “#DemocraticParty”, “#PPP”), not by semantic content. When we use full speech transcripts, the party signal vanishes entirely.

ImportantMethodological lesson

YouTube metadata (titles, descriptions, tags) reflects self-labeling, not content. Any embedding strategy that relies on metadata will capture identity markers rather than communication style. For studying how politicians actually communicate, transcript-based embedding is the correct baseline.

The Visual-Verbal Gap (VVG)

Concept

The pilot findings motivate a new measure. If transcript embeddings capture what politicians say and video-only embeddings capture what politicians show, the distance between them measures the degree of visual-verbal divergence.

\[ \text{VVG}_i = 1 - \cos(\mathbf{e}^{\text{transcript}}_i, \mathbf{e}^{\text{video\_only}}_i) \]

where \(\mathbf{e}^{\text{transcript}}_i\) is the Gemini Embedding 2 vector from the Whisper transcript and \(\mathbf{e}^{\text{video\_only}}_i\) is the embedding from the muted video (audio track stripped via ffmpeg -an).

Interpretation

VVG value Example Meaning
Low (~0.1-0.2) Legislator reading policy brief at desk Visuals match words
Medium (~0.3-0.4) Interview with graphics overlaid Some visual divergence
High (~0.5+) Meme video with music and rapid editing Visuals diverge from words

Why video_only, not multimodal?

Finding 2 showed that audio and transcript overlap for speech-heavy content but diverge for meme content. Using multimodal (which includes audio) would conflate audio divergence with visual divergence. By stripping the audio track, video_only isolates the visual channel.

# Strip audio to create video-only input
ffmpeg -i input.mp4 -an -c:v copy video_only.mp4
def compute_vvg(client, mp4_path: str, transcript: str) -> float:
    """Compute Visual-Verbal Gap for a single Short."""
    # 1. Embed transcript (text only)
    result_text = client.models.embed_content(
        model="gemini-embedding-exp-03-07",
        contents=transcript,
        config=types.EmbedContentConfig(
            task_type="SEMANTIC_SIMILARITY"
        )
    )
    vec_text = np.array(result_text.embeddings[0].values)

    # 2. Create video-only version (strip audio)
    video_only = Path(mp4_path).with_name(
        Path(mp4_path).stem + "_silent.mp4"
    )
    subprocess.run([
        "ffmpeg", "-i", mp4_path,
        "-an", "-c:v", "copy", str(video_only),
        "-y", "-loglevel", "error"
    ], check=True)

    # 3. Embed video-only
    video_uri = upload_and_wait(client, str(video_only))
    result_video = client.models.embed_content(
        model="gemini-embedding-exp-03-07",
        contents=types.Content(
            parts=[types.Part.from_uri(
                file_uri=video_uri, mime_type="video/mp4"
            )]
        )
    )
    vec_video = np.array(result_video.embeddings[0].values)

    # 4. Cosine distance
    sim = cosine_similarity(
        vec_text.reshape(1, -1), vec_video.reshape(1, -1)
    )[0, 0]
    return 1 - sim

Hypotheses

The VVG measure generates four testable hypotheses about political communication on YouTube Shorts.

HVVG1: Opposition legislators have higher VVG

Opposition legislators lack institutional platforms (no government press conferences, no ministerial briefings). They face stronger incentives to adopt performative visual strategies that diverge from verbal substance.

\[ \text{VVG}_{it} = \alpha + \beta_1 \cdot \text{Opposition}_{it} + \gamma \mathbf{X}_{it} + \delta_t + \mu_i + \epsilon_{it} \]

HVVG2: VVG increases near elections

Electoral proximity intensifies the incentive to maximize reach over deliberation. As elections approach, legislators shift toward attention-grabbing visual content while verbal platforms remain constrained by policy discussion.

Test: Legislator-by-month panel with months coded by distance to the April 2024 general election.

HVVG3: Higher VVG predicts higher engagement

Platform algorithms reward visually engaging content. If VVG captures visual performance, it should predict views, controlling for content type and channel size.

\[ \ln(\text{views}_{i}) = \alpha + \beta_1 \cdot \text{VVG}_{i} + \beta_2 \cdot \text{ContentType}_{i} + \mu_{\text{channel}} + \epsilon_{i} \]

HVVG4: VVG varies systematically by content type

If VVG is a valid measure, it should align with face-valid expectations:

  • Meme/humor Shorts: high VVG (music, rapid editing, text overlays divorced from speech)
  • Policy briefings: low VVG (congruent visuals and words)
  • Attack/campaign Shorts: medium-high VVG (visual framing often diverges from spoken argument)

This hypothesis serves as construct validation. If VVG does not distinguish content types, the measure is suspect.

Validity concerns

Modality gap

Liang et al. (2022) document a systematic displacement between modalities in contrastive embedding spaces. Cross-modal distances may partly reflect architectural features rather than semantic content.

Planned validation:

  1. Baseline calibration: Compute VVG for “talking head” Shorts where visual and verbal content are maximally aligned. If VVG is still high, the gap is architectural noise.
  2. Human judgment: Compare VVG rankings against research assistant ratings on a 50-video subsample.
  3. Within-modality control: If within-text similarity patterns match within-multimodal patterns, the modality gap is additive (shifts distances uniformly) and does not affect rankings.

Reference

  • Liang, V. W., Zhang, Y., Kwon, Y., Yeung, S., & Zou, J. Y. (2022). Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. NeurIPS 2022. arXiv:2203.02053

Pilot checklist (updated)

Based on pilot results, the original checklist is now answered: