5. Pilot Results - What We Learned from 15 Videos

Published

March 26, 2026

Pilot design

We embedded 15 YouTube Shorts from Korean National Assembly legislators using four strategies, testing whether multimodal embedding captures information beyond text alone.

Sample composition

Group	Videos	Purpose
High-engagement Shorts (DP 5, PPP 3, RPK 1, NRP 1)	10	Core pilot across parties
Meme / music-heavy Shorts	5	Stress-test non-speech content
Total	15

Embedding strategies

Strategy	Input	Model
`text_title`	YouTube title + description	Gemini Embedding 2
`text_transcript`	Whisper transcript (full speech)	Gemini Embedding 2
`audio`	Extracted MP3 audio track	Gemini Embedding 2
`multimodal`	Full MP4 video + audio + title text	Gemini Embedding 2

Each strategy produces a 3072-dimensional vector in the same shared embedding space, enabling direct cross-strategy comparison via cosine distance.

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Load pilot embeddings (4 strategies x 15 videos x 3072 dimensions)
data_v1 = np.load("outputs/pilot_embeddings.npz")
data_v2 = np.load("outputs/pilot_v2_embeddings.npz")

Finding 1: Audio captures speech, not acoustics

For speech-heavy content (policy briefings, floor speeches, constituent reports), the audio embedding is functionally redundant with the transcript embedding.

# Cosine distance between audio and transcript embeddings
# for speech-heavy Shorts (pilot v2, 8 videos with transcripts)

for vid_id in speech_heavy_ids:
    vec_audio = embeddings["audio"][vid_id]
    vec_transcript = embeddings["text_transcript"][vid_id]
    dist = 1 - cosine_similarity(
        vec_audio.reshape(1, -1),
        vec_transcript.reshape(1, -1)
    )[0, 0]
    print(f"  {vid_id}: cosine_distance = {dist:.3f}")

# Typical range: 0.11 - 0.20

Cosine distance (audio vs. transcript) for speech-heavy Shorts: 0.11 - 0.20

This is expected. Google’s documentation states that audio embedding support is “optimized for speech.” When a Short consists primarily of a legislator talking, the audio encoder extracts the same semantic content as the transcript.

Implication for cost decisions

If ~70-80% of legislative Shorts are speech-heavy, audio embedding adds little marginal information over transcript embedding for the majority of the corpus. The $360 cost of embedding all 51K audio tracks can be redirected to a targeted multimodal subsample.

Finding 2: Audio diverges for non-speech content

For meme/humor Shorts with music, sound effects, and minimal speech, audio and transcript embeddings diverge sharply.

# Same comparison for meme/music-heavy Shorts
for vid_id in meme_ids:
    vec_audio = embeddings["audio"][vid_id]
    vec_transcript = embeddings["text_transcript"][vid_id]
    dist = 1 - cosine_similarity(
        vec_audio.reshape(1, -1),
        vec_transcript.reshape(1, -1)
    )[0, 0]
    print(f"  {vid_id}: cosine_distance = {dist:.3f}")

# Typical range: 0.40 - 0.43

Cosine distance (audio vs. transcript) for meme Shorts: 0.40 - 0.43

When speech is absent, the audio encoder captures non-verbal information (music genre, sound effects, ambient noise) that the transcript cannot represent. This divergence is itself informative: it flags content where the auditory channel carries meaning independent of words.

Finding 3: Video channel provides the largest marginal information gain

Cross-strategy distances for the same Short reveal that the video channel adds the most new information beyond text.

# Cross-strategy cosine distances (single multimodal Short, all four strategies)
strategies = ["text_transcript", "audio", "multimodal"]
pairs = [
    ("text_transcript", "audio"),
    ("text_transcript", "multimodal"),
    ("audio", "multimodal"),
]

print("Cross-strategy cosine distances:")
print("-" * 50)
for s1, s2 in pairs:
    vec1 = embeddings[s1][example_vid]
    vec2 = embeddings[s2][example_vid]
    dist = 1 - cosine_similarity(
        vec1.reshape(1, -1), vec2.reshape(1, -1)
    )[0, 0]
    print(f"  {s1:20s} vs {s2:20s}: {dist:.3f}")

Strategy pair	Cosine distance
text_transcript vs. audio	0.202
text_transcript vs. multimodal	0.478
audio vs. multimodal	0.432

The multimodal embedding (which includes the video channel) is far from both text and audio, confirming that visual content carries substantial semantic information not captured by speech alone. This validates the multimodal approach for our research.

What the video channel captures

Visual elements such as setting (office vs. rally vs. studio), graphics and text overlays, editing style (jump cuts vs. static frame), gestures and physical framing, and the presence of other people. These are precisely the elements that distinguish performative from deliberative political communication.

Finding 4: Title-based party signal is artifactual

The most surprising finding: party clustering depends entirely on which text input is used.

# Within-party vs between-party similarity gap
for strategy in ["text_title", "text_transcript"]:
    vecs = np.stack([embeddings[strategy][v] for v in all_vids])
    sim = cosine_similarity(vecs)

    same, diff = [], []
    for i in range(len(all_vids)):
        for j in range(i + 1, len(all_vids)):
            val = sim[i, j]
            if parties[i] == parties[j]:
                same.append(val)
            else:
                diff.append(val)

    gap = np.mean(same) - np.mean(diff)
    print(f"  {strategy:20s}: party gap = {gap:+.3f}")

Strategy	Same-party vs. cross-party similarity gap
`text_title`	+0.124 (strong party signal)
`text_transcript`	-0.018 (no party signal)

The title-based signal is driven by politician names and party hashtags embedded in YouTube titles (e.g., “#DemocraticParty”, “#PPP”), not by semantic content. When we use full speech transcripts, the party signal vanishes entirely.

Methodological lesson

YouTube metadata (titles, descriptions, tags) reflects self-labeling, not content. Any embedding strategy that relies on metadata will capture identity markers rather than communication style. For studying how politicians actually communicate, transcript-based embedding is the correct baseline.

The Visual-Verbal Gap (VVG)

Concept

The pilot findings motivate a new measure. If transcript embeddings capture what politicians say and video-only embeddings capture what politicians show, the distance between them measures the degree of visual-verbal divergence.

\[ \text{VVG}_i = 1 - \cos(\mathbf{e}^{\text{transcript}}_i, \mathbf{e}^{\text{video\_only}}_i) \]

where $\mathbf{e}^{\text{transcript}}_i$ is the Gemini Embedding 2 vector from the Whisper transcript and $\mathbf{e}^{\text{video\_only}}_i$ is the embedding from the muted video (audio track stripped via ffmpeg -an).

Interpretation

VVG value	Example	Meaning
Low (~0.1-0.2)	Legislator reading policy brief at desk	Visuals match words
Medium (~0.3-0.4)	Interview with graphics overlaid	Some visual divergence
High (~0.5+)	Meme video with music and rapid editing	Visuals diverge from words

Why `video_only`, not `multimodal`?

Finding 2 showed that audio and transcript overlap for speech-heavy content but diverge for meme content. Using multimodal (which includes audio) would conflate audio divergence with visual divergence. By stripping the audio track, video_only isolates the visual channel.

# Strip audio to create video-only input
ffmpeg -i input.mp4 -an -c:v copy video_only.mp4

def compute_vvg(client, mp4_path: str, transcript: str) -> float:
    """Compute Visual-Verbal Gap for a single Short."""
    # 1. Embed transcript (text only)
    result_text = client.models.embed_content(
        model="gemini-embedding-exp-03-07",
        contents=transcript,
        config=types.EmbedContentConfig(
            task_type="SEMANTIC_SIMILARITY"
        )
    )
    vec_text = np.array(result_text.embeddings[0].values)

    # 2. Create video-only version (strip audio)
    video_only = Path(mp4_path).with_name(
        Path(mp4_path).stem + "_silent.mp4"
    )
    subprocess.run([
        "ffmpeg", "-i", mp4_path,
        "-an", "-c:v", "copy", str(video_only),
        "-y", "-loglevel", "error"
    ], check=True)

    # 3. Embed video-only
    video_uri = upload_and_wait(client, str(video_only))
    result_video = client.models.embed_content(
        model="gemini-embedding-exp-03-07",
        contents=types.Content(
            parts=[types.Part.from_uri(
                file_uri=video_uri, mime_type="video/mp4"
            )]
        )
    )
    vec_video = np.array(result_video.embeddings[0].values)

    # 4. Cosine distance
    sim = cosine_similarity(
        vec_text.reshape(1, -1), vec_video.reshape(1, -1)
    )[0, 0]
    return 1 - sim

Hypotheses

The VVG measure generates four testable hypotheses about political communication on YouTube Shorts.

H_VVG1: Opposition legislators have higher VVG

Opposition legislators lack institutional platforms (no government press conferences, no ministerial briefings). They face stronger incentives to adopt performative visual strategies that diverge from verbal substance.

\[ \text{VVG}_{it} = \alpha + \beta_1 \cdot \text{Opposition}_{it} + \gamma \mathbf{X}_{it} + \delta_t + \mu_i + \epsilon_{it} \]

H_VVG2: VVG increases near elections

Electoral proximity intensifies the incentive to maximize reach over deliberation. As elections approach, legislators shift toward attention-grabbing visual content while verbal platforms remain constrained by policy discussion.

Test: Legislator-by-month panel with months coded by distance to the April 2024 general election.

H_VVG3: Higher VVG predicts higher engagement

Platform algorithms reward visually engaging content. If VVG captures visual performance, it should predict views, controlling for content type and channel size.

\[ \ln(\text{views}_{i}) = \alpha + \beta_1 \cdot \text{VVG}_{i} + \beta_2 \cdot \text{ContentType}_{i} + \mu_{\text{channel}} + \epsilon_{i} \]

H_VVG4: VVG varies systematically by content type

If VVG is a valid measure, it should align with face-valid expectations:

Meme/humor Shorts: high VVG (music, rapid editing, text overlays divorced from speech)
Policy briefings: low VVG (congruent visuals and words)
Attack/campaign Shorts: medium-high VVG (visual framing often diverges from spoken argument)

This hypothesis serves as construct validation. If VVG does not distinguish content types, the measure is suspect.

Validity concerns

Modality gap

Liang et al. (2022) document a systematic displacement between modalities in contrastive embedding spaces. Cross-modal distances may partly reflect architectural features rather than semantic content.

Planned validation:

Baseline calibration: Compute VVG for “talking head” Shorts where visual and verbal content are maximally aligned. If VVG is still high, the gap is architectural noise.
Human judgment: Compare VVG rankings against research assistant ratings on a 50-video subsample.
Within-modality control: If within-text similarity patterns match within-multimodal patterns, the modality gap is additive (shifts distances uniformly) and does not affect rankings.

Reference

Liang, V. W., Zhang, Y., Kwon, Y., Yeung, S., & Zou, J. Y. (2022). Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. NeurIPS 2022. arXiv:2203.02053

Pilot checklist (updated)

Based on pilot results, the original checklist is now answered:

Does multimodal add value over audio-only? Yes (distance = 0.432, far exceeds 0.05 threshold)
Do same-party politicians cluster together in UMAP? Only with title-based embedding (artifact); not with transcript-based
Is the File API upload speed acceptable? Yes, but inline bytes work for Shorts under 5MB (workaround for File API agent provisioning error)
Which cost tier fits the budget? Tiered strategy: $0.50 full-corpus text + $234 multimodal subsample
Are there Shorts that fail to embed? 0 failures in 72 API calls (100% success rate)

--- title: "5. Pilot Results - What We Learned from 15 Videos" date: "2026-03-26" execute: eval: false --- ## Pilot design We embedded 15 YouTube Shorts from Korean National Assembly legislators using four strategies, testing whether multimodal embedding captures information beyond text alone. ### Sample composition | Group | Videos | Purpose | |-------|--------|---------| | High-engagement Shorts (DP 5, PPP 3, RPK 1, NRP 1) | 10 | Core pilot across parties | | Meme / music-heavy Shorts | 5 | Stress-test non-speech content | | **Total** | **15** | | ### Embedding strategies | Strategy | Input | Model | |----------|-------|-------| | `text_title` | YouTube title + description | Gemini Embedding 2 | | `text_transcript` | Whisper transcript (full speech) | Gemini Embedding 2 | | `audio` | Extracted MP3 audio track | Gemini Embedding 2 | | `multimodal` | Full MP4 video + audio + title text | Gemini Embedding 2 | Each strategy produces a 3072-dimensional vector in the same shared embedding space, enabling direct cross-strategy comparison via cosine distance. ```{python} import numpy as np from sklearn.metrics.pairwise import cosine_similarity # Load pilot embeddings (4 strategies x 15 videos x 3072 dimensions) data_v1 = np.load("outputs/pilot_embeddings.npz") data_v2 = np.load("outputs/pilot_v2_embeddings.npz") ``` ## Finding 1: Audio captures speech, not acoustics For speech-heavy content (policy briefings, floor speeches, constituent reports), the audio embedding is functionally redundant with the transcript embedding. ```{python} # Cosine distance between audio and transcript embeddings # for speech-heavy Shorts (pilot v2, 8 videos with transcripts) for vid_id in speech_heavy_ids: vec_audio = embeddings["audio"][vid_id] vec_transcript = embeddings["text_transcript"][vid_id] dist = 1 - cosine_similarity( vec_audio.reshape(1, -1), vec_transcript.reshape(1, -1) )[0, 0] print(f" {vid_id}: cosine_distance = {dist:.3f}") # Typical range: 0.11 - 0.20 ``` **Cosine distance (audio vs. transcript) for speech-heavy Shorts: 0.11 - 0.20** This is expected. Google's documentation states that audio embedding support is "optimized for speech." When a Short consists primarily of a legislator talking, the audio encoder extracts the same semantic content as the transcript. ::: {.callout-important} ## Implication for cost decisions If ~70-80% of legislative Shorts are speech-heavy, audio embedding adds little marginal information over transcript embedding for the majority of the corpus. The $360 cost of embedding all 51K audio tracks can be redirected to a targeted multimodal subsample. ::: ## Finding 2: Audio diverges for non-speech content For meme/humor Shorts with music, sound effects, and minimal speech, audio and transcript embeddings diverge sharply. ```{python} # Same comparison for meme/music-heavy Shorts for vid_id in meme_ids: vec_audio = embeddings["audio"][vid_id] vec_transcript = embeddings["text_transcript"][vid_id] dist = 1 - cosine_similarity( vec_audio.reshape(1, -1), vec_transcript.reshape(1, -1) )[0, 0] print(f" {vid_id}: cosine_distance = {dist:.3f}") # Typical range: 0.40 - 0.43 ``` **Cosine distance (audio vs. transcript) for meme Shorts: 0.40 - 0.43** When speech is absent, the audio encoder captures non-verbal information (music genre, sound effects, ambient noise) that the transcript cannot represent. This divergence is itself informative: it flags content where the auditory channel carries meaning independent of words. ## Finding 3: Video channel provides the largest marginal information gain Cross-strategy distances for the same Short reveal that the video channel adds the most new information beyond text. ```{python} # Cross-strategy cosine distances (single multimodal Short, all four strategies) strategies = ["text_transcript", "audio", "multimodal"] pairs = [ ("text_transcript", "audio"), ("text_transcript", "multimodal"), ("audio", "multimodal"), ] print("Cross-strategy cosine distances:") print("-" * 50) for s1, s2 in pairs: vec1 = embeddings[s1][example_vid] vec2 = embeddings[s2][example_vid] dist = 1 - cosine_similarity( vec1.reshape(1, -1), vec2.reshape(1, -1) )[0, 0] print(f" {s1:20s} vs {s2:20s}: {dist:.3f}") ``` | Strategy pair | Cosine distance | |---------------|----------------| | text_transcript vs. audio | 0.202 | | text_transcript vs. multimodal | **0.478** | | audio vs. multimodal | 0.432 | The multimodal embedding (which includes the video channel) is far from both text and audio, confirming that visual content carries substantial semantic information not captured by speech alone. This validates the multimodal approach for our research. ::: {.callout-tip} ## What the video channel captures Visual elements such as setting (office vs. rally vs. studio), graphics and text overlays, editing style (jump cuts vs. static frame), gestures and physical framing, and the presence of other people. These are precisely the elements that distinguish performative from deliberative political communication. ::: ## Finding 4: Title-based party signal is artifactual The most surprising finding: party clustering depends entirely on which text input is used. ```{python} # Within-party vs between-party similarity gap for strategy in ["text_title", "text_transcript"]: vecs = np.stack([embeddings[strategy][v] for v in all_vids]) sim = cosine_similarity(vecs) same, diff = [], [] for i in range(len(all_vids)): for j in range(i + 1, len(all_vids)): val = sim[i, j] if parties[i] == parties[j]: same.append(val) else: diff.append(val) gap = np.mean(same) - np.mean(diff) print(f" {strategy:20s}: party gap = {gap:+.3f}") ``` | Strategy | Same-party vs. cross-party similarity gap | |----------|------------------------------------------| | `text_title` | **+0.124** (strong party signal) | | `text_transcript` | **-0.018** (no party signal) | The title-based signal is driven by politician names and party hashtags embedded in YouTube titles (e.g., "#DemocraticParty", "#PPP"), not by semantic content. When we use full speech transcripts, the party signal vanishes entirely. ::: {.callout-important} ## Methodological lesson YouTube metadata (titles, descriptions, tags) reflects **self-labeling**, not content. Any embedding strategy that relies on metadata will capture identity markers rather than communication style. For studying how politicians actually communicate, transcript-based embedding is the correct baseline. ::: ## The Visual-Verbal Gap (VVG) ### Concept The pilot findings motivate a new measure. If transcript embeddings capture *what politicians say* and video-only embeddings capture *what politicians show*, the distance between them measures the degree of visual-verbal divergence. $$ \text{VVG}_i = 1 - \cos(\mathbf{e}^{\text{transcript}}_i, \mathbf{e}^{\text{video\_only}}_i) $$ where $\mathbf{e}^{\text{transcript}}_i$ is the Gemini Embedding 2 vector from the Whisper transcript and $\mathbf{e}^{\text{video\_only}}_i$ is the embedding from the muted video (audio track stripped via `ffmpeg -an`). ### Interpretation | VVG value | Example | Meaning | |-----------|---------|---------| | Low (~0.1-0.2) | Legislator reading policy brief at desk | Visuals match words | | Medium (~0.3-0.4) | Interview with graphics overlaid | Some visual divergence | | High (~0.5+) | Meme video with music and rapid editing | Visuals diverge from words | ### Why `video_only`, not `multimodal`? Finding 2 showed that audio and transcript overlap for speech-heavy content but diverge for meme content. Using `multimodal` (which includes audio) would conflate audio divergence with visual divergence. By stripping the audio track, `video_only` isolates the visual channel. ```bash # Strip audio to create video-only input ffmpeg -i input.mp4 -an -c:v copy video_only.mp4 ``` ```{python} def compute_vvg(client, mp4_path: str, transcript: str) -> float: """Compute Visual-Verbal Gap for a single Short.""" # 1. Embed transcript (text only) result_text = client.models.embed_content( model="gemini-embedding-exp-03-07", contents=transcript, config=types.EmbedContentConfig( task_type="SEMANTIC_SIMILARITY" ) ) vec_text = np.array(result_text.embeddings[0].values) # 2. Create video-only version (strip audio) video_only = Path(mp4_path).with_name( Path(mp4_path).stem + "_silent.mp4" ) subprocess.run([ "ffmpeg", "-i", mp4_path, "-an", "-c:v", "copy", str(video_only), "-y", "-loglevel", "error" ], check=True) # 3. Embed video-only video_uri = upload_and_wait(client, str(video_only)) result_video = client.models.embed_content( model="gemini-embedding-exp-03-07", contents=types.Content( parts=[types.Part.from_uri( file_uri=video_uri, mime_type="video/mp4" )] ) ) vec_video = np.array(result_video.embeddings[0].values) # 4. Cosine distance sim = cosine_similarity( vec_text.reshape(1, -1), vec_video.reshape(1, -1) )[0, 0] return 1 - sim ``` ## Hypotheses The VVG measure generates four testable hypotheses about political communication on YouTube Shorts. ### H~VVG1~: Opposition legislators have higher VVG Opposition legislators lack institutional platforms (no government press conferences, no ministerial briefings). They face stronger incentives to adopt performative visual strategies that diverge from verbal substance. $$ \text{VVG}_{it} = \alpha + \beta_1 \cdot \text{Opposition}_{it} + \gamma \mathbf{X}_{it} + \delta_t + \mu_i + \epsilon_{it} $$ ### H~VVG2~: VVG increases near elections Electoral proximity intensifies the incentive to maximize reach over deliberation. As elections approach, legislators shift toward attention-grabbing visual content while verbal platforms remain constrained by policy discussion. Test: Legislator-by-month panel with months coded by distance to the April 2024 general election. ### H~VVG3~: Higher VVG predicts higher engagement Platform algorithms reward visually engaging content. If VVG captures visual performance, it should predict views, controlling for content type and channel size. $$ \ln(\text{views}_{i}) = \alpha + \beta_1 \cdot \text{VVG}_{i} + \beta_2 \cdot \text{ContentType}_{i} + \mu_{\text{channel}} + \epsilon_{i} $$ ### H~VVG4~: VVG varies systematically by content type If VVG is a valid measure, it should align with face-valid expectations: - Meme/humor Shorts: high VVG (music, rapid editing, text overlays divorced from speech) - Policy briefings: low VVG (congruent visuals and words) - Attack/campaign Shorts: medium-high VVG (visual framing often diverges from spoken argument) This hypothesis serves as construct validation. If VVG does not distinguish content types, the measure is suspect. ## Validity concerns ### Modality gap Liang et al. (2022) document a systematic displacement between modalities in contrastive embedding spaces. Cross-modal distances may partly reflect architectural features rather than semantic content. **Planned validation:** 1. **Baseline calibration**: Compute VVG for "talking head" Shorts where visual and verbal content are maximally aligned. If VVG is still high, the gap is architectural noise. 2. **Human judgment**: Compare VVG rankings against research assistant ratings on a 50-video subsample. 3. **Within-modality control**: If within-text similarity patterns match within-multimodal patterns, the modality gap is additive (shifts distances uniformly) and does not affect rankings. ### Reference - Liang, V. W., Zhang, Y., Kwon, Y., Yeung, S., & Zou, J. Y. (2022). Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. *NeurIPS 2022*. [arXiv:2203.02053](https://arxiv.org/abs/2203.02053) ## Pilot checklist (updated) Based on pilot results, the original checklist is now answered: - [x] Does multimodal add value over audio-only? **Yes** (distance = 0.432, far exceeds 0.05 threshold) - [x] Do same-party politicians cluster together in UMAP? **Only with title-based embedding (artifact); not with transcript-based** - [x] Is the File API upload speed acceptable? **Yes, but inline bytes work for Shorts under 5MB (workaround for File API agent provisioning error)** - [x] Which cost tier fits the budget? **Tiered strategy: $0.50 full-corpus text + $234 multimodal subsample** - [x] Are there Shorts that fail to embed? **0 failures in 72 API calls (100% success rate)**