Multimodal Embedding for Political Communication

Analyzing Korean Politicians’ YouTube Shorts with Gemini Embedding 2

Authors
Affiliations

Kyusik Yang

New York University

Yongjai Yu

UC Riverside

Published

March 26, 2026

About this project

This project applies multimodal embeddings to political communication research. We embed the full multimedia content of Korean politicians’ YouTube Shorts (text, audio, and video) into a shared vector space using Google’s Gemini Embedding 2 model. This allows us to measure stylistic similarity, detect content clusters, and trace how communication strategies evolve over time.

Our subjects are members of the Korean National Assembly who maintain YouTube channels. YouTube Shorts, the platform’s short-form vertical video format (under 60 seconds), has become a dominant medium for political messaging in South Korea since 2022.

Data

  • 121,900 YouTube videos from 263 legislators (22nd National Assembly)
  • 51,197 Shorts downloaded as MP4 (~342 GB)
  • ~38,000 Whisper transcripts (MLX-accelerated, Korean fine-tuned)
  • 2,969 GPT-4o-mini content classifications (10 categories)

Research questions

  1. Clustering: What types of political Shorts exist? Can we identify distinct communication styles (e.g., policy explainers, rally clips, personal branding) from embeddings alone?
  2. Determinants of similarity: What predicts stylistic similarity between politicians’ Shorts: party affiliation, seniority, district characteristics, or electoral competitiveness?
  3. Electoral dynamics: Through what mechanisms do politicians’ Shorts converge during election seasons?
  4. Visual-Verbal Gap: To what extent do politicians’ visual presentations diverge from their verbal content, and does this divergence predict engagement?

The Visual-Verbal Gap (VVG)

A key concept developed from our pilot study. VVG measures the cosine distance between a Short’s transcript embedding and its video-only embedding in Gemini’s shared vector space. High VVG means a politician is showing something different from what they are saying - a hallmark of performative political communication.

Unit of analysis

Politician x month panel: average multimodal embedding of all Shorts published by a politician in a given month.

51,197 Shorts (YouTube API + yt-dlp)
  → Whisper transcription (MLX, local)
    → Text embedding (Gemini, full corpus, ~$0.50)
    → Multimodal embedding (Gemini, 2,500 subsample, ~$234)
      → VVG computation (text vs. video-only distance)
        ├─ UMAP + HDBSCAN clustering
        ├─ Similarity determinants (Panel FE regression)
        ├─ VVG predictors (Opposition, election proximity)
        └─ Engagement model (VVG → views)

Why this site exists

This site serves two purposes:

1. Research process documentation. Rather than presenting only final results, we document the entire pipeline from API setup to panel regression. Each page records our decisions, failed attempts, and workarounds as they happen. This transparency makes the research reproducible and helps collaborators understand not just what we did but why.

2. Teaching material. Multimodal embedding is a new technique for social science. There are few tutorials that walk through the full workflow for researchers who are not ML engineers. This site is designed to be that tutorial, covering practical topics like API cost estimation, embedding strategy comparison, and aggregation for panel analysis.

The site is organized as an exploratory walkthrough. We start from scratch (getting an API key), work through embedding strategies, compare results, and build toward a publication-ready panel design. Code cells are included at every step so readers can follow along or adapt the pipeline to their own data.

Technology stack

Component Choice Rationale
Embedding model Gemini Embedding 2 Native Korean support, accepts video up to 120s, unified vector space
Transcription MLX Whisper Korean fine-tuned, local GPU (Apple Silicon), zero API cost
API Google AI Studio Free tier available, simple API key authentication
Python SDK google-genai Latest official SDK (replaces deprecated google-generativeai)
Embedding dimensions 3072 (default) Highest quality; 768 or 256 available for cost/speed tradeoff
Dimensionality reduction UMAP Preserves local structure, supports cosine metric
Clustering HDBSCAN Noise-tolerant, does not require pre-specifying cluster count
NoteFor collaborators

This site is built with Quarto and designed for co-editing. All code cells use freeze: auto, so outputs are cached and viewable without re-running API calls. To contribute, edit the .qmd files and run quarto render.