Technical References

Embedding models

Gemini Embedding 2

Google’s first natively multimodal embedding model, released in March 2025. It accepts text, images, video, audio, and PDF inputs, projecting them into a shared 3072-dimensional vector space. This unified space is what makes cross-modal similarity measurement possible.

Key properties for this project:

  • Native Korean: trained on 100+ languages, no translation step needed
  • Video up to 120s: YouTube Shorts (under 60s) fit comfortably
  • Unified space: text, audio, and video embeddings are directly comparable via cosine similarity

ImageBind (Meta)

An earlier multimodal embedding model from Meta AI (2023) that also maps multiple modalities into a shared space. ImageBind uses image embeddings as an anchor, aligning other modalities (audio, text, depth, thermal, IMU) to the image space via contrastive learning.

Why we chose Gemini Embedding 2 over ImageBind

We initially planned to use ImageBind (see pipeline_v1.txt). After evaluating both models against our specific requirements, we switched to Gemini Embedding 2. The comparison below explains this decision.

Gemini Embedding 2 ImageBind
Release March 2025 (Google) May 2023 (Meta)
Architecture Joint multimodal encoder Image-anchored contrastive
Dimensions 3072 (or 768, 256) 1024
Text input Native, 100+ languages CLIP-based, English-centric
Audio input Speech + non-speech, multilingual 2-second clips, environment sounds
Video input Up to 120 seconds natively Frame sampling (indirect)
Korean support Native (trained on Korean text and speech) Minimal (no Korean speech training)
Inference Cloud API (Google AI Studio) Local GPU (A100 recommended)
Cost Pay-per-use (~$0.007-0.043/short) Free (but requires GPU hardware)
Batch processing API with rate limits Limited by local GPU memory
ImportantThe decisive factor: Korean speech

Political YouTube Shorts are dominated by spoken Korean. ImageBind’s audio encoder was trained on environmental sounds (AudioSet), not speech. It processes audio in 2-second clips and lacks Korean language training entirely. For our use case, this is a critical limitation.

Gemini Embedding 2’s audio encoder handles continuous speech in 100+ languages, including Korean, making it fundamentally better suited for political communication analysis where what politicians say matters more than background visuals.

When ImageBind might still be the better choice:

  • Research on visual-only content (no speech component)
  • Projects requiring local/offline processing (data privacy constraints)
  • Budget of zero (no API costs, but GPU hardware needed)
  • Non-language modalities: depth, thermal, or IMU data

Dimensionality reduction

UMAP

Uniform Manifold Approximation and Projection. A nonlinear dimensionality reduction technique that preserves both local and global structure better than t-SNE, while being significantly faster. We use UMAP in two roles:

  1. Visualization (3072 → 2D): for scatter plots colored by party or cluster
  2. Pre-clustering (3072 → 10D): HDBSCAN works better on moderately reduced data than on raw high-dimensional vectors

Key parameters for embedding data:

Parameter Our setting Why
metric "cosine" Standard for normalized embedding vectors
n_neighbors 15 (default) Lower (5-10) for small pilots
min_dist 0.1 Allows tight clusters without forcing overlap
random_state 42 Reproducibility (UMAP is stochastic)

t-SNE

An older alternative to UMAP. We do not use it here because it is slower on large datasets, does not preserve global structure well, and does not support the transform method (projecting new points onto an existing embedding).

Clustering

HDBSCAN

Hierarchical Density-Based Spatial Clustering of Applications with Noise. Unlike K-means, HDBSCAN does not require specifying the number of clusters in advance. It identifies clusters of varying density and labels outlier points as noise (cluster = -1).

Best practice for embedding data: reduce dimensionality with UMAP first (to ~10D), then cluster. This avoids the curse of dimensionality and runs much faster.

Parameter Our setting Why
min_cluster_size 15 Minimum meaningful cluster for political analysis
min_samples 5 Noise tolerance; lower = fewer noise points
metric "euclidean" Standard after UMAP reduction

Similarity and distance

Cosine similarity

The primary similarity measure for embedding vectors. Ranges from -1 (opposite) to 1 (identical). For normalized vectors (unit length), cosine similarity equals the dot product.

\[ \text{cos}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{||\mathbf{a}|| \cdot ||\mathbf{b}||} \]

We use sklearn.metrics.pairwise.cosine_similarity for computing full pairwise matrices.

Panel data methods

Fixed effects regression

For the pairwise similarity panel (politician-dyad x month), we use OLS with month fixed effects and cluster standard errors at the dyad level. This controls for time trends in overall Shorts production while identifying what predicts within-period similarity.

Event study design

To test electoral convergence, we use a pre/post event study around election dates. The key assumption is that similarity trends would remain flat absent the election.

Text embedding in political science

Key references for the embedding regression framework that underlies our panel design.

Image and visual analysis in political science

Emerging literature on computational visual analysis of political content.

  • Torres & Cantú (2022) - “Learning to See: Convolutional Neural Networks for the Analysis of Social Science Data” (Political Analysis)
  • Zhang & Pan (2019) - “CASM: A Deep-Learning Approach for Identifying Collective Action Events with Text and Image Data from Social Media” (Sociological Methodology)

Multimodal approaches

Closest methodological precedents and key models.

  • Radford et al. (2021) - “Learning Transferable Visual Models From Natural Language Supervision” (CLIP paper)
  • Liang et al. (2022) - “Mind the Gap: Understanding the Modality Gap in Multi-Modal Contrastive Representation Learning” (NeurIPS 2022). Documents systematic displacement between modalities in contrastive spaces, a key validity concern for VVG.

Transcription

Tools and infrastructure

Tool Purpose Link
Google AI Studio API key management, usage dashboard aistudio.google.com
google-genai Python SDK for Gemini API PyPI
MLX Whisper Local Korean speech transcription HuggingFace
ffmpeg Audio/video extraction and manipulation ffmpeg.org
yt-dlp YouTube video download github.com/yt-dlp
Quarto This documentation site quarto.org
YouTube Data API Video metadata collection developers.google.com