Technical References
Embedding models
Gemini Embedding 2
Google’s first natively multimodal embedding model, released in March 2025. It accepts text, images, video, audio, and PDF inputs, projecting them into a shared 3072-dimensional vector space. This unified space is what makes cross-modal similarity measurement possible.
- Official documentation - API reference, supported modalities, and usage examples
- Google blog announcement - Design goals and benchmark results
- Pricing - Per-token and per-second cost breakdown for each modality
Key properties for this project:
- Native Korean: trained on 100+ languages, no translation step needed
- Video up to 120s: YouTube Shorts (under 60s) fit comfortably
- Unified space: text, audio, and video embeddings are directly comparable via cosine similarity
ImageBind (Meta)
An earlier multimodal embedding model from Meta AI (2023) that also maps multiple modalities into a shared space. ImageBind uses image embeddings as an anchor, aligning other modalities (audio, text, depth, thermal, IMU) to the image space via contrastive learning.
- ImageBind paper - Girdhar et al. (2023), “ImageBind: One Embedding Space To Bind Them All”
- GitHub repository
Why we chose Gemini Embedding 2 over ImageBind
We initially planned to use ImageBind (see pipeline_v1.txt). After evaluating both models against our specific requirements, we switched to Gemini Embedding 2. The comparison below explains this decision.
| Gemini Embedding 2 | ImageBind | |
|---|---|---|
| Release | March 2025 (Google) | May 2023 (Meta) |
| Architecture | Joint multimodal encoder | Image-anchored contrastive |
| Dimensions | 3072 (or 768, 256) | 1024 |
| Text input | Native, 100+ languages | CLIP-based, English-centric |
| Audio input | Speech + non-speech, multilingual | 2-second clips, environment sounds |
| Video input | Up to 120 seconds natively | Frame sampling (indirect) |
| Korean support | Native (trained on Korean text and speech) | Minimal (no Korean speech training) |
| Inference | Cloud API (Google AI Studio) | Local GPU (A100 recommended) |
| Cost | Pay-per-use (~$0.007-0.043/short) | Free (but requires GPU hardware) |
| Batch processing | API with rate limits | Limited by local GPU memory |
Political YouTube Shorts are dominated by spoken Korean. ImageBind’s audio encoder was trained on environmental sounds (AudioSet), not speech. It processes audio in 2-second clips and lacks Korean language training entirely. For our use case, this is a critical limitation.
Gemini Embedding 2’s audio encoder handles continuous speech in 100+ languages, including Korean, making it fundamentally better suited for political communication analysis where what politicians say matters more than background visuals.
When ImageBind might still be the better choice:
- Research on visual-only content (no speech component)
- Projects requiring local/offline processing (data privacy constraints)
- Budget of zero (no API costs, but GPU hardware needed)
- Non-language modalities: depth, thermal, or IMU data
Dimensionality reduction
UMAP
Uniform Manifold Approximation and Projection. A nonlinear dimensionality reduction technique that preserves both local and global structure better than t-SNE, while being significantly faster. We use UMAP in two roles:
- Visualization (3072 → 2D): for scatter plots colored by party or cluster
- Pre-clustering (3072 → 10D): HDBSCAN works better on moderately reduced data than on raw high-dimensional vectors
- McInnes, Healy, & Melville (2018) - “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction”
- umap-learn documentation - Python package with scikit-learn API
- Understanding UMAP - Interactive visual explainer by Google PAIR
Key parameters for embedding data:
| Parameter | Our setting | Why |
|---|---|---|
metric |
"cosine" |
Standard for normalized embedding vectors |
n_neighbors |
15 (default) | Lower (5-10) for small pilots |
min_dist |
0.1 | Allows tight clusters without forcing overlap |
random_state |
42 | Reproducibility (UMAP is stochastic) |
t-SNE
An older alternative to UMAP. We do not use it here because it is slower on large datasets, does not preserve global structure well, and does not support the transform method (projecting new points onto an existing embedding).
- van der Maaten & Hinton (2008) - Original t-SNE paper
Clustering
HDBSCAN
Hierarchical Density-Based Spatial Clustering of Applications with Noise. Unlike K-means, HDBSCAN does not require specifying the number of clusters in advance. It identifies clusters of varying density and labels outlier points as noise (cluster = -1).
- Campello, Moulavi, & Sander (2013) - Original HDBSCAN paper
- McInnes, Healy, & Astels (2017) - Python implementation (JOSS)
- hdbscan documentation - Usage guide with parameter tuning advice
Best practice for embedding data: reduce dimensionality with UMAP first (to ~10D), then cluster. This avoids the curse of dimensionality and runs much faster.
| Parameter | Our setting | Why |
|---|---|---|
min_cluster_size |
15 | Minimum meaningful cluster for political analysis |
min_samples |
5 | Noise tolerance; lower = fewer noise points |
metric |
"euclidean" |
Standard after UMAP reduction |
Similarity and distance
Cosine similarity
The primary similarity measure for embedding vectors. Ranges from -1 (opposite) to 1 (identical). For normalized vectors (unit length), cosine similarity equals the dot product.
\[ \text{cos}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{||\mathbf{a}|| \cdot ||\mathbf{b}||} \]
We use sklearn.metrics.pairwise.cosine_similarity for computing full pairwise matrices.
Panel data methods
Fixed effects regression
For the pairwise similarity panel (politician-dyad x month), we use OLS with month fixed effects and cluster standard errors at the dyad level. This controls for time trends in overall Shorts production while identifying what predicts within-period similarity.
- Wooldridge (2010) - Econometric Analysis of Cross Section and Panel Data, standard reference for panel methods
- statsmodels OLS - Python implementation
Event study design
To test electoral convergence, we use a pre/post event study around election dates. The key assumption is that similarity trends would remain flat absent the election.
- Freyaldenhoven, Hansen, & Shapiro (2019) - “Pre-Event Trends in the Panel Event-Study Design”
Text embedding in political science
Key references for the embedding regression framework that underlies our panel design.
- Rodriguez & Spirling (2022) - “Word Embeddings: What Works, What Doesn’t, and How to Tell the Difference for Applied Research” (JOP)
- Rodriguez, Spirling & Stewart (2023) - “Embedding Regression: Models for Context-Specific Description and Inference” (APSR) - the conText framework
- Rheault & Cochrane (2020) - “Word Embeddings for the Analysis of Ideological Placement in Parliamentary Corpora” (Political Analysis)
Image and visual analysis in political science
Emerging literature on computational visual analysis of political content.
- Torres & Cantú (2022) - “Learning to See: Convolutional Neural Networks for the Analysis of Social Science Data” (Political Analysis)
- Zhang & Pan (2019) - “CASM: A Deep-Learning Approach for Identifying Collective Action Events with Text and Image Data from Social Media” (Sociological Methodology)
Multimodal approaches
Closest methodological precedents and key models.
- Radford et al. (2021) - “Learning Transferable Visual Models From Natural Language Supervision” (CLIP paper)
- Liang et al. (2022) - “Mind the Gap: Understanding the Modality Gap in Multi-Modal Contrastive Representation Learning” (NeurIPS 2022). Documents systematic displacement between modalities in contrastive spaces, a key validity concern for VVG.
Transcription
- Whisper (Radford et al. 2023) - “Robust Speech Recognition via Large-Scale Weak Supervision” - base model for our transcription pipeline
- whisper-medium-komixv2-mlx - Korean fine-tuned Whisper Medium, MLX-accelerated for Apple Silicon. ~2.3 sec/video on M5 Pro.
Tools and infrastructure
| Tool | Purpose | Link |
|---|---|---|
| Google AI Studio | API key management, usage dashboard | aistudio.google.com |
google-genai |
Python SDK for Gemini API | PyPI |
| MLX Whisper | Local Korean speech transcription | HuggingFace |
| ffmpeg | Audio/video extraction and manipulation | ffmpeg.org |
| yt-dlp | YouTube video download | github.com/yt-dlp |
| Quarto | This documentation site | quarto.org |
| YouTube Data API | Video metadata collection | developers.google.com |