Technical References

Embedding models

Gemini Embedding 2

Google’s first natively multimodal embedding model, released in March 2025. It accepts text, images, video, audio, and PDF inputs, projecting them into a shared 3072-dimensional vector space. This unified space is what makes cross-modal similarity measurement possible.

Official documentation - API reference, supported modalities, and usage examples
Google blog announcement - Design goals and benchmark results
Pricing - Per-token and per-second cost breakdown for each modality

Key properties for this project:

Native Korean: trained on 100+ languages, no translation step needed
Video up to 120s: YouTube Shorts (under 60s) fit comfortably
Unified space: text, audio, and video embeddings are directly comparable via cosine similarity

ImageBind (Meta)

An earlier multimodal embedding model from Meta AI (2023) that also maps multiple modalities into a shared space. ImageBind uses image embeddings as an anchor, aligning other modalities (audio, text, depth, thermal, IMU) to the image space via contrastive learning.

ImageBind paper - Girdhar et al. (2023), “ImageBind: One Embedding Space To Bind Them All”
GitHub repository

Why we chose Gemini Embedding 2 over ImageBind

We initially planned to use ImageBind (see pipeline_v1.txt). After evaluating both models against our specific requirements, we switched to Gemini Embedding 2. The comparison below explains this decision.

	Gemini Embedding 2	ImageBind
Release	March 2025 (Google)	May 2023 (Meta)
Architecture	Joint multimodal encoder	Image-anchored contrastive
Dimensions	3072 (or 768, 256)	1024
Text input	Native, 100+ languages	CLIP-based, English-centric
Audio input	Speech + non-speech, multilingual	2-second clips, environment sounds
Video input	Up to 120 seconds natively	Frame sampling (indirect)
Korean support	Native (trained on Korean text and speech)	Minimal (no Korean speech training)
Inference	Cloud API (Google AI Studio)	Local GPU (A100 recommended)
Cost	Pay-per-use (~$0.007-0.043/short)	Free (but requires GPU hardware)
Batch processing	API with rate limits	Limited by local GPU memory

The decisive factor: Korean speech

Political YouTube Shorts are dominated by spoken Korean. ImageBind’s audio encoder was trained on environmental sounds (AudioSet), not speech. It processes audio in 2-second clips and lacks Korean language training entirely. For our use case, this is a critical limitation.

Gemini Embedding 2’s audio encoder handles continuous speech in 100+ languages, including Korean, making it fundamentally better suited for political communication analysis where what politicians say matters more than background visuals.

When ImageBind might still be the better choice:

Research on visual-only content (no speech component)
Projects requiring local/offline processing (data privacy constraints)
Budget of zero (no API costs, but GPU hardware needed)
Non-language modalities: depth, thermal, or IMU data

Dimensionality reduction

UMAP

Uniform Manifold Approximation and Projection. A nonlinear dimensionality reduction technique that preserves both local and global structure better than t-SNE, while being significantly faster. We use UMAP in two roles:

Visualization (3072 → 2D): for scatter plots colored by party or cluster
Pre-clustering (3072 → 10D): HDBSCAN works better on moderately reduced data than on raw high-dimensional vectors

McInnes, Healy, & Melville (2018) - “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction”
umap-learn documentation - Python package with scikit-learn API
Understanding UMAP - Interactive visual explainer by Google PAIR

Key parameters for embedding data:

Parameter	Our setting	Why
`metric`	`"cosine"`	Standard for normalized embedding vectors
`n_neighbors`	15 (default)	Lower (5-10) for small pilots
`min_dist`	0.1	Allows tight clusters without forcing overlap
`random_state`	42	Reproducibility (UMAP is stochastic)

t-SNE

An older alternative to UMAP. We do not use it here because it is slower on large datasets, does not preserve global structure well, and does not support the transform method (projecting new points onto an existing embedding).

van der Maaten & Hinton (2008) - Original t-SNE paper

Clustering

HDBSCAN

Hierarchical Density-Based Spatial Clustering of Applications with Noise. Unlike K-means, HDBSCAN does not require specifying the number of clusters in advance. It identifies clusters of varying density and labels outlier points as noise (cluster = -1).

Campello, Moulavi, & Sander (2013) - Original HDBSCAN paper
McInnes, Healy, & Astels (2017) - Python implementation (JOSS)
hdbscan documentation - Usage guide with parameter tuning advice

Best practice for embedding data: reduce dimensionality with UMAP first (to ~10D), then cluster. This avoids the curse of dimensionality and runs much faster.

Parameter	Our setting	Why
`min_cluster_size`	15	Minimum meaningful cluster for political analysis
`min_samples`	5	Noise tolerance; lower = fewer noise points
`metric`	`"euclidean"`	Standard after UMAP reduction

Similarity and distance

Cosine similarity

The primary similarity measure for embedding vectors. Ranges from -1 (opposite) to 1 (identical). For normalized vectors (unit length), cosine similarity equals the dot product.

\[ \text{cos}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{||\mathbf{a}|| \cdot ||\mathbf{b}||} \]

We use sklearn.metrics.pairwise.cosine_similarity for computing full pairwise matrices.

scikit-learn documentation

Panel data methods

Fixed effects regression

For the pairwise similarity panel (politician-dyad x month), we use OLS with month fixed effects and cluster standard errors at the dyad level. This controls for time trends in overall Shorts production while identifying what predicts within-period similarity.

Wooldridge (2010) - Econometric Analysis of Cross Section and Panel Data, standard reference for panel methods
statsmodels OLS - Python implementation

Event study design

To test electoral convergence, we use a pre/post event study around election dates. The key assumption is that similarity trends would remain flat absent the election.

Freyaldenhoven, Hansen, & Shapiro (2019) - “Pre-Event Trends in the Panel Event-Study Design”

Text embedding in political science

Key references for the embedding regression framework that underlies our panel design.

Rodriguez & Spirling (2022) - “Word Embeddings: What Works, What Doesn’t, and How to Tell the Difference for Applied Research” (JOP)
Rodriguez, Spirling & Stewart (2023) - “Embedding Regression: Models for Context-Specific Description and Inference” (APSR) - the conText framework
Rheault & Cochrane (2020) - “Word Embeddings for the Analysis of Ideological Placement in Parliamentary Corpora” (Political Analysis)

Image and visual analysis in political science

Emerging literature on computational visual analysis of political content.

Torres & Cantú (2022) - “Learning to See: Convolutional Neural Networks for the Analysis of Social Science Data” (Political Analysis)
Zhang & Pan (2019) - “CASM: A Deep-Learning Approach for Identifying Collective Action Events with Text and Image Data from Social Media” (Sociological Methodology)

Multimodal approaches

Closest methodological precedents and key models.

Radford et al. (2021) - “Learning Transferable Visual Models From Natural Language Supervision” (CLIP paper)
Liang et al. (2022) - “Mind the Gap: Understanding the Modality Gap in Multi-Modal Contrastive Representation Learning” (NeurIPS 2022). Documents systematic displacement between modalities in contrastive spaces, a key validity concern for VVG.

Transcription

Whisper (Radford et al. 2023) - “Robust Speech Recognition via Large-Scale Weak Supervision” - base model for our transcription pipeline
whisper-medium-komixv2-mlx - Korean fine-tuned Whisper Medium, MLX-accelerated for Apple Silicon. ~2.3 sec/video on M5 Pro.

Tools and infrastructure

Tool	Purpose	Link
Google AI Studio	API key management, usage dashboard	aistudio.google.com
`google-genai`	Python SDK for Gemini API	PyPI
MLX Whisper	Local Korean speech transcription	HuggingFace
ffmpeg	Audio/video extraction and manipulation	ffmpeg.org
yt-dlp	YouTube video download	github.com/yt-dlp
Quarto	This documentation site	quarto.org
YouTube Data API	Video metadata collection	developers.google.com

--- title: "Technical References" --- ## Embedding models ### Gemini Embedding 2 Google's first natively multimodal embedding model, released in March 2025. It accepts text, images, video, audio, and PDF inputs, projecting them into a shared 3072-dimensional vector space. This unified space is what makes cross-modal similarity measurement possible. - [Official documentation](https://ai.google.dev/gemini-api/docs/embeddings) - API reference, supported modalities, and usage examples - [Google blog announcement](https://blog.google/technology/google-deepmind/gemini-embedding-2-text-image-video/) - Design goals and benchmark results - [Pricing](https://ai.google.dev/pricing) - Per-token and per-second cost breakdown for each modality Key properties for this project: - **Native Korean**: trained on 100+ languages, no translation step needed - **Video up to 120s**: YouTube Shorts (under 60s) fit comfortably - **Unified space**: text, audio, and video embeddings are directly comparable via cosine similarity ### ImageBind (Meta) An earlier multimodal embedding model from Meta AI (2023) that also maps multiple modalities into a shared space. ImageBind uses image embeddings as an anchor, aligning other modalities (audio, text, depth, thermal, IMU) to the image space via contrastive learning. - [ImageBind paper](https://arxiv.org/abs/2305.05665) - Girdhar et al. (2023), "ImageBind: One Embedding Space To Bind Them All" - [GitHub repository](https://github.com/facebookresearch/ImageBind) ### Why we chose Gemini Embedding 2 over ImageBind We initially planned to use ImageBind (see `pipeline_v1.txt`). After evaluating both models against our specific requirements, we switched to Gemini Embedding 2. The comparison below explains this decision. | | Gemini Embedding 2 | ImageBind | |--|---------------------|-----------| | **Release** | March 2025 (Google) | May 2023 (Meta) | | **Architecture** | Joint multimodal encoder | Image-anchored contrastive | | **Dimensions** | 3072 (or 768, 256) | 1024 | | **Text input** | Native, 100+ languages | CLIP-based, English-centric | | **Audio input** | Speech + non-speech, multilingual | 2-second clips, environment sounds | | **Video input** | Up to 120 seconds natively | Frame sampling (indirect) | | **Korean support** | Native (trained on Korean text and speech) | Minimal (no Korean speech training) | | **Inference** | Cloud API (Google AI Studio) | Local GPU (A100 recommended) | | **Cost** | Pay-per-use (~$0.007-0.043/short) | Free (but requires GPU hardware) | | **Batch processing** | API with rate limits | Limited by local GPU memory | ::: {.callout-important} ## The decisive factor: Korean speech Political YouTube Shorts are dominated by spoken Korean. ImageBind's audio encoder was trained on environmental sounds (AudioSet), not speech. It processes audio in 2-second clips and lacks Korean language training entirely. For our use case, this is a critical limitation. Gemini Embedding 2's audio encoder handles continuous speech in 100+ languages, including Korean, making it fundamentally better suited for political communication analysis where what politicians *say* matters more than background visuals. ::: **When ImageBind might still be the better choice:** - Research on visual-only content (no speech component) - Projects requiring local/offline processing (data privacy constraints) - Budget of zero (no API costs, but GPU hardware needed) - Non-language modalities: depth, thermal, or IMU data ## Dimensionality reduction ### UMAP Uniform Manifold Approximation and Projection. A nonlinear dimensionality reduction technique that preserves both local and global structure better than t-SNE, while being significantly faster. We use UMAP in two roles: 1. **Visualization** (3072 → 2D): for scatter plots colored by party or cluster 2. **Pre-clustering** (3072 → 10D): HDBSCAN works better on moderately reduced data than on raw high-dimensional vectors - [McInnes, Healy, & Melville (2018)](https://arxiv.org/abs/1802.03426) - "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction" - [umap-learn documentation](https://umap-learn.readthedocs.io/) - Python package with scikit-learn API - [Understanding UMAP](https://pair-code.github.io/understanding-umap/) - Interactive visual explainer by Google PAIR Key parameters for embedding data: | Parameter | Our setting | Why | |-----------|-------------|-----| | `metric` | `"cosine"` | Standard for normalized embedding vectors | | `n_neighbors` | 15 (default) | Lower (5-10) for small pilots | | `min_dist` | 0.1 | Allows tight clusters without forcing overlap | | `random_state` | 42 | Reproducibility (UMAP is stochastic) | ### t-SNE An older alternative to UMAP. We do not use it here because it is slower on large datasets, does not preserve global structure well, and does not support the `transform` method (projecting new points onto an existing embedding). - [van der Maaten & Hinton (2008)](https://jmlr.org/papers/v9/vandermaaten08a.html) - Original t-SNE paper ## Clustering ### HDBSCAN Hierarchical Density-Based Spatial Clustering of Applications with Noise. Unlike K-means, HDBSCAN does not require specifying the number of clusters in advance. It identifies clusters of varying density and labels outlier points as noise (cluster = -1). - [Campello, Moulavi, & Sander (2013)](https://doi.org/10.1007/978-3-642-37456-2_14) - Original HDBSCAN paper - [McInnes, Healy, & Astels (2017)](https://doi.org/10.21105/joss.00205) - Python implementation (JOSS) - [hdbscan documentation](https://hdbscan.readthedocs.io/) - Usage guide with parameter tuning advice Best practice for embedding data: reduce dimensionality with UMAP first (to ~10D), then cluster. This avoids the curse of dimensionality and runs much faster. | Parameter | Our setting | Why | |-----------|-------------|-----| | `min_cluster_size` | 15 | Minimum meaningful cluster for political analysis | | `min_samples` | 5 | Noise tolerance; lower = fewer noise points | | `metric` | `"euclidean"` | Standard after UMAP reduction | ## Similarity and distance ### Cosine similarity The primary similarity measure for embedding vectors. Ranges from -1 (opposite) to 1 (identical). For normalized vectors (unit length), cosine similarity equals the dot product. $$ \text{cos}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{||\mathbf{a}|| \cdot ||\mathbf{b}||} $$ We use `sklearn.metrics.pairwise.cosine_similarity` for computing full pairwise matrices. - [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) ## Panel data methods ### Fixed effects regression For the pairwise similarity panel (politician-dyad x month), we use OLS with month fixed effects and cluster standard errors at the dyad level. This controls for time trends in overall Shorts production while identifying what predicts within-period similarity. - [Wooldridge (2010)](https://mitpress.mit.edu/9780262232586/) - *Econometric Analysis of Cross Section and Panel Data*, standard reference for panel methods - [statsmodels OLS](https://www.statsmodels.org/stable/regression.html) - Python implementation ### Event study design To test electoral convergence, we use a pre/post event study around election dates. The key assumption is that similarity trends would remain flat absent the election. - [Freyaldenhoven, Hansen, & Shapiro (2019)](https://doi.org/10.1257/aer.20180609) - "Pre-Event Trends in the Panel Event-Study Design" ## Text embedding in political science Key references for the embedding regression framework that underlies our panel design. - [Rodriguez & Spirling (2022)](https://doi.org/10.1086/715162) - "Word Embeddings: What Works, What Doesn't, and How to Tell the Difference for Applied Research" (*JOP*) - [Rodriguez, Spirling & Stewart (2023)](https://doi.org/10.1017/S0003055422001228) - "Embedding Regression: Models for Context-Specific Description and Inference" (*APSR*) - the conText framework - [Rheault & Cochrane (2020)](https://doi.org/10.1017/pan.2019.26) - "Word Embeddings for the Analysis of Ideological Placement in Parliamentary Corpora" (*Political Analysis*) ## Image and visual analysis in political science Emerging literature on computational visual analysis of political content. - [Torres & Cantú (2022)](https://doi.org/10.1017/pan.2021.9) - "Learning to See: Convolutional Neural Networks for the Analysis of Social Science Data" (*Political Analysis*) - [Zhang & Pan (2019)](https://doi.org/10.1177/0081175019860244) - "CASM: A Deep-Learning Approach for Identifying Collective Action Events with Text and Image Data from Social Media" (*Sociological Methodology*) ## Multimodal approaches Closest methodological precedents and key models. - [Radford et al. (2021)](https://arxiv.org/abs/2103.00020) - "Learning Transferable Visual Models From Natural Language Supervision" (CLIP paper) - [Liang et al. (2022)](https://arxiv.org/abs/2203.02053) - "Mind the Gap: Understanding the Modality Gap in Multi-Modal Contrastive Representation Learning" (*NeurIPS 2022*). Documents systematic displacement between modalities in contrastive spaces, a key validity concern for VVG. ## Transcription - [Whisper (Radford et al. 2023)](https://arxiv.org/abs/2212.04356) - "Robust Speech Recognition via Large-Scale Weak Supervision" - base model for our transcription pipeline - [whisper-medium-komixv2-mlx](https://huggingface.co/youngouk/whisper-medium-komixv2-mlx) - Korean fine-tuned Whisper Medium, MLX-accelerated for Apple Silicon. ~2.3 sec/video on M5 Pro. ## Tools and infrastructure | Tool | Purpose | Link | |------|---------|------| | Google AI Studio | API key management, usage dashboard | [aistudio.google.com](https://aistudio.google.com) | | `google-genai` | Python SDK for Gemini API | [PyPI](https://pypi.org/project/google-genai/) | | MLX Whisper | Local Korean speech transcription | [HuggingFace](https://huggingface.co/youngouk/whisper-medium-komixv2-mlx) | | ffmpeg | Audio/video extraction and manipulation | [ffmpeg.org](https://ffmpeg.org) | | yt-dlp | YouTube video download | [github.com/yt-dlp](https://github.com/yt-dlp/yt-dlp) | | Quarto | This documentation site | [quarto.org](https://quarto.org) | | YouTube Data API | Video metadata collection | [developers.google.com](https://developers.google.com/youtube/v3) |