Population Normalization
Converting raw scores to population-relative measures
Abstract
Methodology for normalizing raw semantic scores against population baselines. Covers bootstrap null distribution generation, percentile ranking, z-score conversion, and prompt-specific calibration. Enables fair comparison across prompts with different baseline geometries.
Overview
This study is currently calibrating. Content will document:
- Prompt-specific baselines — why different prompts require different null distributions
- Bootstrap procedure — generating null distributions by sampling random word sets
- Percentile normalization — converting raw scores to population-relative ranks
- Z-score normalization — standard deviations from null mean for statistical analysis
- Caching strategies — precomputation and storage for production systems
The Problem
Raw relevance and divergence scores are difficult to interpret:
- A relevance of 0.35 may be excellent for a distant anchor-target pair but mediocre for a close pair
- Divergence depends on how many words are included and their baseline geometry
Solution: Prompt-Specific Null Distributions
For each prompt configuration:
- Sample random words from vocabulary (matching submission size)
- Score this random set using the same functions
- Repeat 500+ times to build distribution
- Store for percentile/z-score conversion
This answers: “How does this submission compare to random word sets for this specific prompt?”
Optimization Considerations
For production systems:
- Precompute null distributions for all stimulus pairs
- Cache results keyed by
(prompt, n_clues)tuple - Parametric approximation possible (Beta for relevance, truncated normal for divergence)
This methodology is under development. Check back for updates.