Population Normalization

Converting raw scores to population-relative measures

Abstract

Methodology for normalizing raw semantic scores against population baselines. Covers bootstrap null distribution generation, percentile ranking, z-score conversion, and prompt-specific calibration. Enables fair comparison across prompts with different baseline geometries.

Overview

This study is currently calibrating. Content will document:

Prompt-specific baselines — why different prompts require different null distributions
Bootstrap procedure — generating null distributions by sampling random word sets
Percentile normalization — converting raw scores to population-relative ranks
Z-score normalization — standard deviations from null mean for statistical analysis
Caching strategies — precomputation and storage for production systems

The Problem

Raw relevance and divergence scores are difficult to interpret:

A relevance of 0.35 may be excellent for a distant anchor-target pair but mediocre for a close pair
Divergence depends on how many words are included and their baseline geometry

Solution: Prompt-Specific Null Distributions

For each prompt configuration:

Sample $n$ random words from vocabulary (matching submission size)
Score this random set using the same functions
Repeat 500+ times to build distribution
Store for percentile/z-score conversion

This answers: “How does this submission compare to random word sets for this specific prompt?”

Optimization Considerations

For production systems:

Precompute null distributions for all stimulus pairs
Cache results keyed by (prompt, n_clues) tuple
Parametric approximation possible (Beta for relevance, truncated normal for divergence)

This methodology is under development. Check back for updates.