MTH-002.1 Semantic Association Metrics
Published
v4.0 January 18, 2026

Spread and Fidelity Scoring

Core metrics for INS-001 semantic association assessment

Abstract

Documents the scoring framework for INS-001 semantic association instruments. Both instruments measure divergent thinking (operationalized as spread) and communication fidelity (operationalized as fidelity in INS-001.2 and communicability in INS-001.1). Free associations reflect semantic memory—culturally embedded knowledge structures that enable shared meaning. We measure whether associations that explore semantic territory can also communicate with high fidelity to both humans and AI. Spread adapts the Divergent Association Task methodology (Olson et al., 2021). Includes DAT calibration studies and Haiku-calibrated interpretation thresholds for both instruments.

Executive Summary

This study documents the scoring framework for the INS-001 family of semantic association instruments:

  • INS-001.1 (Signal): Free association to a single seed word
  • INS-001.2 (Common Ground): Constrained association between two endpoint words

Both instruments measure two fundamental properties:

  1. Spread — semantic territory covered (divergent thinking)
  2. Fidelity — whether associations communicate effectively (task validity)

Why These Instruments?

Free associations reflect semantic memory—culturally embedded knowledge structures that enable shared meaning. We designed these instruments to measure whether associations that explore semantic space can also communicate with high fidelity to both humans and AI.

The Problem: Relevance Was Redundant with Spread

Our original “relevance” metric (proximity to reference words) was strongly inversely correlated with spread:

InstrumentSpread × RelevanceImplication
INS-001.1r = -0.77Highly redundant
INS-001.2r = -0.61Highly redundant

This is geometric: clues close to reference points cluster together (low spread); distant clues spread apart (low relevance). We needed fidelity metrics that could capture task engagement independently of spread.

The Solutions: Communicability and Fidelity

INS-001.1: We discovered a synergy—higher spread actually improves communicability (r = +0.34). Associations that are too constrained may be idiosyncratic and fail to communicate.

INS-001.2: We developed fidelity (joint constraint score), which achieves independence from spread (r = +0.06) by measuring whether clues collectively triangulate the anchor-target pair rather than their proximity to endpoints.

Instrument Comparison

PropertyINS-001.1 (Radiation)INS-001.2 (Bridging)
TaskFree associate to seedBridge anchor-target pair
Constraint typeImplicit (must communicate)Explicit (must connect endpoints)
Spread (Haiku mean)65.566.1
Fidelity metricCommunicability (93.2%)Joint constraint (0.68)
Predictability metricUnconventionality
Optimal clue count24
DAT gap-8.6 points-8.0 points

Haiku Baselines

MetricINS-001.1INS-001.2DAT Reference
Spread65.5 ± 4.666.1 ± 4.174.1 ± 1.6
Fidelity93.2% (comm.)0.68 ± 0.13
UnconventionalityHigh (by design)

Note: Haiku’s unconventionality is high by design—the noise floor is calibrated against Haiku’s own predictable responses.


1. Theoretical Framework

1.1 Why Semantic Associations?

Free associations reveal the structure of semantic memory—the mental lexicon organized by meaning rather than sound or spelling. When someone hears “dog” and thinks “cat,” they’re traversing a culturally shared knowledge structure that enables communication.

This cultural embedding has two implications:

  1. Associations are not random: They follow statistical regularities learned from language exposure
  2. Associations can communicate: Because semantic structures are shared, one person’s associations can be decoded by another

1.2 The Dual Measurement Goal

We want to measure two capacities simultaneously:

CapacityOperationalizationWhy It Matters
Divergent thinkingSpread (semantic distance)Creativity requires exploring distant concepts
Communication fidelityCommunicability / FidelityIdeas must be transmissible to have impact

The tension between these creates the core construct: semantic divergence under constraint.

1.3 Two Constraint Types

INS-001.1 and INS-001.2 impose different constraints on divergent exploration:

INS-001.1 (Radiation): The constraint is implicit. Participants generate free associations, but we measure whether an AI (or human) guesser can recover the original seed. Highly idiosyncratic associations may achieve high spread but low communicability.

INS-001.2 (Bridging): The constraint is explicit. Participants must generate clues connecting two given endpoints. The bridging requirement geometrically limits achievable spread.

Both instruments measure the same underlying capacity from different angles.

1.4 Literature Basis

Spread: Adapts the Divergent Association Task (DAT) from Olson et al. (2021), which validated mean pairwise semantic distance as a creativity measure correlating with established creativity tests (r = 0.40 with AUT, r = 0.28 with RAT).

Communicability: Extends the “clue-giver / guesser” paradigm from collaborative word games, operationalized using AI (Claude Haiku) as a standardized guesser.

Fidelity: Novel metric developed for INS-001.2, based on information-theoretic intuition about how clues should collectively narrow the solution space.


2. Metric Overview

2.1 Metric Roles by Instrument

MetricINS-001.1 RoleINS-001.2 Role
SpreadPrimary divergence measurePrimary divergence measure
UnconventionalityPredictability measure
CommunicabilityFidelity measure (binary)
RelevanceLegacy (inverse of spread)Legacy (inverse of spread)
FidelityTask validity measure (continuous)

2.2 Why Measure Fidelity/Relevance?

Divergent thinking alone is insufficient. A participant could achieve high spread by generating random or tangentially related words—semantically distant but disconnected from the task. We need a complementary measure that captures task engagement: are the associations actually about the prompt?

For INS-001.2 (Bridging), this question has a clear geometric answer: do the clues relate to both the anchor and target? The original relevance metric operationalized this as the minimum similarity to either endpoint. However, this created a redundancy problem—clues close to endpoints necessarily cluster together (low spread), producing the r = -0.61 correlation.

The solution was fidelity: instead of measuring proximity to endpoints, measure whether clues collectively identify the specific anchor-target pair versus plausible alternatives. This captures task engagement without penalizing spread.

For INS-001.1 (Radiation), task engagement is implicit in communicability—if the guesser can recover the seed, the associations were relevant enough to communicate. No separate relevance metric is needed.

2.3 Why Different Fidelity Metrics?

The instruments require different fidelity operationalizations:

INS-001.1: With only a seed word as context, fidelity is naturally binary—did the guesser recover the seed or not? This yields communicability (proportion of successful recoveries).

INS-001.2: With two endpoints defining a bridging corridor, fidelity can be graded—how well do clues triangulate the specific pair versus plausible alternatives? This yields fidelity (joint constraint score, 0–1).

2.4 The Independence Problem

The original relevance metric (similarity to endpoints) was highly correlated with spread:

InstrumentSpread × RelevanceImplication
INS-001.1r = -0.77Highly redundant
INS-001.2r = -0.61Highly redundant

This is geometric: clues close to reference points cluster together (low spread); distant clues spread apart (low relevance).

Solution for INS-001.2: Replace relevance with fidelity (joint constraint), which achieves independence (r = +0.06).

Observation for INS-001.1: Communicability shows positive correlation with spread (r = +0.34)—a synergy rather than tradeoff. Higher spread actually improves communication.


3. Spread (Divergent Thinking)

Spread measures how much semantic territory a set of associations covers. This is the primary divergence metric for both instruments.

3.1 Calculation

Mean pairwise cosine distance, scaled to 0–100:

spread=100×1(n2)i<j(1cos(wi,wj))\text{spread} = 100 \times \frac{1}{\binom{n}{2}} \sum_{i < j} \left(1 - \cos(\mathbf{w}_i, \mathbf{w}_j)\right)

Where:

  • nn = number of words
  • wi\mathbf{w}_i = embedding for word ii

3.2 Word Set Composition

InstrumentWords IncludedRationale
INS-001.1Clues onlySeed is given, not generated
INS-001.2Clues onlyAnchor-target are given, not generated
DATAll 7 wordsAll words are participant-generated

Including given words would conflate task difficulty with participant performance.

3.3 Relationship to DAT

The INS-001 spread metric uses the same formula as DAT but measures a different construct:

PropertyDATINS-001.1INS-001.2
Task goalMaximize distanceFree associateBridge endpoints
ConstraintNoneImplicit (communicate)Explicit (bridge)
Haiku mean74.165.566.1
Gap to DAT-8.6-8.0

The gaps reflect the cost of constraint: even without trying to maximize distance, constrained tasks produce less spread.


4. DAT Calibration Study

4.1 Motivation

To interpret INS-001 spread scores, we need a reference point. The Divergent Association Task (DAT) provides this: a well-validated creativity measure with human norms (mean = 78, SD = 6).

Using Claude Haiku 4.5 as a calibration reference allows us to:

  1. Establish a stable, reproducible baseline
  2. Convert INS-001 scores to DAT-equivalent scale
  3. Detect embedding model effects

4.2 Haiku DAT Performance

MetricHuman NormHaiku (GloVe)Haiku (OpenAI)Random Baseline
Mean78 ± 694.8 ± 2.474.1 ± 1.689.8 ± 4.2
InterpretationBarely above randomHuman-comparableUniform sampling

Critical finding: With GloVe embeddings, Haiku (94.8) barely exceeds a random baseline (89.8)—a difference of only 5 points. This means GloVe cannot meaningfully distinguish divergent thinking from noise. With OpenAI embeddings, Haiku falls within the human-comparable range (72–84).

DAT Scores vs Baselines Figure 4: Haiku DAT scores using GloVe embeddings compared to human norms and random baselines. Haiku’s distribution (blue, μ=94.8) is entirely disjoint from human norms (green, μ=78) and barely exceeds the random uniform baseline (red dotted, 90). This demonstrates that GloVe embeddings lack discriminative power for LLM-generated content.

4.3 Embedding Model Effects

The choice of embedding model dramatically affects scores:

ComparisonValue
GloVe mean94.8
OpenAI mean74.1
Difference20.7 points
Correlationr = 0.60

The moderate correlation (r = 0.60) confirms these embedding spaces capture different similarity structures. Scores are not interchangeable across models.

Embedding Model Comparison Figure 5: GloVe vs OpenAI DAT scores across 150 trials. The 20.7-point systematic difference and moderate correlation (r = 0.60) demonstrate that embedding model choice fundamentally affects score interpretation.

Recommendation: Use OpenAI text-embedding-3-small as the standard embedding model. GloVe embeddings lack discriminative power for LLM-generated content.

4.4 Temperature Effects

TemperatureMean ScoreSD
0.073.81.4
0.574.21.6
1.074.31.8

Linear regression: R² = 0.006, p = 0.33

Finding: Temperature has negligible effect on divergent association capacity, simplifying deployment.

4.5 Calibration Adjustments

To convert INS-001 spread to DAT-equivalent scale:

TaskHaiku INS-001Haiku DATAdjustment
INS-001.165.574.1+8.6
INS-001.2 (overall)66.174.1+8.0
INS-001.2 (easy pairs)63.874.1+10.4
INS-001.2 (hard pairs)67.674.1+6.5

Note: INS-001.2 adjustments are stratified by pair difficulty because anchor-target distance moderates achievable spread.

4.6 Calibration Assumption Validation

The calibration assumes Haiku’s divergent capacity is constant across tasks—that score differences reflect task structure, not ability differences. We tested this by comparing pairwise distances:

ComparisonDAT PairsBridging Clue Pairs
N pairs3,1505,202
Mean distance0.7410.654
SD0.0610.104

Statistical test: t = 42.6, p < 0.0001, Cohen’s d = 1.02 (large effect)

Warning: The assumption was not validated. DAT pairwise distances are systematically higher than bridging clue distances, even for the same model. This large effect (d = 1.02) indicates the bridging constraint genuinely compresses Haiku’s associative range—not just an artifact of pair composition.

Implication: The calibration adjustment accounts for this compression, but users should understand that constrained association (INS-001.2) produces systematically lower divergence than free association (DAT or INS-001.1) for fundamental geometric reasons, not just task structure.


5. INS-001.1: Radiation (Free Association)

5.1 Task Description

Participants receive a seed word and generate free associations. The task measures divergent thinking with an implicit constraint: associations should be recoverable by an external guesser.

ComponentDescription
InputSingle seed word
Output2–10 association words
ConstraintImplicit (communicability)
ScoringSpread + Communicability

5.2 Communicability Metric

Communicability measures whether a standardized guesser (Claude Haiku 4.5) can recover the original seed from the participant’s associations.

Protocol:

  1. Present clues to Haiku (without seed)
  2. Haiku generates top-3 guesses for the seed
  3. Score as communicable if seed appears in guesses

Scoring:

  • Exact match: Seed word appears verbatim
  • Semantic match: Seed’s nearest neighbor in embedding space appears

5.3 Unconventionality Metric

Unconventionality measures how predictable a participant’s associations are—whether they avoided the obvious, “first thing that comes to mind” responses.

Rationale: High spread alone doesn’t distinguish between genuinely creative associations and merely random ones. Unconventionality captures whether participants explored outside the predictable semantic neighborhood, not just whether their words were distant from each other.

Algorithm:

def calculate_unconventionality(
    user_associations: list[str],
    noise_floor: list[str]  # Most predictable associations for seed
) -> dict:
    """
    Compare user associations against the predictable neighborhood.

    The noise floor contains the N most common/predictable associations
    for the seed word (e.g., nearest neighbors in embedding space,
    or empirically frequent human responses).
    """

    def is_match(user_word: str, floor_word: str) -> bool:
        """Check exact match or morphological variants."""
        # Normalize: lowercase, stem
        user_stem = stem(user_word.lower())
        floor_stem = stem(floor_word.lower())
        return user_stem == floor_stem

    # Count overlaps with noise floor
    overlaps = []
    for user_word in user_associations:
        for floor_word in noise_floor:
            if is_match(user_word, floor_word):
                overlaps.append(user_word)
                break

    overlap_count = len(overlaps)

    # Classify unconventionality
    if overlap_count == 0:
        level = "high"
        description = "None of your associations appeared in the predictable neighborhood."
    elif overlap_count <= 2:
        level = "moderate"
        description = f"{overlap_count} of your associations appeared in the predictable neighborhood."
    else:
        level = "low"
        description = f"{overlap_count} of your associations appeared in the predictable neighborhood."

    return {
        "level": level,
        "overlap_count": overlap_count,
        "overlapping_words": overlaps,
        "description": description
    }

Interpretation:

Overlap CountLevelInterpretation
0HighAvoided all predictable associations—genuinely unconventional
1–2ModerateSome predictable choices, but explored beyond the obvious
3+LowMost associations were predictable/conventional

Relationship to Other Metrics:

Unconventionality is distinct from both spread and communicability:

MetricWhat It MeasuresHigh Score Means
SpreadDistance between associationsWords are semantically far apart
UnconventionalityDistance from predictable setWords avoided the obvious choices
CommunicabilityGuesser recovery successAssociations decode to correct seed

A participant could have:

  • High spread + Low unconventionality: Words are distant from each other but all predictable (e.g., for “dog”: cat, bark, leash, walk)
  • Low spread + High unconventionality: Words cluster together but in an unexpected region (e.g., for “dog”: loyalty, companionship, devotion)
  • High spread + High unconventionality: The ideal—exploring distant, non-obvious territory

5.4 The Spread-Communicability Synergy

Contrary to expectation, spread and communicability show positive correlation:

StatisticValue
Point-biserial correlationr = 0.34, p < 0.0001
Communicable trials spread66.0
Non-communicable trials spread59.6
t-testt = 6.40, p < 0.0001

Interpretation: Associations that are too constrained (low spread) may be idiosyncratic and fail to communicate. Moderate divergence appears optimal—it provides multiple triangulation points without becoming disconnected from the seed.

INS-001.1 Calibration Figure 1: INS-001.1 calibration results. Left: Spread-relevance inverse correlation (r = -0.77), with green points showing communicable trials and red showing non-communicable. Middle: Spread distribution by communicability status—communicable trials show higher spread (the synergy). Right: Spread varies by seed category, with rare concrete words producing highest spread.

5.5 Spread-Relevance Relationship

The strong inverse correlation between spread and relevance (r = -0.77) confirms these measure the same dimension from opposite poles:

Spread QuartileRelevance MeanCommunicability
Q1 (low)0.4778.2%
Q20.4297.4%
Q30.4097.4%
Q4 (high)0.36100%

Higher spread → lower relevance → higher communicability. The “conventional” associations (high relevance, low spread) are actually harder to decode.

5.6 Optimal Clue Count

CluesSpreadSDCommunicability
1100%
269.78.6100%
368.16.2100%
565.44.9100%
1066.64.4100%

Recommendation: 2 clues—maximum spread while maintaining perfect communicability.

INS-001.1 Clue Count Trajectories Figure 2: INS-001.1 metrics by clue count. Top-left: Spread decreases then stabilizes as clues increase. Top-right: Communicability remains at ceiling across all clue counts. Bottom-left: Spread vs communicability shows all configurations above 80% threshold. Bottom-right: Relevance decreases slightly with more clues.

5.7 Seed Category Effects

CategorySpreadRelevanceCommunicability
Common concrete64.70.422100%
Common abstract65.90.433100%
Rare concrete69.90.331100%
Rare abstract65.00.417100%
Polysemous68.30.391100%

ANOVA for spread: F = 19.96, p < 0.0001

Finding: Rare concrete words produce highest spread—likely because they have fewer strong associates, forcing exploration.

5.8 Haiku Version Comparison

Comparing Haiku 3.5 (from LWOW dataset) to Haiku 4.5:

MetricValue
Mean Jaccard similarity0.213
Mean overlap (of 3 clues)0.95
Perfect match rate0%
No overlap rate25%

Interpretation: Haiku versions share some associations but are not identical. Model version should be locked for reproducibility.

Haiku Version Comparison Figure 3: Association overlap between Haiku 3.5 and 4.5 across 20 seed words.


6. INS-001.2: Bridging (Constrained Association)

6.1 Task Description

Participants receive an anchor-target word pair and generate clues that connect them. The task measures divergent thinking with an explicit geometric constraint.

ComponentDescription
InputAnchor word + Target word
Output4–5 clue words
ConstraintExplicit (must bridge)
ScoringSpread + Fidelity

6.2 Fidelity Metric

Fidelity measures how well clues collectively identify the anchor-target pair by eliminating plausible alternatives.

Intuition: Good bridging clues should:

  • Coverage: Eliminate most alternative endpoints (each clue rules out some wrong answers)
  • Efficiency: Be non-redundant (different clues eliminate different wrong answers)

Bad clues either:

  • Eliminate nothing (uninformative—too far from endpoints)
  • All eliminate the same foils (redundant—no complementary information)

6.2.1 Mathematical Formulation

Let a\mathbf{a} be the anchor embedding, t\mathbf{t} the target embedding, and {c1,...,cn}\{\mathbf{c}_1, ..., \mathbf{c}_n\} the clue embeddings.

Foil sets: For each endpoint, generate the 50 nearest neighbors in embedding space:

  • Fa={f1a,...,f50a}F_a = \{f_1^a, ..., f_{50}^a\} — foils for anchor
  • Ft={f1t,...,f50t}F_t = \{f_1^t, ..., f_{50}^t\} — foils for target

Elimination: A clue ci\mathbf{c}_i eliminates foil fjf_j if the clue is more similar to the true endpoint than to the foil:

eliminates(ci,fja)=1[cos(ci,a)>cos(ci,fja)]\text{eliminates}(\mathbf{c}_i, f_j^a) = \mathbb{1}\left[\cos(\mathbf{c}_i, \mathbf{a}) > \cos(\mathbf{c}_i, f_j^a)\right]

Elimination sets: For each clue, compute the set of foils it eliminates:

Eia={j:eliminates(ci,fja)}E_i^a = \{j : \text{eliminates}(\mathbf{c}_i, f_j^a)\}

Coverage: Fraction of foils eliminated by at least one clue:

coveragea=i=1nEiaFa\text{coverage}_a = \frac{|\bigcup_{i=1}^{n} E_i^a|}{|F_a|}

Efficiency: Non-redundancy of elimination (1 minus the Jaccard-like redundancy):

efficiencya=1i=1nEiai=1nEia\text{efficiency}_a = 1 - \frac{|\bigcap_{i=1}^{n} E_i^a|}{|\bigcup_{i=1}^{n} E_i^a|}

Combined fidelity:

fidelity=coveragea+coveraget2×efficiencya+efficiencyt2\text{fidelity} = \frac{\text{coverage}_a + \text{coverage}_t}{2} \times \frac{\text{efficiency}_a + \text{efficiency}_t}{2}

6.2.2 Algorithm Implementation

def calculate_fidelity(
    clue_embeddings: list[list[float]],
    anchor_embedding: list[float],
    target_embedding: list[float],
    foil_anchors: list[list[float]],  # 50 nearest neighbors of anchor
    foil_targets: list[list[float]],  # 50 nearest neighbors of target
) -> dict:
    """
    Measure how well clues jointly constrain the solution space.

    A foil is "eliminated" by a clue if that clue is more similar to the
    true endpoint than to the foil.
    """

    def get_eliminations(clue_emb, true_emb, foil_embs):
        """Return set of foil indices eliminated by this clue."""
        true_sim = cosine_similarity(clue_emb, true_emb)
        eliminated = set()
        for i, foil in enumerate(foil_embs):
            foil_sim = cosine_similarity(clue_emb, foil)
            if true_sim > foil_sim:
                eliminated.add(i)
        return eliminated

    # Compute eliminations for each clue
    anchor_elims = [get_eliminations(c, anchor_embedding, foil_anchors)
                    for c in clue_embeddings]
    target_elims = [get_eliminations(c, target_embedding, foil_targets)
                    for c in clue_embeddings]

    # Coverage: fraction of foils eliminated by at least one clue
    anchor_union = set.union(*anchor_elims) if anchor_elims else set()
    target_union = set.union(*target_elims) if target_elims else set()
    anchor_coverage = len(anchor_union) / len(foil_anchors)
    target_coverage = len(target_union) / len(foil_targets)

    # Efficiency: 1 - (intersection / union)
    def compute_efficiency(elim_sets):
        if len(elim_sets) < 2:
            return 1.0
        union = set.union(*elim_sets)
        if not union:
            return 0.0
        intersection = set.intersection(*elim_sets)
        redundancy = len(intersection) / len(union)
        return 1 - redundancy

    anchor_efficiency = compute_efficiency(anchor_elims)
    target_efficiency = compute_efficiency(target_elims)

    # Combined metrics
    overall_coverage = (anchor_coverage + target_coverage) / 2
    overall_efficiency = (anchor_efficiency + target_efficiency) / 2
    fidelity = overall_coverage * overall_efficiency

    return {
        "fidelity": fidelity,
        "coverage": overall_coverage,
        "efficiency": overall_efficiency
    }

6.2.3 Foil Generation

Foils are the 50 nearest neighbors of each endpoint in embedding space:

def generate_endpoint_foils(
    endpoint_embedding: list[float],
    vocabulary_embeddings: dict[str, list[float]],
    n_foils: int = 50
) -> list[list[float]]:
    """
    Generate hard-to-distinguish foils using nearest neighbors.
    """
    vocab_embs = list(vocabulary_embeddings.values())
    similarities = [cosine_similarity(endpoint_embedding, v) for v in vocab_embs]
    sorted_indices = np.argsort(similarities)[::-1]
    # Skip index 0 (might be the word itself)
    foil_indices = sorted_indices[1:n_foils+1]
    return [vocab_embs[i] for i in foil_indices]

Why nearest neighbors? These are the hardest foils to distinguish—words semantically similar to the true endpoint. If clues can eliminate these, they carry strong signal about the true pair.

6.2.4 Component Interpretation

ComponentMeaningLow ValueHigh Value
CoverageFraction of foils eliminatedClues too distant from endpointsClues in right neighborhood
EfficiencyNon-redundancy of eliminationAll clues eliminate same foilsComplementary information

Both components contribute to fidelity. A score can be high-coverage but low-efficiency (redundant conventional clues) or low-coverage but high-efficiency (sparse but complementary).

6.3 Why Fidelity Replaced Relevance

The original relevance metric exhibited a strong inverse correlation with spread:

IssueRelevanceFidelity
Spread correlationr = -0.61 (redundant)r = +0.06 (independent)
A-T distance correlationr = -0.41 (confounded)r = +0.02 (unconfounded)
Interpretation”How close to corridor""How well do clues triangulate”

The fundamental tradeoff: Trial-level correlation between relevance and clue-only divergence was r = -0.61 (p < 0.0001). This is geometric—clues close to both endpoints must cluster together.

Relevance-Divergence Tradeoff Figure 7: The relevance-divergence tradeoff in INS-001.2. Left: Trial-level correlation (r = -0.61) between relevance and clue-only divergence, with points colored by anchor-target distance. Right: Clue-level correlation (r = -0.74) showing that more relevant clues are closer to the prompt words.

Fidelity was selected from three candidate metrics based on: lowest spread correlation, lowest difficulty confound, no ceiling effect, and interpretable decomposition (coverage × efficiency).

6.4 Pair Difficulty Moderation

Anchor-target distance affects achievable spread—a key moderating variable for interpretation.

DifficultyA-T DistanceClue-Clue SpreadN
Easy0.69663.8 ± 3.939
Medium0.75464.8 ± 4.638
Hard0.81167.6 ± 4.439

ANOVA: F = 8.08, p = 0.0005, η² = 0.13 (medium effect)

Post-hoc comparisons (Bonferroni-corrected):

  • Easy vs Hard: Δ = -3.80, p = 0.0001 ***
  • Medium vs Hard: Δ = -2.71, p = 0.010 *
  • Easy vs Medium: Δ = -1.10, p = 0.26 (ns)

Interpretation: When endpoints are farther apart, there’s more geometric “room” for clue spread without violating relevance constraints. This 3.8-point difference between easy and hard pairs is substantial—nearly one standard deviation.

Pair Difficulty vs Clue Spread Figure 8: Clue-clue spread by pair difficulty tercile. Harder pairs (greater A-T distance) allow significantly more spread.

Implication: INS-001.2 scores should be interpreted relative to pair difficulty. A spread of 65 on an easy pair reflects more divergent capacity than 65 on a hard pair. Stratified calibration adjustments account for this:

DifficultyA-T ThresholdDAT Adjustment
Easy< 0.727+10.4 points
Medium0.727–0.777+9.3 points
Hard> 0.777+6.5 points

6.5 Optimal Clue Count

Spread stabilizes after 3–4 clues, with diminishing marginal contribution:

CluesFull SpreadSDClue-Clue PairsGap to DAT
168.65.20+5.5
266.85.11+7.3
366.44.83+7.7
466.34.76+7.8
566.24.710+7.9
1066.24.145+7.9

Marginal contribution analysis: After 2 clues, additional clues provide negligible marginal divergence (all p > 0.05 except 1→2 which shows Δ = -1.85, p < 0.0001).

Divergence Trajectory Figure 9: INS-001.2 divergence trajectories by clue count. Top-left: Full and clue-only divergence converge as clue count increases. Top-right: Gap to DAT stabilizes around 8 points. Bottom-left: Pair composition shifts toward clue-clue pairs. Bottom-right: Proportion of participant-generated pairs increases but never reaches DAT’s 100%.

Optimization criteria:

  1. Gap to DAT (construct validity)
  2. Clue-clue pair proportion (purity of divergence measurement)
  3. Score stability (SD)
  4. Task efficiency (participant burden)

Recommendation: 4 clues—balances construct validity (40% clue-clue pairs), stability (SD = 4.7), and reasonable task length.

6.6 Joint Interpretation

SpreadFidelityInterpretation
HighHighCreative and precise bridging
HighLowDivergent but off-task
LowHighConventional but effective
LowLowConstrained and off-task

7. Interpretation Thresholds

7.1 INS-001.1 Spread Thresholds (Haiku-Calibrated)

ScoreLabelDescriptionHaiku Percentile
< 59LowVery constrained associations< 10th
59–62Below AverageLimited exploration10th–25th
62–69AverageTypical free association25th–75th
69–72Above AverageWide-ranging associations75th–90th
> 72HighExceptional spread> 90th

Haiku reference: mean = 65.5, SD = 4.6

7.2 INS-001.1 Unconventionality Thresholds

Overlap CountLevelInterpretation
0HighFully avoided predictable associations
1–2ModeratePartial exploration beyond the obvious
3+LowRelied on predictable/conventional associations

Note: Unconventionality should be interpreted alongside spread and communicability. The ideal profile is high spread + high unconventionality + successful communicability—exploring distant, non-obvious territory while still communicating effectively.

7.3 INS-001.1 Communicability Expectations

RateInterpretation
< 80%Idiosyncratic associations—may not communicate
80–90%Partial communication—some unusual associations
> 90%Effective communication—associations decode reliably

Haiku reference: 93.2%

7.4 INS-001.2 Spread Thresholds (Haiku-Calibrated)

ScoreLabelDescriptionHaiku Percentile
< 55LowVery constrained associations< 5th
55–62Below AverageLimited exploration5th–25th
62–68AverageTypical bridging performance25th–75th
68–72Above AverageWide-ranging associations75th–95th
> 72HighExceptional spread> 95th

Haiku reference: mean = 64.4, SD = 4.6

7.5 INS-001.2 Fidelity Thresholds (Haiku-Calibrated)

ScoreLabelDescriptionHaiku Percentile
< 0.50DiffuseClues don’t narrow down the connection< 5th
0.50–0.65PartialSome constraint, with redundancy5th–25th
0.65–0.75ModerateReasonable triangulation25th–75th
0.75–0.85FocusedEfficient identification75th–95th
> 0.85PreciseOptimal constraint> 95th

Haiku reference: mean = 0.68, SD = 0.13

7.6 DAT-Equivalent Conversion

To express INS-001 spread in DAT-equivalent terms:

OriginalAdjustmentDAT-EquivalentInterpretation
INS-001.1: 60+8.668.6Below human average
INS-001.1: 65+8.673.6Near human average
INS-001.1: 70+8.678.6At human average
INS-001.2: 60+9.769.7Below human average
INS-001.2: 65+9.774.7Near human average

Human DAT reference: mean = 78, SD = 6


8. Implementation

8.1 Core Functions

FunctionINS-001.1INS-001.2
calculate_spread()
calculate_unconventionality()
calculate_communicability()
calculate_fidelity()
generate_endpoint_foils()
generate_noise_floor()
score_radiation()
score_bridging()

8.2 Return Schemas

INS-001.1 (score_radiation()):

{
    "spread": float,              # Clue-only divergence (0-100)
    "relevance": float,           # Legacy metric (0-1)
    "communicable": bool,         # Did Haiku guess correctly?
    "haiku_guesses": list,        # Top-3 guesses
    "exact_match": bool,          # Verbatim match
    "semantic_match": bool,       # Nearest-neighbor match
    "unconventionality": {
        "level": str,             # "high", "moderate", or "low"
        "overlap_count": int,     # Number of predictable associations
        "overlapping_words": list # Which words matched noise floor
    }
}

INS-001.2 (score_bridging()):

{
    "spread": float,              # Clue-only divergence (0-100)
    "fidelity": float,            # Joint constraint score (0-1)
    "fidelity_coverage": float,   # Coverage component (0-1)
    "fidelity_efficiency": float, # Efficiency component (0-1)
    "relevance": float,           # Legacy metric (0-1)
    "valid": bool                 # Whether relevance >= 0.15
}

8.3 Dependencies

ComponentSpecification
Embedding modelOpenAI text-embedding-3-small (1536 dim)
Guesser modelClaude Haiku 4.5 (claude-haiku-4-5-20251001)
Vocabulary~5000 common English nouns
Foil count50 per endpoint

8.4 Computational Cost

OperationINS-001.1INS-001.2
EmbeddingsO(n) cluesO(n) clues
SpreadO(n²) pairsO(n²) pairs
FidelityO(n × 100) foils
Guesser call1 API call

9. Limitations

LimitationImpactMitigation
Haiku as referenceAI may not match human associationsPlan human validation study
Embedding dependenceDifferent models yield different scoresStandardize on OpenAI; re-calibrate if changed
Cultural biasSemantic structures vary by culture/languageDocument English-only scope; expand later
Guesser brittlenessHaiku capabilities may shiftLock model version; re-calibrate periodically
Single-trial reliabilityHigh variance per trialUse multiple trials; report aggregates

10. Code

Implementation available at:


Changelog

VersionDateChanges
4.02026-01-18Expanded scope to cover both INS-001.1 and INS-001.2; added theoretical framework on semantic memory and communication fidelity; added INS-001.1 calibration study with spread-communicability synergy finding; added DAT calibration details including embedding model effects; documented optimal clue counts for both instruments
3.02026-01-18Replaced relevance with fidelity for INS-001.2; added fidelity calibration study
2.02026-01-17Added DAT calibration; renamed “divergence” to “spread”
1.02026-01-15Initial publication

Figures

The following figures are referenced in this document (located in /images/methods/semantic-association-metrics/spread-fidelity-scoring/):

INS-001.1 (Radiation):

  1. ins001_1_calibration.png — Spread-relevance relationship and seed category effects
  2. ins001_1_clue_count_trajectories.png — Spread and communicability by clue count
  3. ins001_1_haiku_version_comparison.png — Association overlap between Haiku versions

DAT Calibration: 4. dat_scores_vs_baselines.png — Haiku vs human norms vs random baselines 5. embedding_model_comparison.png — GloVe vs OpenAI DAT score comparison

INS-001.2 (Bridging): 6. alternative_relevance_metrics.png — Fidelity metric selection analysis 7. relevance_divergence_tradeoff.png — The fundamental relevance-spread tradeoff 8. pair_difficulty_moderation.png — Clue spread by pair difficulty tercile 9. divergence_trajectory.png — Full and clue-only divergence by clue count 10. tercile_comparison.png — Fidelity by spread tercile 11. metric_correlations.png — Correlation heatmap for all INS-001.2 metrics