Instrument Design | Phronos Library

Traditional cognitive assessments optimize for psychometric validity while treating participant experience as secondary. The result: reliable measurements from disengaged participants who may never return. Digital instruments face an additional constraint: they compete with everything else on a phone screen.

INS-001 attempts a different approach—using game mechanics to create intrinsically motivating tasks while preserving measurement validity. This library article examines whether this is possible and what trade-offs it entails.

1. Why This Matters for INS-001

Three design challenges require attention:

Gamification validity (H8.1) — Can game-based formats maintain psychometric properties?
Task parameter calibration — How do design choices (clue count, time pressure, constraints) affect what we measure?
Harm prevention (H8.4) — How do we present results without causing psychological harm?

The first two have empirical grounding; the third is acknowledged but unstudied.

2. Enables

Type	Items
Instruments	INS-001.1, INS-001.2, all future instruments
Hypotheses	H8.1, H8.2, H8.3, H8.4
Methods	MTH-002 (design rationale)

3. Must Establish

H8.1: Game-Based Assessment Validity

Claim: Game-based and gamified cognitive assessments can maintain psychometric validity while increasing engagement relative to traditional formats.

Status: Supported for memory and attention; creativity-specific gamification not validated

Evidence:

Pedersen et al. (2023) validated Skill Lab, demonstrating that mobile cognitive games can achieve substantial correlations (r = 0.40–0.60) with laboratory tasks for memory and attention
Lumsden et al. (2016) systematic review established design principles for gamified cognitive assessment, documenting conditions under which gamification preserves validity

Complication: Neither study validated gamification for creativity or divergent thinking tasks. INS-001 represents an extrapolation from memory/attention domains to semantic association.

Implication for INS-001: We have methodological precedent but no direct validation that game-based word association preserves the construct validity established for traditional DAT.

H8.2: Word Association Games Provide Valid Windows

Claim: Word association game mechanics (as used in games like Codenames, Connector) can reveal meaningful semantic cognition.

Status: Supported by game design precedent

Evidence:

Kumar et al. (2021) validated the Connector game for measuring semantic association between concepts
Stephenson et al. (2024) established LLM-human comparison methodology using Codenames as a benchmark
Cazalets & Dambre (2025) developed human-AI synchronization paradigms using word association
Xu et al. (2025) validated CK-Arena for game-based conceptual knowledge assessment

Implication for INS-001: Word association games have established precedent as measurement contexts. INS-001.2 (Bridging) adapts the Connector paradigm; INS-001.1 (Radiation) adapts free association with AI-guesser validation.

H8.3: Task Constraints Affect Measured Constructs

Claim: Design parameters—clue count, time pressure, explicit constraints—systematically affect what constructs are measured and how scores should be interpreted.

Status: Empirically documented in MTH-002.1

Evidence:

MTH-002.1 documents optimal clue counts:
- INS-001.1: 2 clues recommended (maximizes spread while maintaining 100% communicability)
- INS-001.2: 4 clues recommended (40% clue-clue pairs, SD = 4.7)
Constraint effects are substantial: Cohen’s d = 1.02 between unconstrained DAT and constrained INS-001.2 (MTH-002.1 §4.6)
Relevance-spread redundancy required metric redesign: original relevance metric showed r = -0.77 with spread in INS-001.1, r = -0.61 in INS-001.2 (MTH-002.1 §2.4)

Implication for INS-001: Design choices have large effects on measurement. The calibration assumption—that constrained spread reflects the same construct as unconstrained spread—is documented but not validated.

H8.4: Results Can Be Presented Without Harm

Claim: Cognitive assessment results can be communicated to participants in ways that inform without causing psychological harm.

Status: Acknowledged concern; not studied

Evidence:

No published studies examine psychological effects of creativity score feedback
Self-location in a distribution (learning one is “below average” at creativity) could affect self-concept
Particular concern for creativity, which may be more identity-linked than memory or attention

Implication for INS-001: We have ethical obligations to consider result presentation carefully, but no empirical guidance on how to do so safely.

4. Key Sources

Source	Contribution	Caveats
*Pedersen, M. K., et al. (2023). Measuring cognitive abilities in the wild: Validating a population-scale game-based cognitive assessment. Cognitive Science, 47(6), e13308.*	Skill Lab validation (r = 0.40–0.60 with lab tasks; 5× faster than traditional)	Supports H8.1 for memory and attention games. Creativity/divergent thinking games not validated. See MTH-002.1 for INS-001-specific calibration.
*Lumsden, J., et al. (2016). Gamification of cognitive assessment and cognitive training: A systematic review of applications and efficacy. JMIR Serious Games, 4(2), e11.*	Gamification systematic review	Supports design principles. Conditions for creativity-task gamification not established.
Kumar, A. A., Steyvers, M., & Balota, D. A. (2021). Semantic memory search and retrieval in a novel cooperative word game: A comparison of associative and distributional semantic models. Cognitive Science, 45(10), e13053.	Connector game validation	Supports H8.2. Establishes word association game mechanics for semantic measurement.
*Cazalets, C., & Dambre, J. (2025). Word Synchronization Challenge: A benchmark for word association responses for LLMs. Proceedings of ECAI 2025, 1–16.*	Word Synchronization paradigm	Supports human-AI collaborative assessment formats.
Stephenson, L., Sidji, V., & Ronval, O. (2024). Codenames as a benchmark for large language models. arXiv:2412.11373.	Codenames benchmark	Supports H8.2. Establishes LLM-human comparison methodology in word games.
*Rafferty, A. N., Zaharia, M., & Griffiths, T. L. (2014). Optimally designing games for behavioural research. Proceedings of the Royal Society A, 470(2167), 20130828.*	Optimal game design framework	Provides principled design framework for educational games.
Xu, Y., et al. (2025). Probe by gaming: A game-based benchmark for assessing conceptual knowledge in LLMs. arXiv:2505.17512.	CK-Arena	Supports game-based conceptual knowledge assessment methodology.
MTH-002.1	Spread-fidelity calibration	Documents metric independence problem and solution; optimal clue counts; constraint effects. Full document

5. Scope

In scope:

Game-based cognitive assessment validity
Word association game mechanics (Codenames, Connector, synchronization games)
Task design parameters (clue count, time pressure, constraints)
Result presentation and interpretation
Human-AI interaction in assessment contexts

Out of scope:

Semantic network theory → LIB-001
Measurement validity (test-retest, ecological) → LIB-002
Reflexive identity effects → LIB-003
Cross-cultural design → LIB-007

6. Current Gaps

Gap	Status	Reference
Creativity-specific gamification	Not validated. Pedersen (2023) validates memory/attention, not divergent thinking.	H8.1
Constraint calibration vs. validation	Documented but not validated. MTH-002.1 calibrates INS-001 against DAT; whether constrained spread measures the same construct is assumed, not tested.	MTH-002.1 §4.6
Harm from self-location	Not studied. No evidence on psychological effects of creativity score feedback.	H8.4
Long-term engagement	Unknown. Only single-session validation exists; retention and re-engagement not studied.	—

7. The Design Tension

INS-001 navigates a fundamental tension: engaging formats may alter what is measured.

The optimistic view: Game mechanics increase motivation and reduce test anxiety, yielding more valid measurements of underlying capacity by reducing performance-irrelevant variance.

The pessimistic view: Game constraints introduce construct-irrelevant variance. What we call “divergent thinking under constraint” may be a different capacity than unconstrained divergent thinking.

MTH-002.1 documents this tension empirically:

Constraint produces large effects (d = 1.02)
We calibrate against DAT to interpret scores
We cannot determine whether calibration preserves construct validity

The honest position: we have built instruments with documented metric properties and known relationships to established measures. Whether they capture the same underlying constructs is an empirical question we have not answered.

8. Confidence

Low-to-Moderate.

Game-based assessment validity is well-established for memory and attention (Pedersen 2023, Lumsden 2016). Word association games have precedent (Kumar 2021, Codenames). The methodology is sound.

Creativity-specific gamification is not validated—INS-001 design choices are empirically calibrated (MTH-002.1) but lack external criterion validation against established creativity measures. We know how our scores relate to DAT; we don’t know if the game format preserves what DAT measures.

Harm prevention remains acknowledged but unstudied. We have ethical obligations we cannot yet discharge with evidence.

9. Joint Confidence Note

Individual confidence ratings treat gaps as independent. When multiple LIB articles are combined for instrument development (e.g., INS-001 depends on LIB-001, LIB-002, and LIB-008), joint confidence is substantially lower than any single rating suggests.

Key compounding uncertainties for INS-001:

Embedding validity for individuals (LIB-001 H1.7) × Test-retest reliability (LIB-002 H2.3) × Creativity-game transfer (LIB-008 H8.1)

Until these are independently validated, INS-001 confidence should be interpreted as Low despite moderate ratings on component claims.

Changelog

Version	Date	Changes
1.0	2026-01-18	Initial publication