GOV-004 Research
Living
v0.4 January 18, 2026

Research Agenda

Status Key
Supported
Moderate-to-strong evidence
Partial
Evidence with caveats
Gap
Needed but no evidence
Untested
Requires new research

Research Agenda

This document translates Phronos research goals into testable hypotheses, organized by theoretical foundation. Each hypothesis is stated with dependencies, evidence requirements, and confidence assessments.

Our research program investigates whether stable patterns of cognition—“cognitive phenotypes”—can be reliably measured from behavioral traces in human-AI interaction data. This agenda makes explicit what we believe, what we’ve tested, and what remains uncertain.


Status Key

StatusMeaning
SupportedLibrary article establishes moderate-to-strong evidence
PartialSome evidence exists with significant caveats
GapIdentified as needed but no direct evidence
UntestedEmpirical question requiring new research

Foundational Claims

These claims underlie the entire research program. They must be defensible before downstream work proceeds.

IDClaimStatusConfidenceKey GapEvidence Needed
F1Cognitive phenotypes are real and measurable. Stable patterns of cognition exist as individual differences that can be reliably measured from behavioral traces in human-AI interaction data.PartialLow-to-ModerateH2.3 (test-retest)Convergent validity with established measures • Discriminant validity from linguistic style • Test-retest reliability • Distinguishability from LLM outputs
F2Chat data reveals cognition, not just communication. What people type in AI conversations encodes cognitive processes (reasoning, association, belief structures) beyond communicative intent.PartialLow-to-ModerateH2.6 (task-naturalistic)Correlation between chat-derived measures and cognitive task performance • Chat features predict outcomes beyond self-report

LIB-001: Linguistic Markers of Cognition

Question: What features of language reliably encode cognitive processes, and which are measurable at scale?
Library Confidence: Low-to-Moderate

IDHypothesisStatusConfidenceNotes
H1.1Pronoun patterns encode psychological states. First-person pronoun usage correlates with self-focus, distress, and relational orientation.SupportedHighLIWC tradition, clinical validation
H1.2Semantic network structure varies across individuals. Associations between concepts show stable individual differences that correlate with creative ability.SupportedModerateKenett et al. 2014, Beaty & Kenett 2023. Limitation: Small samples; brief tasks may not capture full structure.
H1.3Semantic networks can be assessed via chat. Network structure can be inferred from chat behavior without explicit association tasks.GapLowDepends on H2.6
H1.4Chat captures identity constructs. Self-concept, values, and beliefs can be reliably measured from chat interactions.UntestedLowDepends on F1, F2
H1.5Chat captures personality. Big Five traits can be reliably inferred from chat interaction patterns.PartialModerate (text) / Low (chat)Some text-based validation exists; chat-specific validation needed
H1.6Chat captures reasoning style. Individual differences in reasoning are detectable from chat patterns.UntestedLowNeeds correlation with reasoning tasks

Embedding and Semantic Spread Hypotheses

These hypotheses are central to INS-001 instrument design.

H1.7: Embedding Validity for Semantic Distance

Semantic distances computed from word embeddings correlate with human judgments of semantic similarity.

AspectDetail
StatusPartial (population-level supported; individual-level untested)
ConfidenceLow-to-Moderate
EvidenceHill et al. 2015 (SimLex-999, r ≈ 0.44–0.56), Auguste et al. 2017 (priming RT)
Critical caveatValidated for population-level semantics only
Empirical findingDifferent embedding models produce substantially different scores (GloVe vs. OpenAI: r = 0.60; 20.7-point systematic difference)
InstrumentsINS-001.1, INS-001.2

H1.8: Divergent Thinking via Semantic Spread

Divergent thinking is measurable via mean pairwise semantic distance (the DAT methodology) and correlates with broader creative ability.

AspectDetail
StatusSupported (10-word task); transfer to brief format calibrated but not validated
ConfidenceModerate (with transfer caveat)
EvidenceOlson et al. 2021 (r = 0.40 with AUT, r = 0.28 with RAT; N = 8,914)
Critical caveatOriginal DAT uses 10 words unconstrained; INS-001 uses 2–5 words under constraint
Empirical findingINS-001.1 produces 8.6 points lower spread than DAT; INS-001.2 produces 8.0 points lower
InstrumentsINS-001.1, INS-001.2

H1.9: Constraint-Construct Relationship

Task constraints (free vs. goal-directed association, relevance requirements) determine which aspect of semantic cognition is measured.

AspectDetail
StatusSupported
ConfidenceModerate
EvidenceBeaty & Kenett 2023, Merseal et al. 2025 (artists exceed scientists in free but not goal-directed association)
Empirical findingConstraint effects are large (Cohen’s d = 1.02 between unconstrained DAT and constrained INS-001.2)
ImplicationINS-001.1 and INS-001.2 measure related but distinct constructs

LIB-002: Digital Validity

Question: Under what conditions can digital interactions serve as valid proxies for cognitive states?
Library Confidence: Low

IDHypothesisStatusConfidenceNotes
H2.1Digital tasks have ecological validity. Game-based measures predict laboratory assessment performance.Supported (memory/attention)ModeratePedersen et al. 2023. Caveat: Creativity/semantic constructs untested.
H2.2Anonymity alters disclosure. Anonymous chat elicits different self-disclosure than identified interactions.PartialModerateOnline disinhibition literature
H2.4Measurements generalize across platforms. Phenotypes from one AI system are detectable in others.UntestedUnknownNeeds cross-platform validation
H2.5Selection effects are characterizable. Chat data populations differ systematically from general population.PartialHigh (exist) / Low (characterized)Pedersen 2023 (gender skew), Thompson 2020 (cultural bias)

Critical Validity Gaps

H2.3: Test-Retest Reliability ⚠️ Critical Gap

Cognitive phenotype measurements show acceptable test-retest reliability (r > 0.7) across occasions.

AspectDetail
StatusGap
ConfidenceUnknown
Evidence neededLongitudinal measurement study with INS-001
NoteNo published test-retest data for DAT or semantic spread measures
PriorityCritical — blocks trait interpretation of INS-001

H2.6: Task-Naturalistic Convergence

Task-based assessments (INS-001) and naturalistic chat-derived measures show convergent validity.

AspectDetail
StatusGap
ConfidenceLow (zero supporting evidence)
Evidence neededConvergent validity study
NoteIf these diverge, INS-001 measures task performance rather than cognitive style
PriorityHigh — determines measurement strategy

AI Interaction and Constraints

IDHypothesisStatusConfidenceNotes
H2.7AI interaction doesn’t invalidate measurement. AI as interlocutor/judge doesn’t distort target constructs.PartialLowVicente & Matute 2023 (bias inheritance). Human validation pending for INS-001.
H2.8Task constraints affect spread measurement. Constraints compress semantic spread, requiring calibration.SupportedModerateCohen’s d = 1.02 between DAT and INS-001.2

LIB-003: Reflexive Identity

Question: How do AI interactions shape cognition while purporting to measure it?

IDHypothesisStatusConfidenceNotes
H3.1AI interaction alters reasoning. Extended LLM interaction changes measurable reasoning aspects.PartialLowVicente & Matute 2023. Depends on H1.6.
H3.2AI interaction alters identity. Extended LLM interaction changes self-concept, values, or beliefs.UntestedLowDepends on H1.4

LIB-008: Instrument Design

Question: How can cognitive instruments be engaging, valid, and interpretable without causing harm?
Library Confidence: Low-to-Moderate

IDHypothesisStatusConfidenceNotes
H8.1Game-based assessment maintains validity. Gamified formats maintain validity while increasing engagement.Supported (memory/attention)ModeratePedersen 2023, Lumsden 2016. Caveat: Creativity untested.
H8.2Word association games reveal cognition. Games like Codenames reveal semantic memory structure.SupportedModerateKumar 2021, Stephenson 2024, Xu 2025
H8.3Task constraints affect measured constructs. Clue count, time pressure, constraints affect what is measured.SupportedModerateOptimal: 2 clues (INS-001.1), 4 clues (INS-001.2). Relevance-spread r = −0.77 led to fidelity metric.
H8.5Self-location is possible. Individuals can locate themselves within trait distributions.UntestedModerateDepends on H8.1, H8.4
H8.6Phenotypes associate with performance. Chat-derived phenotypes predict external performance criteria.PartialLowSaid-Metwaly 2024 (r = 0.18). Effect sizes are modest.
H8.7Phenotypes associate with fitness. Chat-derived phenotypes correlate with cognitive fitness measures.UntestedLowDepends on fitness framework (H5.3, H5.4)
H8.8Phenotype change is measurable. Changes over time can be reliably detected.GapUnknownDepends on H2.3

Ethical Prerequisite

H8.4: Results Can Be Presented Without Harm ⚠️ Critical Gap

Cognitive assessment results can be presented to participants without causing psychological harm.

AspectDetail
StatusGap
ConfidenceUnknown
Evidence neededStudy of psychological effects of creativity feedback
ConcernsLabeling effects, misinterpretation as stable trait, comparison anxiety. Creativity may be more identity-linked than memory or attention.
PriorityHigh — ethical obligation
Current approachPrecautionary: avoid trait language, frame as snapshot, provide context

Research Priorities

PriorityHypothesesStatusRationale
CriticalF1, F2PartialFoundation—everything depends on these
CriticalH2.3GapTest-retest reliability blocks trait interpretation
CriticalH8.4GapFeedback safety is ethical prerequisite
HighH1.7, H1.8Partial / SupportedEmbedding and DAT validity—core to INS-001
HighH2.6GapTask vs. naturalistic validity determines strategy
HighH8.1Supported (limited)Game format validity—creativity untested
MediumH1.2, H1.9, H8.3SupportedSemantic networks and constraint effects
MediumH2.7, H2.8Partial / SupportedAI interaction and constraint effects

Joint Confidence Assessment

Individual hypothesis confidence ratings treat claims as independent. When combined for instrument development, joint confidence is substantially lower.

INS-001 Compounding Uncertainties

ComponentHypothesisStatusConfidence
Embedding validity (individuals)H1.7PartialLow-to-Moderate
Brief-task validityH1.8Calibrated, not validatedModerate (with caveat)
Test-retest reliabilityH2.3GapUnknown
Creativity-game transferH8.1Domain extrapolationModerate (domain-limited)

Joint interpretation: Until these are independently validated, INS-001 confidence should be interpreted as Low despite moderate ratings on component claims.


Version History

VersionDateSummary
0.42026-01-18Consistency update with revised library articles; integrated empirical findings; added Joint Confidence section
0.32026-01-18Updated based on LIB-001, LIB-002, LIB-008 synthesis
0.22026-01-15Added H1.7, H1.8, H2.6 per INS-001 gap analysis
0.12026-01-15Initial hypothesis specification