Research Agenda
- Supported
- Moderate-to-strong evidence
- Partial
- Evidence with caveats
- Gap
- Needed but no evidence
- Untested
- Requires new research
Research Agenda
This document translates Phronos research goals into testable hypotheses, organized by theoretical foundation. Each hypothesis is stated with dependencies, evidence requirements, and confidence assessments.
Our research program investigates whether stable patterns of cognition—“cognitive phenotypes”—can be reliably measured from behavioral traces in human-AI interaction data. This agenda makes explicit what we believe, what we’ve tested, and what remains uncertain.
Status Key
| Status | Meaning |
|---|---|
| Supported | Library article establishes moderate-to-strong evidence |
| Partial | Some evidence exists with significant caveats |
| Gap | Identified as needed but no direct evidence |
| Untested | Empirical question requiring new research |
Foundational Claims
These claims underlie the entire research program. They must be defensible before downstream work proceeds.
| ID | Claim | Status | Confidence | Key Gap | Evidence Needed |
|---|---|---|---|---|---|
| F1 | Cognitive phenotypes are real and measurable. Stable patterns of cognition exist as individual differences that can be reliably measured from behavioral traces in human-AI interaction data. | Partial | Low-to-Moderate | H2.3 (test-retest) | Convergent validity with established measures • Discriminant validity from linguistic style • Test-retest reliability • Distinguishability from LLM outputs |
| F2 | Chat data reveals cognition, not just communication. What people type in AI conversations encodes cognitive processes (reasoning, association, belief structures) beyond communicative intent. | Partial | Low-to-Moderate | H2.6 (task-naturalistic) | Correlation between chat-derived measures and cognitive task performance • Chat features predict outcomes beyond self-report |
LIB-001: Linguistic Markers of Cognition
Question: What features of language reliably encode cognitive processes, and which are measurable at scale?
Library Confidence: Low-to-Moderate
| ID | Hypothesis | Status | Confidence | Notes |
|---|---|---|---|---|
| H1.1 | Pronoun patterns encode psychological states. First-person pronoun usage correlates with self-focus, distress, and relational orientation. | Supported | High | LIWC tradition, clinical validation |
| H1.2 | Semantic network structure varies across individuals. Associations between concepts show stable individual differences that correlate with creative ability. | Supported | Moderate | Kenett et al. 2014, Beaty & Kenett 2023. Limitation: Small samples; brief tasks may not capture full structure. |
| H1.3 | Semantic networks can be assessed via chat. Network structure can be inferred from chat behavior without explicit association tasks. | Gap | Low | Depends on H2.6 |
| H1.4 | Chat captures identity constructs. Self-concept, values, and beliefs can be reliably measured from chat interactions. | Untested | Low | Depends on F1, F2 |
| H1.5 | Chat captures personality. Big Five traits can be reliably inferred from chat interaction patterns. | Partial | Moderate (text) / Low (chat) | Some text-based validation exists; chat-specific validation needed |
| H1.6 | Chat captures reasoning style. Individual differences in reasoning are detectable from chat patterns. | Untested | Low | Needs correlation with reasoning tasks |
Embedding and Semantic Spread Hypotheses
These hypotheses are central to INS-001 instrument design.
H1.7: Embedding Validity for Semantic Distance
Semantic distances computed from word embeddings correlate with human judgments of semantic similarity.
| Aspect | Detail |
|---|---|
| Status | Partial (population-level supported; individual-level untested) |
| Confidence | Low-to-Moderate |
| Evidence | Hill et al. 2015 (SimLex-999, r ≈ 0.44–0.56), Auguste et al. 2017 (priming RT) |
| Critical caveat | Validated for population-level semantics only |
| Empirical finding | Different embedding models produce substantially different scores (GloVe vs. OpenAI: r = 0.60; 20.7-point systematic difference) |
| Instruments | INS-001.1, INS-001.2 |
H1.8: Divergent Thinking via Semantic Spread
Divergent thinking is measurable via mean pairwise semantic distance (the DAT methodology) and correlates with broader creative ability.
| Aspect | Detail |
|---|---|
| Status | Supported (10-word task); transfer to brief format calibrated but not validated |
| Confidence | Moderate (with transfer caveat) |
| Evidence | Olson et al. 2021 (r = 0.40 with AUT, r = 0.28 with RAT; N = 8,914) |
| Critical caveat | Original DAT uses 10 words unconstrained; INS-001 uses 2–5 words under constraint |
| Empirical finding | INS-001.1 produces 8.6 points lower spread than DAT; INS-001.2 produces 8.0 points lower |
| Instruments | INS-001.1, INS-001.2 |
H1.9: Constraint-Construct Relationship
Task constraints (free vs. goal-directed association, relevance requirements) determine which aspect of semantic cognition is measured.
| Aspect | Detail |
|---|---|
| Status | Supported |
| Confidence | Moderate |
| Evidence | Beaty & Kenett 2023, Merseal et al. 2025 (artists exceed scientists in free but not goal-directed association) |
| Empirical finding | Constraint effects are large (Cohen’s d = 1.02 between unconstrained DAT and constrained INS-001.2) |
| Implication | INS-001.1 and INS-001.2 measure related but distinct constructs |
LIB-002: Digital Validity
Question: Under what conditions can digital interactions serve as valid proxies for cognitive states?
Library Confidence: Low
| ID | Hypothesis | Status | Confidence | Notes |
|---|---|---|---|---|
| H2.1 | Digital tasks have ecological validity. Game-based measures predict laboratory assessment performance. | Supported (memory/attention) | Moderate | Pedersen et al. 2023. Caveat: Creativity/semantic constructs untested. |
| H2.2 | Anonymity alters disclosure. Anonymous chat elicits different self-disclosure than identified interactions. | Partial | Moderate | Online disinhibition literature |
| H2.4 | Measurements generalize across platforms. Phenotypes from one AI system are detectable in others. | Untested | Unknown | Needs cross-platform validation |
| H2.5 | Selection effects are characterizable. Chat data populations differ systematically from general population. | Partial | High (exist) / Low (characterized) | Pedersen 2023 (gender skew), Thompson 2020 (cultural bias) |
Critical Validity Gaps
H2.3: Test-Retest Reliability ⚠️ Critical Gap
Cognitive phenotype measurements show acceptable test-retest reliability (r > 0.7) across occasions.
| Aspect | Detail |
|---|---|
| Status | Gap |
| Confidence | Unknown |
| Evidence needed | Longitudinal measurement study with INS-001 |
| Note | No published test-retest data for DAT or semantic spread measures |
| Priority | Critical — blocks trait interpretation of INS-001 |
H2.6: Task-Naturalistic Convergence
Task-based assessments (INS-001) and naturalistic chat-derived measures show convergent validity.
| Aspect | Detail |
|---|---|
| Status | Gap |
| Confidence | Low (zero supporting evidence) |
| Evidence needed | Convergent validity study |
| Note | If these diverge, INS-001 measures task performance rather than cognitive style |
| Priority | High — determines measurement strategy |
AI Interaction and Constraints
| ID | Hypothesis | Status | Confidence | Notes |
|---|---|---|---|---|
| H2.7 | AI interaction doesn’t invalidate measurement. AI as interlocutor/judge doesn’t distort target constructs. | Partial | Low | Vicente & Matute 2023 (bias inheritance). Human validation pending for INS-001. |
| H2.8 | Task constraints affect spread measurement. Constraints compress semantic spread, requiring calibration. | Supported | Moderate | Cohen’s d = 1.02 between DAT and INS-001.2 |
LIB-003: Reflexive Identity
Question: How do AI interactions shape cognition while purporting to measure it?
| ID | Hypothesis | Status | Confidence | Notes |
|---|---|---|---|---|
| H3.1 | AI interaction alters reasoning. Extended LLM interaction changes measurable reasoning aspects. | Partial | Low | Vicente & Matute 2023. Depends on H1.6. |
| H3.2 | AI interaction alters identity. Extended LLM interaction changes self-concept, values, or beliefs. | Untested | Low | Depends on H1.4 |
LIB-008: Instrument Design
Question: How can cognitive instruments be engaging, valid, and interpretable without causing harm?
Library Confidence: Low-to-Moderate
| ID | Hypothesis | Status | Confidence | Notes |
|---|---|---|---|---|
| H8.1 | Game-based assessment maintains validity. Gamified formats maintain validity while increasing engagement. | Supported (memory/attention) | Moderate | Pedersen 2023, Lumsden 2016. Caveat: Creativity untested. |
| H8.2 | Word association games reveal cognition. Games like Codenames reveal semantic memory structure. | Supported | Moderate | Kumar 2021, Stephenson 2024, Xu 2025 |
| H8.3 | Task constraints affect measured constructs. Clue count, time pressure, constraints affect what is measured. | Supported | Moderate | Optimal: 2 clues (INS-001.1), 4 clues (INS-001.2). Relevance-spread r = −0.77 led to fidelity metric. |
| H8.5 | Self-location is possible. Individuals can locate themselves within trait distributions. | Untested | Moderate | Depends on H8.1, H8.4 |
| H8.6 | Phenotypes associate with performance. Chat-derived phenotypes predict external performance criteria. | Partial | Low | Said-Metwaly 2024 (r = 0.18). Effect sizes are modest. |
| H8.7 | Phenotypes associate with fitness. Chat-derived phenotypes correlate with cognitive fitness measures. | Untested | Low | Depends on fitness framework (H5.3, H5.4) |
| H8.8 | Phenotype change is measurable. Changes over time can be reliably detected. | Gap | Unknown | Depends on H2.3 |
Ethical Prerequisite
H8.4: Results Can Be Presented Without Harm ⚠️ Critical Gap
Cognitive assessment results can be presented to participants without causing psychological harm.
| Aspect | Detail |
|---|---|
| Status | Gap |
| Confidence | Unknown |
| Evidence needed | Study of psychological effects of creativity feedback |
| Concerns | Labeling effects, misinterpretation as stable trait, comparison anxiety. Creativity may be more identity-linked than memory or attention. |
| Priority | High — ethical obligation |
| Current approach | Precautionary: avoid trait language, frame as snapshot, provide context |
Research Priorities
| Priority | Hypotheses | Status | Rationale |
|---|---|---|---|
| Critical | F1, F2 | Partial | Foundation—everything depends on these |
| Critical | H2.3 | Gap | Test-retest reliability blocks trait interpretation |
| Critical | H8.4 | Gap | Feedback safety is ethical prerequisite |
| High | H1.7, H1.8 | Partial / Supported | Embedding and DAT validity—core to INS-001 |
| High | H2.6 | Gap | Task vs. naturalistic validity determines strategy |
| High | H8.1 | Supported (limited) | Game format validity—creativity untested |
| Medium | H1.2, H1.9, H8.3 | Supported | Semantic networks and constraint effects |
| Medium | H2.7, H2.8 | Partial / Supported | AI interaction and constraint effects |
Joint Confidence Assessment
Individual hypothesis confidence ratings treat claims as independent. When combined for instrument development, joint confidence is substantially lower.
INS-001 Compounding Uncertainties
| Component | Hypothesis | Status | Confidence |
|---|---|---|---|
| Embedding validity (individuals) | H1.7 | Partial | Low-to-Moderate |
| Brief-task validity | H1.8 | Calibrated, not validated | Moderate (with caveat) |
| Test-retest reliability | H2.3 | Gap | Unknown |
| Creativity-game transfer | H8.1 | Domain extrapolation | Moderate (domain-limited) |
Joint interpretation: Until these are independently validated, INS-001 confidence should be interpreted as Low despite moderate ratings on component claims.
Version History
| Version | Date | Summary |
|---|---|---|
| 0.4 | 2026-01-18 | Consistency update with revised library articles; integrated empirical findings; added Joint Confidence section |
| 0.3 | 2026-01-18 | Updated based on LIB-001, LIB-002, LIB-008 synthesis |
| 0.2 | 2026-01-15 | Added H1.7, H1.8, H2.6 per INS-001 gap analysis |
| 0.1 | 2026-01-15 | Initial hypothesis specification |