MTH-001.2 Observational Chat Analysis
Published
v1.0 January 5, 2026

Engagement Prediction from First-Turn Features

Methodology for Predicting User Return from Initial Prompt Characteristics

Abstract

A methodology for predicting user return behavior from first-prompt features alone. Implements strict temporal holdout validation with new-user restriction to ensure genuine out-of-sample evaluation. Documents feature engineering for linguistic markers, ablation procedures for isolating feature contributions, and permutation importance for quantifying feature effects. Findings are reported in the associated Dispatch.

Executive Summary

This study documents a methodology for predicting user return behavior based solely on first-prompt characteristics. The approach implements strict temporal holdout with new-user restriction, extracts linguistic features from first prompts, and quantifies feature contributions through ablation and permutation importance procedures. The methodology enables investigation of which first-prompt characteristics—structural, stylistic, or semantic—contribute to predicting return behavior. Specific findings and interpretations are reported in the associated Dispatch.


1. Motivation

1.1 Context

Predicting user engagement with conversational AI systems from first-interaction features presents several methodological challenges:

  1. Temporal leakage: Training on future data and testing on past data inflates performance estimates
  2. User contamination: Including the same user in both train and test sets creates information leakage
  3. Feature circularity: Using features that require observing multiple conversations (e.g., intent diversity) prevents prospective prediction
  4. Confounding: Prompt length may correlate with task complexity, which independently drives return behavior

This methodology addresses each concern with specific design choices documented below.

1.2 Research Questions

The methodology enables investigation of:

  1. Whether first-prompt features can predict return behavior at levels exceeding chance
  2. Which feature categories contribute most to prediction: structural (word count), stylistic (pronouns, politeness), or semantic (intent)
  3. Whether predictive signal comes from genuine linguistic markers or is confounded with prompt length
  4. Whether predictive relationships generalize from training period to genuinely new users

2. Temporal Holdout Design

2.1 Split Definition

We implement a strict temporal holdout with new-user restriction:

SplitDefinitionPurpose
TrainingUsers whose first-ever conversation occurred before the cutoff dateModel development and exploratory analysis
TestUsers whose first-ever conversation occurred on or after the cutoff dateConfirmatory evaluation (single use)

2.2 New-User Restriction

The test set contains only users who did not exist in the training period. This is stricter than a simple date cutoff on conversations:

  • A date cutoff on conversations would allow the same user to appear in both splits (e.g., early conversations in training, later conversations in test)
  • The new-user restriction ensures complete user-level separation

2.3 Follow-Up Window Constraint

To ensure all test users have sufficient observation time for outcome measurement, test set inclusion requires:

tfirsttdata_endwt_{\text{first}} \leq t_{\text{data\_end}} - w

Where:

  • tfirstt_{\text{first}} = timestamp of user’s first conversation
  • tdata_endt_{\text{data\_end}} = end of data collection period
  • ww = follow-up window (e.g., 60 days)

This prevents right-censoring bias from users who joined near the end of data collection.

2.4 Contamination Verification

The implementation verifies zero overlap between training and test user sets:

UtrainUtest=0|\mathcal{U}_{\text{train}} \cap \mathcal{U}_{\text{test}}| = 0

3. Outcome Variable

3.1 Definition

A user is classified as “returned” if they initiated at least two conversations within a fixed time window of their first conversation:

returnedw(u)=1[nconv(u)2(t2(u)t1(u))w]\text{returned}_{w}(u) = \mathbf{1}\left[n_{\text{conv}}(u) \geq 2 \land (t_2(u) - t_1(u)) \leq w\right]

Where:

  • uu = user
  • nconv(u)n_{\text{conv}}(u) = total number of conversations for user uu
  • t1(u)t_1(u) = timestamp of user’s first conversation
  • t2(u)t_2(u) = timestamp of user’s second conversation
  • ww = return window in days
  • 1[]\mathbf{1}[\cdot] = indicator function

3.2 Rationale for Fixed Window

Using a fixed time window (rather than unbounded “ever returned”) standardizes the outcome across users with different observation periods. This is particularly important given the temporal holdout design, where test users necessarily have shorter maximum observation periods.

3.3 Days-to-Return vs. Active Span

We use days between first and second conversation (t2t1t_2 - t_1) rather than days between first and last conversation (tlastt1t_{\text{last}} - t_1). The latter conflates return timing with usage intensity and is more susceptible to right-censoring.


4. Feature Engineering

4.1 Extraction Scope

Features are extracted from the first user message only in each user’s first conversation. This ensures:

  • Predictions could be made at the moment of first interaction
  • No information leakage from subsequent messages or conversations
  • Consistent feature space across all users

4.2 Structural Features

FeatureDefinitionFormula
Word countNumber of whitespace-separated tokensnw=text.split()n_w = \|\text{text.split()}\|
Character countTotal characters including whitespacenc=textn_c = \|\text{text}\|
Average word lengthMean characters per wordMean of character counts per word

4.3 Pronoun Rate Features

Pronoun rates are computed as counts per 100 words to normalize across prompt lengths:

rpronoun=100×{wW:wP}nwr_{\text{pronoun}} = \frac{100 \times |\{w \in W : w \in P\}|}{n_w}

Where:

  • WW = set of words in the prompt (lowercased)
  • PP = target pronoun set
  • nwn_w = total word count
FeaturePronoun Set PP
First-person singular rate(i, me, my, mine, myself)
First-person plural rate(we, us, our, ours, ourselves)
Second-person rate(you, your, yours, yourself, yourselves)

4.4 Politeness Rate

Politeness rate uses the same normalization:

rpolite=100×{wW:wM}nwr_{\text{polite}} = \frac{100 \times |\{w \in W : w \in M\}|}{n_w}

Where M = (please, thank, thanks, appreciate, kindly)

4.5 Binary Structural Features

FeatureDefinitionTrigger Conditions
Has questionPrompt contains interrogative markersContains ”?” OR starts with interrogative word (what, how, why, when, where, who, can, could, would, is, are, do, does)
Is imperativePrompt begins with command verbFirst word (after lowercasing and stripping punctuation) is in (write, create, make, generate, list, explain, tell, show, find, help, give)
Is greetingPrompt begins with greetingStarts with: hi, hello, hey, good morning, good afternoon, good evening, greetings

4.6 Intent Features

Intent indicators are binary features derived from keyword pattern matching on the full prompt text. These serve as confounder controls rather than primary predictors.

IntentDescriptionExample Patterns
CodingProgramming-related requestscode, python, javascript, function, error, debug, algorithm
RoleplayPersona or scenario requestsact as, you are a, pretend, roleplay, scenario
Creative writingGenerative creative contentwrite a story, write a poem, fiction, compose
Emotional supportSupport-seeking or personal distressi feel, anxious, depressed, lonely, advice, help me cope

Full pattern lists are documented in the source notebook.


5. Model Architecture

5.1 Model Selection

We use logistic regression with L2 regularization for the following reasons:

  • Interpretability: Coefficients have direct interpretation as log-odds effects; odds ratios quantify effect magnitudes
  • Calibration: Logistic regression produces well-calibrated probability estimates without additional post-hoc calibration
  • Regularization: L2 penalty provides implicit regularization against overfitting
  • Efficiency: Fast training enables multiple ablation experiments

5.2 Class Imbalance Handling

Given imbalanced outcome distribution (most users do not return), we apply balanced class weighting:

wc=ntotalk×ncw_c = \frac{n_{\text{total}}}{k \times n_c}

Where:

  • wcw_c = weight for class cc
  • ntotaln_{\text{total}} = total samples
  • kk = number of classes (2)
  • ncn_c = samples in class cc

This upweights the minority class (returners) during training.

5.3 Feature Standardization

All features are standardized to zero mean and unit variance before model fitting:

xj=xjμjσjx'_j = \frac{x_j - \mu_j}{\sigma_j}

Where μj\mu_j and σj\sigma_j are computed from the training set only. The same transformation parameters are applied to test data to prevent leakage.

5.4 Missing Value Treatment

Missing values (e.g., when word count is zero, preventing rate calculation) are imputed with zero.


6. Ablation Procedure

6.1 Purpose

Ablation studies quantify the marginal contribution of feature categories by comparing model performance across nested feature sets.

6.2 Feature Set Hierarchy

We define feature sets of increasing complexity:

Set NameFeatures AddedCumulative Features
Baselineword_count1
+ Pronounsfirst_person_singular_rate, first_person_plural_rate, second_person_rate4
+ Stylepoliteness_rate, has_question, is_imperative, avg_word_length8
+ Greetingis_greeting9
+ Intentintent_coding, intent_roleplay, intent_creative_writing, intent_emotional_support13

Alternative groupings (e.g., baseline + style without pronouns) are also evaluated to isolate pronoun vs. style contributions.

6.3 Cross-Validation Procedure

For each feature set, we estimate performance using kk-fold stratified cross-validation on training data:

  1. Partition training users into kk stratified folds (preserving outcome class proportions)
  2. For each fold i{1,...,k}i \in \{1, ..., k\}:
    • Hold out fold ii as validation set
    • Fit standardizer and model on remaining k1k-1 folds
    • Apply standardizer to validation set
    • Compute predicted probabilities on validation set
    • Calculate ROC AUC for fold ii
  3. Report mean and standard deviation of AUC across folds

Stratification ensures each fold maintains the class balance of the full training set.

6.4 Interpretation

Comparing AUC across feature sets reveals:

  • Baseline performance: Predictive value of the simplest model (word count alone)
  • Marginal gains: Incremental improvement from adding feature categories
  • Diminishing returns: Whether complex feature sets substantially outperform simple ones

7. Permutation Importance Procedure

7.1 Purpose

Permutation importance quantifies the contribution of each feature to model performance by measuring the decrease in performance when that feature’s relationship with the outcome is destroyed.

7.2 Algorithm

For each feature jj:

  1. Compute baseline performance metric M0M_0 (e.g., AUC) on the evaluation set
  2. Randomly permute the values of feature jj across samples, breaking its relationship with both other features and the outcome
  3. Compute performance metric Mj(π)M_j^{(\pi)} on the permuted data
  4. Repeat steps 2-3 KK times with different random permutations
  5. Compute importance as the mean decrease in performance:
Ij=1Kk=1K[M0Mj(πk)]I_j = \frac{1}{K} \sum_{k=1}^{K} \left[ M_0 - M_j^{(\pi_k)} \right]

7.3 Interpretation

  • Positive importance: Permuting the feature hurts performance; the feature carries predictive signal
  • Near-zero importance: The feature does not contribute to prediction
  • Negative importance: Permuting the feature improves performance; the feature may be introducing noise

7.4 Relative Importance

To compare feature contributions, we compute the proportion of total positive importance attributable to each feature:

RelImpj=max(Ij,0)jmax(Ij,0)\text{RelImp}_j = \frac{\max(I_j, 0)}{\sum_{j'} \max(I_{j'}, 0)}

8. Random Feature Comparison

8.1 Purpose

Comparing real features against random noise features guards against spurious performance claims. If real features perform no better than random Gaussian noise, the predictive signal is likely artifactual.

8.2 Procedure

  1. Generate m random features from a standard normal distribution (mean 0, variance 1)
  2. Evaluate three models using cross-validation:
    • Real features only: Original feature set
    • Random features only: mm random Gaussian features
    • Real + random features: Concatenation of both

8.3 Statistical Test

Compare cross-validation AUC distributions between real and random feature conditions using a two-sample t-test:

H0:μreal=μrandomH_0: \mu_{\text{real}} = \mu_{\text{random}}

Rejection of H0H_0 (with μreal>μrandom\mu_{\text{real}} > \mu_{\text{random}}) confirms genuine predictive signal.


9. Validation Design

9.1 Single-Use Holdout Principle

The test set is evaluated exactly once after all exploratory analysis and model selection is complete. This prevents:

  • Selection bias from repeated testing
  • Implicit hyperparameter tuning on test data
  • Inflated performance estimates

9.2 Evaluation Metrics

MetricPurpose
ROC AUCDiscrimination ability across all probability thresholds
Precision/RecallClass-specific performance at chosen threshold
Calibration curveAgreement between predicted probabilities and observed rates

9.3 Calibration Assessment

Calibration is assessed by binning predictions and comparing mean predicted probability to observed outcome rate within each bin:

Calibration errorb=pˉbyˉb\text{Calibration error}_b = \left| \bar{p}_b - \bar{y}_b \right|

Where:

  • Mean predicted probability in bin b
  • Observed outcome rate in bin b

9.4 Train-Test Comparison

Comparing cross-validation performance (training) to holdout performance (test) quantifies overfitting:

Δoverfit=AUCtrain-CVAUCtest\Delta_{\text{overfit}} = \text{AUC}_{\text{train-CV}} - \text{AUC}_{\text{test}}

Small Δ\Delta indicates the model generalizes well to new users.


10. Identity Marker Testing Procedure

10.1 Hypothesis

LIWC-style identity markers in first prompts may predict engagement. Users who express personal identity (e.g., “I am a programmer”, “As a teacher…”) may exhibit different engagement patterns.

10.2 Features Tested

Feature CategoryExamples
First-person pronoun ratesI, me, my, mine, myself rates
Self-disclosure markersStatements about personal attributes
Identity claim patterns”I am a [role]”, “As a [profession]“
Relationship terminology”my wife”, “my friend”, “my boss”

10.3 Evaluation Procedure

Identity marker features are evaluated using the same ablation and permutation importance procedures documented above. The contribution of these features is quantified relative to structural features (word count).

10.4 Documentation

Full implementation details are in notebook 08_IdentityHypotheses.ipynb. Results regarding whether identity markers provide predictive value are reported in the associated Dispatch.


11. Limitations

LimitationMethodological ImpactMitigation
Ecological fallacyGroup-level patterns may not apply to individualsEffect sizes reported; individual-level claims avoided
Temporal confoundsModel capability improvements over time may affect behaviorSensitivity analysis stratified by model family
Intent circularityIntent features based on keyword matching, not validated intent taxonomyIntent used only as confounder control, not primary predictor
Population specificityWildChat users (anonymous, specific interface) may not represent other AI user populationsReference class documented; generalization claims limited
Outcome window sensitivityFixed return window (e.g., 60 days) is arbitrarySensitivity analyses with alternative windows recommended
Word count confoundingWord count may proxy for task complexity rather than user characteristicsInterpretation acknowledges this confound

12. Code

Analysis notebooks are available on GitHub:


Appendix A: Feature Set Specifications

A.1 Baseline Set

Contains only the structural complexity indicator:

  • word_count

A.2 Pronoun Extension

Adds normalized pronoun usage rates:

  • first_person_singular_rate
  • first_person_plural_rate
  • second_person_rate

A.3 Style Extension

Adds stylistic markers:

  • politeness_rate
  • has_question
  • is_imperative
  • avg_word_length

A.4 Greeting Extension

Adds conversational opener detection:

  • is_greeting

A.5 Intent Extension

Adds task category indicators:

  • intent_coding
  • intent_roleplay
  • intent_creative_writing
  • intent_emotional_support

Appendix B: Model Family Sensitivity Analysis Design

To address temporal confounds, the analysis can be stratified by GPT model family (GPT-3.5, GPT-4, GPT-4o). For each family:

  1. Filter to conversations using that model family
  2. Apply the same temporal holdout and feature extraction
  3. Evaluate cross-validated performance

Consistency of feature importance patterns across model families would suggest findings are not artifacts of temporal trends in model capability.


Changelog

VersionDateChanges
1.02026-01-05Initial publication