Engagement Prediction from First-Turn Features
Methodology for Predicting User Return from Initial Prompt Characteristics
A methodology for predicting user return behavior from first-prompt features alone. Implements strict temporal holdout validation with new-user restriction to ensure genuine out-of-sample evaluation. Documents feature engineering for linguistic markers, ablation procedures for isolating feature contributions, and permutation importance for quantifying feature effects. Findings are reported in the associated Dispatch.
Executive Summary
This study documents a methodology for predicting user return behavior based solely on first-prompt characteristics. The approach implements strict temporal holdout with new-user restriction, extracts linguistic features from first prompts, and quantifies feature contributions through ablation and permutation importance procedures. The methodology enables investigation of which first-prompt characteristics—structural, stylistic, or semantic—contribute to predicting return behavior. Specific findings and interpretations are reported in the associated Dispatch.
1. Motivation
1.1 Context
Predicting user engagement with conversational AI systems from first-interaction features presents several methodological challenges:
- Temporal leakage: Training on future data and testing on past data inflates performance estimates
- User contamination: Including the same user in both train and test sets creates information leakage
- Feature circularity: Using features that require observing multiple conversations (e.g., intent diversity) prevents prospective prediction
- Confounding: Prompt length may correlate with task complexity, which independently drives return behavior
This methodology addresses each concern with specific design choices documented below.
1.2 Research Questions
The methodology enables investigation of:
- Whether first-prompt features can predict return behavior at levels exceeding chance
- Which feature categories contribute most to prediction: structural (word count), stylistic (pronouns, politeness), or semantic (intent)
- Whether predictive signal comes from genuine linguistic markers or is confounded with prompt length
- Whether predictive relationships generalize from training period to genuinely new users
2. Temporal Holdout Design
2.1 Split Definition
We implement a strict temporal holdout with new-user restriction:
| Split | Definition | Purpose |
|---|---|---|
| Training | Users whose first-ever conversation occurred before the cutoff date | Model development and exploratory analysis |
| Test | Users whose first-ever conversation occurred on or after the cutoff date | Confirmatory evaluation (single use) |
2.2 New-User Restriction
The test set contains only users who did not exist in the training period. This is stricter than a simple date cutoff on conversations:
- A date cutoff on conversations would allow the same user to appear in both splits (e.g., early conversations in training, later conversations in test)
- The new-user restriction ensures complete user-level separation
2.3 Follow-Up Window Constraint
To ensure all test users have sufficient observation time for outcome measurement, test set inclusion requires:
Where:
- = timestamp of user’s first conversation
- = end of data collection period
- = follow-up window (e.g., 60 days)
This prevents right-censoring bias from users who joined near the end of data collection.
2.4 Contamination Verification
The implementation verifies zero overlap between training and test user sets:
3. Outcome Variable
3.1 Definition
A user is classified as “returned” if they initiated at least two conversations within a fixed time window of their first conversation:
Where:
- = user
- = total number of conversations for user
- = timestamp of user’s first conversation
- = timestamp of user’s second conversation
- = return window in days
- = indicator function
3.2 Rationale for Fixed Window
Using a fixed time window (rather than unbounded “ever returned”) standardizes the outcome across users with different observation periods. This is particularly important given the temporal holdout design, where test users necessarily have shorter maximum observation periods.
3.3 Days-to-Return vs. Active Span
We use days between first and second conversation () rather than days between first and last conversation (). The latter conflates return timing with usage intensity and is more susceptible to right-censoring.
4. Feature Engineering
4.1 Extraction Scope
Features are extracted from the first user message only in each user’s first conversation. This ensures:
- Predictions could be made at the moment of first interaction
- No information leakage from subsequent messages or conversations
- Consistent feature space across all users
4.2 Structural Features
| Feature | Definition | Formula |
|---|---|---|
| Word count | Number of whitespace-separated tokens | |
| Character count | Total characters including whitespace | |
| Average word length | Mean characters per word | Mean of character counts per word |
4.3 Pronoun Rate Features
Pronoun rates are computed as counts per 100 words to normalize across prompt lengths:
Where:
- = set of words in the prompt (lowercased)
- = target pronoun set
- = total word count
| Feature | Pronoun Set |
|---|---|
| First-person singular rate | (i, me, my, mine, myself) |
| First-person plural rate | (we, us, our, ours, ourselves) |
| Second-person rate | (you, your, yours, yourself, yourselves) |
4.4 Politeness Rate
Politeness rate uses the same normalization:
Where M = (please, thank, thanks, appreciate, kindly)
4.5 Binary Structural Features
| Feature | Definition | Trigger Conditions |
|---|---|---|
| Has question | Prompt contains interrogative markers | Contains ”?” OR starts with interrogative word (what, how, why, when, where, who, can, could, would, is, are, do, does) |
| Is imperative | Prompt begins with command verb | First word (after lowercasing and stripping punctuation) is in (write, create, make, generate, list, explain, tell, show, find, help, give) |
| Is greeting | Prompt begins with greeting | Starts with: hi, hello, hey, good morning, good afternoon, good evening, greetings |
4.6 Intent Features
Intent indicators are binary features derived from keyword pattern matching on the full prompt text. These serve as confounder controls rather than primary predictors.
| Intent | Description | Example Patterns |
|---|---|---|
| Coding | Programming-related requests | code, python, javascript, function, error, debug, algorithm |
| Roleplay | Persona or scenario requests | act as, you are a, pretend, roleplay, scenario |
| Creative writing | Generative creative content | write a story, write a poem, fiction, compose |
| Emotional support | Support-seeking or personal distress | i feel, anxious, depressed, lonely, advice, help me cope |
Full pattern lists are documented in the source notebook.
5. Model Architecture
5.1 Model Selection
We use logistic regression with L2 regularization for the following reasons:
- Interpretability: Coefficients have direct interpretation as log-odds effects; odds ratios quantify effect magnitudes
- Calibration: Logistic regression produces well-calibrated probability estimates without additional post-hoc calibration
- Regularization: L2 penalty provides implicit regularization against overfitting
- Efficiency: Fast training enables multiple ablation experiments
5.2 Class Imbalance Handling
Given imbalanced outcome distribution (most users do not return), we apply balanced class weighting:
Where:
- = weight for class
- = total samples
- = number of classes (2)
- = samples in class
This upweights the minority class (returners) during training.
5.3 Feature Standardization
All features are standardized to zero mean and unit variance before model fitting:
Where and are computed from the training set only. The same transformation parameters are applied to test data to prevent leakage.
5.4 Missing Value Treatment
Missing values (e.g., when word count is zero, preventing rate calculation) are imputed with zero.
6. Ablation Procedure
6.1 Purpose
Ablation studies quantify the marginal contribution of feature categories by comparing model performance across nested feature sets.
6.2 Feature Set Hierarchy
We define feature sets of increasing complexity:
| Set Name | Features Added | Cumulative Features |
|---|---|---|
| Baseline | word_count | 1 |
| + Pronouns | first_person_singular_rate, first_person_plural_rate, second_person_rate | 4 |
| + Style | politeness_rate, has_question, is_imperative, avg_word_length | 8 |
| + Greeting | is_greeting | 9 |
| + Intent | intent_coding, intent_roleplay, intent_creative_writing, intent_emotional_support | 13 |
Alternative groupings (e.g., baseline + style without pronouns) are also evaluated to isolate pronoun vs. style contributions.
6.3 Cross-Validation Procedure
For each feature set, we estimate performance using -fold stratified cross-validation on training data:
- Partition training users into stratified folds (preserving outcome class proportions)
- For each fold :
- Hold out fold as validation set
- Fit standardizer and model on remaining folds
- Apply standardizer to validation set
- Compute predicted probabilities on validation set
- Calculate ROC AUC for fold
- Report mean and standard deviation of AUC across folds
Stratification ensures each fold maintains the class balance of the full training set.
6.4 Interpretation
Comparing AUC across feature sets reveals:
- Baseline performance: Predictive value of the simplest model (word count alone)
- Marginal gains: Incremental improvement from adding feature categories
- Diminishing returns: Whether complex feature sets substantially outperform simple ones
7. Permutation Importance Procedure
7.1 Purpose
Permutation importance quantifies the contribution of each feature to model performance by measuring the decrease in performance when that feature’s relationship with the outcome is destroyed.
7.2 Algorithm
For each feature :
- Compute baseline performance metric (e.g., AUC) on the evaluation set
- Randomly permute the values of feature across samples, breaking its relationship with both other features and the outcome
- Compute performance metric on the permuted data
- Repeat steps 2-3 times with different random permutations
- Compute importance as the mean decrease in performance:
7.3 Interpretation
- Positive importance: Permuting the feature hurts performance; the feature carries predictive signal
- Near-zero importance: The feature does not contribute to prediction
- Negative importance: Permuting the feature improves performance; the feature may be introducing noise
7.4 Relative Importance
To compare feature contributions, we compute the proportion of total positive importance attributable to each feature:
8. Random Feature Comparison
8.1 Purpose
Comparing real features against random noise features guards against spurious performance claims. If real features perform no better than random Gaussian noise, the predictive signal is likely artifactual.
8.2 Procedure
- Generate m random features from a standard normal distribution (mean 0, variance 1)
- Evaluate three models using cross-validation:
- Real features only: Original feature set
- Random features only: random Gaussian features
- Real + random features: Concatenation of both
8.3 Statistical Test
Compare cross-validation AUC distributions between real and random feature conditions using a two-sample t-test:
Rejection of (with ) confirms genuine predictive signal.
9. Validation Design
9.1 Single-Use Holdout Principle
The test set is evaluated exactly once after all exploratory analysis and model selection is complete. This prevents:
- Selection bias from repeated testing
- Implicit hyperparameter tuning on test data
- Inflated performance estimates
9.2 Evaluation Metrics
| Metric | Purpose |
|---|---|
| ROC AUC | Discrimination ability across all probability thresholds |
| Precision/Recall | Class-specific performance at chosen threshold |
| Calibration curve | Agreement between predicted probabilities and observed rates |
9.3 Calibration Assessment
Calibration is assessed by binning predictions and comparing mean predicted probability to observed outcome rate within each bin:
Where:
- Mean predicted probability in bin b
- Observed outcome rate in bin b
9.4 Train-Test Comparison
Comparing cross-validation performance (training) to holdout performance (test) quantifies overfitting:
Small indicates the model generalizes well to new users.
10. Identity Marker Testing Procedure
10.1 Hypothesis
LIWC-style identity markers in first prompts may predict engagement. Users who express personal identity (e.g., “I am a programmer”, “As a teacher…”) may exhibit different engagement patterns.
10.2 Features Tested
| Feature Category | Examples |
|---|---|
| First-person pronoun rates | I, me, my, mine, myself rates |
| Self-disclosure markers | Statements about personal attributes |
| Identity claim patterns | ”I am a [role]”, “As a [profession]“ |
| Relationship terminology | ”my wife”, “my friend”, “my boss” |
10.3 Evaluation Procedure
Identity marker features are evaluated using the same ablation and permutation importance procedures documented above. The contribution of these features is quantified relative to structural features (word count).
10.4 Documentation
Full implementation details are in notebook 08_IdentityHypotheses.ipynb. Results regarding whether identity markers provide predictive value are reported in the associated Dispatch.
11. Limitations
| Limitation | Methodological Impact | Mitigation |
|---|---|---|
| Ecological fallacy | Group-level patterns may not apply to individuals | Effect sizes reported; individual-level claims avoided |
| Temporal confounds | Model capability improvements over time may affect behavior | Sensitivity analysis stratified by model family |
| Intent circularity | Intent features based on keyword matching, not validated intent taxonomy | Intent used only as confounder control, not primary predictor |
| Population specificity | WildChat users (anonymous, specific interface) may not represent other AI user populations | Reference class documented; generalization claims limited |
| Outcome window sensitivity | Fixed return window (e.g., 60 days) is arbitrary | Sensitivity analyses with alternative windows recommended |
| Word count confounding | Word count may proxy for task complexity rather than user characteristics | Interpretation acknowledges this confound |
12. Code
Analysis notebooks are available on GitHub:
- 06_WildChatEDA_PartII.ipynb — Exploratory data analysis
- 07_EngagementPrediction.ipynb — Engagement prediction implementation
Appendix A: Feature Set Specifications
A.1 Baseline Set
Contains only the structural complexity indicator:
word_count
A.2 Pronoun Extension
Adds normalized pronoun usage rates:
first_person_singular_ratefirst_person_plural_ratesecond_person_rate
A.3 Style Extension
Adds stylistic markers:
politeness_ratehas_questionis_imperativeavg_word_length
A.4 Greeting Extension
Adds conversational opener detection:
is_greeting
A.5 Intent Extension
Adds task category indicators:
intent_codingintent_roleplayintent_creative_writingintent_emotional_support
Appendix B: Model Family Sensitivity Analysis Design
To address temporal confounds, the analysis can be stratified by GPT model family (GPT-3.5, GPT-4, GPT-4o). For each family:
- Filter to conversations using that model family
- Apply the same temporal holdout and feature extraction
- Evaluate cross-validated performance
Consistency of feature importance patterns across model families would suggest findings are not artifacts of temporal trends in model capability.
Changelog
| Version | Date | Changes |
|---|---|---|
| 1.0 | 2026-01-05 | Initial publication |