Semantic Exploration and Sustained Utilization
From Hand-Crafted Features to Learned Representations
A methodology for analyzing user engagement through learned semantic representations. Implements sentence embeddings (all-MiniLM-L6-v2, 384 dimensions), stratified sampling for utilization spectrum analysis, multi-level embedding aggregation, semantic diversity metrics, BERTopic topic modeling, and regression-based prediction. Documents the pivot from hand-crafted features to data-driven representation learning. Findings are reported in the associated Dispatch.
Executive Summary
This study documents a methodology for analyzing user engagement through learned semantic representations rather than hand-crafted linguistic features. The approach implements sentence embeddings to characterize prompt semantics, stratified sampling to ensure coverage of high-utilization users, multi-level embedding aggregation (first-turn, all-turn, user-level, temporal), semantic diversity metrics, BERTopic topic modeling, and regression-based prediction of utilization.
The methodology enables investigation of whether semantic structure—particularly diversity across the embedding space—relates to sustained engagement. Specific findings and statistical results are reported in the associated Dispatch.
1. Motivation: The Pivot from Hand-Crafted Features
1.1 Context: The Identity Hypothesis Investigation
The engagement prediction analysis (MTH-001.1) tested whether psychological identity markers in first-turn prompts could predict user return behavior. Six hypotheses were explored:
| Hypothesis | Core Question |
|---|---|
| H1: Efficiency Learning | Do users become more terse over time? |
| H2: Tool vs Partner Types | Are there distinct user phenotypes? |
| H3: “You” Collapse | Does direct address decline with use? |
| H4: “We” Disappearance | Does joint-agency framing decline? |
| H5: Qualitative Configurations | Can users be classified into identity types? |
| H6: Content Over Style | Is identity revealed by content, not style? |
Results from these hypothesis tests are documented in the associated Dispatch. Based on those findings, this study pivots to a data-driven approach using learned representations.
1.2 The Methodological Pivot
Rather than continue refining hand-crafted features, this study adopts an exploratory, data-driven approach:
| Hand-Crafted Approach | Embedding Approach |
|---|---|
| Hypothesis-driven feature selection | Let semantic structure emerge |
| Discrete categories (pronouns, politeness) | Continuous 384-dimensional space |
| Interpretation precedes analysis | Analysis precedes interpretation |
| Risk of confirmation bias | Risk of post-hoc rationalization |
The goal shifts from predicting engagement to characterizing what high utilizers do differently—then working backwards to interpretable patterns.
1.3 Research Questions
- What semantic structure exists in the space of first-turn prompts?
- Do high utilizers occupy distinct regions of semantic space?
- Does semantic diversity correlate with sustained engagement?
- How do users’ semantic territories evolve over time?
2. Embedding Methodology
2.1 Embedding Model Selection
We use all-MiniLM-L6-v2 from Sentence Transformers:
| Property | Value | Rationale |
|---|---|---|
| Dimensions | 384 | Sufficient expressiveness for semantic similarity |
| Training | Contrastive learning on 1B+ sentence pairs | Captures semantic similarity, not just lexical overlap |
| Speed | ~1000 sentences/second on CPU | Tractable for million-scale analysis |
| Normalization | L2-normalized outputs | Cosine similarity equals dot product |
Model loading:
from sentence_transformers import SentenceTransformer
embed_model = SentenceTransformer('all-MiniLM-L6-v2')
2.2 Embedding Generation
First-turn prompts are embedded with normalization for cosine similarity:
embeddings = embed_model.encode(
texts,
show_progress_bar=True,
batch_size=256,
convert_to_numpy=True,
normalize_embeddings=True # Cosine similarity = dot product
)
Truncation: Prompts exceeding 5,000 characters are truncated. This affects less than 1% of conversations and primarily impacts very long code submissions.
2.3 Why Not Fine-Tune?
We use the pre-trained model without fine-tuning because:
- No supervised signal available — We don’t know a priori what semantic distinctions matter for utilization
- Transfer learning sufficiency — General semantic similarity captures the structure we need
- Reproducibility — Off-the-shelf model enables replication without custom training
3. Continuous Utilization Spectrum
3.1 Avoiding Arbitrary Cutoffs
Previous analyses defined “power users” as those with 100+ conversations—an arbitrary threshold. This study treats utilization as a continuous variable:
Where:
- rank(n_conversations) = user’s position when sorted by conversation count
- N_users = total users in the training set
3.2 Log-Transformation for Regression
For regression targets, we use log-transformed conversation counts to handle extreme skew:
Training set statistics:
- Median conversations: 1
- Mean conversations: 2.2
- 90th percentile: 2 conversations
- 99th percentile: 14 conversations
- Maximum: 384,406 conversations
The extreme right skew (top 10% of users generate 58% of all conversations) makes log transformation essential for regression stability.
4. Stratified Sampling Strategy
4.1 Oversampling High Utilizers
A uniform random sample would include less than 0.1% high utilizers (≥90th percentile), insufficient for characterization. We use stratified sampling:
| Stratum | Selection | Purpose |
|---|---|---|
| High utilizers (≥90th percentile) | All users (167,774) | Ensure complete coverage of high-engagement patterns |
| Other users (below 90th percentile) | Random sample (100,000) | Provide comparison baseline |
Total sample: 267,774 users with 251,005 valid first-turn embeddings.
4.2 Temporal Holdout
To enable holdout validation, we restrict analysis to training data only: users whose first conversation occurred before January 1, 2025. This mirrors the temporal split in MTH-001.1.
5. Multi-Level Embedding Analysis
The analysis operates at four levels, each answering different questions:
5.1 Level 1: First-Turn Embeddings
Purpose: Predict/locate new users based on their first prompt.
| Property | Value |
|---|---|
| Scope | One embedding per user (first prompt only) |
| Size | 251,005 users × 384 dimensions |
| Use case | Where does a new user land in semantic space? |
5.2 Level 2: All-Turn Embeddings
Purpose: Map the full semantic territory of high utilizers.
| Property | Value |
|---|---|
| Scope | Every first-turn from every conversation |
| Size | 1,569,614 turns × 384 dimensions |
| Users | 152,990 high utilizers (≥90th percentile) |
| Use case | What topics do high utilizers explore? |
5.3 Level 3: User-Level Aggregates
Purpose: Compare users as semantic entities, not individual prompts.
For each user, we compute:
Centroid (mean embedding):
Semantic spread (mean standard deviation across dimensions):
Semantic diameter (maximum pairwise cosine distance):
Where:
- n_u = number of prompts from user u
- e_i = embedding vector for prompt i
- d = embedding dimensionality (384)
5.4 Level 4: Temporal Embeddings
Purpose: Track semantic drift within users over time.
For users with ≥20 conversations, we compare:
- Early centroid: Mean of first 10 conversation embeddings
- Late centroid: Mean of last 10 conversation embeddings
- Semantic drift: Cosine distance between early and late centroids
- Spread change: Difference in semantic spread (late − early)
6. Semantic Diversity Metrics
6.1 Metric Definitions
We compute two primary semantic diversity metrics for each user:
| Metric | Definition | Interpretation |
|---|---|---|
| Semantic spread | Mean standard deviation across embedding dimensions | Higher values indicate prompts spanning more diverse semantic territory |
| Semantic diameter | Maximum pairwise cosine distance among user’s prompts | Captures the extremes of a user’s semantic range |
These metrics quantify the breadth of topics a user explores. Correlation results with utilization are reported in the associated Dispatch.
6.2 Implementation
from scipy.stats import pearsonr
spread_corr, spread_p = pearsonr(
user_agg_df['semantic_spread'].to_numpy(),
user_agg_df['log_conversations'].to_numpy()
)
6.3 Caution: Correlation ≠ Causation
The correlation could reflect:
- Diverse needs → sustained use (utility hypothesis)
- More conversations → more topics (mechanical relationship)
- Certain user types → both diversity and persistence (confounding)
We cannot disambiguate these interpretations without experimental intervention.
7. Topic Modeling with BERTopic
7.1 Two-Stage Topic Discovery
We fit BERTopic at two scales to enable comparison between general population and high-utilizer semantic patterns:
| Analysis | Corpus | Purpose |
|---|---|---|
| First-turn sample | First-turn prompts from stratified sample | Characterize topic distribution across utilization spectrum |
| High-utilizer all-turns | All first-turns from high-utilizer conversations | Map the full semantic territory of sustained users |
The number of topics discovered at each scale is reported in the associated Dispatch.
7.2 BERTopic Configuration
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
import hdbscan
vectorizer = CountVectorizer(
stop_words='english',
ngram_range=(1, 2),
min_df=20, # Higher threshold for larger corpus
max_df=0.5
)
hdbscan_model = hdbscan.HDBSCAN(
min_cluster_size=500, # Larger clusters for 1.5M points
min_samples=50,
metric='euclidean',
cluster_selection_method='eom',
prediction_data=True
)
topic_model = BERTopic(
embedding_model=embed_model,
umap_model=reducer,
hdbscan_model=hdbscan_model,
vectorizer_model=vectorizer,
nr_topics='auto',
top_n_words=10,
verbose=True
)
7.3 UMAP Optimization for Large Corpora
Direct UMAP on 1.5M points is computationally prohibitive. We use a sample-and-transform strategy:
- Fit UMAP on 200,000 randomly sampled embeddings
- Transform remaining embeddings using fitted reducer
- Pass pre-computed 2D embeddings to BERTopic
# Fit on sample
sample_indices = np.random.choice(n_total, size=200000, replace=False)
reducer.fit(embeddings[sample_indices])
# Transform all
embedding_2d = reducer.transform(embeddings)
7.4 Topic Entropy as Diversity Metric
For users with multiple conversations, we compute topic entropy:
Where:
- p_t = proportion of user u’s conversations in topic t
- T = total number of topics
Higher entropy indicates a user’s conversations are distributed across more topics, rather than concentrated in a few. This provides a complementary diversity measure to embedding-level spread. Correlation results are reported in the associated Dispatch.
8. Regression Analysis
8.1 Prediction Framework
We test whether first-turn embeddings can predict future utilization using regression models. The target variable is log-transformed conversation count.
Models evaluated:
| Model | Configuration | Rationale |
|---|---|---|
| Ridge (α=1, 10, 100) | L2-regularized linear regression | Tests linear predictability with varying regularization strength |
| Random Forest | 100 estimators, max_depth=10 | Captures non-linear relationships in embedding space |
| Baseline | Predict mean | Establishes floor for comparison |
8.2 Evaluation Design
All models are evaluated using 5-fold cross-validation with R² as the primary metric. Performance results are reported in the associated Dispatch.
8.3 Implementation
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score, KFold
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
y = sample_df['log_conversations'].to_numpy()
X = embeddings
scores = cross_val_score(
RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42),
X, y, cv=kfold, scoring='r2'
)
9. Temporal Analysis: Semantic Drift
9.1 Methodology
For users with ≥20 conversations, we compare semantic spread in their first 10 vs. last 10 conversations to assess whether users specialize or diversify over time.
Metrics computed:
| Metric | Definition |
|---|---|
| Semantic drift | Cosine distance between early and late centroids |
| Spread change | Difference in semantic spread (late − early) |
| Trajectory classification | Specialized (spread ↓), Diversified (spread ↑), or No change |
9.2 Research Questions
This analysis addresses whether high utilizers systematically narrow their focus over time (specialization) or expand their semantic range (diversification). Results are reported in the associated Dispatch.
9.3 Implementation
# For each user with ≥20 conversations
early_vecs = sorted_vecs[:N_COMPARE] # First 10
late_vecs = sorted_vecs[-N_COMPARE:] # Last 10
early_spread = early_vecs.std(axis=0).mean()
late_spread = late_vecs.std(axis=0).mean()
spread_change = late_spread - early_spread
10. Validation
10.1 Cross-Validation Design
All regression results use 5-fold cross-validation with shuffled splits (random_state=42). This prevents overfitting to the training set.
10.2 Holdout Validation Limitation
We intended to validate on users first appearing in 2025, but the filtered training set contained no such users (the 2025 cutoff was applied before sampling). Future analyses should reserve a temporal holdout before filtering.
10.3 Statistical Power Considerations
With sample sizes in the hundreds of thousands, statistical significance is effectively guaranteed for any non-trivial effect. The meaningful question in this analysis is effect size rather than p-values. Effect sizes and their interpretation are reported in the associated Dispatch.
11. Limitations
| Limitation | Impact | Mitigation |
|---|---|---|
| Correlation ≠ causation | Cannot claim diversity causes engagement | Interpret as association; note confounds |
| Single embedding model | Results may be model-specific | Test with alternative encoders (future work) |
| First-turn focus | Later prompts may matter more | Level 2 analysis uses all turns |
| Temporal holdout missing | Cannot assess true out-of-sample performance | Use cross-validation as proxy |
| Selection bias | WildChat users ≠ all AI users | Interpret as WildChat-specific patterns |
| Truncation at 5,000 chars | Long prompts underrepresented | Affects less than 1% of conversations |
12. Code
Analysis notebooks are available on GitHub:
- 08_IdentityHypotheses.ipynb — Identity hypothesis testing
- 09_SemanticExploration.ipynb — Semantic exploration and topic modeling
Appendix A: Identity Hypothesis Details
The six identity hypotheses tested in notebook 08:
A.1 H1: Efficiency Learning
Prediction: Early conversations are verbose and social; late conversations are terse and imperative.
Method: Compare first 10 vs. last 10 conversations for power users (100+ conversations). Metrics: word count, greeting rate, pronoun rates.
A.2 H2: Tool vs Partner Types
Prediction: Distinct user clusters with different linguistic profiles (tool-oriented vs. relationship-oriented).
Method: K-means clustering (k=2,3,4) on linguistic features, PCA visualization.
A.3 H3: “You” Collapse
Prediction: Direct address (second-person pronouns) declines with engagement depth.
Method: Compare second-person pronoun rate across engagement tiers (one-shot, moderate, power users).
A.4 H4: “We” Disappearance
Prediction: Joint-agency framing (“we,” “us,” “our”) declines with heavy use.
Method: Track first-person plural rate across tiers and within users over time.
A.5 H5: Qualitative Configurations
Prediction: Users cluster into discrete identity configurations (instrumental, relational, hybrid).
Method: Regex-based classification of prompt styles into predefined categories.
A.6 H6: Content Over Style
Prediction: Content markers (self-disclosure, anthropomorphization) predict engagement better than style markers.
Method: Extract content markers via regex; compare prevalence by engagement tier.
Results for all hypotheses are documented in the associated Dispatch.
Appendix B: Nearest Neighbor Inference
A prototype personal inference function locates new prompts in the existing semantic space:
def personal_inference(prompt_text, sample_df, embeddings, nn_model, embed_model, n_neighbors=50):
"""
Given a new prompt, find similar users and estimate utilization pattern.
"""
# Embed the new prompt
new_embedding = embed_model.encode([prompt_text], normalize_embeddings=True)
# Find nearest neighbors
distances, indices = nn_model.kneighbors(new_embedding, n_neighbors=n_neighbors)
# Get neighbor statistics
neighbor_percentiles = sample_df['utilization_percentile'].to_numpy()[indices[0]]
return {
'mean_percentile': neighbor_percentiles.mean(),
'median_percentile': np.median(neighbor_percentiles),
'std_percentile': neighbor_percentiles.std(),
'n_neighbors': n_neighbors,
}
This enables the research question: “Based on your first prompt, where do you land among existing users?”
Changelog
| Version | Date | Changes |
|---|---|---|
| 1.0 | 2026-01-05 | Initial publication |