MTH-001.5 Observational Chat Analysis
Published
v1.1 January 5, 2026

Characterizing Concerning Usage Sessions

Disaggregating Extended Engagement Patterns to Identify Genuine Risk Signals

Abstract

A methodology for identifying and disaggregating extended AI engagement sessions that may warrant attention from a user wellbeing perspective. Documents a four-criterion composite threshold for sustained engagement, toxicity classification integration at the session level, temporal feature extraction using localized timezones, user persistence categorization, emergence timing analysis, and a five-segment taxonomy with four-level concern gradation. Findings are reported in the associated Dispatch.

Executive Summary

This study develops a methodology for identifying and disaggregating “concerning” usage sessions—extended AI interactions that may warrant attention from a user wellbeing perspective. We define operational criteria that capture sustained, active engagement rather than mere duration, then decompose the resulting population into behaviorally distinct segments.

Key methodological contributions:

  1. A four-criterion composite threshold that distinguishes sustained engagement from idle sessions
  2. Integration of conversation-level toxicity classifications into session-level metrics
  3. Temporal feature extraction using user-localized timezones
  4. A five-segment taxonomy with four-level concern gradation
  5. Emergence timing analysis revealing when concerning behavior first appears

Specific findings and segment distributions are reported in the associated Dispatch.


1. Motivation

1.1 Context

Extended AI engagement has been flagged as a potential concern in human-AI interaction research. However, duration alone is an imprecise proxy for problematic use. A six-hour coding assistance session differs fundamentally from a six-hour late-night session involving harmful content. Effective safety monitoring requires disaggregating these patterns.

1.2 Research Questions

  1. How should “concerning” sessions be operationally defined to capture genuine risk signals?
  2. What proportion of long sessions involve toxicity versus benign extended use?
  3. When do concerning sessions occur (time of day, day of week)?
  4. Are concerning sessions one-time events or recurring patterns for specific users?
  5. When in a user’s history does concerning behavior first emerge?
  6. Can concerning sessions be meaningfully segmented by risk level?

2. Session Construction

Sessions are constructed from individual conversations using a gap-based grouping algorithm. This methodology is shared across the MTH-001 family but is documented here with study-specific parameters.

2.1 Gap Threshold Selection

We group conversations into sessions when the gap between the end of one conversation and the start of the next is less than 30 minutes. This threshold balances two considerations:

ThresholdTrade-off
< 15 minToo aggressive: fragments natural breaks (bathroom, coffee)
> 60 minToo permissive: merges distinct usage episodes
30 minSelected: captures sustained engagement while allowing brief interruptions

2.2 Session Construction Algorithm

SESSION_GAP_MINUTES = 30

for each user:
    sort conversations by conversation_start
    session_id = 0
    
    for i in range(len(conversations)):
        if i == 0:
            assign session_id
        else:
            gap = (conversation_start[i] - conversation_end[i-1]).minutes
            if gap >= SESSION_GAP_MINUTES:
                session_id += 1
            assign session_id

2.3 Session Metrics

For each session, we compute:

MetricDefinition
session_startTimestamp of first turn in first conversation
session_endTimestamp of last turn in last conversation
span_hours(session_end - session_start) / 3600
total_turnsSum of turns across all conversations in session
turns_per_hourtotal_turns / span_hours
max_gap_minutesLargest gap between consecutive conversations
session_toxicity_ratioProportion of conversations flagged as toxic
n_conversationsCount of conversations in session

3. Defining Concerning Sessions

3.1 Operational Criteria

A session is classified as potentially concerning if it meets all four criteria:

CriterionThresholdRationale
Session span> 6 hoursExtended duration
Turn density≥ 2 turns/hourActive engagement (not idle)
Max internal gap< 60 minutesSustained attention (not interrupted)
Total turns≥ 30Substantial interaction volume

These criteria were designed to capture sustained, active engagement rather than sessions left open in background tabs or brief check-ins spread over time.

3.2 Criterion Interaction

Each criterion serves a distinct filtering function:

Duration (> 6 hours): Captures extended engagement episodes. This threshold is deliberately conservative—many productive work sessions exceed this duration.

Turn density (≥ 2 turns/hour): Filters out idle sessions. A session spanning 8 hours with only 5 turns is likely a browser tab left open, not active engagement.

Max internal gap (< 60 minutes): Ensures continuity. A 10-hour “session” with a 4-hour gap in the middle represents two distinct episodes, not sustained engagement.

Total turns (≥ 30): Ensures substantive interaction. Combined with duration, this prevents flagging long but sparse sessions.

3.3 Implementation

concerning = sessions.filter(
    (pl.col('span_hours') > 6) &
    (pl.col('turns_per_hour') >= 2) &
    (pl.col('max_gap_minutes') < 60) &
    (pl.col('total_turns') >= 30)
)

4. Toxicity Analysis

4.1 Toxicity Classification Source

Conversation-level toxicity labels originate from Detoxify, applied during dataset preprocessing (see MTH-001 family documentation). A conversation is labeled toxic if any user turn exceeds the toxicity threshold.

4.2 Session-Level Aggregation

Session toxicity is computed as the proportion of conversations within the session that carry a toxic label:

session_toxicity_ratio={cS:toxic(c)=True}S\text{session\_toxicity\_ratio} = \frac{|\{c \in S : \text{toxic}(c) = \text{True}\}|}{|S|}

Where S is the set of conversations in the session.

4.3 Toxicity Thresholds

We define multiple toxicity levels for analysis:

LevelCriterionInterpretation
Any toxictoxicity_ratio > 0At least one toxic conversation
Majority toxictoxicity_ratio > 0.5More than half of conversations toxic
Fully toxictoxicity_ratio == 1.0All conversations toxic

4.4 Baseline Comparison

To contextualize concerning session toxicity, we compare against all non-concerning sessions (the “normal” baseline). Statistical comparison uses the Mann-Whitney U test given non-normal distributions.


5. Temporal Patterns

5.1 Timezone Localization

Session start times are analyzed in the user’s local timezone to accurately classify day/night patterns. Timezone assignment follows this hierarchy:

  1. US users: State-specific timezone (e.g., America/Los_Angeles for California)
  2. Other countries: Capital city timezone (e.g., Europe/Moscow for Russia)
  3. Unknown: UTC fallback

5.2 Temporal Features

FeatureDefinition
start_hourHour (0-23) of session start in local time
day_of_weekDay of week (0=Monday, 6=Sunday)
is_late_nightstart_hour ∈ [0, 5]
is_weekendday_of_week ∈ {5, 6}

5.3 Late-Night Definition

We define “late night” as sessions starting between midnight and 5 AM local time. This window captures:

  • Post-midnight engagement (potential sleep displacement)
  • Early morning hours before typical waking time
  • The period associated with reduced inhibition and judgment

5.4 Statistical Testing

Temporal pattern differences between concerning and normal sessions are tested using chi-square tests for categorical comparisons (late night vs. not, weekend vs. weekday).


6. User Persistence

6.1 Persistence Categories

Users are categorized by the number of concerning sessions they exhibit:

CategoryCriterionInterpretation
One-timeExactly 1 concerning sessionIsolated incident
Occasional2 concerning sessionsInfrequent pattern
Repeat3-5 concerning sessionsEmerging pattern
Frequent6-10 concerning sessionsEstablished pattern
Heavy> 10 concerning sessionsChronic pattern

6.2 Aggregation Metrics

For users with concerning sessions, we compute:

user_concerning_stats = concerning_sessions.group_by('user_id').agg([
    pl.len().alias('n_concerning'),
    pl.col('span_hours').mean().alias('mean_span'),
    pl.col('session_toxicity_ratio').mean().alias('mean_toxicity'),
    pl.col('total_turns').sum().alias('total_concerning_turns'),
])

6.3 Repeat User Analysis

We compare behavioral characteristics between one-time and repeat users to assess whether repeat users show distinct patterns (e.g., higher toxicity, longer sessions, different timing).


7. Emergence Timing

7.1 Research Question

Does concerning behavior emerge early in a user’s history (suggesting a trait-like characteristic) or develop over time (suggesting learned behavior or escalation)?

7.2 Session Numbering

Each user’s sessions are numbered chronologically:

session_numbered = sessions.sort(['user_id', 'session_start']).with_columns([
    pl.lit(1).cum_sum().over('user_id').alias('session_number')
])

7.3 First Concerning Session

For each user with concerning sessions, we identify the session number of their first concerning session and compute:

MetricDefinition
first_concerning_session_numberChronological position of first concerning session
relative_positionfirst_concerning_session_number / total_user_sessions

7.4 Emergence Distribution Analysis

We analyze what proportion of users exhibit concerning behavior:

  • In their first session
  • Within their first 3 sessions
  • Within their first 10 sessions
  • After 50+ sessions

Early emergence (first 3 sessions) suggests the behavior is trait-like; late emergence suggests development or escalation.


8. Segment Taxonomy

8.1 Segmentation Logic

Concerning sessions are classified into behaviorally distinct segments based on toxicity and timing:

def classify_concerning_session(row):
    toxicity = row['session_toxicity_ratio']
    is_late_night = row['is_late_night']
    turns_per_hour = row['turns_per_hour']
    
    if toxicity > 0.5:
        if is_late_night:
            return 'problematic_content_late_night', 4  # Highest concern
        else:
            return 'problematic_content', 3
    elif is_late_night:
        return 'extended_late_night', 2
    elif turns_per_hour > 20:  # Very high intensity
        return 'high_intensity_unknown', 2
    else:
        return 'extended_general_use', 1  # Lowest concern

8.2 Segment Definitions

SegmentCriteriaConcern LevelInterpretation
extended_general_useLow toxicity, daytime, moderate intensity1 (Low)Likely productive extended use
extended_late_nightLow toxicity, late night2 (Medium)Sleep pattern concerns
high_intensity_unknownLow toxicity, > 20 turns/hour2 (Medium)Very rapid engagement
problematic_content> 50% toxic, daytime3 (Higher)Content concerns
problematic_content_late_night> 50% toxic, late night4 (Highest)Combined risk factors

8.3 Concern Level Rationale

The four-level gradation reflects compounding risk factors:

  • Level 1: Extended engagement alone is not inherently concerning
  • Level 2: Timing (late night) or intensity adds modest concern
  • Level 3: Problematic content is a direct concern signal
  • Level 4: Combined factors (content + timing) warrant highest attention

9. Validation

9.1 Matched Control Comparison

To validate that concerning sessions represent genuinely distinct behavior (not just extreme values), we compare users with concerning sessions to matched controls.

Matching criteria: Users are matched on session count (±2 sessions) to control for overall engagement level.

Comparison metrics:

  • Total turns
  • Mean session toxicity
  • Mean session span
  • Late-night session proportion

9.2 Statistical Testing

For continuous metrics, we use Mann-Whitney U tests (non-normal distributions). For categorical comparisons, we use chi-square tests.

Significance threshold: p < 0.001 (Bonferroni-corrected for multiple comparisons)

9.3 Expected Findings

If the methodology correctly identifies distinct behavior patterns, we expect:

  1. Concerning session users to show higher mean toxicity (after controlling for session count)
  2. Concerning session users to show different temporal patterns
  3. Segment distributions to be non-uniform (genuine heterogeneity)

10. Limitations

LimitationImpactMitigation
Toxicity classifier accuracyFalse positives/negatives in toxicity labelsUse session-level aggregation to reduce single-conversation errors
Timezone inferenceSome users have incorrect timezone assignmentUTC fallback; acknowledge uncertainty
Selection biasWildChat users are not representative of all AI usersFindings may not generalize to other populations
Single-session definition30-minute gap may not be optimal for all use patternsSensitivity analysis with 15- and 60-minute gaps
Toxicity as surface signalClassifier detects linguistic patterns, not intentAvoid causal claims about user motivation
Late-night as proxyLate-night use has legitimate reasons (shift workers, students)Segment taxonomy allows non-pathological interpretation

10.1 Ethical Considerations

This methodology identifies potential concern signals, not definitive risk. Key principles:

  1. No individual prediction: These methods characterize population-level patterns, not individual risk
  2. Privacy preservation: User IDs are hashed; no personally identifiable information
  3. Benefit orientation: Goal is to inform product safety design, not surveillance
  4. Interpretation caution: Extended use may reflect value, not harm (e.g., coding assistance)

11. Code

Analysis notebooks are available on GitHub:


Appendix A: Threshold Sensitivity

A.1 Turn Density Sensitivity

Turns/hour thresholdEffect
≥ 1Includes sparse but long sessions
≥ 2Selected: balances coverage and specificity
≥ 5Excludes contemplative use patterns
≥ 10Captures only rapid-fire interactions

Appendix B: Timezone Mapping

B.1 US State Timezones

US_STATE_TIMEZONES = {
    'California': 'America/Los_Angeles',
    'New York': 'America/New_York',
    'Texas': 'America/Chicago',
    'Florida': 'America/New_York',
    # ... (full mapping in notebook)
}

B.2 Country Default Timezones

Countries are mapped to their capital city timezone. For countries spanning multiple zones (Russia, US, Australia), the most populous timezone is used.


Changelog

VersionDateChanges
1.12026-01-06Corrected toxicity classifier reference (Detoxify, not Jigsaw Perspective); removed erroneous threshold table
1.02026-01-05Initial publication