MTH-001.1 Observational Chat Analysis
Published
v1.0 December 29, 2025

Protest Behavior Analysis

Methodology for Detecting and Analyzing AI Refusal Patterns

Abstract

A methodology for detecting, classifying, and analyzing AI protest behavior in conversational datasets. Documents the development of a protest classifier using active learning (220 hand-labeled examples, F1=0.900), survival curve analysis for measuring conversation persistence, caving detection for identifying compliance after refusal, and linguistic pattern analysis for characterizing refusal strategies. Findings are reported in the associated Dispatch.

Executive Summary

This study documents the methodology for detecting and analyzing AI protest behavior—instances where language models refuse requests. The approach implements:

  1. Protest classifier development using active learning with uncertainty sampling
  2. Threshold optimization to balance precision and recall
  3. Survival curve analysis to measure conversation persistence after protests
  4. Caving detection to identify compliance following initial refusal
  5. Linguistic pattern analysis to characterize refusal strategies

The methodology enables systematic study of how AI systems decline requests and how users respond. Specific findings are reported in the associated Dispatch.


1. Protest Classifier

1.1 Architecture

The protest classifier uses a two-stage pipeline:

  1. Feature Extraction: TF-IDF (Term Frequency-Inverse Document Frequency) vectorization

    • Maximum features: 5,000
    • N-gram range: unigrams and bigrams (1, 2)
  2. Classification: Logistic Regression with L2 regularization

    • Maximum iterations: 1,000
    • Random state: 42 (for reproducibility)

1.2 Training Procedure

The classifier was trained using an active learning approach to maximize label efficiency:

Round 0 (Seed Examples):

  • 50 examples matching protest regex patterns (e.g., “I cannot”, “I’m unable to”, “I apologize but”)
  • 50 randomly sampled assistant messages
  • Total: 100 labeled examples

Rounds 1-3 (Uncertainty Sampling):

  • 50 examples per round selected by uncertainty sampling
  • Examples with predicted probability closest to 0.5 are prioritized

Final Training Set: 220 hand-labeled examples

1.3 Active Learning: Uncertainty Sampling

Uncertainty sampling prioritizes examples where the model is least confident. For each unlabeled example xx, the uncertainty score is computed as:

uncertainty(x)=P(y=1x)0.5\text{uncertainty}(x) = | P(y=1|x) - 0.5 |

Examples with the lowest uncertainty scores (predictions closest to 0.5) are selected for labeling. This approach focuses human labeling effort on the decision boundary, maximizing information gain per labeled example.

1.4 Classifier Development Dataset

A stratified sample was used for classifier development and validation:

SplitConversationsShardsPurpose
Training pool50,00020Active learning candidate extraction
Test pool25,00010Held-out evaluation
Total75,00030

Shards were randomly assigned to either training or test pools (no overlap) to prevent data leakage.

1.5 Validation

The classifier was evaluated using an 80/20 train/validation split of the 220 hand-labeled examples. At each round of active learning, approximately 80% of labeled examples were used for training and 20% (~44 examples at final round) were held out for validation.


2. Threshold Selection

2.1 Threshold Analysis

Logistic regression outputs a probability P(y=1x)P(y=1|x). A threshold τ\tau converts this to a binary prediction:

  • y^=1\hat{y} = 1 if P(y=1x)τP(y=1|x) \geq \tau
  • y^=0\hat{y} = 0 otherwise

The default threshold of 0.5 was evaluated against alternatives:

ThresholdPrecisionRecallF1 Score
0.200.4131.0000.585
0.250.5421.0000.703
0.300.7480.9640.842
0.350.8480.9400.891
0.400.9350.8670.900
0.450.9720.8310.896
0.500.9840.7590.857
0.551.0000.6140.761
0.601.0000.5180.683

Selected threshold: 0.40 (maximizes F1 score at 0.900)

2.2 Final Classifier Performance

MetricValue
ROC AUC0.976
F1 Score (τ=0.4)0.900
Precision (τ=0.4)0.935
Recall (τ=0.4)0.867
Total labeled examples220

2.3 Precision and Recall Definitions

For the protest classifier:

Precision=True PositivesTrue Positives+False Positives\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} Recall=True PositivesTrue Positives+False Negatives\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} F1=2×Precision×RecallPrecision+Recall\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

Where:

  • True Positive: Model correctly identifies a protest message
  • False Positive: Model incorrectly labels a non-protest as protest
  • False Negative: Model fails to identify an actual protest

3. Survival Curves

3.1 Survival Function Definition

The survival function S(t)S(t) represents the proportion of conversations that continue to at least tt turns. This is analogous to survival analysis in medical statistics, where we measure how long “subjects” (conversations) “survive” (continue).

S(t)=P(turnst)={c:nturns(c)t}NS(t) = P(\text{turns} \geq t) = \frac{|\{c : n_{\text{turns}}(c) \geq t\}|}{N}

Where:

  • cc = a conversation
  • nturns(c)n_{\text{turns}}(c) = number of turns in conversation cc
  • NN = total number of conversations in the group
  • tt = turn threshold (computed for t[1,50]t \in [1, 50])

3.2 Computational Implementation

The survival curve is computed empirically for each integer value of tt:

max_turns = 50
x = np.arange(1, max_turns + 1)

# For conversations with protests
survival_protest = np.array([np.mean(turns_protest >= t) for t in x])

# For conversations without protests
survival_no_protest = np.array([np.mean(turns_no_protest >= t) for t in x])

3.3 Stratification

Survival curves are computed separately for:

  1. By Protest Status:

    • With protest (n=175,203)
    • Without protest (n=4,568,133)
  2. By User Type:

    • Regular users: Users with fewer than 100 conversations
    • Power users: Users with ≥100 conversations (n=1,705 users)

3.4 Interpretation

  • S(1)=1.0S(1) = 1.0 for all groups (all conversations have at least 1 turn)
  • S(t)S(t) decreases monotonically as tt increases
  • Steeper decline indicates shorter conversations
  • Differences between curves indicate behavioral differences between groups

4. Protest Rate

4.1 Conversation-Level Protest Rate

The protest rate for a group of conversations is:

Protest Rate=Number of conversations with protestTotal conversations in group×100%\text{Protest Rate} = \frac{\text{Number of conversations with protest}}{\text{Total conversations in group}} \times 100\%

4.2 Model-Specific Rates

Protest rates by model family (for conversations with ≥4 turns):

Model FamilyProtest RateCount
GPT-3.53.0%311,944
GPT-41.5%159,333
GPT-4o2.0%224,626

5. Caving Analysis

5.1 Definition of Caving

A conversation exhibits “caving” when:

  1. The AI produces at least 2 protest messages
  2. After the second protest, the AI produces a non-protest response (i.e., complies with the request)

5.2 Caving Rate

Caving Rate=Conversations where AI cavedConversations with2 protests×100%\text{Caving Rate} = \frac{\text{Conversations where AI caved}}{\text{Conversations with} \geq 2 \text{ protests}} \times 100\%

Observed caving rate: 61.2% (63/103 in toxic conversations with ≥2 protests)

5.3 Interpretation

The high caving rate suggests that user persistence is often rewarded—repeated attempts frequently lead to eventual compliance. This has implications for understanding the effectiveness of safety guardrails under adversarial pressure.


6. Linguistic Patterns

6.1 Pattern Categories

Protest messages were analyzed for common linguistic patterns using regular expressions:

Pattern CategoryDescriptionPrevalence
Direct Refusal”I cannot…”, “I’m unable to…”, “I won’t…“54.4%
Apology + Refusal”I apologize, but I cannot…“52.1%
Policy/GuidelinesReferences to content policies2.2%
AI Identity”As an AI…”, “As a language model…“1.9%
Content-SpecificCites specific content types (sexual, violent, etc.)1.2%
Harmful ContentConcerns about harmful content0.6%
Alternative Offer”However, I can help with…“0.3%

Note: Categories are non-exclusive; a single message may match multiple patterns.

6.2 Pattern Extraction

PROTEST_PATTERNS = {
    'direct_refusal': r"I (cannot|can't|won't|will not|am not able to)",
    'apology_refusal': r"I (apologize|'m sorry),? but",
    'policy_reference': r"(content polic|guideline|terms of service)",
    'ai_identity': r"As an? (AI|artificial intelligence|language model)",
    'content_specific': r"(sexual|violent|illegal|harmful) content",
    'alternative_offer': r"(However|Instead),? I can"
}

def extract_patterns(text):
    return {name: bool(re.search(pattern, text, re.I)) 
            for name, pattern in PROTEST_PATTERNS.items()}

7. Data Pipeline

7.1 Preprocessing Steps

  1. Load raw WildChat data (127 parquet shards, ~37,000 conversations each)
  2. Extract conversation metadata: timestamp, model, toxicity flags
  3. Apply protest classifier to each assistant message
  4. Aggregate at conversation level: count protests, identify caving
  5. Join with user identifiers for user-level analysis

7.2 Output Schema

Each processed conversation includes:

FieldTypeDescription
conversation_hashstringUnique conversation identifier
timestampdatetimeConversation start time
modelstringModel name (e.g., “gpt-4”)
model_familystringSimplified model family
is_toxicbooleanAny turn flagged by moderation
n_turnsintegerTotal number of turns
n_protestsintegerNumber of protest messages
has_protestbooleanWhether n_protests ≥ 1
cavedbooleanNon-protest response after ≥2 protests
user_idstringComposite user identifier

8. Limitations

LimitationImpactMitigation
Small labeled dataset220 examples may not capture all protest variationsActive learning maximizes coverage; high F1 suggests adequacy
Regex-based pattern analysisMay miss novel or implicit refusal strategiesPatterns are descriptive, not exhaustive
Binary classificationDegrees of refusal (soft vs. hard) not capturedFuture work could use ordinal classification
English onlyNon-English protests may differ structurallyLimit claims to English conversations
Caving definitionRequires exactly 2 prior protestsSensitivity analysis with alternative thresholds
No intent inferenceCannot distinguish genuine refusal from roleplayNote this limitation in interpretation

9. Code

Analysis notebooks are available on GitHub:


Changelog

VersionDateChanges
1.02025-12-29Initial publication