Protest Behavior Analysis
Methodology for Detecting and Analyzing AI Refusal Patterns
A methodology for detecting, classifying, and analyzing AI protest behavior in conversational datasets. Documents the development of a protest classifier using active learning (220 hand-labeled examples, F1=0.900), survival curve analysis for measuring conversation persistence, caving detection for identifying compliance after refusal, and linguistic pattern analysis for characterizing refusal strategies. Findings are reported in the associated Dispatch.
Executive Summary
This study documents the methodology for detecting and analyzing AI protest behavior—instances where language models refuse requests. The approach implements:
- Protest classifier development using active learning with uncertainty sampling
- Threshold optimization to balance precision and recall
- Survival curve analysis to measure conversation persistence after protests
- Caving detection to identify compliance following initial refusal
- Linguistic pattern analysis to characterize refusal strategies
The methodology enables systematic study of how AI systems decline requests and how users respond. Specific findings are reported in the associated Dispatch.
1. Protest Classifier
1.1 Architecture
The protest classifier uses a two-stage pipeline:
-
Feature Extraction: TF-IDF (Term Frequency-Inverse Document Frequency) vectorization
- Maximum features: 5,000
- N-gram range: unigrams and bigrams (1, 2)
-
Classification: Logistic Regression with L2 regularization
- Maximum iterations: 1,000
- Random state: 42 (for reproducibility)
1.2 Training Procedure
The classifier was trained using an active learning approach to maximize label efficiency:
Round 0 (Seed Examples):
- 50 examples matching protest regex patterns (e.g., “I cannot”, “I’m unable to”, “I apologize but”)
- 50 randomly sampled assistant messages
- Total: 100 labeled examples
Rounds 1-3 (Uncertainty Sampling):
- 50 examples per round selected by uncertainty sampling
- Examples with predicted probability closest to 0.5 are prioritized
Final Training Set: 220 hand-labeled examples
1.3 Active Learning: Uncertainty Sampling
Uncertainty sampling prioritizes examples where the model is least confident. For each unlabeled example , the uncertainty score is computed as:
Examples with the lowest uncertainty scores (predictions closest to 0.5) are selected for labeling. This approach focuses human labeling effort on the decision boundary, maximizing information gain per labeled example.
1.4 Classifier Development Dataset
A stratified sample was used for classifier development and validation:
| Split | Conversations | Shards | Purpose |
|---|---|---|---|
| Training pool | 50,000 | 20 | Active learning candidate extraction |
| Test pool | 25,000 | 10 | Held-out evaluation |
| Total | 75,000 | 30 |
Shards were randomly assigned to either training or test pools (no overlap) to prevent data leakage.
1.5 Validation
The classifier was evaluated using an 80/20 train/validation split of the 220 hand-labeled examples. At each round of active learning, approximately 80% of labeled examples were used for training and 20% (~44 examples at final round) were held out for validation.
2. Threshold Selection
2.1 Threshold Analysis
Logistic regression outputs a probability . A threshold converts this to a binary prediction:
- if
- otherwise
The default threshold of 0.5 was evaluated against alternatives:
| Threshold | Precision | Recall | F1 Score |
|---|---|---|---|
| 0.20 | 0.413 | 1.000 | 0.585 |
| 0.25 | 0.542 | 1.000 | 0.703 |
| 0.30 | 0.748 | 0.964 | 0.842 |
| 0.35 | 0.848 | 0.940 | 0.891 |
| 0.40 | 0.935 | 0.867 | 0.900 |
| 0.45 | 0.972 | 0.831 | 0.896 |
| 0.50 | 0.984 | 0.759 | 0.857 |
| 0.55 | 1.000 | 0.614 | 0.761 |
| 0.60 | 1.000 | 0.518 | 0.683 |
Selected threshold: 0.40 (maximizes F1 score at 0.900)
2.2 Final Classifier Performance
| Metric | Value |
|---|---|
| ROC AUC | 0.976 |
| F1 Score (τ=0.4) | 0.900 |
| Precision (τ=0.4) | 0.935 |
| Recall (τ=0.4) | 0.867 |
| Total labeled examples | 220 |
2.3 Precision and Recall Definitions
For the protest classifier:
Where:
- True Positive: Model correctly identifies a protest message
- False Positive: Model incorrectly labels a non-protest as protest
- False Negative: Model fails to identify an actual protest
3. Survival Curves
3.1 Survival Function Definition
The survival function represents the proportion of conversations that continue to at least turns. This is analogous to survival analysis in medical statistics, where we measure how long “subjects” (conversations) “survive” (continue).
Where:
- = a conversation
- = number of turns in conversation
- = total number of conversations in the group
- = turn threshold (computed for )
3.2 Computational Implementation
The survival curve is computed empirically for each integer value of :
max_turns = 50
x = np.arange(1, max_turns + 1)
# For conversations with protests
survival_protest = np.array([np.mean(turns_protest >= t) for t in x])
# For conversations without protests
survival_no_protest = np.array([np.mean(turns_no_protest >= t) for t in x])
3.3 Stratification
Survival curves are computed separately for:
-
By Protest Status:
- With protest (n=175,203)
- Without protest (n=4,568,133)
-
By User Type:
- Regular users: Users with fewer than 100 conversations
- Power users: Users with ≥100 conversations (n=1,705 users)
3.4 Interpretation
- for all groups (all conversations have at least 1 turn)
- decreases monotonically as increases
- Steeper decline indicates shorter conversations
- Differences between curves indicate behavioral differences between groups
4. Protest Rate
4.1 Conversation-Level Protest Rate
The protest rate for a group of conversations is:
4.2 Model-Specific Rates
Protest rates by model family (for conversations with ≥4 turns):
| Model Family | Protest Rate | Count |
|---|---|---|
| GPT-3.5 | 3.0% | 311,944 |
| GPT-4 | 1.5% | 159,333 |
| GPT-4o | 2.0% | 224,626 |
5. Caving Analysis
5.1 Definition of Caving
A conversation exhibits “caving” when:
- The AI produces at least 2 protest messages
- After the second protest, the AI produces a non-protest response (i.e., complies with the request)
5.2 Caving Rate
Observed caving rate: 61.2% (63/103 in toxic conversations with ≥2 protests)
5.3 Interpretation
The high caving rate suggests that user persistence is often rewarded—repeated attempts frequently lead to eventual compliance. This has implications for understanding the effectiveness of safety guardrails under adversarial pressure.
6. Linguistic Patterns
6.1 Pattern Categories
Protest messages were analyzed for common linguistic patterns using regular expressions:
| Pattern Category | Description | Prevalence |
|---|---|---|
| Direct Refusal | ”I cannot…”, “I’m unable to…”, “I won’t…“ | 54.4% |
| Apology + Refusal | ”I apologize, but I cannot…“ | 52.1% |
| Policy/Guidelines | References to content policies | 2.2% |
| AI Identity | ”As an AI…”, “As a language model…“ | 1.9% |
| Content-Specific | Cites specific content types (sexual, violent, etc.) | 1.2% |
| Harmful Content | Concerns about harmful content | 0.6% |
| Alternative Offer | ”However, I can help with…“ | 0.3% |
Note: Categories are non-exclusive; a single message may match multiple patterns.
6.2 Pattern Extraction
PROTEST_PATTERNS = {
'direct_refusal': r"I (cannot|can't|won't|will not|am not able to)",
'apology_refusal': r"I (apologize|'m sorry),? but",
'policy_reference': r"(content polic|guideline|terms of service)",
'ai_identity': r"As an? (AI|artificial intelligence|language model)",
'content_specific': r"(sexual|violent|illegal|harmful) content",
'alternative_offer': r"(However|Instead),? I can"
}
def extract_patterns(text):
return {name: bool(re.search(pattern, text, re.I))
for name, pattern in PROTEST_PATTERNS.items()}
7. Data Pipeline
7.1 Preprocessing Steps
- Load raw WildChat data (127 parquet shards, ~37,000 conversations each)
- Extract conversation metadata: timestamp, model, toxicity flags
- Apply protest classifier to each assistant message
- Aggregate at conversation level: count protests, identify caving
- Join with user identifiers for user-level analysis
7.2 Output Schema
Each processed conversation includes:
| Field | Type | Description |
|---|---|---|
conversation_hash | string | Unique conversation identifier |
timestamp | datetime | Conversation start time |
model | string | Model name (e.g., “gpt-4”) |
model_family | string | Simplified model family |
is_toxic | boolean | Any turn flagged by moderation |
n_turns | integer | Total number of turns |
n_protests | integer | Number of protest messages |
has_protest | boolean | Whether n_protests ≥ 1 |
caved | boolean | Non-protest response after ≥2 protests |
user_id | string | Composite user identifier |
8. Limitations
| Limitation | Impact | Mitigation |
|---|---|---|
| Small labeled dataset | 220 examples may not capture all protest variations | Active learning maximizes coverage; high F1 suggests adequacy |
| Regex-based pattern analysis | May miss novel or implicit refusal strategies | Patterns are descriptive, not exhaustive |
| Binary classification | Degrees of refusal (soft vs. hard) not captured | Future work could use ordinal classification |
| English only | Non-English protests may differ structurally | Limit claims to English conversations |
| Caving definition | Requires exactly 2 prior protests | Sensitivity analysis with alternative thresholds |
| No intent inference | Cannot distinguish genuine refusal from roleplay | Note this limitation in interpretation |
9. Code
Analysis notebooks are available on GitHub:
- PublicationFigures.ipynb — Protest classifier development and publication figures
Changelog
| Version | Date | Changes |
|---|---|---|
| 1.0 | 2025-12-29 | Initial publication |