AI Scraper Scoring

WebDecoy includes a dedicated scoring dimension for detecting AI training crawlers. This runs completely independently from attack detection because the signal interpretation is fundamentally different.

The Problem: AI Scrapers Don’t Look Like Attackers

When GPTBot crawls your website, it honestly identifies itself in the User-Agent header. From an attack detection perspective, this is low threat—the bot isn’t trying to hide or evade detection.

But from a content protection perspective, this is the highest possible threat—a known AI training crawler is actively collecting your content.

Same signal, opposite interpretations:

┌─────────────────────────────────────────────────────────────────┐
│                        GPTBot Request                            │
│   User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like       │
│   Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)    │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Attack Detection View          │   Content Protection View     │
│   ──────────────────────         │   ────────────────────────    │
│   "This bot isn't hiding"        │   "Known AI training crawler" │
│   User-Agent easily spoofed      │   Confirmed identity          │
│   Score: 14 (MINIMAL)            │   Score: 85 (CRITICAL)        │
│   Verdict: Low threat            │   Verdict: High threat        │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Why Two Separate Scores?

WebDecoy shows both scores side-by-side because they answer different questions:

Score	Question It Answers
Attack Risk	”Is this visitor trying to attack or evade detection?”
AI Scraper Risk	”Is this visitor an AI crawler collecting content for training?”

For a detected GPTBot request:

┌─────────────────────────┬─────────────────────────┐
│      Attack Risk        │    AI Scraper Risk      │
│                         │                         │
│          14             │           85            │
│       MINIMAL           │        CRITICAL         │
│                         │                         │
│  "Not trying to hide"   │   "GPTBot detected"     │
└─────────────────────────┴─────────────────────────┘

Key insight: Honest self-identification is LOW threat for attack detection but HIGH confidence for AI scraper detection.

How AI Scraper Scoring Works

Inverted Signal Weighting

AI Scraper scoring uses inverted signal weighting compared to attack detection:

Signal	Attack Scoring	AI Scraper Scoring	Reasoning
Honest bot UA (GPTBot, ClaudeBot)	1% weight (trivially spoofed)	85+ points (confirmed identity)	For attack detection, UA can be spoofed. For AI detection, honest self-ID is reliable
Datacenter IP	3% weight (VPN false positives)	+10 points boost	AI crawlers typically operate from cloud infrastructure
Missing Referer	2% weight (easily spoofed)	+5 points boost	Direct access patterns are typical for crawlers
robots.txt access	Not scored	Tracked for compliance analysis	Shows whether crawler is checking robots.txt before crawling

Detection Process

┌─────────────────────────────────────────────────────────────────┐
│                 AI SCRAPER DETECTION PIPELINE                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Incoming Request                                               │
│         │                                                        │
│         ▼                                                        │
│   ┌───────────────────────────────────────────┐                 │
│   │     CHECK USER-AGENT PATTERNS             │                 │
│   │  GPTBot? ClaudeBot? CCBot? ByteSpider?    │                 │
│   └─────────────────┬─────────────────────────┘                 │
│                     │                                            │
│        ┌────────────┴────────────┐                              │
│        │                         │                               │
│        ▼                         ▼                               │
│   ┌─────────┐              ┌───────────┐                        │
│   │ MATCHED │              │ NO MATCH  │                        │
│   └────┬────┘              └─────┬─────┘                        │
│        │                         │                               │
│        ▼                         ▼                               │
│   ┌───────────────┐        ┌───────────────┐                    │
│   │ Assign Score  │        │ Check Other   │                    │
│   │ + Confidence  │        │ Signals       │                    │
│   │ + Category    │        │ (behavioral)  │                    │
│   └───────┬───────┘        └───────┬───────┘                    │
│           │                        │                             │
│           └──────────┬─────────────┘                             │
│                      ▼                                           │
│            ┌─────────────────┐                                   │
│            │  AI Scraper     │                                   │
│            │  Score + Level  │                                   │
│            │  + Category     │                                   │
│            └─────────────────┘                                   │
└─────────────────────────────────────────────────────────────────┘

Scoring Components

The AI Scraper Score is calculated from multiple signals:

Signal	Points	Confidence Boost
Known Training Crawler UA	70-85	95%
Known Search Crawler UA	30-40	90%
AI Company IP Block	+10-20	+10%
Datacenter/Cloud IP	+5-10	—
High Request Rate Pattern	+5-15	+5%
Systematic URL Pattern	+5-10	—
Missing Referer	+5	—

Known AI Crawlers

WebDecoy tracks and identifies the following AI crawlers:

Training Crawlers (High Score)

These crawlers collect content specifically for training AI/ML models:

Crawler	Company	Score	User-Agent Pattern	Purpose
GPTBot	OpenAI	85	`GPTBot/1.0`	Training ChatGPT and GPT models
ChatGPT-User	OpenAI	85	`ChatGPT-User`	ChatGPT plugins and browsing
OAI-SearchBot	OpenAI	80	`OAI-SearchBot`	SearchGPT content retrieval
ClaudeBot	Anthropic	85	`ClaudeBot`	Training Claude models
Anthropic	Anthropic	85	`anthropic-ai`	Anthropic’s general crawler
CCBot	Common Crawl	80	`CCBot/2.0`	Open dataset for AI training
Google-Extended	Google	80	`Google-Extended`	Gemini/Bard AI training
PerplexityBot	Perplexity	80	`PerplexityBot`	Perplexity AI search
Cohere	Cohere	80	`cohere-ai`	Cohere model training
ByteSpider	ByteDance	75	`Bytespider`	TikTok/Douyin AI features
Meta-ExternalAgent	Meta	75	`Meta-ExternalAgent`	Meta AI training
Applebot-Extended	Apple	75	`Applebot-Extended`	Apple Intelligence training
YouBot	You.com	75	`YouBot`	You.com AI search
Amazonbot	Amazon	70	`Amazonbot`	Alexa and Amazon AI
FacebookBot	Meta	70	`facebookexternalhit`	Facebook AI features
Diffbot	Diffbot	70	`Diffbot`	Knowledge graph extraction

Search Crawlers (Lower Score)

Traditional search engines that may also feed AI features:

Crawler	Company	Score	User-Agent Pattern	Note
Googlebot	Google	30	`Googlebot`	Primary search indexing
Bingbot	Microsoft	30	`bingbot`	Bing search + Copilot
DuckDuckBot	DuckDuckGo	30	`DuckDuckBot`	Privacy-focused search
Applebot	Apple	30	`Applebot`	Siri and Spotlight
YandexBot	Yandex	35	`YandexBot`	Russian search + AI
Baiduspider	Baidu	40	`Baiduspider`	Chinese search + AI

AI Scraper Categories

Detected AI activity is classified into categories:

Training Crawler

Score Range: 70-85

Crawlers that explicitly identify themselves as collecting data for AI/ML model training.

Characteristics:

Honest self-identification in User-Agent
Often respect robots.txt AI-specific directives
High-volume, systematic crawling patterns
Operate from known company IP ranges

Examples: GPTBot, ClaudeBot, CCBot, Google-Extended

robots.txt directive:

User-agent: GPTBot
Disallow: /

Search Crawler

Score Range: 30-40

Traditional search engine crawlers that index content for search results but may also feed AI features.

Characteristics:

Well-established crawling behavior
Generally respect robots.txt
Verifiable by IP (e.g., Googlebot verification)
Lower threat to content licensing than pure AI trainers

Examples: Googlebot, Bingbot, DuckDuckBot

Consideration: While Googlebot itself scores low for AI scraping, content indexed by Google may be used in AI Overviews. Consider whether you want to limit Google’s AI-specific features via robots.txt.

Content Scraper

Score Range: Variable (based on behavioral signals)

Scrapers detected by behavior rather than self-identification. These don’t declare themselves as AI crawlers but exhibit scraping patterns.

Characteristics:

May spoof User-Agent as regular browser
Detected by fingerprint anomalies
High request rates
Systematic URL patterns
Missing typical browser behaviors

Detection signals:

Headless browser fingerprint
No mouse/keyboard events
Impossibly fast navigation
Sequential URL access patterns

None

Score Range: 0

Not detected as an AI scraper. This could be:

Legitimate human traffic
Attack bot (scored separately in Attack Risk)
Unknown/new AI crawler not yet in database

Threat Levels for AI Scraping

Score	Level	Interpretation	Recommended Action
0-20	MINIMAL	Not an AI scraper	No action needed
21-40	LOW	Search crawler or weak signals	Log for analysis
41-60	MEDIUM	Possible AI scraper activity	Monitor and review
61-80	HIGH	Strong AI scraper indicators	Consider blocking
81-100	CRITICAL	Confirmed AI training crawler	Block or serve alternative content

Using AI Scraper Scores

Content Protection Strategy

function handleRequest(detection) {
  const {
    unified_score,        // Attack risk (0-100)
    ai_scraper_score,     // AI scraper risk (0-100)
    ai_scraper_category,  // 'training_crawler', 'search_crawler', etc.
    ai_scraper_name       // 'GPTBot', 'ClaudeBot', etc.
  } = detection;

  // Block AI training crawlers from premium content
  if (ai_scraper_category === 'training_crawler' && ai_scraper_score >= 70) {
    return servePremiumContentBlocker();
  }

  // Allow search crawlers for SEO but log AI activity
  if (ai_scraper_category === 'search_crawler') {
    logAiCrawlerActivity(detection);
    return allowRequest();
  }

  // Challenge unknown scrapers
  if (ai_scraper_score >= 50 && ai_scraper_category === 'content_scraper') {
    return challengeWithCaptcha();
  }

  // Handle attack threats separately
  if (unified_score >= 70) {
    return blockAttackRequest();
  }

  return allowRequest();
}

Tiered Content Access

Implement different content experiences based on AI scraper detection:

function getContentVersion(detection) {
  const { ai_scraper_score, ai_scraper_category } = detection;

  // AI training crawlers: Serve summary only
  if (ai_scraper_category === 'training_crawler') {
    return 'summary_only';
  }

  // Search crawlers: Full content for indexing
  if (ai_scraper_category === 'search_crawler') {
    return 'full_content';
  }

  // Suspicious scrapers: Serve paywall
  if (ai_scraper_score >= 50) {
    return 'paywall';
  }

  // Normal visitors: Full experience
  return 'full_content';
}

robots.txt Compliance Monitoring

WebDecoy can track whether AI crawlers are respecting your robots.txt directives:

Your robots.txt:
─────────────────
User-agent: GPTBot
Disallow: /premium/
Disallow: /articles/

WebDecoy Detection:
──────────────────
GPTBot accessed: /articles/2024/my-exclusive-story.html
AI Scraper Score: 85
robots.txt Compliance: VIOLATED

→ This crawler is ignoring your robots.txt!

Dashboard Filtering

In the Detections table, filter by AI scraper activity:

Filter	Effect
`AI Scraper Score >= 70`	Show high-confidence AI crawler detections
`Category = training_crawler`	Show only AI training crawlers
`AI Scraper Name = GPTBot`	Show only GPTBot activity
`AI Scraper Score > 0`	Show all AI scraper activity

Insights to Look For

Volume by Crawler: Which AI crawlers are hitting your site most?
Target Pages: Which content is being scraped most heavily?
Time Patterns: When do AI crawlers typically access your site?
robots.txt Violations: Are crawlers ignoring your preferences?
New Crawlers: Unknown scraping patterns that might be new AI crawlers

Integration with Response Actions

Use AI Scraper scores to trigger automated responses:

Block AI Training Crawlers

Response Action: Block AI Scrapers
Trigger: ai_scraper_score >= 70 AND ai_scraper_category = 'training_crawler'
Action: Block
Duration: Permanent
Notify: Weekly digest

Rate Limit Search Crawlers

Response Action: Rate Limit Search Crawlers
Trigger: ai_scraper_category = 'search_crawler'
Action: Rate Limit
Rate: 100 requests/hour
Notify: On threshold breach

Alert on High-Volume Scraping

Response Action: Scraping Alert
Trigger: ai_scraper_score >= 50 AND request_count > 1000/hour
Action: Webhook Alert
Endpoint: https://your-api.com/alerts
Notify: Immediately

API Access

Access AI Scraper data via the WebDecoy API:

Detection Response

{
  "id": "det_abc123",
  "ip_address": "20.15.240.128",
  "user_agent": "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0)",
  "path": "/articles/exclusive-content.html",

  "unified_score": 14,
  "threat_level": "MINIMAL",
  "category": "crawler",

  "ai_scraper_score": 85,
  "ai_scraper_level": "CRITICAL",
  "ai_scraper_category": "training_crawler",
  "ai_scraper_name": "GPTBot",
  "ai_scraper_confidence": 0.95,

  "score_components": {
    "user_agent_score": 100,
    "honeypot_score": 0,
    "attack_signature_score": 0,
    "ai_scraper_score": 85,
    "ai_scraper_confidence": 0.95,
    "ai_scraper_category": "training_crawler",
    "ai_scraper_name": "GPTBot"
  }
}

Filtering by AI Scraper

# Get all AI training crawler detections
curl -H "Authorization: Bearer $TOKEN" \
  "https://api.webdecoy.com/v1/detections?ai_scraper_category=training_crawler"

# Get high-score AI scraper activity
curl -H "Authorization: Bearer $TOKEN" \
  "https://api.webdecoy.com/v1/detections?ai_scraper_score_gte=70"

Future Considerations

Expanding Crawler Detection

The AI landscape is evolving rapidly. WebDecoy regularly updates its crawler database as new AI training bots emerge. Factors we monitor:

New AI Companies: Startups launching training crawlers
Existing Companies: Tech giants adding AI features to existing bots
IP Range Updates: Changes to AI company infrastructure
Behavioral Patterns: New scraping techniques and evasion methods

AI-Specific Content Directives

Some sites are experimenting with AI-specific content tags:

<!-- Potential future standard -->
<meta name="ai-training" content="disallow">
<meta name="ai-summary" content="allow">

WebDecoy will track adoption and compliance with emerging standards.

Legal and Licensing Implications

AI scraping raises significant copyright and licensing questions. WebDecoy provides the data layer for understanding what’s being scraped. This can support:

Licensing negotiations with AI companies
Copyright enforcement evidence
Terms of service violation documentation
Opt-out compliance verification

Summary

Aspect	Attack Scoring	AI Scraper Scoring
Question	Is this an attack or bot?	Is this an AI training crawler?
User-Agent Weight	1% (easily spoofed)	70-85+ points (reliable ID)
Interpretation	Honest ID = low threat	Honest ID = high threat
Goal	Detect attacks and evasion	Detect content harvesting
Primary Users	Security teams	Publishers, content creators

Threat Scoring - Attack and bot detection scoring
Response Actions - Automated responses to threats
Detection Script - JavaScript-based bot detection

AI Scraper Scoring

The Problem: AI Scrapers Don’t Look Like Attackers

Why Two Separate Scores?

How AI Scraper Scoring Works

Inverted Signal Weighting

Detection Process

Scoring Components

Known AI Crawlers

Training Crawlers (High Score)

Search Crawlers (Lower Score)

AI Scraper Categories

Training Crawler

Search Crawler

Content Scraper

None

Threat Levels for AI Scraping

Using AI Scraper Scores

Content Protection Strategy

Tiered Content Access

robots.txt Compliance Monitoring

Dashboard Filtering

Filter Options

Insights to Look For

Integration with Response Actions

Block AI Training Crawlers

Rate Limit Search Crawlers

Alert on High-Volume Scraping

API Access

Detection Response

Filtering by AI Scraper

Future Considerations

Expanding Crawler Detection

AI-Specific Content Directives

Legal and Licensing Implications

Summary

Related Documentation