Skip to content

AI Scraper Scoring

WebDecoy includes a dedicated scoring dimension for detecting AI training crawlers. This runs completely independently from attack detection because the signal interpretation is fundamentally different.

The Problem: AI Scrapers Don’t Look Like Attackers

Section titled “The Problem: AI Scrapers Don’t Look Like Attackers”

When GPTBot crawls your website, it honestly identifies itself in the User-Agent header. From an attack detection perspective, this is low threat—the bot isn’t trying to hide or evade detection.

But from a content protection perspective, this is the highest possible threat—a known AI training crawler is actively collecting your content.

Same signal, opposite interpretations:
┌─────────────────────────────────────────────────────────────────┐
│ GPTBot Request │
│ User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like │
│ Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Attack Detection View │ Content Protection View │
│ ────────────────────── │ ──────────────────────── │
│ "This bot isn't hiding" │ "Known AI training crawler" │
│ User-Agent easily spoofed │ Confirmed identity │
│ Score: 14 (MINIMAL) │ Score: 85 (CRITICAL) │
│ Verdict: Low threat │ Verdict: High threat │
│ │
└─────────────────────────────────────────────────────────────────┘

WebDecoy shows both scores side-by-side because they answer different questions:

ScoreQuestion It Answers
Attack Risk”Is this visitor trying to attack or evade detection?”
AI Scraper Risk”Is this visitor an AI crawler collecting content for training?”

For a detected GPTBot request:

┌─────────────────────────┬─────────────────────────┐
│ Attack Risk │ AI Scraper Risk │
│ │ │
│ 14 │ 85 │
│ MINIMAL │ CRITICAL │
│ │ │
│ "Not trying to hide" │ "GPTBot detected" │
└─────────────────────────┴─────────────────────────┘

Key insight: Honest self-identification is LOW threat for attack detection but HIGH confidence for AI scraper detection.


AI Scraper scoring uses inverted signal weighting compared to attack detection:

SignalAttack ScoringAI Scraper ScoringReasoning
Honest bot UA (GPTBot, ClaudeBot)1% weight (trivially spoofed)85+ points (confirmed identity)For attack detection, UA can be spoofed. For AI detection, honest self-ID is reliable
Datacenter IP3% weight (VPN false positives)+10 points boostAI crawlers typically operate from cloud infrastructure
Missing Referer2% weight (easily spoofed)+5 points boostDirect access patterns are typical for crawlers
robots.txt accessNot scoredTracked for compliance analysisShows whether crawler is checking robots.txt before crawling
┌─────────────────────────────────────────────────────────────────┐
│ AI SCRAPER DETECTION PIPELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Incoming Request │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────┐ │
│ │ CHECK USER-AGENT PATTERNS │ │
│ │ GPTBot? ClaudeBot? CCBot? ByteSpider? │ │
│ └─────────────────┬─────────────────────────┘ │
│ │ │
│ ┌────────────┴────────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────┐ ┌───────────┐ │
│ │ MATCHED │ │ NO MATCH │ │
│ └────┬────┘ └─────┬─────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌───────────────┐ ┌───────────────┐ │
│ │ Assign Score │ │ Check Other │ │
│ │ + Confidence │ │ Signals │ │
│ │ + Category │ │ (behavioral) │ │
│ └───────┬───────┘ └───────┬───────┘ │
│ │ │ │
│ └──────────┬─────────────┘ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ AI Scraper │ │
│ │ Score + Level │ │
│ │ + Category │ │
│ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

The AI Scraper Score is calculated from multiple signals:

SignalPointsConfidence Boost
Known Training Crawler UA70-8595%
Known Search Crawler UA30-4090%
AI Company IP Block+10-20+10%
Datacenter/Cloud IP+5-10
High Request Rate Pattern+5-15+5%
Systematic URL Pattern+5-10
Missing Referer+5

WebDecoy tracks and identifies the following AI crawlers:

These crawlers collect content specifically for training AI/ML models:

CrawlerCompanyScoreUser-Agent PatternPurpose
GPTBotOpenAI85GPTBot/1.0Training ChatGPT and GPT models
ChatGPT-UserOpenAI85ChatGPT-UserChatGPT plugins and browsing
OAI-SearchBotOpenAI80OAI-SearchBotSearchGPT content retrieval
ClaudeBotAnthropic85ClaudeBotTraining Claude models
AnthropicAnthropic85anthropic-aiAnthropic’s general crawler
CCBotCommon Crawl80CCBot/2.0Open dataset for AI training
Google-ExtendedGoogle80Google-ExtendedGemini/Bard AI training
PerplexityBotPerplexity80PerplexityBotPerplexity AI search
CohereCohere80cohere-aiCohere model training
ByteSpiderByteDance75BytespiderTikTok/Douyin AI features
Meta-ExternalAgentMeta75Meta-ExternalAgentMeta AI training
Applebot-ExtendedApple75Applebot-ExtendedApple Intelligence training
YouBotYou.com75YouBotYou.com AI search
AmazonbotAmazon70AmazonbotAlexa and Amazon AI
FacebookBotMeta70facebookexternalhitFacebook AI features
DiffbotDiffbot70DiffbotKnowledge graph extraction

Traditional search engines that may also feed AI features:

CrawlerCompanyScoreUser-Agent PatternNote
GooglebotGoogle30GooglebotPrimary search indexing
BingbotMicrosoft30bingbotBing search + Copilot
DuckDuckBotDuckDuckGo30DuckDuckBotPrivacy-focused search
ApplebotApple30ApplebotSiri and Spotlight
YandexBotYandex35YandexBotRussian search + AI
BaiduspiderBaidu40BaiduspiderChinese search + AI

Detected AI activity is classified into categories:

Score Range: 70-85

Crawlers that explicitly identify themselves as collecting data for AI/ML model training.

Characteristics:

  • Honest self-identification in User-Agent
  • Often respect robots.txt AI-specific directives
  • High-volume, systematic crawling patterns
  • Operate from known company IP ranges

Examples: GPTBot, ClaudeBot, CCBot, Google-Extended

robots.txt directive:

User-agent: GPTBot
Disallow: /

Score Range: 30-40

Traditional search engine crawlers that index content for search results but may also feed AI features.

Characteristics:

  • Well-established crawling behavior
  • Generally respect robots.txt
  • Verifiable by IP (e.g., Googlebot verification)
  • Lower threat to content licensing than pure AI trainers

Examples: Googlebot, Bingbot, DuckDuckBot

Consideration: While Googlebot itself scores low for AI scraping, content indexed by Google may be used in AI Overviews. Consider whether you want to limit Google’s AI-specific features via robots.txt.


Score Range: Variable (based on behavioral signals)

Scrapers detected by behavior rather than self-identification. These don’t declare themselves as AI crawlers but exhibit scraping patterns.

Characteristics:

  • May spoof User-Agent as regular browser
  • Detected by fingerprint anomalies
  • High request rates
  • Systematic URL patterns
  • Missing typical browser behaviors

Detection signals:

  • Headless browser fingerprint
  • No mouse/keyboard events
  • Impossibly fast navigation
  • Sequential URL access patterns

Score Range: 0

Not detected as an AI scraper. This could be:

  • Legitimate human traffic
  • Attack bot (scored separately in Attack Risk)
  • Unknown/new AI crawler not yet in database

ScoreLevelInterpretationRecommended Action
0-20MINIMALNot an AI scraperNo action needed
21-40LOWSearch crawler or weak signalsLog for analysis
41-60MEDIUMPossible AI scraper activityMonitor and review
61-80HIGHStrong AI scraper indicatorsConsider blocking
81-100CRITICALConfirmed AI training crawlerBlock or serve alternative content

function handleRequest(detection) {
const {
unified_score, // Attack risk (0-100)
ai_scraper_score, // AI scraper risk (0-100)
ai_scraper_category, // 'training_crawler', 'search_crawler', etc.
ai_scraper_name // 'GPTBot', 'ClaudeBot', etc.
} = detection;
// Block AI training crawlers from premium content
if (ai_scraper_category === 'training_crawler' && ai_scraper_score >= 70) {
return servePremiumContentBlocker();
}
// Allow search crawlers for SEO but log AI activity
if (ai_scraper_category === 'search_crawler') {
logAiCrawlerActivity(detection);
return allowRequest();
}
// Challenge unknown scrapers
if (ai_scraper_score >= 50 && ai_scraper_category === 'content_scraper') {
return challengeWithCaptcha();
}
// Handle attack threats separately
if (unified_score >= 70) {
return blockAttackRequest();
}
return allowRequest();
}

Implement different content experiences based on AI scraper detection:

function getContentVersion(detection) {
const { ai_scraper_score, ai_scraper_category } = detection;
// AI training crawlers: Serve summary only
if (ai_scraper_category === 'training_crawler') {
return 'summary_only';
}
// Search crawlers: Full content for indexing
if (ai_scraper_category === 'search_crawler') {
return 'full_content';
}
// Suspicious scrapers: Serve paywall
if (ai_scraper_score >= 50) {
return 'paywall';
}
// Normal visitors: Full experience
return 'full_content';
}

WebDecoy can track whether AI crawlers are respecting your robots.txt directives:

Your robots.txt:
─────────────────
User-agent: GPTBot
Disallow: /premium/
Disallow: /articles/
WebDecoy Detection:
──────────────────
GPTBot accessed: /articles/2024/my-exclusive-story.html
AI Scraper Score: 85
robots.txt Compliance: VIOLATED
→ This crawler is ignoring your robots.txt!

In the Detections table, filter by AI scraper activity:

FilterEffect
AI Scraper Score >= 70Show high-confidence AI crawler detections
Category = training_crawlerShow only AI training crawlers
AI Scraper Name = GPTBotShow only GPTBot activity
AI Scraper Score > 0Show all AI scraper activity
  1. Volume by Crawler: Which AI crawlers are hitting your site most?
  2. Target Pages: Which content is being scraped most heavily?
  3. Time Patterns: When do AI crawlers typically access your site?
  4. robots.txt Violations: Are crawlers ignoring your preferences?
  5. New Crawlers: Unknown scraping patterns that might be new AI crawlers

Use AI Scraper scores to trigger automated responses:

Response Action: Block AI Scrapers
Trigger: ai_scraper_score >= 70 AND ai_scraper_category = 'training_crawler'
Action: Block
Duration: Permanent
Notify: Weekly digest
Response Action: Rate Limit Search Crawlers
Trigger: ai_scraper_category = 'search_crawler'
Action: Rate Limit
Rate: 100 requests/hour
Notify: On threshold breach
Response Action: Scraping Alert
Trigger: ai_scraper_score >= 50 AND request_count > 1000/hour
Action: Webhook Alert
Endpoint: https://your-api.com/alerts
Notify: Immediately

Access AI Scraper data via the WebDecoy API:

{
"id": "det_abc123",
"ip_address": "20.15.240.128",
"user_agent": "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0)",
"path": "/articles/exclusive-content.html",
"unified_score": 14,
"threat_level": "MINIMAL",
"category": "crawler",
"ai_scraper_score": 85,
"ai_scraper_level": "CRITICAL",
"ai_scraper_category": "training_crawler",
"ai_scraper_name": "GPTBot",
"ai_scraper_confidence": 0.95,
"score_components": {
"user_agent_score": 100,
"honeypot_score": 0,
"attack_signature_score": 0,
"ai_scraper_score": 85,
"ai_scraper_confidence": 0.95,
"ai_scraper_category": "training_crawler",
"ai_scraper_name": "GPTBot"
}
}
Terminal window
# Get all AI training crawler detections
curl -H "Authorization: Bearer $TOKEN" \
"https://api.webdecoy.com/v1/detections?ai_scraper_category=training_crawler"
# Get high-score AI scraper activity
curl -H "Authorization: Bearer $TOKEN" \
"https://api.webdecoy.com/v1/detections?ai_scraper_score_gte=70"

The AI landscape is evolving rapidly. WebDecoy regularly updates its crawler database as new AI training bots emerge. Factors we monitor:

  • New AI Companies: Startups launching training crawlers
  • Existing Companies: Tech giants adding AI features to existing bots
  • IP Range Updates: Changes to AI company infrastructure
  • Behavioral Patterns: New scraping techniques and evasion methods

Some sites are experimenting with AI-specific content tags:

<!-- Potential future standard -->
<meta name="ai-training" content="disallow">
<meta name="ai-summary" content="allow">

WebDecoy will track adoption and compliance with emerging standards.

AI scraping raises significant copyright and licensing questions. WebDecoy provides the data layer for understanding what’s being scraped. This can support:

  • Licensing negotiations with AI companies
  • Copyright enforcement evidence
  • Terms of service violation documentation
  • Opt-out compliance verification

AspectAttack ScoringAI Scraper Scoring
QuestionIs this an attack or bot?Is this an AI training crawler?
User-Agent Weight1% (easily spoofed)70-85+ points (reliable ID)
InterpretationHonest ID = low threatHonest ID = high threat
GoalDetect attacks and evasionDetect content harvesting
Primary UsersSecurity teamsPublishers, content creators