AI Scraper Scoring
WebDecoy includes a dedicated scoring dimension for detecting AI training crawlers. This runs completely independently from attack detection because the signal interpretation is fundamentally different.
The Problem: AI Scrapers Don’t Look Like Attackers
Section titled “The Problem: AI Scrapers Don’t Look Like Attackers”When GPTBot crawls your website, it honestly identifies itself in the User-Agent header. From an attack detection perspective, this is low threat—the bot isn’t trying to hide or evade detection.
But from a content protection perspective, this is the highest possible threat—a known AI training crawler is actively collecting your content.
Same signal, opposite interpretations:
┌─────────────────────────────────────────────────────────────────┐│ GPTBot Request ││ User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like ││ Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot) │├─────────────────────────────────────────────────────────────────┤│ ││ Attack Detection View │ Content Protection View ││ ────────────────────── │ ──────────────────────── ││ "This bot isn't hiding" │ "Known AI training crawler" ││ User-Agent easily spoofed │ Confirmed identity ││ Score: 14 (MINIMAL) │ Score: 85 (CRITICAL) ││ Verdict: Low threat │ Verdict: High threat ││ │└─────────────────────────────────────────────────────────────────┘Why Two Separate Scores?
Section titled “Why Two Separate Scores?”WebDecoy shows both scores side-by-side because they answer different questions:
| Score | Question It Answers |
|---|---|
| Attack Risk | ”Is this visitor trying to attack or evade detection?” |
| AI Scraper Risk | ”Is this visitor an AI crawler collecting content for training?” |
For a detected GPTBot request:
┌─────────────────────────┬─────────────────────────┐│ Attack Risk │ AI Scraper Risk ││ │ ││ 14 │ 85 ││ MINIMAL │ CRITICAL ││ │ ││ "Not trying to hide" │ "GPTBot detected" │└─────────────────────────┴─────────────────────────┘Key insight: Honest self-identification is LOW threat for attack detection but HIGH confidence for AI scraper detection.
How AI Scraper Scoring Works
Section titled “How AI Scraper Scoring Works”Inverted Signal Weighting
Section titled “Inverted Signal Weighting”AI Scraper scoring uses inverted signal weighting compared to attack detection:
| Signal | Attack Scoring | AI Scraper Scoring | Reasoning |
|---|---|---|---|
| Honest bot UA (GPTBot, ClaudeBot) | 1% weight (trivially spoofed) | 85+ points (confirmed identity) | For attack detection, UA can be spoofed. For AI detection, honest self-ID is reliable |
| Datacenter IP | 3% weight (VPN false positives) | +10 points boost | AI crawlers typically operate from cloud infrastructure |
| Missing Referer | 2% weight (easily spoofed) | +5 points boost | Direct access patterns are typical for crawlers |
| robots.txt access | Not scored | Tracked for compliance analysis | Shows whether crawler is checking robots.txt before crawling |
Detection Process
Section titled “Detection Process”┌─────────────────────────────────────────────────────────────────┐│ AI SCRAPER DETECTION PIPELINE │├─────────────────────────────────────────────────────────────────┤│ ││ Incoming Request ││ │ ││ ▼ ││ ┌───────────────────────────────────────────┐ ││ │ CHECK USER-AGENT PATTERNS │ ││ │ GPTBot? ClaudeBot? CCBot? ByteSpider? │ ││ └─────────────────┬─────────────────────────┘ ││ │ ││ ┌────────────┴────────────┐ ││ │ │ ││ ▼ ▼ ││ ┌─────────┐ ┌───────────┐ ││ │ MATCHED │ │ NO MATCH │ ││ └────┬────┘ └─────┬─────┘ ││ │ │ ││ ▼ ▼ ││ ┌───────────────┐ ┌───────────────┐ ││ │ Assign Score │ │ Check Other │ ││ │ + Confidence │ │ Signals │ ││ │ + Category │ │ (behavioral) │ ││ └───────┬───────┘ └───────┬───────┘ ││ │ │ ││ └──────────┬─────────────┘ ││ ▼ ││ ┌─────────────────┐ ││ │ AI Scraper │ ││ │ Score + Level │ ││ │ + Category │ ││ └─────────────────┘ │└─────────────────────────────────────────────────────────────────┘Scoring Components
Section titled “Scoring Components”The AI Scraper Score is calculated from multiple signals:
| Signal | Points | Confidence Boost |
|---|---|---|
| Known Training Crawler UA | 70-85 | 95% |
| Known Search Crawler UA | 30-40 | 90% |
| AI Company IP Block | +10-20 | +10% |
| Datacenter/Cloud IP | +5-10 | — |
| High Request Rate Pattern | +5-15 | +5% |
| Systematic URL Pattern | +5-10 | — |
| Missing Referer | +5 | — |
Known AI Crawlers
Section titled “Known AI Crawlers”WebDecoy tracks and identifies the following AI crawlers:
Training Crawlers (High Score)
Section titled “Training Crawlers (High Score)”These crawlers collect content specifically for training AI/ML models:
| Crawler | Company | Score | User-Agent Pattern | Purpose |
|---|---|---|---|---|
| GPTBot | OpenAI | 85 | GPTBot/1.0 | Training ChatGPT and GPT models |
| ChatGPT-User | OpenAI | 85 | ChatGPT-User | ChatGPT plugins and browsing |
| OAI-SearchBot | OpenAI | 80 | OAI-SearchBot | SearchGPT content retrieval |
| ClaudeBot | Anthropic | 85 | ClaudeBot | Training Claude models |
| Anthropic | Anthropic | 85 | anthropic-ai | Anthropic’s general crawler |
| CCBot | Common Crawl | 80 | CCBot/2.0 | Open dataset for AI training |
| Google-Extended | 80 | Google-Extended | Gemini/Bard AI training | |
| PerplexityBot | Perplexity | 80 | PerplexityBot | Perplexity AI search |
| Cohere | Cohere | 80 | cohere-ai | Cohere model training |
| ByteSpider | ByteDance | 75 | Bytespider | TikTok/Douyin AI features |
| Meta-ExternalAgent | Meta | 75 | Meta-ExternalAgent | Meta AI training |
| Applebot-Extended | Apple | 75 | Applebot-Extended | Apple Intelligence training |
| YouBot | You.com | 75 | YouBot | You.com AI search |
| Amazonbot | Amazon | 70 | Amazonbot | Alexa and Amazon AI |
| FacebookBot | Meta | 70 | facebookexternalhit | Facebook AI features |
| Diffbot | Diffbot | 70 | Diffbot | Knowledge graph extraction |
Search Crawlers (Lower Score)
Section titled “Search Crawlers (Lower Score)”Traditional search engines that may also feed AI features:
| Crawler | Company | Score | User-Agent Pattern | Note |
|---|---|---|---|---|
| Googlebot | 30 | Googlebot | Primary search indexing | |
| Bingbot | Microsoft | 30 | bingbot | Bing search + Copilot |
| DuckDuckBot | DuckDuckGo | 30 | DuckDuckBot | Privacy-focused search |
| Applebot | Apple | 30 | Applebot | Siri and Spotlight |
| YandexBot | Yandex | 35 | YandexBot | Russian search + AI |
| Baiduspider | Baidu | 40 | Baiduspider | Chinese search + AI |
AI Scraper Categories
Section titled “AI Scraper Categories”Detected AI activity is classified into categories:
Training Crawler
Section titled “Training Crawler”Score Range: 70-85
Crawlers that explicitly identify themselves as collecting data for AI/ML model training.
Characteristics:
- Honest self-identification in User-Agent
- Often respect robots.txt AI-specific directives
- High-volume, systematic crawling patterns
- Operate from known company IP ranges
Examples: GPTBot, ClaudeBot, CCBot, Google-Extended
robots.txt directive:
User-agent: GPTBotDisallow: /Search Crawler
Section titled “Search Crawler”Score Range: 30-40
Traditional search engine crawlers that index content for search results but may also feed AI features.
Characteristics:
- Well-established crawling behavior
- Generally respect robots.txt
- Verifiable by IP (e.g., Googlebot verification)
- Lower threat to content licensing than pure AI trainers
Examples: Googlebot, Bingbot, DuckDuckBot
Consideration: While Googlebot itself scores low for AI scraping, content indexed by Google may be used in AI Overviews. Consider whether you want to limit Google’s AI-specific features via robots.txt.
Content Scraper
Section titled “Content Scraper”Score Range: Variable (based on behavioral signals)
Scrapers detected by behavior rather than self-identification. These don’t declare themselves as AI crawlers but exhibit scraping patterns.
Characteristics:
- May spoof User-Agent as regular browser
- Detected by fingerprint anomalies
- High request rates
- Systematic URL patterns
- Missing typical browser behaviors
Detection signals:
- Headless browser fingerprint
- No mouse/keyboard events
- Impossibly fast navigation
- Sequential URL access patterns
Score Range: 0
Not detected as an AI scraper. This could be:
- Legitimate human traffic
- Attack bot (scored separately in Attack Risk)
- Unknown/new AI crawler not yet in database
Threat Levels for AI Scraping
Section titled “Threat Levels for AI Scraping”| Score | Level | Interpretation | Recommended Action |
|---|---|---|---|
| 0-20 | MINIMAL | Not an AI scraper | No action needed |
| 21-40 | LOW | Search crawler or weak signals | Log for analysis |
| 41-60 | MEDIUM | Possible AI scraper activity | Monitor and review |
| 61-80 | HIGH | Strong AI scraper indicators | Consider blocking |
| 81-100 | CRITICAL | Confirmed AI training crawler | Block or serve alternative content |
Using AI Scraper Scores
Section titled “Using AI Scraper Scores”Content Protection Strategy
Section titled “Content Protection Strategy”function handleRequest(detection) { const { unified_score, // Attack risk (0-100) ai_scraper_score, // AI scraper risk (0-100) ai_scraper_category, // 'training_crawler', 'search_crawler', etc. ai_scraper_name // 'GPTBot', 'ClaudeBot', etc. } = detection;
// Block AI training crawlers from premium content if (ai_scraper_category === 'training_crawler' && ai_scraper_score >= 70) { return servePremiumContentBlocker(); }
// Allow search crawlers for SEO but log AI activity if (ai_scraper_category === 'search_crawler') { logAiCrawlerActivity(detection); return allowRequest(); }
// Challenge unknown scrapers if (ai_scraper_score >= 50 && ai_scraper_category === 'content_scraper') { return challengeWithCaptcha(); }
// Handle attack threats separately if (unified_score >= 70) { return blockAttackRequest(); }
return allowRequest();}Tiered Content Access
Section titled “Tiered Content Access”Implement different content experiences based on AI scraper detection:
function getContentVersion(detection) { const { ai_scraper_score, ai_scraper_category } = detection;
// AI training crawlers: Serve summary only if (ai_scraper_category === 'training_crawler') { return 'summary_only'; }
// Search crawlers: Full content for indexing if (ai_scraper_category === 'search_crawler') { return 'full_content'; }
// Suspicious scrapers: Serve paywall if (ai_scraper_score >= 50) { return 'paywall'; }
// Normal visitors: Full experience return 'full_content';}robots.txt Compliance Monitoring
Section titled “robots.txt Compliance Monitoring”WebDecoy can track whether AI crawlers are respecting your robots.txt directives:
Your robots.txt:─────────────────User-agent: GPTBotDisallow: /premium/Disallow: /articles/
WebDecoy Detection:──────────────────GPTBot accessed: /articles/2024/my-exclusive-story.htmlAI Scraper Score: 85robots.txt Compliance: VIOLATED
→ This crawler is ignoring your robots.txt!Dashboard Filtering
Section titled “Dashboard Filtering”In the Detections table, filter by AI scraper activity:
Filter Options
Section titled “Filter Options”| Filter | Effect |
|---|---|
AI Scraper Score >= 70 | Show high-confidence AI crawler detections |
Category = training_crawler | Show only AI training crawlers |
AI Scraper Name = GPTBot | Show only GPTBot activity |
AI Scraper Score > 0 | Show all AI scraper activity |
Insights to Look For
Section titled “Insights to Look For”- Volume by Crawler: Which AI crawlers are hitting your site most?
- Target Pages: Which content is being scraped most heavily?
- Time Patterns: When do AI crawlers typically access your site?
- robots.txt Violations: Are crawlers ignoring your preferences?
- New Crawlers: Unknown scraping patterns that might be new AI crawlers
Integration with Response Actions
Section titled “Integration with Response Actions”Use AI Scraper scores to trigger automated responses:
Block AI Training Crawlers
Section titled “Block AI Training Crawlers”Response Action: Block AI ScrapersTrigger: ai_scraper_score >= 70 AND ai_scraper_category = 'training_crawler'Action: BlockDuration: PermanentNotify: Weekly digestRate Limit Search Crawlers
Section titled “Rate Limit Search Crawlers”Response Action: Rate Limit Search CrawlersTrigger: ai_scraper_category = 'search_crawler'Action: Rate LimitRate: 100 requests/hourNotify: On threshold breachAlert on High-Volume Scraping
Section titled “Alert on High-Volume Scraping”Response Action: Scraping AlertTrigger: ai_scraper_score >= 50 AND request_count > 1000/hourAction: Webhook AlertEndpoint: https://your-api.com/alertsNotify: ImmediatelyAPI Access
Section titled “API Access”Access AI Scraper data via the WebDecoy API:
Detection Response
Section titled “Detection Response”{ "id": "det_abc123", "ip_address": "20.15.240.128", "user_agent": "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0)", "path": "/articles/exclusive-content.html",
"unified_score": 14, "threat_level": "MINIMAL", "category": "crawler",
"ai_scraper_score": 85, "ai_scraper_level": "CRITICAL", "ai_scraper_category": "training_crawler", "ai_scraper_name": "GPTBot", "ai_scraper_confidence": 0.95,
"score_components": { "user_agent_score": 100, "honeypot_score": 0, "attack_signature_score": 0, "ai_scraper_score": 85, "ai_scraper_confidence": 0.95, "ai_scraper_category": "training_crawler", "ai_scraper_name": "GPTBot" }}Filtering by AI Scraper
Section titled “Filtering by AI Scraper”# Get all AI training crawler detectionscurl -H "Authorization: Bearer $TOKEN" \ "https://api.webdecoy.com/v1/detections?ai_scraper_category=training_crawler"
# Get high-score AI scraper activitycurl -H "Authorization: Bearer $TOKEN" \ "https://api.webdecoy.com/v1/detections?ai_scraper_score_gte=70"Future Considerations
Section titled “Future Considerations”Expanding Crawler Detection
Section titled “Expanding Crawler Detection”The AI landscape is evolving rapidly. WebDecoy regularly updates its crawler database as new AI training bots emerge. Factors we monitor:
- New AI Companies: Startups launching training crawlers
- Existing Companies: Tech giants adding AI features to existing bots
- IP Range Updates: Changes to AI company infrastructure
- Behavioral Patterns: New scraping techniques and evasion methods
AI-Specific Content Directives
Section titled “AI-Specific Content Directives”Some sites are experimenting with AI-specific content tags:
<!-- Potential future standard --><meta name="ai-training" content="disallow"><meta name="ai-summary" content="allow">WebDecoy will track adoption and compliance with emerging standards.
Legal and Licensing Implications
Section titled “Legal and Licensing Implications”AI scraping raises significant copyright and licensing questions. WebDecoy provides the data layer for understanding what’s being scraped. This can support:
- Licensing negotiations with AI companies
- Copyright enforcement evidence
- Terms of service violation documentation
- Opt-out compliance verification
Summary
Section titled “Summary”| Aspect | Attack Scoring | AI Scraper Scoring |
|---|---|---|
| Question | Is this an attack or bot? | Is this an AI training crawler? |
| User-Agent Weight | 1% (easily spoofed) | 70-85+ points (reliable ID) |
| Interpretation | Honest ID = low threat | Honest ID = high threat |
| Goal | Detect attacks and evasion | Detect content harvesting |
| Primary Users | Security teams | Publishers, content creators |
Related Documentation
Section titled “Related Documentation”- Threat Scoring - Attack and bot detection scoring
- Response Actions - Automated responses to threats
- Bot Scanner - JavaScript-based bot detection