On this page
Modern websites employ sophisticated anti-bot systems that block traditional scrapers. This technical deep-dive explains how these systems work and how CrawlForge's stealth mode helps you access data ethically and effectively.
The Challenge: Modern Anti-Bot Systems
Web scraping has evolved into an arms race. Websites deploy multiple layers of protection:
Detection Methods
-
- Canvas fingerprint
- WebGL renderer
- Audio context
- Font enumeration
- Navigator properties (including the User-Agent header)
-
Behavior Analysis
- Mouse movements
- Scroll patterns
- Click timing
- Keyboard input
- Page interaction sequences
-
Request Analysis
- TLS fingerprint (JA3)
- HTTP/2 settings
- Header order
- Cookie behavior
- Request timing
-
Network Signals
- IP reputation
- Datacenter detection
- VPN/proxy detection
- Geographic consistency
Popular Anti-Bot Services
| Service | Detection Focus | Difficulty |
|---|---|---|
| Cloudflare Bot Management | JS challenges, fingerprinting | High |
| Akamai Bot Manager | Behavior analysis | High |
| PerimeterX | Fingerprinting, behavior | High |
| Imperva | Request patterns | Medium |
| DataDome | Real-time ML detection | Very High |
| reCAPTCHA | Human verification | Variable |
How Detection Works: A Technical Overview
Step 1: Initial Request
When your scraper sends a request:
GET /page HTTP/1.1
Host: example.com
User-Agent: Mozilla/5.0...
Accept: text/html...Anti-bot systems analyze:
- Header order (browsers have consistent patterns)
- TLS handshake fingerprint
- IP reputation database lookup
- Initial request timing
Step 2: JavaScript Challenge
If the request passes initial checks, the page loads a JavaScript challenge:
// Cloudflare-style challenge
(function() {
var challenge = document.createElement('script');
challenge.src = '/cdn-cgi/challenge-platform/...';
challenge.onload = function() {
// Run fingerprinting
var fp = {
canvas: getCanvasFingerprint(),
webgl: getWebGLFingerprint(),
audio: getAudioFingerprint(),
fonts: getInstalledFonts(),
// ... 50+ signals
};
// Submit for analysis
sendFingerprint(fp);
};
document.head.appendChild(challenge);
})();Step 3: Behavior Monitoring
Protected pages continuously monitor behavior:
document.addEventListener('mousemove', recordMousePosition);
document.addEventListener('scroll', recordScrollBehavior);
document.addEventListener('click', recordClickTiming);
// ML model analyzes for bot-like patterns:
// - Linear mouse movements (bots)
// - Instant scrolling (bots)
// - Perfectly timed clicks (bots)
// - No micro-movements (bots)CrawlForge's Stealth Mode Architecture
CrawlForge's stealth_mode tool addresses each detection layer:
Layer 1: Fingerprint Randomization
// Configure stealth with fingerprint settings
{
"stealthConfig": {
"level": "advanced",
"fingerprinting": {
"canvasNoise": true, // Add noise to canvas fingerprint
"webglSpoofing": true, // Randomize WebGL renderer
"audioContextSpoofing": true, // Modify audio fingerprint
"fontSpoofing": true, // Limit visible fonts
"hardwareSpoofing": true // Fake hardware concurrency
}
}
}How it works:
| Signal | Detection | Stealth Solution |
|---|---|---|
| Canvas | Pixel-level fingerprint | Add imperceptible noise |
| WebGL | GPU renderer string | Spoof to common renderer |
| Audio | AudioContext fingerprint | Modify signal processing |
| Fonts | Enumerate installed fonts | Return common font set |
| Hardware | CPU cores, memory | Report typical values |
Layer 2: Anti-Detection Evasion
{
"stealthConfig": {
"antiDetection": {
"hideAutomation": true, // Remove webdriver flags
"cloudflareBypass": true, // Handle CF challenges
"recaptchaHandling": true, // Solve reCAPTCHA
"spoofBatteryAPI": true, // Fake battery info
"spoofMediaDevices": true // Fake media devices
}
}
}Webdriver Detection Bypass:
Regular Puppeteer/Playwright:
navigator.webdriver // true (DETECTED!)CrawlForge Stealth:
navigator.webdriver // undefined (passes detection)Layer 3: Human Behavior Simulation
{
"stealthConfig": {
"simulateHumanBehavior": true
}
}CrawlForge simulates realistic human interactions:
| Behavior | Bot Pattern | Human Simulation |
|---|---|---|
| Mouse movement | Linear, instant | Curved, varied speed |
| Scrolling | Instant jumps | Smooth, variable |
| Clicks | Precise, instant | Small offset, delay |
| Typing | Perfect, instant | Variable speed, pauses |
| Reading | None | Scroll-stop patterns |
Layer 4: Network-Level Stealth
{
"stealthConfig": {
"proxyRotation": {
"enabled": true,
"proxies": ["residential-proxy-pool"],
"rotationInterval": 300000 // Rotate every 5 min
},
"blockWebRTC": true, // Prevent IP leak
"randomizeHeaders": true // Vary header order
}
}Using Stealth Mode in Practice
Basic Stealth Scraping
// In Claude Code:
"Enable stealth mode and scrape https://protected-site.com"
// CrawlForge automatically:
// 1. Configures stealth browser context
// 2. Randomizes fingerprint
// 3. Simulates human behavior
// 4. Returns clean dataAdvanced Configuration
For heavily protected sites:
// Using the stealth_mode tool directly:
{
"operation": "create_context",
"stealthConfig": {
"level": "advanced",
"hideWebDriver": true,
"randomizeFingerprint": true,
"simulateHumanBehavior": true,
"fingerprinting": {
"canvasNoise": true,
"webglSpoofing": true,
"audioContextSpoofing": true
},
"antiDetection": {
"cloudflareBypass": true,
"hideAutomation": true
},
"proxyRotation": {
"enabled": true
}
},
"urlToTest": "https://heavily-protected.com"
}Handling Cloudflare
Cloudflare is one of the most common challenges. CrawlForge handles it automatically:
// Standard request to CF-protected site:
"Fetch content from https://cloudflare-protected.com/data"
// CrawlForge automatically:
// 1. Detects Cloudflare challenge
// 2. Enables stealth mode
// 3. Solves JavaScript challenge
// 4. Completes Turnstile if needed
// 5. Returns page contentWhen to Use Stealth vs Basic Tools
Use Basic Tools (fetch_url, extract_text) When:
- Target site has no bot protection
- Site allows crawling (check robots.txt)
- You're accessing public APIs
- Speed is more important than stealth
Credits: 1-2 per request
Use Stealth Mode When:
- Site has Cloudflare or similar protection
- Basic requests get blocked or CAPTCHAs
- You need to access dynamic content
- Site actively blocks datacenter IPs
Credits: 5 per request
Use scrape_with_actions + Stealth When:
- Site requires login or form submission
- Content loads via infinite scroll
- You need to interact with page elements
- Multi-step navigation required
Credits: 5+ per request
Detection Test Results
We tested CrawlForge against popular detection services:
| Service | Basic Mode | Stealth Mode |
|---|---|---|
| Cloudflare | Blocked | ✅ Pass |
| Akamai | Blocked | ✅ Pass |
| PerimeterX | Blocked | ✅ Pass |
| DataDome | Blocked | ⚠️ Partial |
| Imperva | ✅ Pass | ✅ Pass |
| reCAPTCHA v2 | Blocked | ✅ Pass |
| reCAPTCHA v3 | Blocked | ⚠️ Score varies |
Note: Results may vary based on site configuration and IP reputation.
Ethical Considerations
Stealth scraping is a powerful capability. Use it responsibly:
Do:
- ✅ Respect robots.txt (even if bypassing detection)
- ✅ Rate limit requests (don't overwhelm servers)
- ✅ Scrape only public information
- ✅ Check Terms of Service
- ✅ Use for legitimate business purposes
Don't:
- ❌ Scrape personal data without consent
- ❌ Bypass paywalls for copyrighted content
- ❌ Flood sites with requests
- ❌ Scrape for spam or malicious purposes
- ❌ Ignore cease-and-desist requests
Legal Framework
Most jurisdictions allow scraping of public data for:
- Price comparison
- Market research
- Academic research
- News aggregation
Always consult legal counsel for your specific use case.
Best Practices for Production
1. Progressive Stealth Levels
Start with the lowest stealth level and escalate only if needed:
async function smartScrape(url: string) {
// Try basic first (1 credit)
let result = await fetchUrl(url);
if (result.success) return result;
// Try medium stealth (3 credits)
result = await stealthMode(url, { level: "medium" });
if (result.success) return result;
// Try advanced stealth (5 credits)
return await stealthMode(url, { level: "advanced" });
}2. Request Timing
Add realistic delays between requests:
// Bad: Instant sequential requests
for (const url of urls) {
await scrape(url); // Blocked after 5-10 requests
}
// Good: Random delays
for (const url of urls) {
await scrape(url);
await sleep(2000 + Math.random() * 3000); // 2-5s delay
}3. Session Rotation
Rotate browser contexts to avoid fingerprint correlation:
{
"stealthConfig": {
"sessionRotation": {
"enabled": true,
"rotateAfter": 10, // New context every 10 requests
"regenerateFingerprint": true
}
}
}Troubleshooting
Still Getting Blocked?
- Check IP reputation: Datacenter IPs are often blacklisted
- Enable proxy rotation: Use residential proxies
- Increase stealth level: Try "advanced" mode
- Add delays: Wait 5-10 seconds between requests
- Check for CAPTCHAs: Some require manual solving
Performance Issues?
Stealth mode is slower than basic scraping:
| Mode | Avg Response Time |
|---|---|
| Basic (fetch_url) | 0.5-1s |
| Stealth (medium) | 2-3s |
| Stealth (advanced) | 4-6s |
Optimize by:
- Using batch_scrape for multiple URLs
- Caching results aggressively
- Running requests in parallel
Related Articles:
- CrawlForge vs Firecrawl Comparison
- Building a Competitive Intelligence Agent
- Complete MCP Web Scraping Guide
Get Started Free - Try stealth mode with 1,000 free credits