Intermediate Guide
Batch Processing Guide
Scale web scraping to thousands of URLs with efficient queue management, error recovery, and performance optimization strategies.
1. Using batch_scrape Tool
The batch_scrape tool handles up to 50 URLs concurrently with built-in rate limiting and webhook notifications.
Basic Batch Scraping
1 credit per URL (50 URLs = 50 credits)
Bash
curl -X POST https://crawlforge.dev/api/v1/tools/batch_scrape \
-H "X-API-Key: cf_test_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"urls": [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
],
"formats": ["markdown"],
"maxConcurrency": 5,
"onlyMainContent": true
}'Async Processing with Webhooks
Ideal for large batches (100+ URLs) - get notified when complete
Typescript
// Start batch job with webhook
const response = await fetch('https://crawlforge.dev/api/v1/tools/batch_scrape', {
method: 'POST',
headers: {
'X-API-Key': process.env.CRAWLFORGE_API_KEY!,
'Content-Type': 'application/json',
},
body: JSON.stringify({
urls: largeUrlList, // 100+ URLs
formats: ['markdown'],
webhook: 'https://yourapp.com/webhook/batch-complete',
maxConcurrency: 10
}),
});
const { jobId } = await response.json();
console.log(`Batch job started: ${jobId}`);
// Webhook handler (Express.js example)
app.post('/webhook/batch-complete', (req, res) => {
const { jobId, status, results, errors } = req.body;
if (status === 'completed') {
console.log(`Job ${jobId} completed!`);
console.log(`Success: ${results.length} pages`);
console.log(`Errors: ${errors.length} pages`);
// Process results
results.forEach(result => {
saveToDatabase(result.url, result.content);
});
}
res.sendStatus(200);
});2. Queue Management
Process thousands of URLs by chunking them into batches and managing a queue.
Chunking Strategy
Break large URL lists into manageable batches
Typescript
// Chunk array into batches of 50
function chunkArray<T>(array: T[], size: number): T[][] {
const chunks: T[][] = [];
for (let i = 0; i < array.length; i += size) {
chunks.push(array.slice(i, i + size));
}
return chunks;
}
// Process all URLs in batches
async function processBatches(urls: string[]) {
const batches = chunkArray(urls, 50); // Max 50 URLs per batch
const allResults: any[] = [];
for (let i = 0; i < batches.length; i++) {
const batch = batches[i];
console.log(`Processing batch ${i + 1}/${batches.length}...`);
const response = await fetch('https://crawlforge.dev/api/v1/tools/batch_scrape', {
method: 'POST',
headers: {
'X-API-Key': process.env.CRAWLFORGE_API_KEY!,
'Content-Type': 'application/json',
},
body: JSON.stringify({
urls: batch,
formats: ['markdown'],
maxConcurrency: 10
}),
});
const data = await response.json();
allResults.push(...data.data.results);
// Wait between batches to respect rate limits
if (i < batches.length - 1) {
await new Promise(resolve => setTimeout(resolve, 2000));
}
}
return allResults;
}
// Usage
const urls = [...]; // 500 URLs
const results = await processBatches(urls);
console.log(`Scraped ${results.length} total pages`);Pro Tip: Use Redis or a database to store your queue. This allows you to resume processing if your script crashes or needs to restart.
3. Error Recovery
Handle failures gracefully with retry logic and error tracking.
Robust Error Handling
Typescript
interface BatchResult {
successful: any[];
failed: { url: string; error: string }[];
}
async function batchScrapeWithRetry(
urls: string[],
maxRetries = 3
): Promise<BatchResult> {
const successful: any[] = [];
const failed: { url: string; error: string }[] = [];
let remainingUrls = [...urls];
let retries = 0;
while (remainingUrls.length > 0 && retries <= maxRetries) {
try {
const response = await fetch('https://crawlforge.dev/api/v1/tools/batch_scrape', {
method: 'POST',
headers: {
'X-API-Key': process.env.CRAWLFORGE_API_KEY!,
'Content-Type': 'application/json',
},
body: JSON.stringify({
urls: remainingUrls,
formats: ['markdown'],
maxConcurrency: 5
}),
});
const data = await response.json();
// Separate successful and failed results
const batchSuccessful = data.data.results.filter((r: any) => r.success);
const batchFailed = data.data.results.filter((r: any) => !r.success);
successful.push(...batchSuccessful);
// Only retry failed URLs
remainingUrls = batchFailed.map((r: any) => r.url);
if (remainingUrls.length > 0) {
console.log(`Retrying ${remainingUrls.length} failed URLs...`);
retries++;
await new Promise(resolve => setTimeout(resolve, 2000 * retries));
}
} catch (error) {
console.error('Batch request failed:', error);
retries++;
if (retries > maxRetries) {
// Mark all remaining URLs as failed
failed.push(...remainingUrls.map(url => ({
url,
error: String(error)
})));
break;
}
await new Promise(resolve => setTimeout(resolve, 2000 * retries));
}
}
return { successful, failed };
}
// Usage
const { successful, failed } = await batchScrapeWithRetry(urls);
console.log(`Success: ${successful.length}, Failed: ${failed.length}`);
// Save failed URLs for manual review
if (failed.length > 0) {
fs.writeFileSync('failed-urls.json', JSON.stringify(failed, null, 2));
}4. Performance Optimization
Maximize throughput and minimize costs with these optimization strategies.
Optimize Concurrency
Start with
maxConcurrency: 5, increase to 10 for Professional/Business plansUse onlyMainContent
Set
onlyMainContent: true to reduce response size by 60-80%Choose Minimal Formats
Use
formats: ["markdown"] instead of multiple formats (html, text, screenshot)Cache Results
Store scraped data in Redis/database to avoid re-scraping same URLs
Avoid Over-Batching
Don't exceed 50 URLs per batch - split into multiple requests instead
Don't Ignore Rate Limits
Respect your plan's rate limits (Free: 5/s, Hobby: 10/s, Pro: 50/s, Business: 100/s)
Expected Performance
| Scenario | Time | Settings |
|---|---|---|
| Small Batch (10 URLs) | ~5 seconds | maxConcurrency: 5 |
| Medium Batch (50 URLs) | ~15 seconds | maxConcurrency: 10 |
| Large Batch (500 URLs) | ~3 minutes | 10 batches × 50 URLs |
| Massive Batch (5,000 URLs) | ~30 minutes | 100 batches × 50 URLs |
Next Steps
Continue learning with more advanced guides