CrawlForge
Intermediate Guide

Batch Processing Guide

Scale web scraping to thousands of URLs with efficient queue management, error recovery, and performance optimization strategies.

Using batch_scrape Tool
Queue Management
Error Recovery
Performance Optimization

1. Using batch_scrape Tool

The batch_scrape tool handles up to 50 URLs concurrently with built-in rate limiting and webhook notifications.

Basic Batch Scraping

1 credit per URL (50 URLs = 50 credits)

Bash
curl -X POST https://crawlforge.dev/api/v1/tools/batch_scrape \
  -H "X-API-Key: cf_test_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://example.com/page1",
      "https://example.com/page2",
      "https://example.com/page3"
    ],
    "formats": ["markdown"],
    "maxConcurrency": 5,
    "onlyMainContent": true
  }'

Async Processing with Webhooks

Ideal for large batches (100+ URLs) - get notified when complete

Typescript
// Start batch job with webhook
const response = await fetch('https://crawlforge.dev/api/v1/tools/batch_scrape', {
  method: 'POST',
  headers: {
    'X-API-Key': process.env.CRAWLFORGE_API_KEY!,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    urls: largeUrlList, // 100+ URLs
    formats: ['markdown'],
    webhook: 'https://yourapp.com/webhook/batch-complete',
    maxConcurrency: 10
  }),
});

const { jobId } = await response.json();
console.log(`Batch job started: ${jobId}`);

// Webhook handler (Express.js example)
app.post('/webhook/batch-complete', (req, res) => {
  const { jobId, status, results, errors } = req.body;

  if (status === 'completed') {
    console.log(`Job ${jobId} completed!`);
    console.log(`Success: ${results.length} pages`);
    console.log(`Errors: ${errors.length} pages`);

    // Process results
    results.forEach(result => {
      saveToDatabase(result.url, result.content);
    });
  }

  res.sendStatus(200);
});

2. Queue Management

Process thousands of URLs by chunking them into batches and managing a queue.

Chunking Strategy

Break large URL lists into manageable batches

Typescript
// Chunk array into batches of 50
function chunkArray<T>(array: T[], size: number): T[][] {
  const chunks: T[][] = [];
  for (let i = 0; i < array.length; i += size) {
    chunks.push(array.slice(i, i + size));
  }
  return chunks;
}

// Process all URLs in batches
async function processBatches(urls: string[]) {
  const batches = chunkArray(urls, 50); // Max 50 URLs per batch
  const allResults: any[] = [];

  for (let i = 0; i < batches.length; i++) {
    const batch = batches[i];
    console.log(`Processing batch ${i + 1}/${batches.length}...`);

    const response = await fetch('https://crawlforge.dev/api/v1/tools/batch_scrape', {
      method: 'POST',
      headers: {
        'X-API-Key': process.env.CRAWLFORGE_API_KEY!,
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({
        urls: batch,
        formats: ['markdown'],
        maxConcurrency: 10
      }),
    });

    const data = await response.json();
    allResults.push(...data.data.results);

    // Wait between batches to respect rate limits
    if (i < batches.length - 1) {
      await new Promise(resolve => setTimeout(resolve, 2000));
    }
  }

  return allResults;
}

// Usage
const urls = [...]; // 500 URLs
const results = await processBatches(urls);
console.log(`Scraped ${results.length} total pages`);
Pro Tip: Use Redis or a database to store your queue. This allows you to resume processing if your script crashes or needs to restart.

3. Error Recovery

Handle failures gracefully with retry logic and error tracking.

Robust Error Handling

Typescript
interface BatchResult {
  successful: any[];
  failed: { url: string; error: string }[];
}

async function batchScrapeWithRetry(
  urls: string[],
  maxRetries = 3
): Promise<BatchResult> {
  const successful: any[] = [];
  const failed: { url: string; error: string }[] = [];
  let remainingUrls = [...urls];
  let retries = 0;

  while (remainingUrls.length > 0 && retries <= maxRetries) {
    try {
      const response = await fetch('https://crawlforge.dev/api/v1/tools/batch_scrape', {
        method: 'POST',
        headers: {
          'X-API-Key': process.env.CRAWLFORGE_API_KEY!,
          'Content-Type': 'application/json',
        },
        body: JSON.stringify({
          urls: remainingUrls,
          formats: ['markdown'],
          maxConcurrency: 5
        }),
      });

      const data = await response.json();

      // Separate successful and failed results
      const batchSuccessful = data.data.results.filter((r: any) => r.success);
      const batchFailed = data.data.results.filter((r: any) => !r.success);

      successful.push(...batchSuccessful);

      // Only retry failed URLs
      remainingUrls = batchFailed.map((r: any) => r.url);

      if (remainingUrls.length > 0) {
        console.log(`Retrying ${remainingUrls.length} failed URLs...`);
        retries++;
        await new Promise(resolve => setTimeout(resolve, 2000 * retries));
      }

    } catch (error) {
      console.error('Batch request failed:', error);
      retries++;

      if (retries > maxRetries) {
        // Mark all remaining URLs as failed
        failed.push(...remainingUrls.map(url => ({
          url,
          error: String(error)
        })));
        break;
      }

      await new Promise(resolve => setTimeout(resolve, 2000 * retries));
    }
  }

  return { successful, failed };
}

// Usage
const { successful, failed } = await batchScrapeWithRetry(urls);
console.log(`Success: ${successful.length}, Failed: ${failed.length}`);

// Save failed URLs for manual review
if (failed.length > 0) {
  fs.writeFileSync('failed-urls.json', JSON.stringify(failed, null, 2));
}

4. Performance Optimization

Maximize throughput and minimize costs with these optimization strategies.

Optimize Concurrency
Start with maxConcurrency: 5, increase to 10 for Professional/Business plans
Use onlyMainContent
Set onlyMainContent: true to reduce response size by 60-80%
Choose Minimal Formats
Use formats: ["markdown"] instead of multiple formats (html, text, screenshot)
Cache Results
Store scraped data in Redis/database to avoid re-scraping same URLs
Avoid Over-Batching
Don't exceed 50 URLs per batch - split into multiple requests instead
Don't Ignore Rate Limits
Respect your plan's rate limits (Free: 5/s, Hobby: 10/s, Pro: 50/s, Business: 100/s)

Expected Performance

ScenarioTimeSettings
Small Batch (10 URLs)~5 secondsmaxConcurrency: 5
Medium Batch (50 URLs)~15 secondsmaxConcurrency: 10
Large Batch (500 URLs)~3 minutes10 batches × 50 URLs
Massive Batch (5,000 URLs)~30 minutes100 batches × 50 URLs
Next Steps
Continue learning with more advanced guides
Credit Optimization →
Minimize costs
Stealth Techniques →
Bypass anti-bot systems