中级指南

批量处理指南

通过高效的队列管理、错误恢复和性能优化策略，将网页抓取扩展到数千个 URL。

1. 使用 batch_scrape 工具

batch_scrape 工具最多可并发处理 50 个 URL，内置速率限制和 webhook 通知。

基础批量抓取

每个 URL 1 credit（50 个 URL = 50 credits）

Bash

curl -X POST https://crawlforge.dev/api/v1/tools/batch_scrape \
  -H "X-API-Key: cf_test_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://example.com/page1",
      "https://example.com/page2",
      "https://example.com/page3"
    ],
    "formats": ["markdown"],
    "maxConcurrency": 5,
    "onlyMainContent": true
  }'

结合 Webhook 的异步处理

适合大批量任务（100+ URL）——完成时获得通知

Typescript

// Start batch job with webhook
const response = await fetch('https://crawlforge.dev/api/v1/tools/batch_scrape', {
  method: 'POST',
  headers: {
    'X-API-Key': process.env.CRAWLFORGE_API_KEY!,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    urls: largeUrlList, // 100+ URLs
    formats: ['markdown'],
    webhook: 'https://yourapp.com/webhook/batch-complete',
    maxConcurrency: 10
  }),
});

const { jobId } = await response.json();
console.log(`Batch job started: ${jobId}`);

// Webhook handler (Express.js example)
app.post('/webhook/batch-complete', (req, res) => {
  const { jobId, status, results, errors } = req.body;

  if (status === 'completed') {
    console.log(`Job ${jobId} completed!`);
    console.log(`Success: ${results.length} pages`);
    console.log(`Errors: ${errors.length} pages`);

    // Process results
    results.forEach(result => {
      saveToDatabase(result.url, result.content);
    });
  }

  res.sendStatus(200);
});

2. 队列管理

通过将数千个 URL 切分成多个批次并管理队列来处理它们。

切分策略

将大型 URL 列表拆分成可管理的批次

Typescript

// Chunk array into batches of 50
function chunkArray<T>(array: T[], size: number): T[][] {
  const chunks: T[][] = [];
  for (let i = 0; i < array.length; i += size) {
    chunks.push(array.slice(i, i + size));
  }
  return chunks;
}

// Process all URLs in batches
async function processBatches(urls: string[]) {
  const batches = chunkArray(urls, 50); // Max 50 URLs per batch
  const allResults: any[] = [];

  for (let i = 0; i < batches.length; i++) {
    const batch = batches[i];
    console.log(`Processing batch ${i + 1}/${batches.length}...`);

    const response = await fetch('https://crawlforge.dev/api/v1/tools/batch_scrape', {
      method: 'POST',
      headers: {
        'X-API-Key': process.env.CRAWLFORGE_API_KEY!,
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({
        urls: batch,
        formats: ['markdown'],
        maxConcurrency: 10
      }),
    });

    const data = await response.json();
    allResults.push(...data.data.results);

    // Wait between batches to respect rate limits
    if (i < batches.length - 1) {
      await new Promise(resolve => setTimeout(resolve, 2000));
    }
  }

  return allResults;
}

// Usage
const urls = [...]; // 500 URLs
const results = await processBatches(urls);
console.log(`Scraped ${results.length} total pages`);

专业提示： 使用 Redis 或数据库来存储你的队列。这样在脚本崩溃或需要重启时，你可以恢复处理。

3. 错误恢复

通过重试逻辑和错误追踪优雅地处理失败。

健壮的错误处理

Typescript

interface BatchResult {
  successful: any[];
  failed: { url: string; error: string }[];
}

async function batchScrapeWithRetry(
  urls: string[],
  maxRetries = 3
): Promise<BatchResult> {
  const successful: any[] = [];
  const failed: { url: string; error: string }[] = [];
  let remainingUrls = [...urls];
  let retries = 0;

  while (remainingUrls.length > 0 && retries <= maxRetries) {
    try {
      const response = await fetch('https://crawlforge.dev/api/v1/tools/batch_scrape', {
        method: 'POST',
        headers: {
          'X-API-Key': process.env.CRAWLFORGE_API_KEY!,
          'Content-Type': 'application/json',
        },
        body: JSON.stringify({
          urls: remainingUrls,
          formats: ['markdown'],
          maxConcurrency: 5
        }),
      });

      const data = await response.json();

      // Separate successful and failed results
      const batchSuccessful = data.data.results.filter((r: any) => r.success);
      const batchFailed = data.data.results.filter((r: any) => !r.success);

      successful.push(...batchSuccessful);

      // Only retry failed URLs
      remainingUrls = batchFailed.map((r: any) => r.url);

      if (remainingUrls.length > 0) {
        console.log(`Retrying ${remainingUrls.length} failed URLs...`);
        retries++;
        await new Promise(resolve => setTimeout(resolve, 2000 * retries));
      }

    } catch (error) {
      console.error('Batch request failed:', error);
      retries++;

      if (retries > maxRetries) {
        // Mark all remaining URLs as failed
        failed.push(...remainingUrls.map(url => ({
          url,
          error: String(error)
        })));
        break;
      }

      await new Promise(resolve => setTimeout(resolve, 2000 * retries));
    }
  }

  return { successful, failed };
}

// Usage
const { successful, failed } = await batchScrapeWithRetry(urls);
console.log(`Success: ${successful.length}, Failed: ${failed.length}`);

// Save failed URLs for manual review
if (failed.length > 0) {
  fs.writeFileSync('failed-urls.json', JSON.stringify(failed, null, 2));
}

4. 性能优化

通过这些优化策略，最大化吞吐量并最小化成本。

优化并发数

从 maxConcurrency: 5 开始，Professional/Business 套餐可提高到 10

使用 onlyMainContent

设置 onlyMainContent: true 可将响应体积减少 60-80%

选择最简格式

使用 formats: ["markdown"] 而非多种格式（html、text、screenshot）

缓存结果

将抓取的数据存储在 Redis/数据库中，避免重复抓取相同的 URL

避免过度批处理

每批次不要超过 50 个 URL——应拆分成多个请求

不要忽视速率限制

遵守你所在套餐的速率限制（Free：5/s，Hobby：10/s，Pro：50/s，Business：100/s）

预期性能

场景	时间	设置
小批量（10 个 URL）	约 5 秒	maxConcurrency: 5
中批量（50 个 URL）	约 15 秒	maxConcurrency: 10
大批量（500 个 URL）	约 3 分钟	10 批 × 50 个 URL
超大批量（5,000 个 URL）	约 30 分钟	100 批 × 50 个 URL

下一步

继续学习更多高级指南

curl -X POST https://crawlforge.dev/api/v1/tools/batch_scrape \ -H "X-API-Key: cf_test_YOUR_KEY" \ -H "Content-Type: application/json" \ -d '{ "urls": [ "https://example.com/page1", "https://example.com/page2", "https://example.com/page3" ], "formats": ["markdown"], "maxConcurrency": 5, "onlyMainContent": true }'

// Start batch job with webhook const response = await fetch('https://crawlforge.dev/api/v1/tools/batch_scrape', { method: 'POST', headers: { 'X-API-Key': process.env.CRAWLFORGE_API_KEY!, 'Content-Type': 'application/json', }, body: JSON.stringify({ urls: largeUrlList, // 100+ URLs formats: ['markdown'], webhook: 'https://yourapp.com/webhook/batch-complete', maxConcurrency: 10 }), }); const { jobId } = await response.json(); console.log(`Batch job started: ${jobId}`); // Webhook handler (Express.js example) app.post('/webhook/batch-complete', (req, res) => { const { jobId, status, results, errors } = req.body; if (status === 'completed') { console.log(`Job ${jobId} completed!`); console.log(`Success: ${results.length} pages`); console.log(`Errors: ${errors.length} pages`); // Process results results.forEach(result => { saveToDatabase(result.url, result.content); }); } res.sendStatus(200); });

// Chunk array into batches of 50 function chunkArray<T>(array: T[], size: number): T[][] { const chunks: T[][] = []; for (let i = 0; i < array.length; i += size) { chunks.push(array.slice(i, i + size)); } return chunks; } // Process all URLs in batches async function processBatches(urls: string[]) { const batches = chunkArray(urls, 50); // Max 50 URLs per batch const allResults: any[] = []; for (let i = 0; i < batches.length; i++) { const batch = batches[i]; console.log(`Processing batch ${i + 1}/${batches.length}...`); const response = await fetch('https://crawlforge.dev/api/v1/tools/batch_scrape', { method: 'POST', headers: { 'X-API-Key': process.env.CRAWLFORGE_API_KEY!, 'Content-Type': 'application/json', }, body: JSON.stringify({ urls: batch, formats: ['markdown'], maxConcurrency: 10 }), }); const data = await response.json(); allResults.push(...data.data.results); // Wait between batches to respect rate limits if (i < batches.length - 1) { await new Promise(resolve => setTimeout(resolve, 2000)); } } return allResults; } // Usage const urls = [...]; // 500 URLs const results = await processBatches(urls); console.log(`Scraped ${results.length} total pages`);

interface BatchResult { successful: any[]; failed: { url: string; error: string }[]; } async function batchScrapeWithRetry( urls: string[], maxRetries = 3 ): Promise<BatchResult> { const successful: any[] = []; const failed: { url: string; error: string }[] = []; let remainingUrls = [...urls]; let retries = 0; while (remainingUrls.length > 0 && retries <= maxRetries) { try { const response = await fetch('https://crawlforge.dev/api/v1/tools/batch_scrape', { method: 'POST', headers: { 'X-API-Key': process.env.CRAWLFORGE_API_KEY!, 'Content-Type': 'application/json', }, body: JSON.stringify({ urls: remainingUrls, formats: ['markdown'], maxConcurrency: 5 }), }); const data = await response.json(); // Separate successful and failed results const batchSuccessful = data.data.results.filter((r: any) => r.success); const batchFailed = data.data.results.filter((r: any) => !r.success); successful.push(...batchSuccessful); // Only retry failed URLs remainingUrls = batchFailed.map((r: any) => r.url); if (remainingUrls.length > 0) { console.log(`Retrying ${remainingUrls.length} failed URLs...`); retries++; await new Promise(resolve => setTimeout(resolve, 2000 * retries)); } } catch (error) { console.error('Batch request failed:', error); retries++; if (retries > maxRetries) { // Mark all remaining URLs as failed failed.push(...remainingUrls.map(url => ({ url, error: String(error) }))); break; } await new Promise(resolve => setTimeout(resolve, 2000 * retries)); } } return { successful, failed }; } // Usage const { successful, failed } = await batchScrapeWithRetry(urls); console.log(`Success: ${successful.length}, Failed: ${failed.length}`); // Save failed URLs for manual review if (failed.length > 0) { fs.writeFileSync('failed-urls.json', JSON.stringify(failed, null, 2)); }

场景

时间

设置

小批量（10 个 URL）

约 5 秒

maxConcurrency: 5

中批量（50 个 URL）

约 15 秒

maxConcurrency: 10

大批量（500 个 URL）

约 3 分钟

10 批 × 50 个 URL

超大批量（5,000 个 URL）

约 30 分钟

100 批 × 50 个 URL

批量 处理指南

1. 使用 batch_scrape 工具

基础批量抓取

结合 Webhook 的异步处理

2. 队列管理

切分策略

3. 错误恢复

健壮的错误处理

4. 性能优化

批量 处理指南

1. 使用 batch_scrape 工具

基础批量抓取

结合 Webhook 的异步处理

2. 队列管理

切分策略

3. 错误恢复

健壮的错误处理

4. 性能优化

批量处理指南

批量处理指南