中级指南
批量 处理指南
通过高效的队列管理、错误恢复和性能优化策略,将网页抓取扩展到数千个 URL。
1. 使用 batch_scrape 工具
batch_scrape 工具最多可并发处理 50 个 URL,内置速率限制和 webhook 通知。
基础批量抓取
每个 URL 1 credit(50 个 URL = 50 credits)
Bash
curl -X POST https://crawlforge.dev/api/v1/tools/batch_scrape \
-H "X-API-Key: cf_test_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"urls": [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
],
"formats": ["markdown"],
"maxConcurrency": 5,
"onlyMainContent": true
}'结合 Webhook 的异步处理
适合大批量任务(100+ URL)——完成时获得通知
Typescript
// Start batch job with webhook
const response = await fetch('https://crawlforge.dev/api/v1/tools/batch_scrape', {
method: 'POST',
headers: {
'X-API-Key': process.env.CRAWLFORGE_API_KEY!,
'Content-Type': 'application/json',
},
body: JSON.stringify({
urls: largeUrlList, // 100+ URLs
formats: ['markdown'],
webhook: 'https://yourapp.com/webhook/batch-complete',
maxConcurrency: 10
}),
});
const { jobId } = await response.json();
console.log(`Batch job started: ${jobId}`);
// Webhook handler (Express.js example)
app.post('/webhook/batch-complete', (req, res) => {
const { jobId, status, results, errors } = req.body;
if (status === 'completed') {
console.log(`Job ${jobId} completed!`);
console.log(`Success: ${results.length} pages`);
console.log(`Errors: ${errors.length} pages`);
// Process results
results.forEach(result => {
saveToDatabase(result.url, result.content);
});
}
res.sendStatus(200);
});2. 队列管理
通过将数千个 URL 切分成多个批次并管理队列来处理它们。
切分策略
将大型 URL 列表拆分成可管理的批次
Typescript
// Chunk array into batches of 50
function chunkArray<T>(array: T[], size: number): T[][] {
const chunks: T[][] = [];
for (let i = 0; i < array.length; i += size) {
chunks.push(array.slice(i, i + size));
}
return chunks;
}
// Process all URLs in batches
async function processBatches(urls: string[]) {
const batches = chunkArray(urls, 50); // Max 50 URLs per batch
const allResults: any[] = [];
for (let i = 0; i < batches.length; i++) {
const batch = batches[i];
console.log(`Processing batch ${i + 1}/${batches.length}...`);
const response = await fetch('https://crawlforge.dev/api/v1/tools/batch_scrape', {
method: 'POST',
headers: {
'X-API-Key': process.env.CRAWLFORGE_API_KEY!,
'Content-Type': 'application/json',
},
body: JSON.stringify({
urls: batch,
formats: ['markdown'],
maxConcurrency: 10
}),
});
const data = await response.json();
allResults.push(...data.data.results);
// Wait between batches to respect rate limits
if (i < batches.length - 1) {
await new Promise(resolve => setTimeout(resolve, 2000));
}
}
return allResults;
}
// Usage
const urls = [...]; // 500 URLs
const results = await processBatches(urls);
console.log(`Scraped ${results.length} total pages`);专业提示: 使用 Redis 或数据库来存储你的队列。这样在脚本崩溃或需要重启时,你可以恢复处理。
3. 错误恢复
通过重试逻辑和错误追踪优雅地处理失败。
健壮的错误处理
Typescript
interface BatchResult {
successful: any[];
failed: { url: string; error: string }[];
}
async function batchScrapeWithRetry(
urls: string[],
maxRetries = 3
): Promise<BatchResult> {
const successful: any[] = [];
const failed: { url: string; error: string }[] = [];
let remainingUrls = [...urls];
let retries = 0;
while (remainingUrls.length > 0 && retries <= maxRetries) {
try {
const response = await fetch('https://crawlforge.dev/api/v1/tools/batch_scrape', {
method: 'POST',
headers: {
'X-API-Key': process.env.CRAWLFORGE_API_KEY!,
'Content-Type': 'application/json',
},
body: JSON.stringify({
urls: remainingUrls,
formats: ['markdown'],
maxConcurrency: 5
}),
});
const data = await response.json();
// Separate successful and failed results
const batchSuccessful = data.data.results.filter((r: any) => r.success);
const batchFailed = data.data.results.filter((r: any) => !r.success);
successful.push(...batchSuccessful);
// Only retry failed URLs
remainingUrls = batchFailed.map((r: any) => r.url);
if (remainingUrls.length > 0) {
console.log(`Retrying ${remainingUrls.length} failed URLs...`);
retries++;
await new Promise(resolve => setTimeout(resolve, 2000 * retries));
}
} catch (error) {
console.error('Batch request failed:', error);
retries++;
if (retries > maxRetries) {
// Mark all remaining URLs as failed
failed.push(...remainingUrls.map(url => ({
url,
error: String(error)
})));
break;
}
await new Promise(resolve => setTimeout(resolve, 2000 * retries));
}
}
return { successful, failed };
}
// Usage
const { successful, failed } = await batchScrapeWithRetry(urls);
console.log(`Success: ${successful.length}, Failed: ${failed.length}`);
// Save failed URLs for manual review
if (failed.length > 0) {
fs.writeFileSync('failed-urls.json', JSON.stringify(failed, null, 2));
}4. 性能优化
通过这些优化策略,最大化吞吐量并最小化成本。
优化并发数
从
maxConcurrency: 5 开始,Professional/Business 套餐可提高到 10使用 onlyMainContent
设置
onlyMainContent: true 可将响应体积减少 60-80%选择最简格式
使用
formats: ["markdown"] 而非多种格式(html、text、screenshot)缓存结果
将抓取的数据存储在 Redis/数据库中,避免重复抓取相同的 URL
避免过度批处理
每批次不要超过 50 个 URL——应拆分成多个请求
不要忽视速率限制
遵守你所在套餐的速率限制(Free:5/s,Hobby:10/s,Pro:50/s,Business:100/s)
预期性能
| 场景 | 时间 | 设置 |
|---|---|---|
| 小批量(10 个 URL) | 约 5 秒 | maxConcurrency: 5 |
| 中批量(50 个 URL) | 约 15 秒 | maxConcurrency: 10 |
| 大批量(500 个 URL) | 约 3 分钟 | 10 批 × 50 个 URL |
| 超大批量(5,000 个 URL) | 约 30 分钟 | 100 批 × 50 个 URL |
下一步
继续学习更多高级指南