CrawlForge 中的 deep_research 工具是什么？

deep_research 是 CrawlForge 由 AI 驱动的多源分析工具，它运行一条 5 阶段流水线（查询扩展、来源发现、内容提取、验证、综合），并在几秒内返回带引用的摘要。它用一次 API 调用、花费 10 credits，取代了原本需要 65-95 分钟的手动研究流程。

deep_research 如何检测来源之间的冲突？

在从多个来源提取内容后，deep_research 会比较各项主张，并在响应的 `conflicts` 数组中标记出矛盾之处。这对尽职调查、市场研究，以及任何看重来源可靠性的场景都至关重要。

一次 deep_research 调用需要多长时间？

视深度和来源数量而定，典型运行在 15-30 秒内完成。文中的示例在约 18 秒内返回了一份综合了 10 个已验证来源的摘要，其中包含搜索、提取和 AI 综合。

免费额度包含多少次 deep_research 查询？

1,000 credits 的免费额度可覆盖 100 次 deep_research 查询（每次 10 credits）。付费套餐则按比例扩展：Hobby（$19/月，5,000 credits）支持 500 次查询，更高的套餐按比例扩展。

我能按来源类型或时效性来过滤 deep_research 吗？

可以。你可以用学术、新闻或政府来源等选项来配置来源过滤，设置可信度阈值，并切换 includeRecentOnly 以聚焦于新鲜内容。该工具还支持五种研究方式：broad、focused、academic、current_events 和 comparative。

为 AI 训练数据做网页抓取：2026 完整指南

AI 革命靠数据驱动。无论你是在微调 LLM、构建 RAG 系统，还是训练自定义模型，网页数据往往是你最丰富的训练素材来源。

但从网络采集高质量的训练数据并不简单。本指南涵盖一切：道德考量、采集流水线、质量保障，以及用 CrawlForge 进行的实际落地。

数据瓶颈

AI 模型的好坏取决于其训练数据。然而大多数团队都面临关键挑战：

数量：模型需要数百万个样本
质量：垃圾进，垃圾出
多样性：用狭窄的数据训练会造就狭窄的模型
新鲜度：静态数据集会过时
合规：法律与道德考量

网页抓取解决了数量和多样性问题——但前提是做对了。

面向 AI 的网页数据类型

文本内容

最常见的训练数据类型：

文章和博客文章 - 用于语言理解的叙述性文本
文档 - 技术写作和结构化讲解
论坛和问答 - 对话模式与问题解决
产品描述 - 简洁的描述性文本
评论 - 富含情感与观点的内容

结构化数据

用于分类和实体识别：

产品目录 - 带属性的条目
企业名录 - 带关系的实体
事件数据 - 时间与地点信息
表格和数据集 - 数值型和类别型数据

元数据

常被忽视但很有价值：

SEO 标签 - 人工撰写的摘要和关键词
Schema.org 标记 - 结构化的实体数据
社交图谱 - 关系数据
时间戳 - 时间模式

多模态

用于视觉和多模态模型：

带说明文字的图片 - 视觉-语言对
带文本的 PDF - 文档理解
带字幕的视频 - 时序视觉-语言

合乎道德的网页抓取原则

在采集数据之前，先了解道德与法律环境。

1. 尊重 robots.txt

robots.txt 告诉爬虫哪些是被允许的：

# Example robots.txt
User-agent: *
Disallow: /private/
Disallow: /api/
Allow: /public/
Crawl-delay: 10

CrawlForge 默认尊重 robots.txt。你可以查看任意站点的策略：

Bash

curl https://example.com/robots.txt

2. 速率限制

不要压垮服务器：

尊重 Crawl-delay 指令
请求间隔至少 1-5 秒
监控响应码 - 429 意味着要放慢速度
降低并发，针对较小的站点

CrawlForge 内置了速率限制，但也请保持克制。

3. 数据授权

了解内容权利：

Creative Commons - 通常在注明出处后即可使用
版权 - 用于训练需要获得许可
服务条款 - 有些站点禁止抓取
GDPR/隐私 - 个人数据有限制

4. LLMs.txt 标准

一个面向 AI 专用权限的新标准：

# llms.txt example
Allow: training
Allow: inference
Require: attribution
Contact: ai@example.com

在爬取之前，检查每个站点的 llms.txt（或 robots.txt）以发现其 AI 权限。

构建数据采集流水线

架构概览

┌──────────────────────────────────────────────────┐
│  1. Source Discovery                              │
│  - Identify target websites                       │
│  - Map site structure                             │
│  - Prioritize high-quality sources                │
└──────────────────────────────────────────────────┘
                        ↓
┌──────────────────────────────────────────────────┐
│  2. Content Extraction                            │
│  - Fetch pages                                    │
│  - Extract main content                           │
│  - Handle pagination                              │
└──────────────────────────────────────────────────┘
                        ↓
┌──────────────────────────────────────────────────┐
│  3. Data Cleaning                                 │
│  - Remove duplicates                              │
│  - Filter low-quality content                     │
│  - Normalize formats                              │
└──────────────────────────────────────────────────┘
                        ↓
┌──────────────────────────────────────────────────┐
│  4. Quality Validation                            │
│  - Language detection                             │
│  - Content scoring                                │
│  - Deduplication                                  │
└──────────────────────────────────────────────────┘
                        ↓
┌──────────────────────────────────────────────────┐
│  5. Storage & Export                              │
│  - Format for training                            │
│  - Version control                                │
│  - Documentation                                  │
└──────────────────────────────────────────────────┘

第 1 步：来源发现

先从映射可用内容开始：

Typescript

// Map a documentation site
const response = await fetch('https://crawlforge.dev/api/v1/tools/map_site', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${API_KEY}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    url: 'https://docs.example.com',
    maxDepth: 3,
    includePatterns: ['/docs/*', '/guides/*'],
    excludePatterns: ['/api/*', '/admin/*']
  })
});

const { data } = await response.json();
console.log(`Found ${data.urls.length} pages to scrape`);
// Found 847 pages to scrape

成本：每次 map_site 调用 2 credits

第 2 步：内容提取

从发现的 URL 中提取内容：

Typescript

// Batch scrape all discovered URLs
const urls = data.urls;

// Process in batches of 50
for (let i = 0; i < urls.length; i += 50) {
  const batch = urls.slice(i, i + 50);

  const response = await fetch('https://crawlforge.dev/api/v1/tools/batch_scrape', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${API_KEY}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      urls: batch,
      extractContent: true,
      includeMetadata: true
    })
  });

  const { data } = await response.json();

  // Store results
  for (const result of data.results) {
    await storeDocument({
      url: result.url,
      content: result.content,
      metadata: result.metadata,
      scrapedAt: new Date()
    });
  }

  // Respect rate limits
  await sleep(1000);
}

成本：每次 batch_scrape 调用 5 credits（最多 50 个 URL）

第 3 步：数据清洗

去除噪声并规范化内容：

Typescript

function cleanDocument(doc: ScrapedDocument): CleanedDocument {
  let content = doc.content;

  // Remove boilerplate
  content = removeNavigation(content);
  content = removeFooters(content);
  content = removeAds(content);

  // Normalize whitespace
  content = content.replace(/\s+/g, ' ').trim();

  // Remove very short documents
  if (content.split(' ').length < 50) {
    return null; // Too short
  }

  // Remove documents with too much code
  const codeRatio = countCodeBlocks(content) / content.length;
  if (codeRatio > 0.8) {
    return null; // Mostly code, not useful for text training
  }

  return {
    ...doc,
    content,
    wordCount: content.split(' ').length,
    cleanedAt: new Date()
  };
}

第 4 步：质量校验

用 AI 给内容质量打分：

Typescript

// Analyze content quality
const response = await fetch('https://crawlforge.dev/api/v1/tools/analyze_content', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${API_KEY}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    content: doc.content
  })
});

const { data } = await response.json();

// Filter based on analysis
if (data.language !== 'en') {
  // Skip non-English content (or separate by language)
}

if (data.readability.score < 30) {
  // Skip content that's too technical/unreadable
}

if (data.sentiment.toxic > 0.5) {
  // Skip potentially harmful content
}

成本：每次 analyze_content 调用 3 credits

第 5 步：去重

去除近似重复项：

Typescript

import { simhash } from './hashing';

const seen = new Map<string, string>(); // hash -> url

function isDuplicate(doc: CleanedDocument): boolean {
  const hash = simhash(doc.content);

  for (const [existingHash, existingUrl] of seen) {
    const similarity = hammingDistance(hash, existingHash);
    if (similarity > 0.95) {
      console.log(`Duplicate: ${doc.url} ~= ${existingUrl}`);
      return true;
    }
  }

  seen.set(hash, doc.url);
  return false;
}

面向 LLM 训练的数据质量

需要追踪的质量指标

指标	目标	为何重要
唯一文档占比	>95%	避免记忆化
平均词数	200-5000	平衡的上下文长度
语言纯度	>99%	一致的训练信号
可读性评分	40-80	人类水准的文本
新鲜度	<1 年	当前信息

训练用格式

用于 LLM 微调时，输出 JSONL：

Jsonl

{"text": "Document content here...", "source": "docs.example.com", "category": "tutorial"}
{"text": "Another document...", "source": "blog.example.com", "category": "article"}

用于 RAG 系统时，包含嵌入向量：

Jsonl

{"text": "...", "embedding": [0.123, 0.456, ...], "metadata": {"url": "...", "title": "..."}}

扩展你的流水线

credits 优化

针对大规模采集：

从 map_site 开始（2 credits）来发现 URL
使用 batch_scrape（5 credits/50 个 URL）而非逐个调用
跳过 analyze_content，对已知优质的来源
积极缓存 - 相同 URL = 相同内容

成本估算

数据集规模	工具	Credits	成本（Pro 套餐）
1K 文档	map + batch	~500	$1
10K 文档	map + batch	~2,500	$5
100K 文档	map + batch	~15,000	$30
1M 文档	map + batch + 分析	~100,000	$200

增量更新

不要重新抓取所有内容：

Typescript

// Check for changes before re-scraping
const response = await fetch('https://crawlforge.dev/api/v1/tools/track_changes', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${API_KEY}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    url: doc.url,
    lastHash: doc.contentHash
  })
});

const { data } = await response.json();

if (data.hasChanges) {
  // Re-scrape and update
} else {
  // Skip, content unchanged
}

成本：每次 track_changes 调用 3 credits

常见陷阱

1. 过度抓取

问题：采集了过多低质量数据解决：质量 > 数量。10 万份好文档胜过 100 万份平庸文档。

2. 忽视数据质量

问题：用嘈杂的数据训练解决：在清洗和校验上投入。使用 analyze_content。

3. 侵犯版权

问题：未经许可使用受版权保护的内容解决：坚持使用宽松授权的来源。检查 robots.txt 和服务条款。

4. 速率限制耗尽

问题：被目标站点封禁解决：对敏感站点使用 stealth_mode（5 credits）。尊重 crawl delay。

5. 数据陈旧

问题：用过时信息训练解决：用 track_changes 设置周期性抓取。

案例研究：构建文档数据集

目标：从技术文档中创建一个用于代码助手的训练数据集。

来源

官方框架文档（React、Vue、Next.js 等）
API 参考
采用宽松授权的教程站点

流水线

Typescript

const SOURCES = [
  'https://react.dev/learn',
  'https://vuejs.org/guide',
  'https://nextjs.org/docs',
  // ... more sources
];

async function buildDataset() {
  const documents = [];

  for (const source of SOURCES) {
    // 1. Map site structure
    const siteMap = await mapSite(source);

    // 2. Filter to documentation pages
    const docUrls = siteMap.urls.filter(url =>
      url.includes('/docs') ||
      url.includes('/guide') ||
      url.includes('/learn')
    );

    // 3. Batch scrape
    const scraped = await batchScrape(docUrls);

    // 4. Clean and validate
    for (const doc of scraped) {
      const cleaned = cleanDocument(doc);
      if (cleaned && !isDuplicate(cleaned)) {
        documents.push(cleaned);
      }
    }
  }

  // 5. Export
  await exportToJSONL(documents, 'training_data.jsonl');

  console.log(`Dataset: ${documents.length} documents`);
}

结果

来源：12 个文档站点
抓取页面：15,847
清洗后：12,392 份文档
去重后：11,108 份唯一文档
总词数：8.2M
使用的 credits：~4,500
成本：~$9（Professional 套餐）

结论

为 AI 训练数据做网页抓取，是在数量、质量和道德之间求平衡。关键原则：

从明确目标开始 - 你的模型需要什么？
优先质量 - 干净的数据胜过更多的数据
尊重来源 - 遵循 robots.txt 和速率限制
彻底校验 - 使用自动化质量检查
持续迭代 - 模型会随着更好的数据而进步

CrawlForge 提供构建生产级数据流水线所需的工具。在 crawlforge.dev/signup 用 1,000 个免费 credits 开始吧。

资源：

API Reference - 完整的工具文档
Batch Processing Guide - 大规模抓取
Credit Optimization - 降低成本

有疑问？在 GitHub 或 Twitter 联系我们。

AI 革命靠数据驱动。无论你是在微调 LLM、构建 RAG 系统，还是训练自定义模型，网页数据往往是你最丰富的训练素材来源。

但从网络采集高质量的训练数据并不简单。本指南涵盖一切：道德考量、采集流水线、质量保障，以及用 CrawlForge 进行的实际落地。

数据瓶颈

AI 模型的好坏取决于其训练数据。然而大多数团队都面临关键挑战：

数量：模型需要数百万个样本
质量：垃圾进，垃圾出
多样性：用狭窄的数据训练会造就狭窄的模型
新鲜度：静态数据集会过时
合规：法律与道德考量

网页抓取解决了数量和多样性问题——但前提是做对了。

面向 AI 的网页数据类型

文本内容

最常见的训练数据类型：

文章和博客文章 - 用于语言理解的叙述性文本
文档 - 技术写作和结构化讲解
论坛和问答 - 对话模式与问题解决
产品描述 - 简洁的描述性文本
评论 - 富含情感与观点的内容

结构化数据

用于分类和实体识别：

产品目录 - 带属性的条目
企业名录 - 带关系的实体
事件数据 - 时间与地点信息
表格和数据集 - 数值型和类别型数据

元数据

常被忽视但很有价值：

SEO 标签 - 人工撰写的摘要和关键词
Schema.org 标记 - 结构化的实体数据
社交图谱 - 关系数据
时间戳 - 时间模式

多模态

用于视觉和多模态模型：

带说明文字的图片 - 视觉-语言对
带文本的 PDF - 文档理解
带字幕的视频 - 时序视觉-语言

合乎道德的网页抓取原则

在采集数据之前，先了解道德与法律环境。

1. 尊重 robots.txt

robots.txt 告诉爬虫哪些是被允许的：

# Example robots.txt
User-agent: *
Disallow: /private/
Disallow: /api/
Allow: /public/
Crawl-delay: 10

CrawlForge 默认尊重 robots.txt。你可以查看任意站点的策略：

Bash

curl https://example.com/robots.txt

2. 速率限制

不要压垮服务器：

尊重 Crawl-delay 指令
请求间隔至少 1-5 秒
监控响应码 - 429 意味着要放慢速度
降低并发，针对较小的站点

CrawlForge 内置了速率限制，但也请保持克制。

3. 数据授权

了解内容权利：

Creative Commons - 通常在注明出处后即可使用
版权 - 用于训练需要获得许可
服务条款 - 有些站点禁止抓取
GDPR/隐私 - 个人数据有限制

4. LLMs.txt 标准

一个面向 AI 专用权限的新标准：

# llms.txt example
Allow: training
Allow: inference
Require: attribution
Contact: ai@example.com

在爬取之前，检查每个站点的 llms.txt（或 robots.txt）以发现其 AI 权限。

构建数据采集流水线

架构概览

┌──────────────────────────────────────────────────┐
│  1. Source Discovery                              │
│  - Identify target websites                       │
│  - Map site structure                             │
│  - Prioritize high-quality sources                │
└──────────────────────────────────────────────────┘
                        ↓
┌──────────────────────────────────────────────────┐
│  2. Content Extraction                            │
│  - Fetch pages                                    │
│  - Extract main content                           │
│  - Handle pagination                              │
└──────────────────────────────────────────────────┘
                        ↓
┌──────────────────────────────────────────────────┐
│  3. Data Cleaning                                 │
│  - Remove duplicates                              │
│  - Filter low-quality content                     │
│  - Normalize formats                              │
└──────────────────────────────────────────────────┘
                        ↓
┌──────────────────────────────────────────────────┐
│  4. Quality Validation                            │
│  - Language detection                             │
│  - Content scoring                                │
│  - Deduplication                                  │
└──────────────────────────────────────────────────┘
                        ↓
┌──────────────────────────────────────────────────┐
│  5. Storage & Export                              │
│  - Format for training                            │
│  - Version control                                │
│  - Documentation                                  │
└──────────────────────────────────────────────────┘

第 1 步：来源发现

先从映射可用内容开始：

Typescript

// Map a documentation site
const response = await fetch('https://crawlforge.dev/api/v1/tools/map_site', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${API_KEY}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    url: 'https://docs.example.com',
    maxDepth: 3,
    includePatterns: ['/docs/*', '/guides/*'],
    excludePatterns: ['/api/*', '/admin/*']
  })
});

const { data } = await response.json();
console.log(`Found ${data.urls.length} pages to scrape`);
// Found 847 pages to scrape

成本：每次 map_site 调用 2 credits

第 2 步：内容提取

从发现的 URL 中提取内容：

Typescript

// Batch scrape all discovered URLs
const urls = data.urls;

// Process in batches of 50
for (let i = 0; i < urls.length; i += 50) {
  const batch = urls.slice(i, i + 50);

  const response = await fetch('https://crawlforge.dev/api/v1/tools/batch_scrape', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${API_KEY}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      urls: batch,
      extractContent: true,
      includeMetadata: true
    })
  });

  const { data } = await response.json();

  // Store results
  for (const result of data.results) {
    await storeDocument({
      url: result.url,
      content: result.content,
      metadata: result.metadata,
      scrapedAt: new Date()
    });
  }

  // Respect rate limits
  await sleep(1000);
}

成本：每次 batch_scrape 调用 5 credits（最多 50 个 URL）

第 3 步：数据清洗

去除噪声并规范化内容：

Typescript

function cleanDocument(doc: ScrapedDocument): CleanedDocument {
  let content = doc.content;

  // Remove boilerplate
  content = removeNavigation(content);
  content = removeFooters(content);
  content = removeAds(content);

  // Normalize whitespace
  content = content.replace(/\s+/g, ' ').trim();

  // Remove very short documents
  if (content.split(' ').length < 50) {
    return null; // Too short
  }

  // Remove documents with too much code
  const codeRatio = countCodeBlocks(content) / content.length;
  if (codeRatio > 0.8) {
    return null; // Mostly code, not useful for text training
  }

  return {
    ...doc,
    content,
    wordCount: content.split(' ').length,
    cleanedAt: new Date()
  };
}

第 4 步：质量校验

用 AI 给内容质量打分：

Typescript

// Analyze content quality
const response = await fetch('https://crawlforge.dev/api/v1/tools/analyze_content', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${API_KEY}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    content: doc.content
  })
});

const { data } = await response.json();

// Filter based on analysis
if (data.language !== 'en') {
  // Skip non-English content (or separate by language)
}

if (data.readability.score < 30) {
  // Skip content that's too technical/unreadable
}

if (data.sentiment.toxic > 0.5) {
  // Skip potentially harmful content
}

成本：每次 analyze_content 调用 3 credits

第 5 步：去重

去除近似重复项：

Typescript

import { simhash } from './hashing';

const seen = new Map<string, string>(); // hash -> url

function isDuplicate(doc: CleanedDocument): boolean {
  const hash = simhash(doc.content);

  for (const [existingHash, existingUrl] of seen) {
    const similarity = hammingDistance(hash, existingHash);
    if (similarity > 0.95) {
      console.log(`Duplicate: ${doc.url} ~= ${existingUrl}`);
      return true;
    }
  }

  seen.set(hash, doc.url);
  return false;
}

面向 LLM 训练的数据质量

需要追踪的质量指标

指标	目标	为何重要
唯一文档占比	>95%	避免记忆化
平均词数	200-5000	平衡的上下文长度
语言纯度	>99%	一致的训练信号
可读性评分	40-80	人类水准的文本
新鲜度	<1 年	当前信息

训练用格式

用于 LLM 微调时，输出 JSONL：

Jsonl

{"text": "Document content here...", "source": "docs.example.com", "category": "tutorial"}
{"text": "Another document...", "source": "blog.example.com", "category": "article"}

用于 RAG 系统时，包含嵌入向量：

Jsonl

{"text": "...", "embedding": [0.123, 0.456, ...], "metadata": {"url": "...", "title": "..."}}

扩展你的流水线

credits 优化

针对大规模采集：

从 map_site 开始（2 credits）来发现 URL
使用 batch_scrape（5 credits/50 个 URL）而非逐个调用
跳过 analyze_content，对已知优质的来源
积极缓存 - 相同 URL = 相同内容

成本估算

数据集规模	工具	Credits	成本（Pro 套餐）
1K 文档	map + batch	~500	$1
10K 文档	map + batch	~2,500	$5
100K 文档	map + batch	~15,000	$30
1M 文档	map + batch + 分析	~100,000	$200

增量更新

不要重新抓取所有内容：

Typescript

// Check for changes before re-scraping
const response = await fetch('https://crawlforge.dev/api/v1/tools/track_changes', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${API_KEY}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    url: doc.url,
    lastHash: doc.contentHash
  })
});

const { data } = await response.json();

if (data.hasChanges) {
  // Re-scrape and update
} else {
  // Skip, content unchanged
}

成本：每次 track_changes 调用 3 credits

常见陷阱

1. 过度抓取

问题：采集了过多低质量数据解决：质量 > 数量。10 万份好文档胜过 100 万份平庸文档。

2. 忽视数据质量

问题：用嘈杂的数据训练解决：在清洗和校验上投入。使用 analyze_content。

3. 侵犯版权

问题：未经许可使用受版权保护的内容解决：坚持使用宽松授权的来源。检查 robots.txt 和服务条款。

4. 速率限制耗尽

问题：被目标站点封禁解决：对敏感站点使用 stealth_mode（5 credits）。尊重 crawl delay。

5. 数据陈旧

问题：用过时信息训练解决：用 track_changes 设置周期性抓取。

案例研究：构建文档数据集

目标：从技术文档中创建一个用于代码助手的训练数据集。

来源

官方框架文档（React、Vue、Next.js 等）
API 参考
采用宽松授权的教程站点

流水线

Typescript

const SOURCES = [
  'https://react.dev/learn',
  'https://vuejs.org/guide',
  'https://nextjs.org/docs',
  // ... more sources
];

async function buildDataset() {
  const documents = [];

  for (const source of SOURCES) {
    // 1. Map site structure
    const siteMap = await mapSite(source);

    // 2. Filter to documentation pages
    const docUrls = siteMap.urls.filter(url =>
      url.includes('/docs') ||
      url.includes('/guide') ||
      url.includes('/learn')
    );

    // 3. Batch scrape
    const scraped = await batchScrape(docUrls);

    // 4. Clean and validate
    for (const doc of scraped) {
      const cleaned = cleanDocument(doc);
      if (cleaned && !isDuplicate(cleaned)) {
        documents.push(cleaned);
      }
    }
  }

  // 5. Export
  await exportToJSONL(documents, 'training_data.jsonl');

  console.log(`Dataset: ${documents.length} documents`);
}

结果

来源：12 个文档站点
抓取页面：15,847
清洗后：12,392 份文档
去重后：11,108 份唯一文档
总词数：8.2M
使用的 credits：~4,500
成本：~$9（Professional 套餐）

结论

为 AI 训练数据做网页抓取，是在数量、质量和道德之间求平衡。关键原则：

从明确目标开始 - 你的模型需要什么？
优先质量 - 干净的数据胜过更多的数据
尊重来源 - 遵循 robots.txt 和速率限制
彻底校验 - 使用自动化质量检查
持续迭代 - 模型会随着更好的数据而进步

CrawlForge 提供构建生产级数据流水线所需的工具。在 crawlforge.dev/signup 用 1,000 个免费 credits 开始吧。

资源：

API Reference - 完整的工具文档
Batch Processing Guide - 大规模抓取
Credit Optimization - 降低成本

有疑问？在 GitHub 或 Twitter 联系我们。

本页内容

数据瓶颈

面向 AI 的网页数据类型

文本内容

结构化数据

元数据

多模态

合乎道德的网页抓取原则

1. 尊重 robots.txt

2. 速率限制

3. 数据授权

4. LLMs.txt 标准

构建数据采集流水线

架构概览

第 1 步：来源发现

第 2 步：内容提取

第 3 步：数据清洗

第 4 步：质量校验

第 5 步：去重

面向 LLM 训练的数据质量

需要追踪的质量指标

训练用格式

扩展你的流水线

credits 优化

成本估算

增量更新

常见陷阱

1. 过度抓取

2. 忽视数据质量

3. 侵犯版权

4. 速率限制耗尽

5. 数据陈旧

案例研究：构建文档数据集

来源

流水线

结果

结论

亲自试一试——无需注册

标签

关于作者

CrawlForge Team

及时获取最新洞察

Frequently Asked Questions

相关文章

2026 年面向 AI 智能体的最佳网页爬取工具

用本地 LLM 提取网页数据（Ollama + CrawlForge）

MCP 协议详解：2026 开发者指南

本页内容

数据瓶颈

面向 AI 的网页数据类型

文本内容

结构化数据

元数据

多模态

合乎道德的网页抓取原则

1. 尊重 robots.txt

2. 速率限制

3. 数据授权

4. LLMs.txt 标准

构建数据采集流水线

架构概览

第 1 步：来源发现

第 2 步：内容提取

第 3 步：数据清洗

第 4 步：质量校验

第 5 步：去重

面向 LLM 训练的数据质量

需要追踪的质量指标

训练用格式

扩展你的流水线

credits 优化

成本估算

增量更新

常见陷阱

1. 过度抓取

2. 忽视数据质量

3. 侵犯版权

4. 速率限制耗尽

5. 数据陈旧

案例研究：构建文档数据集