高级工具2 credits

extract_content

通过可读性检测提取主要文章正文，去除广告、导航栏和页脚等样板元素。非常适合面向 LLM 和内容分析优化的干净内容提取。

使用场景

面向 LLM 的干净内容

提取不含广告和导航的文章正文，用于输入 AI 模型

文章提取

从新闻站点、博客和内容平台获取主要文章正文

去除样板内容

剥离广告、弹窗、页眉、页脚及其他非正文元素

内容聚合

构建 RSS 阅读器、新闻聚合器和内容策展平台

阅读模式

打造无干扰的阅读体验，类似浏览器的阅读模式

研究与分析

提取文章正文用于情感分析、NLP 和研究项目

Endpoint

POST/api/v1/tools/extract_content

Auth Required

Free 计划 2 req/s

2 credits

Parameters

Name	Type	Required	Default	Description
url	string	Required	-	要从中提取内容的网页 URL Example: https://example.com/article
options	object	Optional	-	内容提取选项 Example: {"includeImages": true, "includeLinks": true}
options.includeImages	boolean	Optional	true	在提取的内容中包含图像 Example: true
options.includeLinks	boolean	Optional	false	在提取的内容中保留链接 Example: false
options.minTextLength	number	Optional	100	视为主要内容所需的最小文本长度（以字符计） Example: 200

请求示例

terminalBash

curl -X POST https://crawlforge.dev/api/v1/tools/extract_content \
  -H "X-API-Key: cf_test_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/article",
    "options": {
      "includeImages": true,
      "includeLinks": false,
      "minTextLength": 200
    }
  }'

响应示例

200 OK680ms

{
  "success": true,
  "data": {
    "title": "The Future of Web Scraping: AI and Machine Learning",
    "content": "# The Future of Web Scraping\n\nWeb scraping has evolved significantly over the past decade...\n\n## Machine Learning Integration\n\nModern scraping tools now leverage AI to adapt to website changes...",
    "author": "John Doe",
    "publishDate": "2024-01-15T10:30:00Z",
    "images": [
      {
        "src": "https://example.com/images/hero.jpg",
        "alt": "Web scraping visualization",
        "width": 1200,
        "height": 630
      }
    ],
    "readingTime": 8,
    "wordCount": 1847,
    "excerpt": "Web scraping has evolved significantly over the past decade with the integration of AI and machine learning..."
  },
  "credits_used": 2,
  "credits_remaining": 998,
  "processing_time": 680
}

Field Descriptions

data.title提取的文章标题

data.contentMarkdown 格式的主要文章正文（干净，无广告/导航栏）

data.author文章作者（如有）

data.publishDate文章发布日期（ISO 8601 格式）

data.images包含 src、alt 文本和尺寸的图像数组

data.readingTime估计阅读时间（分钟，按 200 wpm 计算）

data.wordCount提取内容的总字数

credits_used本次请求扣除的 credits（每次提取 2 credits）

错误处理

未找到内容（422 Unprocessable Entity）

无法从该页面提取主要内容。页面可能为空或没有可读内容。

无效的 URL（400 Bad Request）

URL 格式无效。请确保其包含协议（http:// 或 https://）

页面无法访问（404 Not Found）

该 URL 返回 404 错误。请确认 URL 正确且可公开访问。

credits 不足（402 Payment Required）

您的账户没有足够的 credits（需要 2 个）。购买更多 credits或升级您的计划。

超出速率限制（429 Too Many Requests）

您已超出计划的速率限制。请稍候片刻或升级您的计划以获得更高的限制。

专业提示： extract_content 使用 Mozilla 的 Readability 算法，与 Firefox 阅读视图背后的技术相同。它在结构清晰的文章型页面上效果最佳。

credits 费用

2 credits

每次请求 2 credits

每次成功的 extract_content 请求费用为 2 credits，与内容长度无关。

Free 计划： 1,000 个一次性试用 credits = 500 次提取

Hobby 计划： 每月 5,000 credits = 2,500 次提取（$19/mo）

Professional 计划： 每月 50,000 credits = 25,000 次提取（$99/mo）

Business 计划： 每月 250,000 credits = 125,000 次提取（$399/mo）

相关工具

extract_text

从 HTML 提取所有文本（包含样板内容）（1 credit）

summarize_content

对提取的内容进行摘要（4 credits）

准备好试用 extract_content 了吗？免费注册并获得 1,000 credits 开始构建。

使用场景

面向 LLM 的干净内容

提取不含广告和导航的文章正文，用于输入 AI 模型

文章提取

从新闻站点、博客和内容平台获取主要文章正文

去除样板内容

剥离广告、弹窗、页眉、页脚及其他非正文元素

内容聚合

构建 RSS 阅读器、新闻聚合器和内容策展平台

阅读模式

打造无干扰的阅读体验，类似浏览器的阅读模式

研究与分析

提取文章正文用于情感分析、NLP 和研究项目

Parameters

Name	Type	Required	Default	Description
url	string	Required	-	要从中提取内容的网页 URL Example: https://example.com/article
options	object	Optional	-	内容提取选项 Example: {"includeImages": true, "includeLinks": true}
options.includeImages	boolean	Optional	true	在提取的内容中包含图像 Example: true
options.includeLinks	boolean	Optional	false	在提取的内容中保留链接 Example: false
options.minTextLength	number	Optional	100	视为主要内容所需的最小文本长度（以字符计） Example: 200

请求示例

terminalBash

curl -X POST https://crawlforge.dev/api/v1/tools/extract_content \
  -H "X-API-Key: cf_test_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/article",
    "options": {
      "includeImages": true,
      "includeLinks": false,
      "minTextLength": 200
    }
  }'

响应示例

200 OK680ms

{
  "success": true,
  "data": {
    "title": "The Future of Web Scraping: AI and Machine Learning",
    "content": "# The Future of Web Scraping\n\nWeb scraping has evolved significantly over the past decade...\n\n## Machine Learning Integration\n\nModern scraping tools now leverage AI to adapt to website changes...",
    "author": "John Doe",
    "publishDate": "2024-01-15T10:30:00Z",
    "images": [
      {
        "src": "https://example.com/images/hero.jpg",
        "alt": "Web scraping visualization",
        "width": 1200,
        "height": 630
      }
    ],
    "readingTime": 8,
    "wordCount": 1847,
    "excerpt": "Web scraping has evolved significantly over the past decade with the integration of AI and machine learning..."
  },
  "credits_used": 2,
  "credits_remaining": 998,
  "processing_time": 680
}

Field Descriptions

data.title提取的文章标题

data.contentMarkdown 格式的主要文章正文（干净，无广告/导航栏）

data.author文章作者（如有）

data.publishDate文章发布日期（ISO 8601 格式）

data.images包含 src、alt 文本和尺寸的图像数组

data.readingTime估计阅读时间（分钟，按 200 wpm 计算）

data.wordCount提取内容的总字数

credits_used本次请求扣除的 credits（每次提取 2 credits）

错误处理

未找到内容（422 Unprocessable Entity）

无法从该页面提取主要内容。页面可能为空或没有可读内容。

无效的 URL（400 Bad Request）

URL 格式无效。请确保其包含协议（http:// 或 https://）

页面无法访问（404 Not Found）

该 URL 返回 404 错误。请确认 URL 正确且可公开访问。

credits 不足（402 Payment Required）

您的账户没有足够的 credits（需要 2 个）。购买更多 credits或升级您的计划。

超出速率限制（429 Too Many Requests）

您已超出计划的速率限制。请稍候片刻或升级您的计划以获得更高的限制。

专业提示： extract_content 使用 Mozilla 的 Readability 算法，与 Firefox 阅读视图背后的技术相同。它在结构清晰的文章型页面上效果最佳。

credits 费用

2 credits

每次请求 2 credits

每次成功的 extract_content 请求费用为 2 credits，与内容长度无关。

Free 计划： 1,000 个一次性试用 credits = 500 次提取

Hobby 计划： 每月 5,000 credits = 2,500 次提取（$19/mo）

Professional 计划： 每月 50,000 credits = 25,000 次提取（$99/mo）

Business 计划： 每月 250,000 credits = 125,000 次提取（$399/mo）