基础工具1 credit

extract_text

通过智能解析从 HTML 中提取干净、易读的文本。在保留主要文本内容的同时，自动移除脚本、样式和样板内容。

使用场景

为 LLM 提取文章

提取干净的文章文本，用于摘要、分析或 AI 处理

内容分析

获取纯文本用于词数统计、可读性分析或情感检测

用于摘要的干净文本

在传给摘要模型之前移除 HTML 噪声

移除样板内容

移除广告、导航等非内容元素

Endpoint

POST/api/v1/tools/extract_text

Auth Required

Free 计划 2 req/s

1 credit

Parameters

注意： 你必须提供 html 或 url 之一。如果两者都提供，则以 html 为准。

Name	Type	Required	Default	Description
html	string	Optional	-	要从中提取文本的 HTML 内容（html 与 url 二选一） Example: <html><body><h1>Hello World</h1></body></html>
url	string	Optional	-	要抓取并从中提取文本的 URL（html 与 url 二选一） Example: https://example.com/article
selector	string	Optional	-	用于定位特定元素的 CSS 选择器（默认：整个页面） Example: article, .content, #main
clean	boolean	Optional	true	移除多余空白并规范化格式 Example: true
preserve_links	boolean	Optional	false	在提取的文本中包含链接及其 URL Example: false
preserve_formatting	boolean	Optional	false	保留基本的 HTML 格式（段落、换行） Example: false
max_length	number	Optional	-	提取文本的最大长度（超出将以 ... 截断） Example: 5000

请求示例

cURL - 从 URL 提取

terminalBash

curl -X POST https://www.crawlforge.dev/api/v1/tools/extract_text \
  -H "X-API-Key: cf_test_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/article",
    "selector": "article",
    "clean": true,
    "max_length": 5000
  }'

TypeScript - 从 HTML 提取

extractText.tsTypescript

const htmlContent = `
  <html>
    <body>
      <article>
        <h1>Article Title</h1>
        <p>This is the main content of the article.</p>
        <a href="/related">Related Article</a>
      </article>
    </body>
  </html>
`;

const response = await fetch('https://www.crawlforge.dev/api/v1/tools/extract_text', {
  method: 'POST',
  headers: {
    'X-API-Key': process.env.CRAWLFORGE_API_KEY!,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    html: htmlContent,
    selector: 'article',
    clean: true,
    preserve_links: true
  }),
});

const data = await response.json();

if (data.success) {
  console.log('Extracted text:', data.data.text);
  console.log('Word count:', data.data.metadata.word_count);
  console.log('Character count:', data.data.metadata.character_count);
} else {
  console.error('Error:', data.error);
}

Python - 使用选择器提取

extract_text.pyPython

import requests
import os

response = requests.post(
    'https://www.crawlforge.dev/api/v1/tools/extract_text',
    headers={
        'X-API-Key': os.environ['CRAWLFORGE_API_KEY'],
        'Content-Type': 'application/json',
    },
    json={
        'url': 'https://example.com/article',
        'selector': 'article, .main-content',
        'clean': True,
        'preserve_formatting': True,
        'max_length': 10000
    }
)

data = response.json()

if data['success']:
    text = data['data']['text']
    metadata = data['data']['metadata']

    print(f"Title: {metadata['title']}")
    print(f"Word count: {metadata['word_count']}")
    print(f"Character count: {metadata['character_count']}")
    print(f"\nExtracted text:\n{text[:500]}...")
else:
    print(f"Error: {data['error']}")

响应示例

200 OK180ms

{
  "success": true,
  "data": {
    "text": "Article Title\n\nThis is the main content of the article. It contains useful information that has been extracted from the HTML.\n\nLinks:\nRelated Article (/related)",
    "metadata": {
      "title": "Article Title - Example Site",
      "description": "Meta description of the article",
      "word_count": 248,
      "character_count": 1432,
      "selector_used": "article",
      "links_preserved": true,
      "formatting_preserved": false
    }
  },
  "credits_used": 1,
  "credits_remaining": 999,
  "processing_time": 180
}

Field Descriptions

data.text提取的纯文本内容

data.metadata.word_count提取文本中的总词数

data.metadata.character_count字符总数

data.metadata.selector_used实际应用的 CSS 选择器

credits_used本次请求扣除的 credits（每次提取 1 个）

错误处理

缺少输入（400 Bad Request）

html 和 url 都未提供。你必须至少提供其中一个。

无效选择器（400 Bad Request）

CSS 选择器无效或未匹配到任何元素。请检查选择器语法。

URL 抓取失败（500 Internal Server Error）

抓取 URL 失败。请检查该 URL 可访问并返回 HTML。

credit 费用