scrape_template 支持哪些站点？

v4.2.2 中支持十个站点：Amazon、LinkedIn、GitHub、YouTube、Reddit、Hacker News、Stack Overflow、npm、Product Hunt 和 Twitter/X。每个都有预制模式，返回你通常想要的字段（产品价格/评分、个人资料姓名/职位、仓库 stars/README、视频字幕等）。更多模板将在 v4.3 推出。

抓取 LinkedIn 合法吗？

hiQ Labs 诉 LinkedIn 案（第九巡回法院，2022）确立了抓取公开个人资料数据一般是被允许的，但 LinkedIn 的 ToS 限制自动化访问——而激进抓取或商业转售仍可能引发法律行动和封禁。请将 scrape_template 与 linkedin-profile 模板用于公开、低频、不转售的场景。如果你在大规模抓取或用于商业产品，请咨询律师。

我能添加自定义模板吗？

目前还不能直接添加，但我们在 Discord 上接受模板请求，并按需求优先级排序。请求量较大的站点（Etsy、eBay、TikTok、Instagram、Google Maps）已列入 v4.3 的路线图。对于一次性的自定义工作，请用 scrape_structured（CSS 选择器）或 extract_with_llm（基于模式）。

scrape_template 和 scrape_structured 有什么区别？

scrape_template 针对我们已维护好模式的十个特定站点——你只需挑选模板名称。scrape_structured 是通用的：你为任意站点提供 CSS 选择器，CrawlForge 来执行。当你的目标是这十个受支持站点之一时，模板更快也更便宜（1 credit 对 2 credits）。

scrape_template 的模式有多新？

我们会监控每个受支持站点的布局变化，通常在任何破坏性变更发生后 24 小时内推出模板补丁。更新对你的代码是透明的——你照样调用相同的模板名称，数据结构保持不变。如果你发现回退，请在 Discord 或 GitHub 上报告。

如果某个受支持的站点更改了布局会怎样？

即使底层选择器需要改变，调用仍会按文档中描述的结构返回 JSON。我们承担维护负担，你不必操心。如果某次布局变更严重到暂时破坏了某个字段，我们会在补丁上线前（通常在 24 小时内）将该字段在响应中标记为可空。

用一个工具抓取 Amazon、LinkedIn 和另外 8 个站点

我们在 CrawlForge 看到的抓取请求中，有一半都是这同样的十个站点：Amazon、LinkedIn、GitHub、YouTube、Reddit、Hacker News、Stack Overflow、npm、Product Hunt 和 Twitter/X。我们厌倦了看着大家一遍又一遍地写相同的 CSS 选择器——也厌倦了看着这些选择器在站点下次更新布局时失效。于是我们把这件事做一次，把它打包成 scrape_template，现在你只需花 1 credit 就能拿到结构化 JSON。

scrape_template 是什么？

scrape_template 是 CrawlForge 的单个工具，内含十个预制的站点模式。你选择模板、传入一个 URL，就能拿到与该站点天然结构相符的结构化 JSON。没有 CSS 选择器。没有 HTML 解析。没有模式定义。

代价是：你只能用我们维护的这十个站点。如果你需要别的，请用 scrape_structured（CSS 优先）或 extract_with_llm（LLM 优先）。对于那些「我想要 Amazon 的产品数据」这类长尾请求，scrape_template 是最短路径。需要的是多步骤工作流而不是单个站点？请看如何使用模板库。

每次抓取花费 1 credit——和基础的 fetch_url 一样——因为我们已经在上游把模式的活儿做完了。

支持的 10 个站点

模板	返回内容	最适合	示例 URL 模式
`amazon-product`	标题、价格、评分、评论数、图片、ASIN、库存状态	价格监控、产品调研	`/dp/<ASIN>`
`linkedin-profile`	姓名、头衔、所在地、简介、当前公司	线索补全	`/in/<handle>`
`github-repo`	Stars、forks、语言、主题、许可证、最后更新	仓库分析、AI 训练数据	`/<owner>/<repo>`
`youtube-video`	标题、频道、观看数、时长、发布时间、描述	内容调研	`/watch?v=<id>`
`reddit-thread`	帖子标题、得分、作者、subreddit、正文	社区信号	`/r/<sub>/comments/<id>`
`hacker-news-front-page`	首页故事：标题、URL、得分、作者、评论	技术趋势追踪	`news.ycombinator.com`
`stackoverflow-question`	问题、被采纳的回答、投票数、标签	开发者问答挖掘	`/questions/<id>`
`npm-package`	包元数据、周下载量、版本、维护者	依赖分析	`/package/<name>`
`producthunt-launch`	产品、标语、点赞、主题、网站	发布监控	`/posts/<slug>`
`tweet`	文本、作者、URL、图片	社交聆听	`/<user>/status/<id>`

快速上手：抓取一个 Amazon 产品

Bash

crawlforge template amazon-product "https://www.amazon.com/dp/B0CHX1W1XY"

输出：

Json

{
  "asin": "B0CHX1W1XY",
  "title": "Logitech MX Master 3S Wireless Performance Mouse",
  "price": { "amount": 99.99, "currency": "USD" },
  "rating": 4.7,
  "review_count": 12483,
  "in_stock": true,
  "images": ["https://m.media-amazon.com/...", "..."],
  "credits_used": 1
}

在像 Claude Code 这样的 MCP 客户端中：

「用 scrape_template 和 amazon 模板获取 ASIN B0CHX1W1XY 的当前价格和评分。」

Claude 会挑选工具、组织好调用并返回数据。一个 credit。

LinkedIn 个人资料（附法律说明）

Bash

crawlforge template linkedin-profile "https://www.linkedin.com/in/satyanadella"

输出：

Json

{
  "name": "Satya Nadella",
  "headline": "Chairman and CEO at Microsoft",
  "location": "Redmond, Washington",
  "current_role": { "title": "CEO", "company": "Microsoft", "since": "2014-02" },
  "experience_count": 6,
  "skills_top": ["Leadership", "Strategy", "Cloud Computing"],
  "credits_used": 1
}

关于抓取 LinkedIn 的说明。 LinkedIn 的服务条款限制自动化访问。hiQ Labs 诉 LinkedIn 案（第九巡回法院，2022）确立了抓取公开个人资料数据一般是被允许的，但商业用途、需要登录的抓取以及激进的频率仍可能引发法律行动和违反 ToS 的封禁。请仅将 scrape_template 与 linkedin-profile 模板用于公开、低频、不转售的数据。

用于 AI 训练数据的 GitHub 仓库

Bash

crawlforge template github-repo "https://github.com/anthropics/anthropic-sdk-python"

输出：

Json

{
  "owner": "anthropics",
  "name": "anthropic-sdk-python",
  "stars": 1842,
  "forks": 287,
  "primary_language": "Python",
  "languages": { "Python": 98.4, "Makefile": 1.6 },
  "license": "MIT",
  "topics": ["claude", "anthropic", "sdk"],
  "readme_markdown": "# Anthropic Python SDK...",
  "last_commit_at": "2026-05-19T14:22:11Z",
  "credits_used": 1
}

这个模板被大量用于 AI 训练数据流水线——在数千个仓库间大规模拉取 README。把它和 batch_scrape 搭配，用来处理一份仓库 URL 的 CSV。

另外七个模板

YouTube —— 标题、频道、观看数，以及可用时的完整字幕：

Bash

crawlforge template youtube-video "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

Reddit —— 帖子 + 评论树：

Bash

crawlforge template reddit-thread "https://www.reddit.com/r/programming/comments/<id>"

Hacker News —— 把首页作为故事列表：

Bash

crawlforge template hacker-news-front-page "https://news.ycombinator.com"
# returns up to 30 front-page stories; slice the top 10 with jq:
crawlforge template hacker-news-front-page "https://news.ycombinator.com" --json | jq '.stories[:10]'

Stack Overflow —— 问题、被采纳的回答、靠前的备选答案：

Bash

crawlforge template stackoverflow-question "https://stackoverflow.com/questions/12345678"

npm —— 包元数据 + 周下载量：

Bash

crawlforge template npm-package "https://www.npmjs.com/package/next"

Product Hunt —— 产品、创作者、点赞：

Bash

crawlforge template producthunt-launch "https://www.producthunt.com/posts/crawlforge"

Twitter/X —— 单条推文，含互动数据和回复：

Bash

crawlforge template tweet "https://x.com/elonmusk/status/<id>"

它们全都返回 JSON。全都花费 1 credit。全都由我们集中维护——当 LinkedIn 或 Amazon 更新布局时，由我们来更新模板。

scrape_template 对比 scrape_structured 对比 extract_with_llm

一棵决策树：

Is your target one of the 10 supported sites?
  Yes -> use scrape_template (1 credit, maintained for you)
  No
    Do you know the CSS selectors and are they stable?
      Yes -> use scrape_structured (2 credits, you maintain selectors)
      No  -> use extract_with_llm (3 credits, schema-based, layout-resilient)

快速对比：

	scrape_template	scrape_structured	extract_with_llm
Credits	1	2	3
覆盖范围	10 个特定站点	任何你能写出选择器的站点	任何站点
维护	由我们维护	由你维护	LLM 自适应
速度	快（缓存的模式）	快	较慢（LLM 调用）
最适合	热门站点、高体量	已知的特定结构	未知或变动的结构

局限

只有 10 个站点。 如果你需要 Etsy、eBay、TikTok 或其他站点，你要么等待路线图，要么用 scrape_structured / extract_with_llm 自己搞定。在 Discord 上提交模板请求。
仅公开数据。 没有任何模板需要登录。设为私密的个人资料、有访问门槛的仓库以及受保护的推文，只会返回公开可见的内容。
布局变更时有发生。 当某个站点发布改版时，我们通常会在 24 小时内修补好模板。
速率限制适用。 大体量抓取 LinkedIn 或 Amazon 时，应将 scrape_template 与 stealth_mode（5 credits）搭配，并尊重各站点的 robots.txt。

准备好跳过选择器了吗？ 免费开始，赠 1,000 credits——足够进行 1,000 次模板抓取。第一次来？读读 v4.2.2 发布文章了解背景，或读读电商提取指南，看一个围绕这些模板构建的真实工作流。

scrape_template 是什么？

每次抓取花费 1 credit——和基础的 fetch_url 一样——因为我们已经在上游把模式的活儿做完了。

支持的 10 个站点

模板	返回内容	最适合	示例 URL 模式
`amazon-product`	标题、价格、评分、评论数、图片、ASIN、库存状态	价格监控、产品调研	`/dp/<ASIN>`
`linkedin-profile`	姓名、头衔、所在地、简介、当前公司	线索补全	`/in/<handle>`
`github-repo`	Stars、forks、语言、主题、许可证、最后更新	仓库分析、AI 训练数据	`/<owner>/<repo>`
`youtube-video`	标题、频道、观看数、时长、发布时间、描述	内容调研	`/watch?v=<id>`
`reddit-thread`	帖子标题、得分、作者、subreddit、正文	社区信号	`/r/<sub>/comments/<id>`
`hacker-news-front-page`	首页故事：标题、URL、得分、作者、评论	技术趋势追踪	`news.ycombinator.com`
`stackoverflow-question`	问题、被采纳的回答、投票数、标签	开发者问答挖掘	`/questions/<id>`
`npm-package`	包元数据、周下载量、版本、维护者	依赖分析	`/package/<name>`
`producthunt-launch`	产品、标语、点赞、主题、网站	发布监控	`/posts/<slug>`
`tweet`	文本、作者、URL、图片	社交聆听	`/<user>/status/<id>`

快速上手：抓取一个 Amazon 产品

Bash

crawlforge template amazon-product "https://www.amazon.com/dp/B0CHX1W1XY"

输出：

Json

{
  "asin": "B0CHX1W1XY",
  "title": "Logitech MX Master 3S Wireless Performance Mouse",
  "price": { "amount": 99.99, "currency": "USD" },
  "rating": 4.7,
  "review_count": 12483,
  "in_stock": true,
  "images": ["https://m.media-amazon.com/...", "..."],
  "credits_used": 1
}

在像 Claude Code 这样的 MCP 客户端中：

「用 scrape_template 和 amazon 模板获取 ASIN B0CHX1W1XY 的当前价格和评分。」

Claude 会挑选工具、组织好调用并返回数据。一个 credit。

LinkedIn 个人资料（附法律说明）

Bash

crawlforge template linkedin-profile "https://www.linkedin.com/in/satyanadella"

输出：

Json

{
  "name": "Satya Nadella",
  "headline": "Chairman and CEO at Microsoft",
  "location": "Redmond, Washington",
  "current_role": { "title": "CEO", "company": "Microsoft", "since": "2014-02" },
  "experience_count": 6,
  "skills_top": ["Leadership", "Strategy", "Cloud Computing"],
  "credits_used": 1
}

关于抓取 LinkedIn 的说明。 LinkedIn 的服务条款限制自动化访问。hiQ Labs 诉 LinkedIn 案（第九巡回法院，2022）确立了抓取公开个人资料数据一般是被允许的，但商业用途、需要登录的抓取以及激进的频率仍可能引发法律行动和违反 ToS 的封禁。请仅将 scrape_template 与 linkedin-profile 模板用于公开、低频、不转售的数据。

用于 AI 训练数据的 GitHub 仓库

Bash

crawlforge template github-repo "https://github.com/anthropics/anthropic-sdk-python"

输出：

Json

{
  "owner": "anthropics",
  "name": "anthropic-sdk-python",
  "stars": 1842,
  "forks": 287,
  "primary_language": "Python",
  "languages": { "Python": 98.4, "Makefile": 1.6 },
  "license": "MIT",
  "topics": ["claude", "anthropic", "sdk"],
  "readme_markdown": "# Anthropic Python SDK...",
  "last_commit_at": "2026-05-19T14:22:11Z",
  "credits_used": 1
}

这个模板被大量用于 AI 训练数据流水线——在数千个仓库间大规模拉取 README。把它和 batch_scrape 搭配，用来处理一份仓库 URL 的 CSV。

另外七个模板

YouTube —— 标题、频道、观看数，以及可用时的完整字幕：

Bash

crawlforge template youtube-video "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

Reddit —— 帖子 + 评论树：

Bash

crawlforge template reddit-thread "https://www.reddit.com/r/programming/comments/<id>"

Hacker News —— 把首页作为故事列表：

Bash

crawlforge template hacker-news-front-page "https://news.ycombinator.com"
# returns up to 30 front-page stories; slice the top 10 with jq:
crawlforge template hacker-news-front-page "https://news.ycombinator.com" --json | jq '.stories[:10]'

Stack Overflow —— 问题、被采纳的回答、靠前的备选答案：

Bash

crawlforge template stackoverflow-question "https://stackoverflow.com/questions/12345678"

npm —— 包元数据 + 周下载量：

Bash

crawlforge template npm-package "https://www.npmjs.com/package/next"

Product Hunt —— 产品、创作者、点赞：

Bash

crawlforge template producthunt-launch "https://www.producthunt.com/posts/crawlforge"

Twitter/X —— 单条推文，含互动数据和回复：

Bash

crawlforge template tweet "https://x.com/elonmusk/status/<id>"

它们全都返回 JSON。全都花费 1 credit。全都由我们集中维护——当 LinkedIn 或 Amazon 更新布局时，由我们来更新模板。

scrape_template 对比 scrape_structured 对比 extract_with_llm

一棵决策树：

Is your target one of the 10 supported sites?
  Yes -> use scrape_template (1 credit, maintained for you)
  No
    Do you know the CSS selectors and are they stable?
      Yes -> use scrape_structured (2 credits, you maintain selectors)
      No  -> use extract_with_llm (3 credits, schema-based, layout-resilient)

快速对比：

	scrape_template	scrape_structured	extract_with_llm
Credits	1	2	3
覆盖范围	10 个特定站点	任何你能写出选择器的站点	任何站点
维护	由我们维护	由你维护	LLM 自适应
速度	快（缓存的模式）	快	较慢（LLM 调用）
最适合	热门站点、高体量	已知的特定结构	未知或变动的结构

局限

只有 10 个站点。 如果你需要 Etsy、eBay、TikTok 或其他站点，你要么等待路线图，要么用 scrape_structured / extract_with_llm 自己搞定。在 Discord 上提交模板请求。
仅公开数据。 没有任何模板需要登录。设为私密的个人资料、有访问门槛的仓库以及受保护的推文，只会返回公开可见的内容。
布局变更时有发生。 当某个站点发布改版时，我们通常会在 24 小时内修补好模板。
速率限制适用。 大体量抓取 LinkedIn 或 Amazon 时，应将 scrape_template 与 stealth_mode（5 credits）搭配，并尊重各站点的 robots.txt。

本页内容

目录

scrape_template 是什么？

支持的 10 个站点

快速上手：抓取一个 Amazon 产品

LinkedIn 个人资料（附法律说明）

用于 AI 训练数据的 GitHub 仓库

另外七个模板

scrape_template 对比 scrape_structured 对比 extract_with_llm

局限

亲自试一试——无需注册

标签

关于作者

CrawlForge Team

及时获取最新洞察

Frequently Asked Questions

相关文章

分行业网页抓取：2026 实战手册

大规模提取电商产品数据

用 CrawlForge Deep Research 构建调研智能体

本页内容

目录

scrape_template 是什么？

支持的 10 个站点

快速上手：抓取一个 Amazon 产品

LinkedIn 个人资料（附法律说明）

用于 AI 训练数据的 GitHub 仓库

另外七个模板

scrape_template 对比 scrape_structured 对比 extract_with_llm

局限

亲自试一试——无需注册

标签

关于作者

CrawlForge Team

及时获取最新洞察

Frequently Asked Questions

相关文章

分行业网页抓取：2026 实战手册

大规模提取电商产品数据

用 CrawlForge Deep Research 构建调研智能体