高级指南

高级抓取技术

使用 CrawlForge MCP 精通动态内容、身份验证保护页面、JavaScript 渲染和 AJAX 处理等复杂抓取场景。

1. 动态内容与 JavaScript

许多现代网站在初始页面加载后才用 JavaScript 渲染内容。使用 scrape_with_actions 等待动态元素。

何时使用浏览器自动化

单页应用（SPA）： 异步加载数据的 React、Vue、Angular 应用

懒加载

随滚动加载的图片、视频或内容

交互式元素

用于展开内容的下拉菜单、弹窗或选项卡

静态 HTML

改用 fetch_url（便宜 5 倍）

示例：抓取 React SPA

5 credits

Bash

curl -X POST https://crawlforge.dev/api/v1/tools/scrape_with_actions \
  -H "X-API-Key: cf_test_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://spa-example.com/products",
    "actions": [
      {
        "type": "wait",
        "selector": ".product-card",
        "timeout": 5000
      },
      {
        "type": "scroll",
        "direction": "down",
        "amount": 1000
      },
      {
        "type": "wait",
        "duration": 2000
      },
      {
        "type": "extract",
        "selectors": {
          "title": "h1.product-title",
          "price": "span.price",
          "description": "div.product-description"
        }
      }
    ]
  }'

专业提示： 始终先尝试 fetch_url。许多 SPA 会在初始 HTML 中预渲染内容，或暴露可直接调用的 API 端点。

2. 身份验证与会话

使用 cookie、请求头或自动表单提交，抓取登录表单或 API 身份验证背后的页面。

策略 1：Cookie 身份验证

最适合可手动获取会话 cookie 的站点

Bash

curl -X POST https://crawlforge.dev/api/v1/tools/fetch_url \
  -H "X-API-Key: cf_test_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/dashboard",
    "headers": {
      "Cookie": "session_id=abc123; user_token=xyz789"
    }
  }'

策略 2：使用表单自动登录

使用 form_submit 自动化整个登录流程

Bash

curl -X POST https://crawlforge.dev/api/v1/tools/form_submit \
  -H "X-API-Key: cf_test_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/login",
    "fields": {
      "email": "user@example.com",
      "password": "secure_password"
    },
    "submitButton": "button[type=submit]",
    "waitForNavigation": true
  }'

安全提示： 切勿硬编码凭据。使用环境变量并定期轮换。如有可能，请考虑使用 OAuth 或 API token。

3. AJAX 与无限滚动

捕获随滚动或点击"加载更多"按钮而加载的内容。

无限滚动示例

5 credits

Typescript

const response = await fetch('https://crawlforge.dev/api/v1/tools/scrape_with_actions', {
  method: 'POST',
  headers: {
    'X-API-Key': process.env.CRAWLFORGE_API_KEY!,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    url: 'https://social-media.com/feed',
    actions: [
      // Scroll to load more content (repeat 5 times)
      { type: 'scroll', direction: 'down', amount: 1000 },
      { type: 'wait', duration: 1000 },
      { type: 'scroll', direction: 'down', amount: 1000 },
      { type: 'wait', duration: 1000 },
      { type: 'scroll', direction: 'down', amount: 1000 },
      { type: 'wait', duration: 1000 },
      { type: 'scroll', direction: 'down', amount: 1000 },
      { type: 'wait', duration: 1000 },
      { type: 'scroll', direction: 'down', amount: 1000 },
      { type: 'wait', duration: 1000 },
      // Extract all posts
      {
        type: 'extract',
        selectors: {
          posts: {
            selector: 'article.post',
            multiple: true,
            fields: {
              text: '.post-content',
              author: '.author-name',
              timestamp: 'time'
            }
          }
        }
      }
    ]
  }),
});

const data = await response.json();
console.log(`Loaded ${data.data.extracted.posts.length} posts`);

4. 速率限制处理

遇到 429 响应时，实施指数退避和重试逻辑。

重试逻辑示例

Typescript

async function scrapeWithRetry(url: string, maxRetries = 3) {
  let retries = 0;
  let delay = 1000; // Start with 1 second

  while (retries < maxRetries) {
    try {
      const response = await fetch('https://crawlforge.dev/api/v1/tools/fetch_url', {
        method: 'POST',
        headers: {
          'X-API-Key': process.env.CRAWLFORGE_API_KEY!,
          'Content-Type': 'application/json',
        },
        body: JSON.stringify({ url }),
      });

      if (response.status === 429) {
        // Rate limited - wait and retry
        console.log(`Rate limited. Waiting ${delay}ms before retry...`);
        await new Promise(resolve => setTimeout(resolve, delay));
        delay *= 2; // Exponential backoff
        retries++;
        continue;
      }

      if (!response.ok) {
        throw new Error(`HTTP ${response.status}: ${response.statusText}`);
      }

      return await response.json();
    } catch (error) {
      if (retries === maxRetries - 1) throw error;
      retries++;
      await new Promise(resolve => setTimeout(resolve, delay));
      delay *= 2;
    }
  }

  throw new Error('Max retries exceeded');
}

// Usage
const data = await scrapeWithRetry('https://example.com');

下一步

继续你的学习之旅，探索更多高级指南

curl -X POST https://crawlforge.dev/api/v1/tools/scrape_with_actions \ -H "X-API-Key: cf_test_YOUR_KEY" \ -H "Content-Type: application/json" \ -d '{ "url": "https://spa-example.com/products", "actions": [ { "type": "wait", "selector": ".product-card", "timeout": 5000 }, { "type": "scroll", "direction": "down", "amount": 1000 }, { "type": "wait", "duration": 2000 }, { "type": "extract", "selectors": { "title": "h1.product-title", "price": "span.price", "description": "div.product-description" } } ] }'

curl -X POST https://crawlforge.dev/api/v1/tools/form_submit \ -H "X-API-Key: cf_test_YOUR_KEY" \ -H "Content-Type: application/json" \ -d '{ "url": "https://example.com/login", "fields": { "email": "user@example.com", "password": "secure_password" }, "submitButton": "button[type=submit]", "waitForNavigation": true }'

async function scrapeWithRetry(url: string, maxRetries = 3) { let retries = 0; let delay = 1000; // Start with 1 second while (retries < maxRetries) { try { const response = await fetch('https://crawlforge.dev/api/v1/tools/fetch_url', { method: 'POST', headers: { 'X-API-Key': process.env.CRAWLFORGE_API_KEY!, 'Content-Type': 'application/json', }, body: JSON.stringify({ url }), }); if (response.status === 429) { // Rate limited - wait and retry console.log(`Rate limited. Waiting ${delay}ms before retry...`); await new Promise(resolve => setTimeout(resolve, delay)); delay *= 2; // Exponential backoff retries++; continue; } if (!response.ok) { throw new Error(`HTTP ${response.status}: ${response.statusText}`); } return await response.json(); } catch (error) { if (retries === maxRetries - 1) throw error; retries++; await new Promise(resolve => setTimeout(resolve, delay)); delay *= 2; } } throw new Error('Max retries exceeded'); } // Usage const data = await scrapeWithRetry('https://example.com');

高级 抓取技术

1. 动态内容与 JavaScript

示例：抓取 React SPA

2. 身份验证与会话

策略 1：Cookie 身份验证

策略 2：使用表单自动登录

3. AJAX 与无限滚动

无限滚动示例

4. 速率限制处理

重试逻辑示例

高级 抓取技术

1. 动态内容与 JavaScript

示例：抓取 React SPA

2. 身份验证与会话

策略 1：Cookie 身份验证

策略 2：使用表单自动登录

3. AJAX 与无限滚动

无限滚动示例

4. 速率限制处理

重试逻辑示例

高级抓取技术

高级抓取技术