CrawlForge
Advanced Guide

Advanced Scraping Techniques

Master complex scraping scenarios including dynamic content, authentication-protected pages, JavaScript rendering, and AJAX handling with CrawlForge MCP.

Dynamic Content & JavaScript
Authentication & Sessions
AJAX & Infinite Scroll
Rate Limit Handling

1. Dynamic Content & JavaScript

Many modern websites render content with JavaScript after the initial page load. Use scrape_with_actions to wait for dynamic elements.

When to Use Browser Automation
Single-Page Apps (SPAs): React, Vue, Angular apps that load data asynchronously
Lazy Loading
Images, videos, or content that loads on scroll
Interactive Elements
Dropdowns, modals, or tabs that reveal content
Static HTML
Use fetch_url instead (5x cheaper)

Example: Scraping a React SPA

5 credits

Bash
curl -X POST https://crawlforge.dev/api/v1/tools/scrape_with_actions \
  -H "X-API-Key: cf_test_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://spa-example.com/products",
    "actions": [
      {
        "type": "wait",
        "selector": ".product-card",
        "timeout": 5000
      },
      {
        "type": "scroll",
        "direction": "down",
        "amount": 1000
      },
      {
        "type": "wait",
        "duration": 2000
      },
      {
        "type": "extract",
        "selectors": {
          "title": "h1.product-title",
          "price": "span.price",
          "description": "div.product-description"
        }
      }
    ]
  }'
Pro Tip: Always try fetch_url first. Many SPAs pre-render content in the initial HTML or expose API endpoints you can call directly.

2. Authentication & Sessions

Scrape pages behind login forms or API authentication using cookies, headers, or automated form submission.

Strategy 1: Cookie Authentication

Best for sites where you can obtain session cookies manually

Bash
curl -X POST https://crawlforge.dev/api/v1/tools/fetch_url \
  -H "X-API-Key: cf_test_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/dashboard",
    "headers": {
      "Cookie": "session_id=abc123; user_token=xyz789"
    }
  }'

Strategy 2: Automated Login with Forms

Automate the entire login process with form_submit

Bash
curl -X POST https://crawlforge.dev/api/v1/tools/form_submit \
  -H "X-API-Key: cf_test_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/login",
    "fields": {
      "email": "user@example.com",
      "password": "secure_password"
    },
    "submitButton": "button[type=submit]",
    "waitForNavigation": true
  }'
Security Note: Never hardcode credentials. Use environment variables and rotate them regularly. Consider using OAuth or API tokens when available.

3. AJAX & Infinite Scroll

Capture content that loads as you scroll or click "Load More" buttons.

Infinite Scroll Example

5 credits

Typescript
const response = await fetch('https://crawlforge.dev/api/v1/tools/scrape_with_actions', {
  method: 'POST',
  headers: {
    'X-API-Key': process.env.CRAWLFORGE_API_KEY!,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    url: 'https://social-media.com/feed',
    actions: [
      // Scroll to load more content (repeat 5 times)
      { type: 'scroll', direction: 'down', amount: 1000 },
      { type: 'wait', duration: 1000 },
      { type: 'scroll', direction: 'down', amount: 1000 },
      { type: 'wait', duration: 1000 },
      { type: 'scroll', direction: 'down', amount: 1000 },
      { type: 'wait', duration: 1000 },
      { type: 'scroll', direction: 'down', amount: 1000 },
      { type: 'wait', duration: 1000 },
      { type: 'scroll', direction: 'down', amount: 1000 },
      { type: 'wait', duration: 1000 },
      // Extract all posts
      {
        type: 'extract',
        selectors: {
          posts: {
            selector: 'article.post',
            multiple: true,
            fields: {
              text: '.post-content',
              author: '.author-name',
              timestamp: 'time'
            }
          }
        }
      }
    ]
  }),
});

const data = await response.json();
console.log(`Loaded ${data.data.extracted.posts.length} posts`);

4. Rate Limit Handling

Implement exponential backoff and retry logic when encountering 429 responses.

Retry Logic Example

Typescript
async function scrapeWithRetry(url: string, maxRetries = 3) {
  let retries = 0;
  let delay = 1000; // Start with 1 second

  while (retries < maxRetries) {
    try {
      const response = await fetch('https://crawlforge.dev/api/v1/tools/fetch_url', {
        method: 'POST',
        headers: {
          'X-API-Key': process.env.CRAWLFORGE_API_KEY!,
          'Content-Type': 'application/json',
        },
        body: JSON.stringify({ url }),
      });

      if (response.status === 429) {
        // Rate limited - wait and retry
        console.log(`Rate limited. Waiting ${delay}ms before retry...`);
        await new Promise(resolve => setTimeout(resolve, delay));
        delay *= 2; // Exponential backoff
        retries++;
        continue;
      }

      if (!response.ok) {
        throw new Error(`HTTP ${response.status}: ${response.statusText}`);
      }

      return await response.json();
    } catch (error) {
      if (retries === maxRetries - 1) throw error;
      retries++;
      await new Promise(resolve => setTimeout(resolve, delay));
      delay *= 2;
    }
  }

  throw new Error('Max retries exceeded');
}

// Usage
const data = await scrapeWithRetry('https://example.com');
Next Steps
Continue your learning journey with more advanced guides
Batch Processing →
Scale to thousands of URLs
Stealth Techniques →
Bypass anti-bot systems