Building a product comparison engine requires extracting structured data from thousands of e-commerce pages across dozens of retailers. Each site has a different HTML structure, different anti-bot measures, and different ways of rendering product data. A scraper that works on Amazon breaks on Shopify stores, and neither works on custom-built catalogs that ignore the Schema.org Product vocabulary.

CrawlForge solves this with a combination of CSS selector extraction, browser automation for JavaScript-heavy pages, and stealth mode for sites with aggressive bot detection. This guide walks you through building a scalable product data extraction pipeline that handles the real-world messiness of e-commerce sites.

Why E-commerce Data Extraction Is Hard
Architecture Overview
Step 1: Discover Product Pages
Step 2: Extract Structured Product Data
Step 3: Handle JavaScript-Rendered Pages
Step 4: Scale with Batch Processing
Step 5: Handle Anti-Bot Protection
Credit Cost Analysis
Results and Benefits
Frequently Asked Questions

Why E-commerce Data Extraction Is Hard

E-commerce scraping faces challenges that other scraping domains do not:

Challenge	Why It Happens	Impact
Heterogeneous HTML	Every platform uses different markup	Need per-site selectors
Dynamic rendering	React/Next.js/Vue render prices client-side	Static scraping gets empty divs
Anti-bot measures	Cloudflare, DataDome, PerimeterX	Requests get blocked
Rate limiting	Sites throttle after N requests/minute	Crawls stall or get banned
Data inconsistency	Prices change by region, session, or time	Need consistent snapshots

What is e-commerce data extraction? It is the process of programmatically collecting structured product information -- names, prices, descriptions, images, availability, reviews -- from online retail websites and converting it into a standardized format for analysis, comparison, or catalog building.

CrawlForge is best for e-commerce extraction because it provides static scraping, browser automation, and stealth capabilities in a single tool -- so you can match the right technique to each target site without switching between tools.

Architecture Overview

The extraction pipeline uses five CrawlForge tools matched to site complexity:

Site Complexity	Tool	Credits	When to Use
Static HTML	`scrape_structured`	2	Shopify, WooCommerce, static catalogs
JavaScript-rendered	`scrape_with_actions`	5	React/Next.js SPAs, lazy-loaded content
Anti-bot protected	`stealth_mode`	5	Cloudflare, DataDome protected sites
Bulk processing	`batch_scrape`	5	25+ URLs from the same domain
Page discovery	`crawl_deep`	5	Finding all product pages on a site

Step 1: Discover Product Pages

Crawl e-commerce sites to build a complete list of product page URLs.

Typescript

import { Client } from '@modelcontextprotocol/sdk/client/index.js';

const client = new Client({
  name: 'ecommerce-extractor',
  version: '1.0.0',
});

interface ProductPageDiscovery {
  domain: string;
  productUrls: string[];
  categoryUrls: string[];
  totalPages: number;
}

async function discoverProducts(
  siteUrl: string,
  maxPages: number = 500
): Promise<ProductPageDiscovery> {
  const crawlResult = await client.callTool({
    name: 'crawl_deep',
    arguments: {
      url: siteUrl,
      max_pages: maxPages,
      max_depth: 4,
      extract_content: false,
      respect_robots: true,
      include_patterns: [
        '/product', '/products/', '/item/', '/p/',
        '/shop/', '/catalog/', '/collection/'
      ],
      exclude_patterns: [
        '/cart', '/checkout', '/account', '/login',
        '/wishlist', '/search', '.css', '.js', '.png', '.jpg'
      ],
    },
  });

  const crawled = JSON.parse(crawlResult.content[0].text);

  const productUrls = crawled.pages
    .map((p: { url: string }) => p.url)
    .filter((url: string) =>
      url.match(/\/products?\/|/item\/|\/p\//)
    );

  const categoryUrls = crawled.pages
    .map((p: { url: string }) => p.url)
    .filter((url: string) =>
      url.match(/\/collections?\/|\/category\/|\/shop\//)
    );

  return {
    domain: new URL(siteUrl).hostname,
    productUrls,
    categoryUrls,
    totalPages: crawled.totalPages,
  };
}

Step 2: Extract Structured Product Data

For static HTML sites (Shopify, WooCommerce, most traditional e-commerce), use CSS selectors to extract product data.

Typescript

interface ProductData {
  url: string;
  name: string;
  price: string;
  currency: string;
  description: string;
  images: string[];
  availability: string;
  sku: string;
  brand: string;
  category: string;
  rating: string;
  reviewCount: string;
  extractedAt: string;
}

// Selector presets for common e-commerce platforms
const PLATFORM_SELECTORS: Record<string, Record<string, string>> = {
  shopify: {
    name: '.product-single__title, h1.product__title',
    price: '.product__price, .price-item--regular',
    description: '.product-single__description, .product__description',
    images: '.product-single__photo img, .product__media img',
    availability: '.product-form__inventory, [data-availability]',
    sku: '[data-product-sku], .product-single__sku',
    brand: '.product-single__vendor, .product__vendor',
  },
  woocommerce: {
    name: '.product_title, h1.entry-title',
    price: '.woocommerce-Price-amount, .price ins .amount',
    description: '.woocommerce-product-details__short-description, #tab-description',
    images: '.woocommerce-product-gallery img',
    availability: '.stock, .availability',
    sku: '.sku',
    brand: '.posted_in a',
  },
  generic: {
    name: 'h1, [itemprop="name"]',
    price: '[itemprop="price"], .price, .product-price',
    description: '[itemprop="description"], .product-description',
    images: '.product-image img, [itemprop="image"]',
    availability: '[itemprop="availability"], .availability',
    sku: '[itemprop="sku"]',
    brand: '[itemprop="brand"], .brand',
  },
};

async function extractProduct(
  url: string,
  platform: string = 'generic'
): Promise<ProductData> {
  const selectors = PLATFORM_SELECTORS[platform] || PLATFORM_SELECTORS.generic;

  const result = await client.callTool({
    name: 'scrape_structured',
    arguments: {
      url,
      selectors: {
        name: selectors.name,
        price: selectors.price,
        description: selectors.description,
        images: selectors.images,
        availability: selectors.availability,
        sku: selectors.sku,
        brand: selectors.brand,
        rating: '[itemprop="ratingValue"], .star-rating',
        reviewCount: '[itemprop="reviewCount"], .review-count',
        category: '.breadcrumb a, [itemprop="category"]',
      },
    },
  });

  const data = JSON.parse(result.content[0].text);

  return {
    url,
    name: data.name || '',
    price: data.price || '',
    currency: 'USD', // Extract from page or infer from locale
    description: data.description || '',
    images: Array.isArray(data.images) ? data.images : [data.images].filter(Boolean),
    availability: data.availability || 'Unknown',
    sku: data.sku || '',
    brand: data.brand || '',
    category: data.category || '',
    rating: data.rating || '',
    reviewCount: data.reviewCount || '',
    extractedAt: new Date().toISOString(),
  };
}

Step 3: Handle JavaScript-Rendered Pages

Modern e-commerce sites built with React, Next.js, or Vue render product data client-side. Use scrape_with_actions to wait for rendering and interact with the page.

Typescript

async function extractDynamicProduct(url: string): Promise<ProductData> {
  const result = await client.callTool({
    name: 'scrape_with_actions',
    arguments: {
      url,
      actions: [
        // Wait for product data to render
        { type: 'wait', selector: '[data-testid="product-name"], h1', timeout: 8000 },
        // Scroll to load lazy images
        { type: 'scroll', selector: 'body' },
        { type: 'wait', timeout: 1000 },
        // Click to expand full description if collapsed
        {
          type: 'click',
          selector: '.read-more, .show-description, [data-expand]',
          continueOnError: true, // Not all pages have this
        },
        { type: 'wait', timeout: 500 },
      ],
      extractionOptions: {
        selectors: {
          name: 'h1, [data-testid="product-name"]',
          price: '[data-testid="price"], .price',
          description: '.description, [data-testid="description"]',
          images: '.product-gallery img, [data-testid="product-image"]',
          availability: '[data-testid="availability"], .stock-status',
          rating: '[data-testid="rating"], .rating-value',
        },
        includeMetadata: true,
        includeImages: true,
      },
      continueOnActionError: true,
    },
  });

  const data = JSON.parse(result.content[0].text);

  return {
    url,
    name: data.extracted?.name || data.metadata?.title || '',
    price: data.extracted?.price || '',
    currency: 'USD',
    description: data.extracted?.description || '',
    images: data.extracted?.images || [],
    availability: data.extracted?.availability || 'Unknown',
    sku: '',
    brand: '',
    category: '',
    rating: data.extracted?.rating || '',
    reviewCount: '',
    extractedAt: new Date().toISOString(),
  };
}

Step 4: Scale with Batch Processing

For extracting data from hundreds or thousands of product pages, use batch_scrape for parallel processing.

Typescript

async function batchExtractProducts(
  urls: string[],
  platform: string = 'generic'
): Promise<ProductData[]> {
  const selectors = PLATFORM_SELECTORS[platform] || PLATFORM_SELECTORS.generic;
  const products: ProductData[] = [];
  const batchSize = 25;

  for (let i = 0; i < urls.length; i += batchSize) {
    const batch = urls.slice(i, i + batchSize);

    console.log(
      `Batch ${Math.floor(i / batchSize) + 1}/${Math.ceil(urls.length / batchSize)}: ${batch.length} URLs`
    );

    const result = await client.callTool({
      name: 'batch_scrape',
      arguments: {
        urls: batch.map(url => ({
          url,
          selectors: {
            name: selectors.name,
            price: selectors.price,
            description: selectors.description,
            availability: selectors.availability,
            sku: selectors.sku,
            brand: selectors.brand,
          },
        })),
        maxConcurrency: 10,
        includeMetadata: true,
        delayBetweenRequests: 200, // Respectful crawling
      },
    });

    const batchResult = JSON.parse(result.content[0].text);

    for (const page of batchResult.results) {
      if (page.status === 'success') {
        products.push({
          url: page.url,
          name: page.data?.name || page.metadata?.title || '',
          price: page.data?.price || '',
          currency: 'USD',
          description: page.data?.description || '',
          images: [],
          availability: page.data?.availability || 'Unknown',
          sku: page.data?.sku || '',
          brand: page.data?.brand || '',
          category: '',
          rating: '',
          reviewCount: '',
          extractedAt: new Date().toISOString(),
        });
      }
    }
  }

  return products;
}

Using batch_scrape at 5 credits per batch of 25 URLs is 10x more cost-efficient than individual scrape_structured calls (2 credits x 25 = 50 credits).

Step 5: Handle Anti-Bot Protection

Some e-commerce sites use Cloudflare, DataDome, or PerimeterX to block scrapers. Use stealth_mode for these targets.

Typescript

async function extractProtectedProduct(url: string): Promise<ProductData> {
  // Configure stealth browsing context
  const stealthConfig = await client.callTool({
    name: 'stealth_mode',
    arguments: {
      operation: 'create_context',
      stealthConfig: {
        level: 'advanced',
        hideWebDriver: true,
        randomizeFingerprint: true,
        simulateHumanBehavior: true,
        antiDetection: {
          cloudflareBypass: true,
          hideAutomation: true,
        },
        fingerprinting: {
          canvasNoise: true,
          webglSpoofing: true,
          audioContextSpoofing: true,
        },
      },
      urlToTest: url,
    },
  });

  const context = JSON.parse(stealthConfig.content[0].text);

  // Now use scrape_with_actions within the stealth context
  const result = await client.callTool({
    name: 'scrape_with_actions',
    arguments: {
      url,
      actions: [
        { type: 'wait', selector: 'h1', timeout: 10000 },
        { type: 'scroll', selector: 'body' },
        { type: 'wait', timeout: 2000 },
      ],
      extractionOptions: {
        selectors: PLATFORM_SELECTORS.generic,
        includeMetadata: true,
      },
    },
  });

  const data = JSON.parse(result.content[0].text);

  return {
    url,
    name: data.extracted?.name || '',
    price: data.extracted?.price || '',
    currency: 'USD',
    description: data.extracted?.description || '',
    images: [],
    availability: data.extracted?.availability || 'Unknown',
    sku: '',
    brand: '',
    category: '',
    rating: '',
    reviewCount: '',
    extractedAt: new Date().toISOString(),
  };
}

Always try static extraction first (scrape_structured at 2 credits), then browser automation (scrape_with_actions at 5 credits), and only escalate to stealth mode (5 credits) when needed. This tiered approach minimizes credit costs.

Credit Cost Analysis

Extracting 1,000 products from multiple e-commerce sites:

Scenario	Tool	Credits per URL	Total (1,000 URLs)
Static HTML (Shopify)	`batch_scrape`	0.20	200
JavaScript-rendered	`scrape_with_actions`	5.00	5,000
Anti-bot protected	`stealth_mode` + `scrape_with_actions`	10.00	10,000
Mixed (typical)	Various	~2.00 avg	2,000

A realistic mix of 70% static, 20% JS-rendered, and 10% protected sites averages about 2 credits per product page.

Scale	Credits/Month	Recommended Plan
500 products	1,000	Free tier
2,500 products	5,000	Professional ($99/mo)
10,000+ products	20,000+	Business ($399/mo)

Results and Benefits

A well-built e-commerce extraction pipeline delivers:

Speed: Extract 1,000 products per hour with batch processing
Coverage: Handle Shopify, WooCommerce, custom builds, and protected sites
Accuracy: Structured selectors ensure consistent data quality
Cost efficiency: $0.01-0.07 per product page depending on complexity

Teams building product comparison engines, price tracking tools, or catalog aggregators use CrawlForge to maintain datasets of 10,000-100,000 products with daily refresh cycles.

Frequently Asked Questions

How do I detect which e-commerce platform a site uses?

Use fetch_url (1 credit) and check the HTML source. Look for Shopify.theme (Shopify), woocommerce classes (WooCommerce), magento (Magento), or __next (headless commerce on Next.js). CrawlForge's tech detection in the HTML response headers also helps identify the platform.

What about extracting product reviews?

Reviews are often loaded asynchronously or paginated. Use scrape_with_actions to click "Load More" buttons or scroll to trigger lazy loading. For sites that load reviews via API calls, use fetch_url to call the API endpoint directly -- this is both faster and cheaper.

How do I handle product variants (sizes, colors)?

Variants are typically rendered after selecting options. Use scrape_with_actions with click actions to select each variant, then extract the updated price and availability. Alternatively, check if the site exposes variant data in a JSON-LD script tag -- scrape_structured can extract this without browser automation.

Start extracting product data now. Get 1,000 free credits -- enough to extract 500+ product pages from static sites. No credit card required.

Related resources:

Why E-commerce Data Extraction Is Hard
Architecture Overview
Step 1: Discover Product Pages
Step 2: Extract Structured Product Data
Step 3: Handle JavaScript-Rendered Pages
Step 4: Scale with Batch Processing
Step 5: Handle Anti-Bot Protection
Credit Cost Analysis
Results and Benefits
Frequently Asked Questions

Why E-commerce Data Extraction Is Hard

E-commerce scraping faces challenges that other scraping domains do not:

Challenge	Why It Happens	Impact
Heterogeneous HTML	Every platform uses different markup	Need per-site selectors
Dynamic rendering	React/Next.js/Vue render prices client-side	Static scraping gets empty divs
Anti-bot measures	Cloudflare, DataDome, PerimeterX	Requests get blocked
Rate limiting	Sites throttle after N requests/minute	Crawls stall or get banned
Data inconsistency	Prices change by region, session, or time	Need consistent snapshots

Architecture Overview

The extraction pipeline uses five CrawlForge tools matched to site complexity:

Site Complexity	Tool	Credits	When to Use
Static HTML	`scrape_structured`	2	Shopify, WooCommerce, static catalogs
JavaScript-rendered	`scrape_with_actions`	5	React/Next.js SPAs, lazy-loaded content
Anti-bot protected	`stealth_mode`	5	Cloudflare, DataDome protected sites
Bulk processing	`batch_scrape`	5	25+ URLs from the same domain
Page discovery	`crawl_deep`	5	Finding all product pages on a site

Step 1: Discover Product Pages

Crawl e-commerce sites to build a complete list of product page URLs.

Typescript

import { Client } from '@modelcontextprotocol/sdk/client/index.js';

const client = new Client({
  name: 'ecommerce-extractor',
  version: '1.0.0',
});

interface ProductPageDiscovery {
  domain: string;
  productUrls: string[];
  categoryUrls: string[];
  totalPages: number;
}

async function discoverProducts(
  siteUrl: string,
  maxPages: number = 500
): Promise<ProductPageDiscovery> {
  const crawlResult = await client.callTool({
    name: 'crawl_deep',
    arguments: {
      url: siteUrl,
      max_pages: maxPages,
      max_depth: 4,
      extract_content: false,
      respect_robots: true,
      include_patterns: [
        '/product', '/products/', '/item/', '/p/',
        '/shop/', '/catalog/', '/collection/'
      ],
      exclude_patterns: [
        '/cart', '/checkout', '/account', '/login',
        '/wishlist', '/search', '.css', '.js', '.png', '.jpg'
      ],
    },
  });

  const crawled = JSON.parse(crawlResult.content[0].text);

  const productUrls = crawled.pages
    .map((p: { url: string }) => p.url)
    .filter((url: string) =>
      url.match(/\/products?\/|/item\/|\/p\//)
    );

  const categoryUrls = crawled.pages
    .map((p: { url: string }) => p.url)
    .filter((url: string) =>
      url.match(/\/collections?\/|\/category\/|\/shop\//)
    );

  return {
    domain: new URL(siteUrl).hostname,
    productUrls,
    categoryUrls,
    totalPages: crawled.totalPages,
  };
}

Step 2: Extract Structured Product Data

For static HTML sites (Shopify, WooCommerce, most traditional e-commerce), use CSS selectors to extract product data.

Typescript

interface ProductData {
  url: string;
  name: string;
  price: string;
  currency: string;
  description: string;
  images: string[];
  availability: string;
  sku: string;
  brand: string;
  category: string;
  rating: string;
  reviewCount: string;
  extractedAt: string;
}

// Selector presets for common e-commerce platforms
const PLATFORM_SELECTORS: Record<string, Record<string, string>> = {
  shopify: {
    name: '.product-single__title, h1.product__title',
    price: '.product__price, .price-item--regular',
    description: '.product-single__description, .product__description',
    images: '.product-single__photo img, .product__media img',
    availability: '.product-form__inventory, [data-availability]',
    sku: '[data-product-sku], .product-single__sku',
    brand: '.product-single__vendor, .product__vendor',
  },
  woocommerce: {
    name: '.product_title, h1.entry-title',
    price: '.woocommerce-Price-amount, .price ins .amount',
    description: '.woocommerce-product-details__short-description, #tab-description',
    images: '.woocommerce-product-gallery img',
    availability: '.stock, .availability',
    sku: '.sku',
    brand: '.posted_in a',
  },
  generic: {
    name: 'h1, [itemprop="name"]',
    price: '[itemprop="price"], .price, .product-price',
    description: '[itemprop="description"], .product-description',
    images: '.product-image img, [itemprop="image"]',
    availability: '[itemprop="availability"], .availability',
    sku: '[itemprop="sku"]',
    brand: '[itemprop="brand"], .brand',
  },
};

async function extractProduct(
  url: string,
  platform: string = 'generic'
): Promise<ProductData> {
  const selectors = PLATFORM_SELECTORS[platform] || PLATFORM_SELECTORS.generic;

  const result = await client.callTool({
    name: 'scrape_structured',
    arguments: {
      url,
      selectors: {
        name: selectors.name,
        price: selectors.price,
        description: selectors.description,
        images: selectors.images,
        availability: selectors.availability,
        sku: selectors.sku,
        brand: selectors.brand,
        rating: '[itemprop="ratingValue"], .star-rating',
        reviewCount: '[itemprop="reviewCount"], .review-count',
        category: '.breadcrumb a, [itemprop="category"]',
      },
    },
  });

  const data = JSON.parse(result.content[0].text);

  return {
    url,
    name: data.name || '',
    price: data.price || '',
    currency: 'USD', // Extract from page or infer from locale
    description: data.description || '',
    images: Array.isArray(data.images) ? data.images : [data.images].filter(Boolean),
    availability: data.availability || 'Unknown',
    sku: data.sku || '',
    brand: data.brand || '',
    category: data.category || '',
    rating: data.rating || '',
    reviewCount: data.reviewCount || '',
    extractedAt: new Date().toISOString(),
  };
}

Step 3: Handle JavaScript-Rendered Pages

Modern e-commerce sites built with React, Next.js, or Vue render product data client-side. Use scrape_with_actions to wait for rendering and interact with the page.

Typescript

async function extractDynamicProduct(url: string): Promise<ProductData> {
  const result = await client.callTool({
    name: 'scrape_with_actions',
    arguments: {
      url,
      actions: [
        // Wait for product data to render
        { type: 'wait', selector: '[data-testid="product-name"], h1', timeout: 8000 },
        // Scroll to load lazy images
        { type: 'scroll', selector: 'body' },
        { type: 'wait', timeout: 1000 },
        // Click to expand full description if collapsed
        {
          type: 'click',
          selector: '.read-more, .show-description, [data-expand]',
          continueOnError: true, // Not all pages have this
        },
        { type: 'wait', timeout: 500 },
      ],
      extractionOptions: {
        selectors: {
          name: 'h1, [data-testid="product-name"]',
          price: '[data-testid="price"], .price',
          description: '.description, [data-testid="description"]',
          images: '.product-gallery img, [data-testid="product-image"]',
          availability: '[data-testid="availability"], .stock-status',
          rating: '[data-testid="rating"], .rating-value',
        },
        includeMetadata: true,
        includeImages: true,
      },
      continueOnActionError: true,
    },
  });

  const data = JSON.parse(result.content[0].text);

  return {
    url,
    name: data.extracted?.name || data.metadata?.title || '',
    price: data.extracted?.price || '',
    currency: 'USD',
    description: data.extracted?.description || '',
    images: data.extracted?.images || [],
    availability: data.extracted?.availability || 'Unknown',
    sku: '',
    brand: '',
    category: '',
    rating: data.extracted?.rating || '',
    reviewCount: '',
    extractedAt: new Date().toISOString(),
  };
}

Step 4: Scale with Batch Processing

For extracting data from hundreds or thousands of product pages, use batch_scrape for parallel processing.

Typescript

async function batchExtractProducts(
  urls: string[],
  platform: string = 'generic'
): Promise<ProductData[]> {
  const selectors = PLATFORM_SELECTORS[platform] || PLATFORM_SELECTORS.generic;
  const products: ProductData[] = [];
  const batchSize = 25;

  for (let i = 0; i < urls.length; i += batchSize) {
    const batch = urls.slice(i, i + batchSize);

    console.log(
      `Batch ${Math.floor(i / batchSize) + 1}/${Math.ceil(urls.length / batchSize)}: ${batch.length} URLs`
    );

    const result = await client.callTool({
      name: 'batch_scrape',
      arguments: {
        urls: batch.map(url => ({
          url,
          selectors: {
            name: selectors.name,
            price: selectors.price,
            description: selectors.description,
            availability: selectors.availability,
            sku: selectors.sku,
            brand: selectors.brand,
          },
        })),
        maxConcurrency: 10,
        includeMetadata: true,
        delayBetweenRequests: 200, // Respectful crawling
      },
    });

    const batchResult = JSON.parse(result.content[0].text);

    for (const page of batchResult.results) {
      if (page.status === 'success') {
        products.push({
          url: page.url,
          name: page.data?.name || page.metadata?.title || '',
          price: page.data?.price || '',
          currency: 'USD',
          description: page.data?.description || '',
          images: [],
          availability: page.data?.availability || 'Unknown',
          sku: page.data?.sku || '',
          brand: page.data?.brand || '',
          category: '',
          rating: '',
          reviewCount: '',
          extractedAt: new Date().toISOString(),
        });
      }
    }
  }

  return products;
}

Using batch_scrape at 5 credits per batch of 25 URLs is 10x more cost-efficient than individual scrape_structured calls (2 credits x 25 = 50 credits).

Step 5: Handle Anti-Bot Protection

Some e-commerce sites use Cloudflare, DataDome, or PerimeterX to block scrapers. Use stealth_mode for these targets.

Typescript

async function extractProtectedProduct(url: string): Promise<ProductData> {
  // Configure stealth browsing context
  const stealthConfig = await client.callTool({
    name: 'stealth_mode',
    arguments: {
      operation: 'create_context',
      stealthConfig: {
        level: 'advanced',
        hideWebDriver: true,
        randomizeFingerprint: true,
        simulateHumanBehavior: true,
        antiDetection: {
          cloudflareBypass: true,
          hideAutomation: true,
        },
        fingerprinting: {
          canvasNoise: true,
          webglSpoofing: true,
          audioContextSpoofing: true,
        },
      },
      urlToTest: url,
    },
  });

  const context = JSON.parse(stealthConfig.content[0].text);

  // Now use scrape_with_actions within the stealth context
  const result = await client.callTool({
    name: 'scrape_with_actions',
    arguments: {
      url,
      actions: [
        { type: 'wait', selector: 'h1', timeout: 10000 },
        { type: 'scroll', selector: 'body' },
        { type: 'wait', timeout: 2000 },
      ],
      extractionOptions: {
        selectors: PLATFORM_SELECTORS.generic,
        includeMetadata: true,
      },
    },
  });

  const data = JSON.parse(result.content[0].text);

  return {
    url,
    name: data.extracted?.name || '',
    price: data.extracted?.price || '',
    currency: 'USD',
    description: data.extracted?.description || '',
    images: [],
    availability: data.extracted?.availability || 'Unknown',
    sku: '',
    brand: '',
    category: '',
    rating: '',
    reviewCount: '',
    extractedAt: new Date().toISOString(),
  };
}

Credit Cost Analysis

Extracting 1,000 products from multiple e-commerce sites:

Scenario	Tool	Credits per URL	Total (1,000 URLs)
Static HTML (Shopify)	`batch_scrape`	0.20	200
JavaScript-rendered	`scrape_with_actions`	5.00	5,000
Anti-bot protected	`stealth_mode` + `scrape_with_actions`	10.00	10,000
Mixed (typical)	Various	~2.00 avg	2,000

A realistic mix of 70% static, 20% JS-rendered, and 10% protected sites averages about 2 credits per product page.

Scale	Credits/Month	Recommended Plan
500 products	1,000	Free tier
2,500 products	5,000	Professional ($99/mo)
10,000+ products	20,000+	Business ($399/mo)

Results and Benefits

A well-built e-commerce extraction pipeline delivers:

Speed: Extract 1,000 products per hour with batch processing
Coverage: Handle Shopify, WooCommerce, custom builds, and protected sites
Accuracy: Structured selectors ensure consistent data quality
Cost efficiency: $0.01-0.07 per product page depending on complexity

Teams building product comparison engines, price tracking tools, or catalog aggregators use CrawlForge to maintain datasets of 10,000-100,000 products with daily refresh cycles.

Frequently Asked Questions

How do I detect which e-commerce platform a site uses?

What about extracting product reviews?

How do I handle product variants (sizes, colors)?

Start extracting product data now. Get 1,000 free credits -- enough to extract 500+ product pages from static sites. No credit card required.

Related resources:

On this page

Table of Contents

Why E-commerce Data Extraction Is Hard

Architecture Overview

Step 1: Discover Product Pages

Step 2: Extract Structured Product Data

Step 3: Handle JavaScript-Rendered Pages

Step 4: Scale with Batch Processing

Step 5: Handle Anti-Bot Protection

Credit Cost Analysis

Results and Benefits

Frequently Asked Questions

How do I detect which e-commerce platform a site uses?

What about extracting product reviews?

How do I handle product variants (sizes, colors)?

Try this yourself — no signup needed

Tags

About the Author

CrawlForge Team

Stay updated with the latest insights

Related Articles

Build an AI-Powered Price Monitoring System

Build a Research Agent with CrawlForge Deep Research

Build a Lead Enrichment Engine with CrawlForge

On this page

Table of Contents

Why E-commerce Data Extraction Is Hard

Architecture Overview

Step 1: Discover Product Pages

Step 2: Extract Structured Product Data

Step 3: Handle JavaScript-Rendered Pages

Step 4: Scale with Batch Processing

Step 5: Handle Anti-Bot Protection

Credit Cost Analysis

Results and Benefits

Frequently Asked Questions

How do I detect which e-commerce platform a site uses?

What about extracting product reviews?

How do I handle product variants (sizes, colors)?

Try this yourself — no signup needed

Tags

About the Author

CrawlForge Team

Stay updated with the latest insights

Related Articles

Build an AI-Powered Price Monitoring System

Build a Research Agent with CrawlForge Deep Research

Build a Lead Enrichment Engine with CrawlForge