On this page
Building a product comparison engine requires extracting structured data from thousands of e-commerce pages across dozens of retailers. Each site has a different HTML structure, different anti-bot measures, and different ways of rendering product data. A scraper that works on Amazon breaks on Shopify stores, and neither works on custom-built catalogs.
CrawlForge solves this with a combination of CSS selector extraction, browser automation for JavaScript-heavy pages, and stealth mode for sites with aggressive bot detection. This guide walks you through building a scalable product data extraction pipeline that handles the real-world messiness of e-commerce sites.
Table of Contents
- Why E-commerce Data Extraction Is Hard
- Architecture Overview
- Step 1: Discover Product Pages
- Step 2: Extract Structured Product Data
- Step 3: Handle JavaScript-Rendered Pages
- Step 4: Scale with Batch Processing
- Step 5: Handle Anti-Bot Protection
- Credit Cost Analysis
- Results and Benefits
- Frequently Asked Questions
Why E-commerce Data Extraction Is Hard
E-commerce scraping faces challenges that other scraping domains do not:
| Challenge | Why It Happens | Impact |
|---|---|---|
| Heterogeneous HTML | Every platform uses different markup | Need per-site selectors |
| Dynamic rendering | React/Next.js/Vue render prices client-side | Static scraping gets empty divs |
| Anti-bot measures | Cloudflare, DataDome, PerimeterX | Requests get blocked |
| Rate limiting | Sites throttle after N requests/minute | Crawls stall or get banned |
| Data inconsistency | Prices change by region, session, or time | Need consistent snapshots |
What is e-commerce data extraction? It is the process of programmatically collecting structured product information -- names, prices, descriptions, images, availability, reviews -- from online retail websites and converting it into a standardized format for analysis, comparison, or catalog building.
CrawlForge is best for e-commerce extraction because it provides static scraping, browser automation, and stealth capabilities in a single tool -- so you can match the right technique to each target site without switching between tools.
Architecture Overview
The extraction pipeline uses five CrawlForge tools matched to site complexity:
| Site Complexity | Tool | Credits | When to Use |
|---|---|---|---|
| Static HTML | scrape_structured | 2 | Shopify, WooCommerce, static catalogs |
| JavaScript-rendered | scrape_with_actions | 5 | React/Next.js SPAs, lazy-loaded content |
| Anti-bot protected | stealth_mode | 5 | Cloudflare, DataDome protected sites |
| Bulk processing | batch_scrape | 5 | 25+ URLs from the same domain |
| Page discovery | crawl_deep | 5 | Finding all product pages on a site |
Step 1: Discover Product Pages
Crawl e-commerce sites to build a complete list of product page URLs.
Step 2: Extract Structured Product Data
For static HTML sites (Shopify, WooCommerce, most traditional e-commerce), use CSS selectors to extract product data.
Step 3: Handle JavaScript-Rendered Pages
Modern e-commerce sites built with React, Next.js, or Vue render product data client-side. Use scrape_with_actions to wait for rendering and interact with the page.
Step 4: Scale with Batch Processing
For extracting data from hundreds or thousands of product pages, use batch_scrape for parallel processing.
Using batch_scrape at 5 credits per batch of 25 URLs is 10x more cost-efficient than individual scrape_structured calls (2 credits x 25 = 50 credits).
Step 5: Handle Anti-Bot Protection
Some e-commerce sites use Cloudflare, DataDome, or PerimeterX to block scrapers. Use stealth_mode for these targets.
Always try static extraction first (scrape_structured at 2 credits), then browser automation (scrape_with_actions at 5 credits), and only escalate to stealth mode (5 credits) when needed. This tiered approach minimizes credit costs.
Credit Cost Analysis
Extracting 1,000 products from multiple e-commerce sites:
| Scenario | Tool | Credits per URL | Total (1,000 URLs) |
|---|---|---|---|
| Static HTML (Shopify) | batch_scrape | 0.20 | 200 |
| JavaScript-rendered | scrape_with_actions | 5.00 | 5,000 |
| Anti-bot protected | stealth_mode + scrape_with_actions | 10.00 | 10,000 |
| Mixed (typical) | Various | ~2.00 avg | 2,000 |
A realistic mix of 70% static, 20% JS-rendered, and 10% protected sites averages about 2 credits per product page.
| Scale | Credits/Month | Recommended Plan |
|---|---|---|
| 500 products | 1,000 | Free tier |
| 2,500 products | 5,000 | Professional ($99/mo) |
| 10,000+ products | 20,000+ | Business ($399/mo) |
Results and Benefits
A well-built e-commerce extraction pipeline delivers:
- Speed: Extract 1,000 products per hour with batch processing
- Coverage: Handle Shopify, WooCommerce, custom builds, and protected sites
- Accuracy: Structured selectors ensure consistent data quality
- Cost efficiency: $0.01-0.07 per product page depending on complexity
Teams building product comparison engines, price tracking tools, or catalog aggregators use CrawlForge to maintain datasets of 10,000-100,000 products with daily refresh cycles.
Frequently Asked Questions
How do I detect which e-commerce platform a site uses?
Use fetch_url (1 credit) and check the HTML source. Look for Shopify.theme (Shopify), woocommerce classes (WooCommerce), magento (Magento), or __next (headless commerce on Next.js). CrawlForge's tech detection in the HTML response headers also helps identify the platform.
What about extracting product reviews?
Reviews are often loaded asynchronously or paginated. Use scrape_with_actions to click "Load More" buttons or scroll to trigger lazy loading. For sites that load reviews via API calls, use fetch_url to call the API endpoint directly -- this is both faster and cheaper.
How do I handle product variants (sizes, colors)?
Variants are typically rendered after selecting options. Use scrape_with_actions with click actions to select each variant, then extract the updated price and availability. Alternatively, check if the site exposes variant data in a JSON-LD script tag -- scrape_structured can extract this without browser automation.
Start extracting product data now. Get 1,000 free credits -- enough to extract 500+ product pages from static sites. No credit card required.
Related resources: