构建一个产品比价引擎，需要从数十家零售商的成千上万个电商页面中提取结构化数据。每个网站的 HTML 结构不同、反爬虫措施不同、渲染产品数据的方式也不同。一个在 Amazon 上能用的 scraper，到了 Shopify 商店就会失效，而两者都无法应对那些忽略 Schema.org Product 词汇表的定制化产品目录。

CrawlForge 用一套组合来解决这个问题：CSS 选择器提取、针对重 JavaScript 页面的浏览器自动化，以及针对反爬虫检测激进的网站的 stealth 模式。本指南将带你一步步构建一个可扩展的产品数据提取 pipeline，应对电商网站现实世界中的杂乱无章。

为什么电商数据提取很难

电商 scraping 面临着其他 scraping 领域所没有的挑战：

挑战	为什么会发生	影响
异构的 HTML	每个平台使用不同的标记	需要针对每个站点的选择器
动态渲染	React/Next.js/Vue 在客户端渲染价格	静态抓取拿到的是空 div
反爬虫措施	Cloudflare、DataDome、PerimeterX	请求被拦截
速率限制	网站在每分钟 N 次请求后限流	爬取停滞或被封禁
数据不一致	价格随地区、会话或时间变化	需要一致的快照

什么是电商数据提取？它是指以编程方式从在线零售网站收集结构化的产品信息——名称、价格、描述、图片、库存状况、评价——并将其转换为标准化格式，用于分析、比较或构建产品目录。

CrawlForge 最适合电商提取，因为它在单个工具中提供了静态抓取、浏览器自动化和 stealth 能力——这样你就能为每个目标站点匹配合适的技术，而无需在工具之间切换。

架构概览

提取 pipeline 使用五个 CrawlForge 工具，与站点复杂度相匹配：

站点复杂度	工具	Credits	何时使用
静态 HTML	`scrape_structured`	2	Shopify、WooCommerce、静态目录
JavaScript 渲染	`scrape_with_actions`	5	React/Next.js 的 SPA、懒加载内容
反爬虫保护	`stealth_mode`	5	受 Cloudflare、DataDome 保护的站点
批量处理	`batch_scrape`	5	同一域名下 25 个以上 URL
页面发现	`crawl_deep`	5	找出一个站点的所有产品页面

步骤 1：发现产品页面

爬取电商网站，构建一份完整的产品页面 URL 列表。

Typescript

import { Client } from '@modelcontextprotocol/sdk/client/index.js';

const client = new Client({
  name: 'ecommerce-extractor',
  version: '1.0.0',
});

interface ProductPageDiscovery {
  domain: string;
  productUrls: string[];
  categoryUrls: string[];
  totalPages: number;
}

async function discoverProducts(
  siteUrl: string,
  maxPages: number = 500
): Promise<ProductPageDiscovery> {
  const crawlResult = await client.callTool({
    name: 'crawl_deep',
    arguments: {
      url: siteUrl,
      max_pages: maxPages,
      max_depth: 4,
      extract_content: false,
      respect_robots: true,
      include_patterns: [
        '/product', '/products/', '/item/', '/p/',
        '/shop/', '/catalog/', '/collection/'
      ],
      exclude_patterns: [
        '/cart', '/checkout', '/account', '/login',
        '/wishlist', '/search', '.css', '.js', '.png', '.jpg'
      ],
    },
  });

  const crawled = JSON.parse(crawlResult.content[0].text);

  const productUrls = crawled.pages
    .map((p: { url: string }) => p.url)
    .filter((url: string) =>
      url.match(/\/products?\/|/item\/|\/p\//)
    );

  const categoryUrls = crawled.pages
    .map((p: { url: string }) => p.url)
    .filter((url: string) =>
      url.match(/\/collections?\/|\/category\/|\/shop\//)
    );

  return {
    domain: new URL(siteUrl).hostname,
    productUrls,
    categoryUrls,
    totalPages: crawled.totalPages,
  };
}

步骤 2：提取结构化产品数据

对于静态 HTML 站点（Shopify、WooCommerce、大多数传统电商），使用 CSS 选择器来提取产品数据。

Typescript

interface ProductData {
  url: string;
  name: string;
  price: string;
  currency: string;
  description: string;
  images: string[];
  availability: string;
  sku: string;
  brand: string;
  category: string;
  rating: string;
  reviewCount: string;
  extractedAt: string;
}

// Selector presets for common e-commerce platforms
const PLATFORM_SELECTORS: Record<string, Record<string, string>> = {
  shopify: {
    name: '.product-single__title, h1.product__title',
    price: '.product__price, .price-item--regular',
    description: '.product-single__description, .product__description',
    images: '.product-single__photo img, .product__media img',
    availability: '.product-form__inventory, [data-availability]',
    sku: '[data-product-sku], .product-single__sku',
    brand: '.product-single__vendor, .product__vendor',
  },
  woocommerce: {
    name: '.product_title, h1.entry-title',
    price: '.woocommerce-Price-amount, .price ins .amount',
    description: '.woocommerce-product-details__short-description, #tab-description',
    images: '.woocommerce-product-gallery img',
    availability: '.stock, .availability',
    sku: '.sku',
    brand: '.posted_in a',
  },
  generic: {
    name: 'h1, [itemprop="name"]',
    price: '[itemprop="price"], .price, .product-price',
    description: '[itemprop="description"], .product-description',
    images: '.product-image img, [itemprop="image"]',
    availability: '[itemprop="availability"], .availability',
    sku: '[itemprop="sku"]',
    brand: '[itemprop="brand"], .brand',
  },
};

async function extractProduct(
  url: string,
  platform: string = 'generic'
): Promise<ProductData> {
  const selectors = PLATFORM_SELECTORS[platform] || PLATFORM_SELECTORS.generic;

  const result = await client.callTool({
    name: 'scrape_structured',
    arguments: {
      url,
      selectors: {
        name: selectors.name,
        price: selectors.price,
        description: selectors.description,
        images: selectors.images,
        availability: selectors.availability,
        sku: selectors.sku,
        brand: selectors.brand,
        rating: '[itemprop="ratingValue"], .star-rating',
        reviewCount: '[itemprop="reviewCount"], .review-count',
        category: '.breadcrumb a, [itemprop="category"]',
      },
    },
  });

  const data = JSON.parse(result.content[0].text);

  return {
    url,
    name: data.name || '',
    price: data.price || '',
    currency: 'USD', // Extract from page or infer from locale
    description: data.description || '',
    images: Array.isArray(data.images) ? data.images : [data.images].filter(Boolean),
    availability: data.availability || 'Unknown',
    sku: data.sku || '',
    brand: data.brand || '',
    category: data.category || '',
    rating: data.rating || '',
    reviewCount: data.reviewCount || '',
    extractedAt: new Date().toISOString(),
  };
}

步骤 3：处理 JavaScript 渲染的页面

用 React、Next.js 或 Vue 构建的现代电商网站会在客户端渲染产品数据。使用 scrape_with_actions 等待渲染完成并与页面交互。

Typescript

async function extractDynamicProduct(url: string): Promise<ProductData> {
  const result = await client.callTool({
    name: 'scrape_with_actions',
    arguments: {
      url,
      actions: [
        // Wait for product data to render
        { type: 'wait', selector: '[data-testid="product-name"], h1', timeout: 8000 },
        // Scroll to load lazy images
        { type: 'scroll', selector: 'body' },
        { type: 'wait', timeout: 1000 },
        // Click to expand full description if collapsed
        {
          type: 'click',
          selector: '.read-more, .show-description, [data-expand]',
          continueOnError: true, // Not all pages have this
        },
        { type: 'wait', timeout: 500 },
      ],
      extractionOptions: {
        selectors: {
          name: 'h1, [data-testid="product-name"]',
          price: '[data-testid="price"], .price',
          description: '.description, [data-testid="description"]',
          images: '.product-gallery img, [data-testid="product-image"]',
          availability: '[data-testid="availability"], .stock-status',
          rating: '[data-testid="rating"], .rating-value',
        },
        includeMetadata: true,
        includeImages: true,
      },
      continueOnActionError: true,
    },
  });

  const data = JSON.parse(result.content[0].text);

  return {
    url,
    name: data.extracted?.name || data.metadata?.title || '',
    price: data.extracted?.price || '',
    currency: 'USD',
    description: data.extracted?.description || '',
    images: data.extracted?.images || [],
    availability: data.extracted?.availability || 'Unknown',
    sku: '',
    brand: '',
    category: '',
    rating: data.extracted?.rating || '',
    reviewCount: '',
    extractedAt: new Date().toISOString(),
  };
}

步骤 4：用批处理实现规模化

要从数百或数千个产品页面中提取数据，使用 batch_scrape 进行并行处理。

Typescript

async function batchExtractProducts(
  urls: string[],
  platform: string = 'generic'
): Promise<ProductData[]> {
  const selectors = PLATFORM_SELECTORS[platform] || PLATFORM_SELECTORS.generic;
  const products: ProductData[] = [];
  const batchSize = 25;

  for (let i = 0; i < urls.length; i += batchSize) {
    const batch = urls.slice(i, i + batchSize);

    console.log(
      `Batch ${Math.floor(i / batchSize) + 1}/${Math.ceil(urls.length / batchSize)}: ${batch.length} URLs`
    );

    const result = await client.callTool({
      name: 'batch_scrape',
      arguments: {
        urls: batch.map(url => ({
          url,
          selectors: {
            name: selectors.name,
            price: selectors.price,
            description: selectors.description,
            availability: selectors.availability,
            sku: selectors.sku,
            brand: selectors.brand,
          },
        })),
        maxConcurrency: 10,
        includeMetadata: true,
        delayBetweenRequests: 200, // Respectful crawling
      },
    });

    const batchResult = JSON.parse(result.content[0].text);

    for (const page of batchResult.results) {
      if (page.status === 'success') {
        products.push({
          url: page.url,
          name: page.data?.name || page.metadata?.title || '',
          price: page.data?.price || '',
          currency: 'USD',
          description: page.data?.description || '',
          images: [],
          availability: page.data?.availability || 'Unknown',
          sku: page.data?.sku || '',
          brand: page.data?.brand || '',
          category: '',
          rating: '',
          reviewCount: '',
          extractedAt: new Date().toISOString(),
        });
      }
    }
  }

  return products;
}

以每批 25 个 URL 收取 5 credits 的方式使用 batch_scrape，比逐个调用 scrape_structured（2 credits × 25 = 50 credits）的成本效率高 10 倍。

步骤 5：应对反爬虫保护

有些电商网站使用 Cloudflare、DataDome 或 PerimeterX 来拦截 scraper。对这些目标使用 stealth_mode。

Typescript

async function extractProtectedProduct(url: string): Promise<ProductData> {
  // Configure stealth browsing context
  const stealthConfig = await client.callTool({
    name: 'stealth_mode',
    arguments: {
      operation: 'create_context',
      stealthConfig: {
        level: 'advanced',
        hideWebDriver: true,
        randomizeFingerprint: true,
        simulateHumanBehavior: true,
        antiDetection: {
          cloudflareBypass: true,
          hideAutomation: true,
        },
        fingerprinting: {
          canvasNoise: true,
          webglSpoofing: true,
          audioContextSpoofing: true,
        },
      },
      urlToTest: url,
    },
  });

  const context = JSON.parse(stealthConfig.content[0].text);

  // Now use scrape_with_actions within the stealth context
  const result = await client.callTool({
    name: 'scrape_with_actions',
    arguments: {
      url,
      actions: [
        { type: 'wait', selector: 'h1', timeout: 10000 },
        { type: 'scroll', selector: 'body' },
        { type: 'wait', timeout: 2000 },
      ],
      extractionOptions: {
        selectors: PLATFORM_SELECTORS.generic,
        includeMetadata: true,
      },
    },
  });

  const data = JSON.parse(result.content[0].text);

  return {
    url,
    name: data.extracted?.name || '',
    price: data.extracted?.price || '',
    currency: 'USD',
    description: data.extracted?.description || '',
    images: [],
    availability: data.extracted?.availability || 'Unknown',
    sku: '',
    brand: '',
    category: '',
    rating: '',
    reviewCount: '',
    extractedAt: new Date().toISOString(),
  };
}

始终先尝试静态提取（scrape_structured，2 credits），其次是浏览器自动化（scrape_with_actions，5 credits），只有在必要时才升级到 stealth 模式（5 credits）。这种分级方法能最大限度地降低 credits 成本。

credits 成本分析

从多个电商网站提取 1,000 个产品：

场景	工具	每个 URL 的 Credits	合计（1,000 个 URL）
静态 HTML（Shopify）	`batch_scrape`	0.20	200
JavaScript 渲染	`scrape_with_actions`	5.00	5,000
反爬虫保护	`stealth_mode` + `scrape_with_actions`	10.00	10,000
混合（典型）	多种	平均约 2.00	2,000

一个现实的组合——70% 静态、20% JS 渲染、10% 受保护站点——平均每个产品页面约 2 credits。

规模	Credits/月	推荐套餐
500 个产品	1,000	Free 套餐
2,500 个产品	5,000	Professional（$99/月）
10,000 个以上产品	20,000+	Business（$399/月）

成果与收益

一个构建良好的电商提取 pipeline 能带来：

速度：通过批处理每小时提取 1,000 个产品
覆盖：处理 Shopify、WooCommerce、定制化构建以及受保护的站点
精度：结构化选择器确保一致的数据质量
成本效率：根据复杂度，每个产品页面 $0.01 至 $0.07

构建产品比价引擎、价格跟踪工具或目录聚合器的团队，使用 CrawlForge 以每日更新周期维护 10,000 到 100,000 个产品的数据集。

常见问题

我如何检测一个网站使用的是哪个电商平台？

使用 fetch_url（1 credit）并查看 HTML 源码。查找 Shopify.theme（Shopify）、woocommerce class（WooCommerce）、magento（Magento）或 __next（基于 Next.js 的 headless commerce）。CrawlForge 在 HTML 响应头中的技术检测也有助于识别平台。

产品评价的提取怎么办？

评价通常是异步加载或分页的。使用 scrape_with_actions 点击 "Load More" 按钮或滚动以触发懒加载。对于通过 API 调用加载评价的站点，使用 fetch_url 直接调用该 API endpoint——这既更快又更便宜。

我如何处理产品变体（尺寸、颜色）？

变体通常在选择选项后才渲染。使用带 click 动作的 scrape_with_actions 来选择每个变体，然后提取更新后的价格和库存状况。或者，检查站点是否在 JSON-LD script 标签中暴露了变体数据——scrape_structured 无需浏览器自动化即可提取这些数据。

现在就开始提取产品数据。 获取 1,000 个免费 credits——足够从静态站点提取 500 多个产品页面。无需信用卡。

相关资源：

为什么电商数据提取很难

电商 scraping 面临着其他 scraping 领域所没有的挑战：

挑战	为什么会发生	影响
异构的 HTML	每个平台使用不同的标记	需要针对每个站点的选择器
动态渲染	React/Next.js/Vue 在客户端渲染价格	静态抓取拿到的是空 div
反爬虫措施	Cloudflare、DataDome、PerimeterX	请求被拦截
速率限制	网站在每分钟 N 次请求后限流	爬取停滞或被封禁
数据不一致	价格随地区、会话或时间变化	需要一致的快照

架构概览

提取 pipeline 使用五个 CrawlForge 工具，与站点复杂度相匹配：

站点复杂度	工具	Credits	何时使用
静态 HTML	`scrape_structured`	2	Shopify、WooCommerce、静态目录
JavaScript 渲染	`scrape_with_actions`	5	React/Next.js 的 SPA、懒加载内容
反爬虫保护	`stealth_mode`	5	受 Cloudflare、DataDome 保护的站点
批量处理	`batch_scrape`	5	同一域名下 25 个以上 URL
页面发现	`crawl_deep`	5	找出一个站点的所有产品页面

步骤 1：发现产品页面

爬取电商网站，构建一份完整的产品页面 URL 列表。

Typescript

import { Client } from '@modelcontextprotocol/sdk/client/index.js';

const client = new Client({
  name: 'ecommerce-extractor',
  version: '1.0.0',
});

interface ProductPageDiscovery {
  domain: string;
  productUrls: string[];
  categoryUrls: string[];
  totalPages: number;
}

async function discoverProducts(
  siteUrl: string,
  maxPages: number = 500
): Promise<ProductPageDiscovery> {
  const crawlResult = await client.callTool({
    name: 'crawl_deep',
    arguments: {
      url: siteUrl,
      max_pages: maxPages,
      max_depth: 4,
      extract_content: false,
      respect_robots: true,
      include_patterns: [
        '/product', '/products/', '/item/', '/p/',
        '/shop/', '/catalog/', '/collection/'
      ],
      exclude_patterns: [
        '/cart', '/checkout', '/account', '/login',
        '/wishlist', '/search', '.css', '.js', '.png', '.jpg'
      ],
    },
  });

  const crawled = JSON.parse(crawlResult.content[0].text);

  const productUrls = crawled.pages
    .map((p: { url: string }) => p.url)
    .filter((url: string) =>
      url.match(/\/products?\/|/item\/|\/p\//)
    );

  const categoryUrls = crawled.pages
    .map((p: { url: string }) => p.url)
    .filter((url: string) =>
      url.match(/\/collections?\/|\/category\/|\/shop\//)
    );

  return {
    domain: new URL(siteUrl).hostname,
    productUrls,
    categoryUrls,
    totalPages: crawled.totalPages,
  };
}

步骤 2：提取结构化产品数据

对于静态 HTML 站点（Shopify、WooCommerce、大多数传统电商），使用 CSS 选择器来提取产品数据。

Typescript

interface ProductData {
  url: string;
  name: string;
  price: string;
  currency: string;
  description: string;
  images: string[];
  availability: string;
  sku: string;
  brand: string;
  category: string;
  rating: string;
  reviewCount: string;
  extractedAt: string;
}

// Selector presets for common e-commerce platforms
const PLATFORM_SELECTORS: Record<string, Record<string, string>> = {
  shopify: {
    name: '.product-single__title, h1.product__title',
    price: '.product__price, .price-item--regular',
    description: '.product-single__description, .product__description',
    images: '.product-single__photo img, .product__media img',
    availability: '.product-form__inventory, [data-availability]',
    sku: '[data-product-sku], .product-single__sku',
    brand: '.product-single__vendor, .product__vendor',
  },
  woocommerce: {
    name: '.product_title, h1.entry-title',
    price: '.woocommerce-Price-amount, .price ins .amount',
    description: '.woocommerce-product-details__short-description, #tab-description',
    images: '.woocommerce-product-gallery img',
    availability: '.stock, .availability',
    sku: '.sku',
    brand: '.posted_in a',
  },
  generic: {
    name: 'h1, [itemprop="name"]',
    price: '[itemprop="price"], .price, .product-price',
    description: '[itemprop="description"], .product-description',
    images: '.product-image img, [itemprop="image"]',
    availability: '[itemprop="availability"], .availability',
    sku: '[itemprop="sku"]',
    brand: '[itemprop="brand"], .brand',
  },
};

async function extractProduct(
  url: string,
  platform: string = 'generic'
): Promise<ProductData> {
  const selectors = PLATFORM_SELECTORS[platform] || PLATFORM_SELECTORS.generic;

  const result = await client.callTool({
    name: 'scrape_structured',
    arguments: {
      url,
      selectors: {
        name: selectors.name,
        price: selectors.price,
        description: selectors.description,
        images: selectors.images,
        availability: selectors.availability,
        sku: selectors.sku,
        brand: selectors.brand,
        rating: '[itemprop="ratingValue"], .star-rating',
        reviewCount: '[itemprop="reviewCount"], .review-count',
        category: '.breadcrumb a, [itemprop="category"]',
      },
    },
  });

  const data = JSON.parse(result.content[0].text);

  return {
    url,
    name: data.name || '',
    price: data.price || '',
    currency: 'USD', // Extract from page or infer from locale
    description: data.description || '',
    images: Array.isArray(data.images) ? data.images : [data.images].filter(Boolean),
    availability: data.availability || 'Unknown',
    sku: data.sku || '',
    brand: data.brand || '',
    category: data.category || '',
    rating: data.rating || '',
    reviewCount: data.reviewCount || '',
    extractedAt: new Date().toISOString(),
  };
}

步骤 3：处理 JavaScript 渲染的页面

用 React、Next.js 或 Vue 构建的现代电商网站会在客户端渲染产品数据。使用 scrape_with_actions 等待渲染完成并与页面交互。

Typescript

async function extractDynamicProduct(url: string): Promise<ProductData> {
  const result = await client.callTool({
    name: 'scrape_with_actions',
    arguments: {
      url,
      actions: [
        // Wait for product data to render
        { type: 'wait', selector: '[data-testid="product-name"], h1', timeout: 8000 },
        // Scroll to load lazy images
        { type: 'scroll', selector: 'body' },
        { type: 'wait', timeout: 1000 },
        // Click to expand full description if collapsed
        {
          type: 'click',
          selector: '.read-more, .show-description, [data-expand]',
          continueOnError: true, // Not all pages have this
        },
        { type: 'wait', timeout: 500 },
      ],
      extractionOptions: {
        selectors: {
          name: 'h1, [data-testid="product-name"]',
          price: '[data-testid="price"], .price',
          description: '.description, [data-testid="description"]',
          images: '.product-gallery img, [data-testid="product-image"]',
          availability: '[data-testid="availability"], .stock-status',
          rating: '[data-testid="rating"], .rating-value',
        },
        includeMetadata: true,
        includeImages: true,
      },
      continueOnActionError: true,
    },
  });

  const data = JSON.parse(result.content[0].text);

  return {
    url,
    name: data.extracted?.name || data.metadata?.title || '',
    price: data.extracted?.price || '',
    currency: 'USD',
    description: data.extracted?.description || '',
    images: data.extracted?.images || [],
    availability: data.extracted?.availability || 'Unknown',
    sku: '',
    brand: '',
    category: '',
    rating: data.extracted?.rating || '',
    reviewCount: '',
    extractedAt: new Date().toISOString(),
  };
}

步骤 4：用批处理实现规模化

要从数百或数千个产品页面中提取数据，使用 batch_scrape 进行并行处理。

Typescript

async function batchExtractProducts(
  urls: string[],
  platform: string = 'generic'
): Promise<ProductData[]> {
  const selectors = PLATFORM_SELECTORS[platform] || PLATFORM_SELECTORS.generic;
  const products: ProductData[] = [];
  const batchSize = 25;

  for (let i = 0; i < urls.length; i += batchSize) {
    const batch = urls.slice(i, i + batchSize);

    console.log(
      `Batch ${Math.floor(i / batchSize) + 1}/${Math.ceil(urls.length / batchSize)}: ${batch.length} URLs`
    );

    const result = await client.callTool({
      name: 'batch_scrape',
      arguments: {
        urls: batch.map(url => ({
          url,
          selectors: {
            name: selectors.name,
            price: selectors.price,
            description: selectors.description,
            availability: selectors.availability,
            sku: selectors.sku,
            brand: selectors.brand,
          },
        })),
        maxConcurrency: 10,
        includeMetadata: true,
        delayBetweenRequests: 200, // Respectful crawling
      },
    });

    const batchResult = JSON.parse(result.content[0].text);

    for (const page of batchResult.results) {
      if (page.status === 'success') {
        products.push({
          url: page.url,
          name: page.data?.name || page.metadata?.title || '',
          price: page.data?.price || '',
          currency: 'USD',
          description: page.data?.description || '',
          images: [],
          availability: page.data?.availability || 'Unknown',
          sku: page.data?.sku || '',
          brand: page.data?.brand || '',
          category: '',
          rating: '',
          reviewCount: '',
          extractedAt: new Date().toISOString(),
        });
      }
    }
  }

  return products;
}

以每批 25 个 URL 收取 5 credits 的方式使用 batch_scrape，比逐个调用 scrape_structured（2 credits × 25 = 50 credits）的成本效率高 10 倍。

步骤 5：应对反爬虫保护

有些电商网站使用 Cloudflare、DataDome 或 PerimeterX 来拦截 scraper。对这些目标使用 stealth_mode。

Typescript

async function extractProtectedProduct(url: string): Promise<ProductData> {
  // Configure stealth browsing context
  const stealthConfig = await client.callTool({
    name: 'stealth_mode',
    arguments: {
      operation: 'create_context',
      stealthConfig: {
        level: 'advanced',
        hideWebDriver: true,
        randomizeFingerprint: true,
        simulateHumanBehavior: true,
        antiDetection: {
          cloudflareBypass: true,
          hideAutomation: true,
        },
        fingerprinting: {
          canvasNoise: true,
          webglSpoofing: true,
          audioContextSpoofing: true,
        },
      },
      urlToTest: url,
    },
  });

  const context = JSON.parse(stealthConfig.content[0].text);

  // Now use scrape_with_actions within the stealth context
  const result = await client.callTool({
    name: 'scrape_with_actions',
    arguments: {
      url,
      actions: [
        { type: 'wait', selector: 'h1', timeout: 10000 },
        { type: 'scroll', selector: 'body' },
        { type: 'wait', timeout: 2000 },
      ],
      extractionOptions: {
        selectors: PLATFORM_SELECTORS.generic,
        includeMetadata: true,
      },
    },
  });

  const data = JSON.parse(result.content[0].text);

  return {
    url,
    name: data.extracted?.name || '',
    price: data.extracted?.price || '',
    currency: 'USD',
    description: data.extracted?.description || '',
    images: [],
    availability: data.extracted?.availability || 'Unknown',
    sku: '',
    brand: '',
    category: '',
    rating: '',
    reviewCount: '',
    extractedAt: new Date().toISOString(),
  };
}

credits 成本分析

从多个电商网站提取 1,000 个产品：

场景	工具	每个 URL 的 Credits	合计（1,000 个 URL）
静态 HTML（Shopify）	`batch_scrape`	0.20	200
JavaScript 渲染	`scrape_with_actions`	5.00	5,000
反爬虫保护	`stealth_mode` + `scrape_with_actions`	10.00	10,000
混合（典型）	多种	平均约 2.00	2,000

一个现实的组合——70% 静态、20% JS 渲染、10% 受保护站点——平均每个产品页面约 2 credits。

规模	Credits/月	推荐套餐
500 个产品	1,000	Free 套餐
2,500 个产品	5,000	Professional（$99/月）
10,000 个以上产品	20,000+	Business（$399/月）

成果与收益

一个构建良好的电商提取 pipeline 能带来：

速度：通过批处理每小时提取 1,000 个产品
覆盖：处理 Shopify、WooCommerce、定制化构建以及受保护的站点
精度：结构化选择器确保一致的数据质量
成本效率：根据复杂度，每个产品页面 $0.01 至 $0.07

构建产品比价引擎、价格跟踪工具或目录聚合器的团队，使用 CrawlForge 以每日更新周期维护 10,000 到 100,000 个产品的数据集。

常见问题

我如何检测一个网站使用的是哪个电商平台？

产品评价的提取怎么办？

我如何处理产品变体（尺寸、颜色）？

现在就开始提取产品数据。 获取 1,000 个免费 credits——足够从静态站点提取 500 多个产品页面。无需信用卡。

相关资源：

本页内容

目录

为什么电商数据提取很难

架构概览

步骤 1：发现产品页面

步骤 2：提取结构化产品数据

步骤 3：处理 JavaScript 渲染的页面

步骤 4：用批处理实现规模化

步骤 5：应对反爬虫保护

credits 成本分析

成果与收益

常见问题

我如何检测一个网站使用的是哪个电商平台？

产品评价的提取怎么办？

我如何处理产品变体（尺寸、颜色）？

亲自试一试——无需注册

标签

关于作者

CrawlForge Team

及时获取最新洞察

相关文章

构建一个 AI 驱动的价格监控系统

用 CrawlForge Deep Research 构建调研智能体

用 CrawlForge 构建 lead enrichment 引擎

本页内容

目录

为什么电商数据提取很难

架构概览

步骤 1：发现产品页面

步骤 2：提取结构化产品数据

步骤 3：处理 JavaScript 渲染的页面

步骤 4：用批处理实现规模化

步骤 5：应对反爬虫保护

credits 成本分析

成果与收益

常见问题

我如何检测一个网站使用的是哪个电商平台？

产品评价的提取怎么办？

我如何处理产品变体（尺寸、颜色）？

亲自试一试——无需注册

标签

关于作者

CrawlForge Team

及时获取最新洞察

相关文章

构建一个 AI 驱动的价格监控系统

用 CrawlForge Deep Research 构建调研智能体

用 CrawlForge 构建 lead enrichment 引擎