用网页数据构建一个 RAG 流水线需要多少 credits？

一次 crawl_deep 调用花费 5 credits，最多可爬取 1,000 个页面。对于一个 200 页的文档站点，CrawlForge 的总成本是 5 credits。1,000 credits 的免费套餐让你可以在付费之前构建 200 个 RAG 流水线。

如何用网页数据构建 RAG 流水线

检索增强生成（RAG）的效果，取决于你喂给它的数据质量。大多数 RAG 教程使用静态文档集合——放在某个文件夹里的 PDF 或 markdown 文件。而生产级 RAG 系统需要实时网页数据：每周更新的文档、每月变动的竞品定价、每天发布的研究论文。

本指南将带你构建一个完整的 RAG 流水线，使用 CrawlForge 爬取并提取网页内容，然后将其送入向量数据库以进行检索增强生成。每一步都包含可运行的 TypeScript 代码。

什么是 RAG，为什么要使用网页数据？
RAG 流水线架构
步骤 1：爬取目标网站
步骤 2：提取并清洗内容
步骤 3：为 embedding 对文本分块
步骤 4：生成 embedding
步骤 5：存入向量数据库
步骤 6：查询与检索
把所有环节串联起来
性能优化技巧
常见问题

什么是 RAG，为什么要使用网页数据？

检索增强生成是一种让 LLM 的回答扎根于从外部知识库检索到的相关文档的技术。RAG 系统不再仅依赖训练数据（训练数据存在知识截止时间），而是在生成回答之前先获取当前、相关的上下文。

为什么网页数据能让 RAG 更出色：

新鲜度 —— 网页内容实时更新，而训练数据不会
广度 —— 网络覆盖每一个主题、行业和细分领域
针对性 —— 精确抓取与你的领域相关的页面
权威性 —— 从官方文档、研究论文和可信来源中获取

网页数据常见的 RAG 使用场景：

扎根于实时文档的客户支持机器人
跟踪竞争对手变化的竞争情报系统
从学术数据库和新闻中获取信息的研究助手
借助外部行业数据丰富的内部知识库

RAG 流水线架构

一个基于网页数据的 RAG 流水线有六个阶段：

Crawl --> Extract --> Chunk --> Embed --> Store --> Retrieve
 |         |          |         |         |         |
 URLs    Clean     Text     Vectors   Vector    Context
         text     segments            DB       + LLM

阶段	工具	用途
爬取	CrawlForge `crawl_deep` / `batch_scrape`	发现并获取页面
提取	CrawlForge `extract_content`	将 HTML 清洗为可读文本
分块	自定义逻辑	将文本切分为适合 embedding 大小的片段
Embed	OpenAI / Cohere / 本地模型	将文本转换为向量表示
存储	Pinecone / Weaviate / Qdrant	为相似度搜索建立向量索引
检索	向量 DB 查询 + LLM	找到相关片段，生成回答

步骤 1：爬取目标网站

首先，从你的目标域名发现并获取所有相关页面。CrawlForge 的 crawl_deep 工具会处理分页、链接发现和并行获取。

Typescript

import { CrawlForge } from '@crawlforge/sdk';

const cf = new CrawlForge({ apiKey: process.env.CRAWLFORGE_API_KEY });

// Crawl a documentation site up to 3 levels deep
const crawlResult = await cf.crawlDeep({
  url: 'https://docs.example.com',
  max_depth: 3,
  max_pages: 200,
  include_patterns: ['/docs/', '/guides/', '/api/'],
  exclude_patterns: ['/changelog', '/blog'],
  extract_content: true, // Get clean text during crawl
  concurrency: 10
});

console.log(`Crawled ${crawlResult.pages.length} pages`);
// Crawled 147 pages

如果要抓取一个已知的 URL 列表（比如 sitemap），请改用 batch_scrape：

Typescript

// Scrape a specific list of URLs in parallel
const batchResult = await cf.batchScrape({
  urls: [
    'https://docs.example.com/auth',
    'https://docs.example.com/billing',
    'https://docs.example.com/webhooks',
    'https://docs.example.com/rate-limits',
    // ... up to 50 URLs per batch
  ],
  formats: ['text'],
  maxConcurrency: 10
});

credits 成本： crawl_deep 每次调用花费 5 credits。batch_scrape 每个批次（最多 50 个 URL）花费 5 credits。对于 200 个页面，单次 crawl_deep 调用比多次 batch_scrape 调用更划算。

步骤 2：提取并清洗内容

原始 HTML 包含导航、广告、页脚和样板内容，它们会污染你的 embedding。CrawlForge 的 extract_content 工具使用可读性算法来分离出主要内容。

如果你在爬取阶段使用了 extract_content: true，那么你的内容已经是干净的了。对于单个页面：

Typescript

// Extract clean content from a single page
const page = await cf.extractContent({
  url: 'https://docs.example.com/authentication'
});

console.log(page.content);
// Returns: "Authentication\n\nAll API requests require..."
// No nav bars, no footers, no cookie banners

后处理技巧：

移除重复内容（许多网站会在提取出的文本中重复页眉/页脚）
去除内部导航链接（“Next: Billing” / “Previous: Setup”）
规范化空白字符并移除空行
保留标题——它们为分块提供结构

步骤 3：为 embedding 对文本分块

embedding 模型有 token 上限（通常为 512-8,192 个 token）。长文档必须被切分为更小的、能保留语义含义的片段。

Typescript

interface TextChunk {
  content: string;
  metadata: {
    source: string;
    title: string;
    heading: string;
    chunkIndex: number;
  };
}

function chunkByHeading(
  text: string,
  source: string,
  title: string,
  maxChunkSize: number = 1000 // characters
): TextChunk[] {
  const chunks: TextChunk[] = [];
  // Split on markdown headings (## or ###)
  const sections = text.split(/(?=^#{2,3}\s)/m);

  let chunkIndex = 0;
  for (const section of sections) {
    // Extract heading from section
    const headingMatch = section.match(/^#{2,3}\s+(.+)/);
    const heading = headingMatch ? headingMatch[1].trim() : title;
    const content = section.trim();

    if (content.length <= maxChunkSize) {
      chunks.push({
        content,
        metadata: { source, title, heading, chunkIndex: chunkIndex++ }
      });
    } else {
      // Split large sections by paragraph
      const paragraphs = content.split('\n\n');
      let currentChunk = '';

      for (const para of paragraphs) {
        if ((currentChunk + para).length > maxChunkSize && currentChunk) {
          chunks.push({
            content: currentChunk.trim(),
            metadata: { source, title, heading, chunkIndex: chunkIndex++ }
          });
          currentChunk = para;
        } else {
          currentChunk += (currentChunk ? '\n\n' : '') + para;
        }
      }
      if (currentChunk.trim()) {
        chunks.push({
          content: currentChunk.trim(),
          metadata: { source, title, heading, chunkIndex: chunkIndex++ }
        });
      }
    }
  }

  return chunks;
}

// Usage with crawled pages
const allChunks: TextChunk[] = [];
for (const page of crawlResult.pages) {
  const pageChunks = chunkByHeading(
    page.content,
    page.url,
    page.title || 'Untitled',
    1000
  );
  allChunks.push(...pageChunks);
}

console.log(`Created ${allChunks.length} chunks from ${crawlResult.pages.length} pages`);
// Created 892 chunks from 147 pages

分块策略对比：

策略	优点	缺点	适合
固定大小	简单、可预测	会在句子中间切断	通用内容
基于标题	保留结构	块大小不均	文档
基于句子	自然的边界	可能太小	对话数据
递归式	大小均衡 + 保留结构	更复杂	生产系统

步骤 4：生成 embedding

使用 embedding 模型将每个文本块转换为向量表示。

Typescript

import OpenAI from 'openai';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

interface EmbeddedChunk extends TextChunk {
  embedding: number[];
}

async function embedChunks(
  chunks: TextChunk[],
  batchSize: number = 100
): Promise<EmbeddedChunk[]> {
  const embeddedChunks: EmbeddedChunk[] = [];

  for (let i = 0; i < chunks.length; i += batchSize) {
    const batch = chunks.slice(i, i + batchSize);
    const texts = batch.map(c => c.content);

    const response = await openai.embeddings.create({
      model: 'text-embedding-3-small', // 1536 dimensions, $0.02/1M tokens
      input: texts
    });

    for (let j = 0; j < batch.length; j++) {
      embeddedChunks.push({
        ...batch[j],
        embedding: response.data[j].embedding
      });
    }

    console.log(`Embedded ${Math.min(i + batchSize, chunks.length)}/${chunks.length} chunks`);
  }

  return embeddedChunks;
}

const embeddedChunks = await embedChunks(allChunks);

embedding 模型选项：

模型	维度	成本	质量
OpenAI text-embedding-3-small	1,536	$0.02/1M tokens	不错
OpenAI text-embedding-3-large	3,072	$0.13/1M tokens	更好
Cohere embed-english-v3.0	1,024	$0.10/1M tokens	不错
本地（all-MiniLM-L6-v2）	384	免费	够用

步骤 5：存入向量数据库

将带 embedding 的文本块索引到向量数据库中，以进行快速的相似度搜索。

Pinecone 示例

Typescript

import { Pinecone } from '@pinecone-database/pinecone';

const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const index = pinecone.Index('rag-web-data');

// Upsert chunks in batches of 100
async function storeChunks(chunks: EmbeddedChunk[]) {
  const batchSize = 100;

  for (let i = 0; i < chunks.length; i += batchSize) {
    const batch = chunks.slice(i, i + batchSize);
    const vectors = batch.map((chunk, j) => ({
      id: `${chunk.metadata.source}-${chunk.metadata.chunkIndex}`,
      values: chunk.embedding,
      metadata: {
        content: chunk.content,
        source: chunk.metadata.source,
        title: chunk.metadata.title,
        heading: chunk.metadata.heading
      }
    }));

    await index.upsert(vectors);
    console.log(`Stored ${Math.min(i + batchSize, chunks.length)}/${chunks.length} vectors`);
  }
}

await storeChunks(embeddedChunks);

Weaviate 示例

Typescript

import weaviate from 'weaviate-ts-client';

const client = weaviate.client({
  scheme: 'https',
  host: process.env.WEAVIATE_HOST || 'localhost:8080'
});

// Create collection schema
await client.schema.classCreator().withClass({
  class: 'WebDocument',
  vectorizer: 'none', // We provide our own vectors
  properties: [
    { name: 'content', dataType: ['text'] },
    { name: 'source', dataType: ['string'] },
    { name: 'title', dataType: ['string'] },
    { name: 'heading', dataType: ['string'] }
  ]
}).do();

// Store chunks
for (const chunk of embeddedChunks) {
  await client.data.creator()
    .withClassName('WebDocument')
    .withProperties({
      content: chunk.content,
      source: chunk.metadata.source,
      title: chunk.metadata.title,
      heading: chunk.metadata.heading
    })
    .withVector(chunk.embedding)
    .do();
}

步骤 6：查询与检索

现在用用户问题查询向量数据库，检索相关的文本块，并把它们作为上下文传给 LLM。

Typescript

async function ragQuery(question: string): Promise<string> {
  // 1. Embed the question
  const questionEmbedding = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: question
  });

  // 2. Query Pinecone for the 5 most relevant chunks
  const queryResult = await index.query({
    vector: questionEmbedding.data[0].embedding,
    topK: 5,
    includeMetadata: true
  });

  // 3. Build context from retrieved chunks
  const context = queryResult.matches
    .map(match => {
      const meta = match.metadata as Record<string, string>;
      return `Source: ${meta.source}\nSection: ${meta.heading}\n\n${meta.content}`;
    })
    .join('\n\n---\n\n');

  // 4. Generate answer with context
  const completion = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: `Answer the user's question based on the provided context.
Cite sources when possible. If the context does not contain enough
information, say so.`
      },
      {
        role: 'user',
        content: `Context:\n${context}\n\nQuestion: ${question}`
      }
    ]
  });

  return completion.choices[0].message.content || 'No answer generated';
}

// Example usage
const answer = await ragQuery('How does authentication work?');
console.log(answer);

把所有环节串联起来

下面是把整个流水线放进单个编排函数中的完整实现：

Typescript

async function buildRAGPipeline(config: {
  targetUrl: string;
  maxPages: number;
  indexName: string;
}) {
  const cf = new CrawlForge({ apiKey: process.env.CRAWLFORGE_API_KEY });

  // 1. Crawl
  console.log('Crawling...');
  const crawled = await cf.crawlDeep({
    url: config.targetUrl,
    max_depth: 3,
    max_pages: config.maxPages,
    extract_content: true,
    concurrency: 10
  });
  console.log(`Crawled ${crawled.pages.length} pages`);

  // 2. Chunk
  console.log('Chunking...');
  const chunks = crawled.pages.flatMap(page =>
    chunkByHeading(page.content, page.url, page.title || 'Untitled')
  );
  console.log(`Created ${chunks.length} chunks`);

  // 3. Embed
  console.log('Embedding...');
  const embedded = await embedChunks(chunks);

  // 4. Store
  console.log('Storing...');
  await storeChunks(embedded);

  console.log('RAG pipeline complete. Ready for queries.');
}

// Run it
await buildRAGPipeline({
  targetUrl: 'https://docs.example.com',
  maxPages: 200,
  indexName: 'rag-web-data'
});

爬取 200 个页面的 CrawlForge 总 credits 成本： 5 credits（一次带 extract_content: true 的 crawl_deep 调用）。

性能优化技巧

批量 embedding —— 每次 API 调用 embed 100 个文本块，而不是一次一个（速度快 10 倍，成本相同）
对文档使用基于标题的分块，对新闻文章使用基于句子的分块
设置合适的重叠 —— 文本块之间 10-15% 的重叠可防止在边界处丢失上下文
在爬取时过滤 —— 使用 include_patterns 和 exclude_patterns 避免爬取无关页面
积极缓存 —— 将爬取的内容存储在本地，只在内容变化时才重新爬取
监控新鲜度 —— 使用 CrawlForge 的变更跟踪来检测源页面何时更新，然后只对变化的内容重新爬取和重新 embed

常见问题

用 CrawlForge 构建一个 RAG 流水线需要多少 credits？

一次 crawl_deep 调用花费 5 credits，最多可爬取 1,000 个页面。对于一个 200 页的文档站点，CrawlForge 的总成本是 5 credits。免费套餐（1,000 credits）让你可以在付费之前构建 200 个 RAG 流水线。查看定价详情。

我应该为 RAG 使用哪个向量数据库？

Pinecone 上手最简单（全托管，无需基础设施）。Weaviate 通过混合搜索（向量 + 关键词）提供更高的灵活性。Qdrant 是最佳的自托管选项。ChromaDB 非常适合原型开发和本地开发。

我应该多久重新爬取并更新一次 RAG 数据？

这取决于你的源内容变化的频率。文档站点：每周。新闻和研究：每天。产品目录：每小时。使用 CrawlForge 的变更跟踪来检测更新，并只重新处理变化的页面。

我可以将 CrawlForge 与 LangChain 或 LlamaIndex 一起使用吗？

可以。CrawlForge 与这两个框架都能集成。使用 SDK 获取内容，然后把它传给 LangChain 的 document loader 或 LlamaIndex 的 data connector。请参阅我们的 LangChain 集成指南查看示例。

在 10 分钟内构建你的第一个 RAG 流水线。 免费开始，获得 1,000 credits，今天就爬取你的第一个站点。

什么是 RAG，为什么要使用网页数据？
RAG 流水线架构
步骤 1：爬取目标网站
步骤 2：提取并清洗内容
步骤 3：为 embedding 对文本分块
步骤 4：生成 embedding
步骤 5：存入向量数据库
步骤 6：查询与检索
把所有环节串联起来
性能优化技巧
常见问题

什么是 RAG，为什么要使用网页数据？

为什么网页数据能让 RAG 更出色：

新鲜度 —— 网页内容实时更新，而训练数据不会
广度 —— 网络覆盖每一个主题、行业和细分领域
针对性 —— 精确抓取与你的领域相关的页面
权威性 —— 从官方文档、研究论文和可信来源中获取

网页数据常见的 RAG 使用场景：

扎根于实时文档的客户支持机器人
跟踪竞争对手变化的竞争情报系统
从学术数据库和新闻中获取信息的研究助手
借助外部行业数据丰富的内部知识库

RAG 流水线架构

一个基于网页数据的 RAG 流水线有六个阶段：

Crawl --> Extract --> Chunk --> Embed --> Store --> Retrieve
 |         |          |         |         |         |
 URLs    Clean     Text     Vectors   Vector    Context
         text     segments            DB       + LLM

阶段	工具	用途
爬取	CrawlForge `crawl_deep` / `batch_scrape`	发现并获取页面
提取	CrawlForge `extract_content`	将 HTML 清洗为可读文本
分块	自定义逻辑	将文本切分为适合 embedding 大小的片段
Embed	OpenAI / Cohere / 本地模型	将文本转换为向量表示
存储	Pinecone / Weaviate / Qdrant	为相似度搜索建立向量索引
检索	向量 DB 查询 + LLM	找到相关片段，生成回答

步骤 1：爬取目标网站

首先，从你的目标域名发现并获取所有相关页面。CrawlForge 的 crawl_deep 工具会处理分页、链接发现和并行获取。

Typescript

import { CrawlForge } from '@crawlforge/sdk';

const cf = new CrawlForge({ apiKey: process.env.CRAWLFORGE_API_KEY });

// Crawl a documentation site up to 3 levels deep
const crawlResult = await cf.crawlDeep({
  url: 'https://docs.example.com',
  max_depth: 3,
  max_pages: 200,
  include_patterns: ['/docs/', '/guides/', '/api/'],
  exclude_patterns: ['/changelog', '/blog'],
  extract_content: true, // Get clean text during crawl
  concurrency: 10
});

console.log(`Crawled ${crawlResult.pages.length} pages`);
// Crawled 147 pages

如果要抓取一个已知的 URL 列表（比如 sitemap），请改用 batch_scrape：

Typescript

// Scrape a specific list of URLs in parallel
const batchResult = await cf.batchScrape({
  urls: [
    'https://docs.example.com/auth',
    'https://docs.example.com/billing',
    'https://docs.example.com/webhooks',
    'https://docs.example.com/rate-limits',
    // ... up to 50 URLs per batch
  ],
  formats: ['text'],
  maxConcurrency: 10
});

步骤 2：提取并清洗内容

原始 HTML 包含导航、广告、页脚和样板内容，它们会污染你的 embedding。CrawlForge 的 extract_content 工具使用可读性算法来分离出主要内容。

如果你在爬取阶段使用了 extract_content: true，那么你的内容已经是干净的了。对于单个页面：

Typescript

// Extract clean content from a single page
const page = await cf.extractContent({
  url: 'https://docs.example.com/authentication'
});

console.log(page.content);
// Returns: "Authentication\n\nAll API requests require..."
// No nav bars, no footers, no cookie banners

后处理技巧：

移除重复内容（许多网站会在提取出的文本中重复页眉/页脚）
去除内部导航链接（“Next: Billing” / “Previous: Setup”）
规范化空白字符并移除空行
保留标题——它们为分块提供结构

步骤 3：为 embedding 对文本分块

embedding 模型有 token 上限（通常为 512-8,192 个 token）。长文档必须被切分为更小的、能保留语义含义的片段。

Typescript

interface TextChunk {
  content: string;
  metadata: {
    source: string;
    title: string;
    heading: string;
    chunkIndex: number;
  };
}

function chunkByHeading(
  text: string,
  source: string,
  title: string,
  maxChunkSize: number = 1000 // characters
): TextChunk[] {
  const chunks: TextChunk[] = [];
  // Split on markdown headings (## or ###)
  const sections = text.split(/(?=^#{2,3}\s)/m);

  let chunkIndex = 0;
  for (const section of sections) {
    // Extract heading from section
    const headingMatch = section.match(/^#{2,3}\s+(.+)/);
    const heading = headingMatch ? headingMatch[1].trim() : title;
    const content = section.trim();

    if (content.length <= maxChunkSize) {
      chunks.push({
        content,
        metadata: { source, title, heading, chunkIndex: chunkIndex++ }
      });
    } else {
      // Split large sections by paragraph
      const paragraphs = content.split('\n\n');
      let currentChunk = '';

      for (const para of paragraphs) {
        if ((currentChunk + para).length > maxChunkSize && currentChunk) {
          chunks.push({
            content: currentChunk.trim(),
            metadata: { source, title, heading, chunkIndex: chunkIndex++ }
          });
          currentChunk = para;
        } else {
          currentChunk += (currentChunk ? '\n\n' : '') + para;
        }
      }
      if (currentChunk.trim()) {
        chunks.push({
          content: currentChunk.trim(),
          metadata: { source, title, heading, chunkIndex: chunkIndex++ }
        });
      }
    }
  }

  return chunks;
}

// Usage with crawled pages
const allChunks: TextChunk[] = [];
for (const page of crawlResult.pages) {
  const pageChunks = chunkByHeading(
    page.content,
    page.url,
    page.title || 'Untitled',
    1000
  );
  allChunks.push(...pageChunks);
}

console.log(`Created ${allChunks.length} chunks from ${crawlResult.pages.length} pages`);
// Created 892 chunks from 147 pages

分块策略对比：

策略	优点	缺点	适合
固定大小	简单、可预测	会在句子中间切断	通用内容
基于标题	保留结构	块大小不均	文档
基于句子	自然的边界	可能太小	对话数据
递归式	大小均衡 + 保留结构	更复杂	生产系统

步骤 4：生成 embedding

使用 embedding 模型将每个文本块转换为向量表示。

Typescript

import OpenAI from 'openai';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

interface EmbeddedChunk extends TextChunk {
  embedding: number[];
}

async function embedChunks(
  chunks: TextChunk[],
  batchSize: number = 100
): Promise<EmbeddedChunk[]> {
  const embeddedChunks: EmbeddedChunk[] = [];

  for (let i = 0; i < chunks.length; i += batchSize) {
    const batch = chunks.slice(i, i + batchSize);
    const texts = batch.map(c => c.content);

    const response = await openai.embeddings.create({
      model: 'text-embedding-3-small', // 1536 dimensions, $0.02/1M tokens
      input: texts
    });

    for (let j = 0; j < batch.length; j++) {
      embeddedChunks.push({
        ...batch[j],
        embedding: response.data[j].embedding
      });
    }

    console.log(`Embedded ${Math.min(i + batchSize, chunks.length)}/${chunks.length} chunks`);
  }

  return embeddedChunks;
}

const embeddedChunks = await embedChunks(allChunks);

embedding 模型选项：

模型	维度	成本	质量
OpenAI text-embedding-3-small	1,536	$0.02/1M tokens	不错
OpenAI text-embedding-3-large	3,072	$0.13/1M tokens	更好
Cohere embed-english-v3.0	1,024	$0.10/1M tokens	不错
本地（all-MiniLM-L6-v2）	384	免费	够用

步骤 5：存入向量数据库

将带 embedding 的文本块索引到向量数据库中，以进行快速的相似度搜索。

Pinecone 示例

Typescript

import { Pinecone } from '@pinecone-database/pinecone';

const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const index = pinecone.Index('rag-web-data');

// Upsert chunks in batches of 100
async function storeChunks(chunks: EmbeddedChunk[]) {
  const batchSize = 100;

  for (let i = 0; i < chunks.length; i += batchSize) {
    const batch = chunks.slice(i, i + batchSize);
    const vectors = batch.map((chunk, j) => ({
      id: `${chunk.metadata.source}-${chunk.metadata.chunkIndex}`,
      values: chunk.embedding,
      metadata: {
        content: chunk.content,
        source: chunk.metadata.source,
        title: chunk.metadata.title,
        heading: chunk.metadata.heading
      }
    }));

    await index.upsert(vectors);
    console.log(`Stored ${Math.min(i + batchSize, chunks.length)}/${chunks.length} vectors`);
  }
}

await storeChunks(embeddedChunks);

Weaviate 示例

Typescript

import weaviate from 'weaviate-ts-client';

const client = weaviate.client({
  scheme: 'https',
  host: process.env.WEAVIATE_HOST || 'localhost:8080'
});

// Create collection schema
await client.schema.classCreator().withClass({
  class: 'WebDocument',
  vectorizer: 'none', // We provide our own vectors
  properties: [
    { name: 'content', dataType: ['text'] },
    { name: 'source', dataType: ['string'] },
    { name: 'title', dataType: ['string'] },
    { name: 'heading', dataType: ['string'] }
  ]
}).do();

// Store chunks
for (const chunk of embeddedChunks) {
  await client.data.creator()
    .withClassName('WebDocument')
    .withProperties({
      content: chunk.content,
      source: chunk.metadata.source,
      title: chunk.metadata.title,
      heading: chunk.metadata.heading
    })
    .withVector(chunk.embedding)
    .do();
}

步骤 6：查询与检索

现在用用户问题查询向量数据库，检索相关的文本块，并把它们作为上下文传给 LLM。

Typescript

async function ragQuery(question: string): Promise<string> {
  // 1. Embed the question
  const questionEmbedding = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: question
  });

  // 2. Query Pinecone for the 5 most relevant chunks
  const queryResult = await index.query({
    vector: questionEmbedding.data[0].embedding,
    topK: 5,
    includeMetadata: true
  });

  // 3. Build context from retrieved chunks
  const context = queryResult.matches
    .map(match => {
      const meta = match.metadata as Record<string, string>;
      return `Source: ${meta.source}\nSection: ${meta.heading}\n\n${meta.content}`;
    })
    .join('\n\n---\n\n');

  // 4. Generate answer with context
  const completion = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: `Answer the user's question based on the provided context.
Cite sources when possible. If the context does not contain enough
information, say so.`
      },
      {
        role: 'user',
        content: `Context:\n${context}\n\nQuestion: ${question}`
      }
    ]
  });

  return completion.choices[0].message.content || 'No answer generated';
}

// Example usage
const answer = await ragQuery('How does authentication work?');
console.log(answer);

把所有环节串联起来

下面是把整个流水线放进单个编排函数中的完整实现：

Typescript

async function buildRAGPipeline(config: {
  targetUrl: string;
  maxPages: number;
  indexName: string;
}) {
  const cf = new CrawlForge({ apiKey: process.env.CRAWLFORGE_API_KEY });

  // 1. Crawl
  console.log('Crawling...');
  const crawled = await cf.crawlDeep({
    url: config.targetUrl,
    max_depth: 3,
    max_pages: config.maxPages,
    extract_content: true,
    concurrency: 10
  });
  console.log(`Crawled ${crawled.pages.length} pages`);

  // 2. Chunk
  console.log('Chunking...');
  const chunks = crawled.pages.flatMap(page =>
    chunkByHeading(page.content, page.url, page.title || 'Untitled')
  );
  console.log(`Created ${chunks.length} chunks`);

  // 3. Embed
  console.log('Embedding...');
  const embedded = await embedChunks(chunks);

  // 4. Store
  console.log('Storing...');
  await storeChunks(embedded);

  console.log('RAG pipeline complete. Ready for queries.');
}

// Run it
await buildRAGPipeline({
  targetUrl: 'https://docs.example.com',
  maxPages: 200,
  indexName: 'rag-web-data'
});

爬取 200 个页面的 CrawlForge 总 credits 成本： 5 credits（一次带 extract_content: true 的 crawl_deep 调用）。

性能优化技巧

批量 embedding —— 每次 API 调用 embed 100 个文本块，而不是一次一个（速度快 10 倍，成本相同）
对文档使用基于标题的分块，对新闻文章使用基于句子的分块
设置合适的重叠 —— 文本块之间 10-15% 的重叠可防止在边界处丢失上下文
在爬取时过滤 —— 使用 include_patterns 和 exclude_patterns 避免爬取无关页面
积极缓存 —— 将爬取的内容存储在本地，只在内容变化时才重新爬取
监控新鲜度 —— 使用 CrawlForge 的变更跟踪来检测源页面何时更新，然后只对变化的内容重新爬取和重新 embed

常见问题

用 CrawlForge 构建一个 RAG 流水线需要多少 credits？

我应该为 RAG 使用哪个向量数据库？

我应该多久重新爬取并更新一次 RAG 数据？

我可以将 CrawlForge 与 LangChain 或 LlamaIndex 一起使用吗？

在 10 分钟内构建你的第一个 RAG 流水线。 免费开始，获得 1,000 credits，今天就爬取你的第一个站点。

本页内容

目录

什么是 RAG，为什么要使用网页数据？

RAG 流水线架构

步骤 1：爬取目标网站

步骤 2：提取并清洗内容

步骤 3：为 embedding 对文本分块

步骤 4：生成 embedding

步骤 5：存入向量数据库

Pinecone 示例

Weaviate 示例

步骤 6：查询与检索

把所有环节串联起来

性能优化技巧

常见问题

用 CrawlForge 构建一个 RAG 流水线需要多少 credits？

我应该为 RAG 使用哪个向量数据库？

我应该多久重新爬取并更新一次 RAG 数据？

我可以将 CrawlForge 与 LangChain 或 LlamaIndex 一起使用吗？

亲自试一试——无需注册

标签

关于作者

CrawlForge Team

及时获取最新洞察

Frequently Asked Questions

相关文章

MCP 协议详解：2026 开发者指南

2026 年面向 AI 智能体的最佳网页爬取工具

用本地 LLM 提取网页数据（Ollama + CrawlForge）

本页内容

目录

什么是 RAG，为什么要使用网页数据？

RAG 流水线架构

步骤 1：爬取目标网站

步骤 2：提取并清洗内容

步骤 3：为 embedding 对文本分块

步骤 4：生成 embedding

步骤 5：存入向量数据库

Pinecone 示例

Weaviate 示例

步骤 6：查询与检索

把所有环节串联起来

性能优化技巧

常见问题

用 CrawlForge 构建一个 RAG 流水线需要多少 credits？

我应该为 RAG 使用哪个向量数据库？

我应该多久重新爬取并更新一次 RAG 数据？

我可以将 CrawlForge 与 LangChain 或 LlamaIndex 一起使用吗？

亲自试一试——无需注册

标签

关于作者

CrawlForge Team

及时获取最新洞察

Frequently Asked Questions

相关文章

MCP 协议详解：2026 开发者指南

2026 年面向 AI 智能体的最佳网页爬取工具

用本地 LLM 提取网页数据（Ollama + CrawlForge）