How to Build a RAG Pipeline with Web Data

Retrieval-Augmented Generation (RAG) is only as good as the data you feed it. Most RAG tutorials use static document collections -- PDFs or markdown files sitting in a folder. Production RAG systems need live web data: documentation that updates weekly, competitor pricing that changes monthly, research papers published daily.

This guide walks through building a complete RAG pipeline that uses CrawlForge to crawl and extract web content, then feeds it into a vector database for retrieval-augmented generation. Every step includes working TypeScript code.

What Is RAG and Why Use Web Data?
RAG Pipeline Architecture
Step 1: Crawl Target Websites
Step 2: Extract and Clean Content
Step 3: Chunk Text for Embedding
Step 4: Generate Embeddings
Step 5: Store in a Vector Database
Step 6: Query and Retrieve
Putting It All Together
Performance Optimization Tips
Frequently Asked Questions

What Is RAG and Why Use Web Data?

Retrieval-Augmented Generation is a technique where an LLM's response is grounded in relevant documents retrieved from an external knowledge base. Instead of relying solely on training data (which has a knowledge cutoff), RAG systems fetch current, relevant context before generating an answer.

Why web data makes RAG better:

Freshness -- web content updates in real time; training data does not
Breadth -- the web covers every topic, industry, and niche
Specificity -- scrape exactly the pages relevant to your domain
Authority -- pull from official documentation, research papers, and trusted sources

Common RAG use cases with web data:

Customer support bots grounded in live documentation
Competitive intelligence systems tracking competitor changes
Research assistants pulling from academic databases and news
Internal knowledge bases enriched with external industry data

RAG Pipeline Architecture

A web-data RAG pipeline has six stages:

Crawl --> Extract --> Chunk --> Embed --> Store --> Retrieve
 |         |          |         |         |         |
 URLs    Clean     Text     Vectors   Vector    Context
         text     segments            DB       + LLM

Stage	Tool	Purpose
Crawl	CrawlForge `crawl_deep` / `batch_scrape`	Discover and fetch pages
Extract	CrawlForge `extract_content`	Clean HTML into readable text
Chunk	Custom logic	Split text into embedding-sized segments
Embed	OpenAI / Cohere / local model	Convert text to vector representations
Store	Pinecone / Weaviate / Qdrant	Index vectors for similarity search
Retrieve	Vector DB query + LLM	Find relevant chunks, generate answer

Step 1: Crawl Target Websites

First, discover and fetch all relevant pages from your target domain. CrawlForge's crawl_deep tool handles pagination, link discovery, and parallel fetching.

Typescript

import { CrawlForge } from '@crawlforge/sdk';

const cf = new CrawlForge({ apiKey: process.env.CRAWLFORGE_API_KEY });

// Crawl a documentation site up to 3 levels deep
const crawlResult = await cf.crawlDeep({
  url: 'https://docs.example.com',
  max_depth: 3,
  max_pages: 200,
  include_patterns: ['/docs/', '/guides/', '/api/'],
  exclude_patterns: ['/changelog', '/blog'],
  extract_content: true, // Get clean text during crawl
  concurrency: 10
});

console.log(`Crawled ${crawlResult.pages.length} pages`);
// Crawled 147 pages

For scraping a known list of URLs (like a sitemap), use batch_scrape instead:

Typescript

// Scrape a specific list of URLs in parallel
const batchResult = await cf.batchScrape({
  urls: [
    'https://docs.example.com/auth',
    'https://docs.example.com/billing',
    'https://docs.example.com/webhooks',
    'https://docs.example.com/rate-limits',
    // ... up to 50 URLs per batch
  ],
  formats: ['text'],
  maxConcurrency: 10
});

Credit cost: crawl_deep costs 5 credits per invocation. batch_scrape costs 5 credits per batch (up to 50 URLs). For 200 pages, a single crawl_deep call is more cost-effective than multiple batch_scrape calls.

Step 2: Extract and Clean Content

Raw HTML contains navigation, ads, footers, and boilerplate that will pollute your embeddings. CrawlForge's extract_content tool uses readability algorithms to isolate the main content.

If you used extract_content: true during the crawl step, your content is already clean. For individual pages:

Typescript

// Extract clean content from a single page
const page = await cf.extractContent({
  url: 'https://docs.example.com/authentication'
});

console.log(page.content);
// Returns: "Authentication

All API requests require..."
// No nav bars, no footers, no cookie banners

Post-processing tips:

Remove duplicate content (many sites repeat headers/footers in extracted text)
Strip internal navigation links ("Next: Billing" / "Previous: Setup")
Normalize whitespace and remove empty lines
Keep headings -- they provide structure for chunking

Step 3: Chunk Text for Embedding

Embedding models have token limits (typically 512-8,192 tokens). Long documents must be split into smaller chunks that preserve semantic meaning.

Typescript

interface TextChunk {
  content: string;
  metadata: {
    source: string;
    title: string;
    heading: string;
    chunkIndex: number;
  };
}

function chunkByHeading(
  text: string,
  source: string,
  title: string,
  maxChunkSize: number = 1000 // characters
): TextChunk[] {
  const chunks: TextChunk[] = [];
  // Split on markdown headings (## or ###)
  const sections = text.split(/(?=^#{2,3}\s)/m);

  let chunkIndex = 0;
  for (const section of sections) {
    // Extract heading from section
    const headingMatch = section.match(/^#{2,3}\s+(.+)/);
    const heading = headingMatch ? headingMatch[1].trim() : title;
    const content = section.trim();

    if (content.length <= maxChunkSize) {
      chunks.push({
        content,
        metadata: { source, title, heading, chunkIndex: chunkIndex++ }
      });
    } else {
      // Split large sections by paragraph
      const paragraphs = content.split('\n\n');
      let currentChunk = '';

      for (const para of paragraphs) {
        if ((currentChunk + para).length > maxChunkSize && currentChunk) {
          chunks.push({
            content: currentChunk.trim(),
            metadata: { source, title, heading, chunkIndex: chunkIndex++ }
          });
          currentChunk = para;
        } else {
          currentChunk += (currentChunk ? '\n\n' : '') + para;
        }
      }
      if (currentChunk.trim()) {
        chunks.push({
          content: currentChunk.trim(),
          metadata: { source, title, heading, chunkIndex: chunkIndex++ }
        });
      }
    }
  }

  return chunks;
}

// Usage with crawled pages
const allChunks: TextChunk[] = [];
for (const page of crawlResult.pages) {
  const pageChunks = chunkByHeading(
    page.content,
    page.url,
    page.title || 'Untitled',
    1000
  );
  allChunks.push(...pageChunks);
}

console.log(`Created ${allChunks.length} chunks from ${crawlResult.pages.length} pages`);
// Created 892 chunks from 147 pages

Chunking strategies compared:

Strategy	Pros	Cons	Best For
Fixed-size	Simple, predictable	Breaks mid-sentence	General content
Heading-based	Preserves structure	Uneven chunk sizes	Documentation
Sentence-based	Natural boundaries	May be too small	Conversational data
Recursive	Balanced sizes + structure	More complex	Production systems

Step 4: Generate Embeddings

Convert each text chunk into a vector representation using an embedding model.

Typescript

import OpenAI from 'openai';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

interface EmbeddedChunk extends TextChunk {
  embedding: number[];
}

async function embedChunks(
  chunks: TextChunk[],
  batchSize: number = 100
): Promise<EmbeddedChunk[]> {
  const embeddedChunks: EmbeddedChunk[] = [];

  for (let i = 0; i < chunks.length; i += batchSize) {
    const batch = chunks.slice(i, i + batchSize);
    const texts = batch.map(c => c.content);

    const response = await openai.embeddings.create({
      model: 'text-embedding-3-small', // 1536 dimensions, $0.02/1M tokens
      input: texts
    });

    for (let j = 0; j < batch.length; j++) {
      embeddedChunks.push({
        ...batch[j],
        embedding: response.data[j].embedding
      });
    }

    console.log(`Embedded ${Math.min(i + batchSize, chunks.length)}/${chunks.length} chunks`);
  }

  return embeddedChunks;
}

const embeddedChunks = await embedChunks(allChunks);

Embedding model options:

Model	Dimensions	Cost	Quality
OpenAI text-embedding-3-small	1,536	$0.02/1M tokens	Good
OpenAI text-embedding-3-large	3,072	$0.13/1M tokens	Better
Cohere embed-english-v3.0	1,024	$0.10/1M tokens	Good
Local (all-MiniLM-L6-v2)	384	Free	Adequate

Step 5: Store in a Vector Database

Index the embedded chunks in a vector database for fast similarity search.

Pinecone Example

Typescript

import { Pinecone } from '@pinecone-database/pinecone';

const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const index = pinecone.Index('rag-web-data');

// Upsert chunks in batches of 100
async function storeChunks(chunks: EmbeddedChunk[]) {
  const batchSize = 100;

  for (let i = 0; i < chunks.length; i += batchSize) {
    const batch = chunks.slice(i, i + batchSize);
    const vectors = batch.map((chunk, j) => ({
      id: `${chunk.metadata.source}-${chunk.metadata.chunkIndex}`,
      values: chunk.embedding,
      metadata: {
        content: chunk.content,
        source: chunk.metadata.source,
        title: chunk.metadata.title,
        heading: chunk.metadata.heading
      }
    }));

    await index.upsert(vectors);
    console.log(`Stored ${Math.min(i + batchSize, chunks.length)}/${chunks.length} vectors`);
  }
}

await storeChunks(embeddedChunks);

Weaviate Example

Typescript

import weaviate from 'weaviate-ts-client';

const client = weaviate.client({
  scheme: 'https',
  host: process.env.WEAVIATE_HOST || 'localhost:8080'
});

// Create collection schema
await client.schema.classCreator().withClass({
  class: 'WebDocument',
  vectorizer: 'none', // We provide our own vectors
  properties: [
    { name: 'content', dataType: ['text'] },
    { name: 'source', dataType: ['string'] },
    { name: 'title', dataType: ['string'] },
    { name: 'heading', dataType: ['string'] }
  ]
}).do();

// Store chunks
for (const chunk of embeddedChunks) {
  await client.data.creator()
    .withClassName('WebDocument')
    .withProperties({
      content: chunk.content,
      source: chunk.metadata.source,
      title: chunk.metadata.title,
      heading: chunk.metadata.heading
    })
    .withVector(chunk.embedding)
    .do();
}

Step 6: Query and Retrieve

Now query the vector database with a user question, retrieve relevant chunks, and pass them to an LLM as context.

Typescript

async function ragQuery(question: string): Promise<string> {
  // 1. Embed the question
  const questionEmbedding = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: question
  });

  // 2. Query Pinecone for the 5 most relevant chunks
  const queryResult = await index.query({
    vector: questionEmbedding.data[0].embedding,
    topK: 5,
    includeMetadata: true
  });

  // 3. Build context from retrieved chunks
  const context = queryResult.matches
    .map(match => {
      const meta = match.metadata as Record<string, string>;
      return `Source: ${meta.source}\nSection: ${meta.heading}\n\n${meta.content}`;
    })
    .join('\n\n---\n\n');

  // 4. Generate answer with context
  const completion = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: `Answer the user's question based on the provided context.
Cite sources when possible. If the context does not contain enough
information, say so.`
      },
      {
        role: 'user',
        content: `Context:\n${context}\n\nQuestion: ${question}`
      }
    ]
  });

  return completion.choices[0].message.content || 'No answer generated';
}

// Example usage
const answer = await ragQuery('How does authentication work?');
console.log(answer);

Putting It All Together

Here is the complete pipeline in a single orchestration function:

Typescript

async function buildRAGPipeline(config: {
  targetUrl: string;
  maxPages: number;
  indexName: string;
}) {
  const cf = new CrawlForge({ apiKey: process.env.CRAWLFORGE_API_KEY });

  // 1. Crawl
  console.log('Crawling...');
  const crawled = await cf.crawlDeep({
    url: config.targetUrl,
    max_depth: 3,
    max_pages: config.maxPages,
    extract_content: true,
    concurrency: 10
  });
  console.log(`Crawled ${crawled.pages.length} pages`);

  // 2. Chunk
  console.log('Chunking...');
  const chunks = crawled.pages.flatMap(page =>
    chunkByHeading(page.content, page.url, page.title || 'Untitled')
  );
  console.log(`Created ${chunks.length} chunks`);

  // 3. Embed
  console.log('Embedding...');
  const embedded = await embedChunks(chunks);

  // 4. Store
  console.log('Storing...');
  await storeChunks(embedded);

  console.log('RAG pipeline complete. Ready for queries.');
}

// Run it
await buildRAGPipeline({
  targetUrl: 'https://docs.example.com',
  maxPages: 200,
  indexName: 'rag-web-data'
});

Total CrawlForge credit cost for 200 pages: 5 credits (single crawl_deep call with extract_content: true).

Performance Optimization Tips

Batch embeddings -- embed 100 chunks per API call instead of one at a time (10x faster, same cost)
Use heading-based chunking for documentation, sentence-based for news articles
Set appropriate overlap -- 10-15% overlap between chunks prevents context loss at boundaries
Filter during crawl -- use include_patterns and exclude_patterns to avoid crawling irrelevant pages
Cache aggressively -- store crawled content locally so you only re-crawl when content changes
Monitor freshness -- use CrawlForge's change tracking to detect when source pages update, then re-crawl and re-embed only changed content

Frequently Asked Questions

How many credits does it cost to build a RAG pipeline with CrawlForge?

A single crawl_deep call costs 5 credits and can crawl up to 1,000 pages. For a 200-page documentation site, the total CrawlForge cost is 5 credits. The free tier (1,000 credits) lets you build 200 RAG pipelines before paying anything. View pricing details.

Which vector database should I use for RAG?

Pinecone is the easiest to start with (fully managed, no infrastructure). Weaviate offers more flexibility with hybrid search (vector + keyword). Qdrant is the best self-hosted option. ChromaDB works well for prototyping and local development.

How often should I re-crawl and update my RAG data?

It depends on how often your source content changes. Documentation sites: weekly. News and research: daily. Product catalogs: hourly. Use CrawlForge's change tracking to detect updates and only re-process changed pages.

Can I use CrawlForge with LangChain or LlamaIndex?

Yes. CrawlForge integrates with both frameworks. Use the SDK to fetch content, then pass it to LangChain's document loaders or LlamaIndex's data connectors. See our LangChain integration guide for examples.

Build your first RAG pipeline in under 10 minutes. Start free with 1,000 credits and crawl your first site today.

What Is RAG and Why Use Web Data?
RAG Pipeline Architecture
Step 1: Crawl Target Websites
Step 2: Extract and Clean Content
Step 3: Chunk Text for Embedding
Step 4: Generate Embeddings
Step 5: Store in a Vector Database
Step 6: Query and Retrieve
Putting It All Together
Performance Optimization Tips
Frequently Asked Questions

What Is RAG and Why Use Web Data?

Why web data makes RAG better:

Freshness -- web content updates in real time; training data does not
Breadth -- the web covers every topic, industry, and niche
Specificity -- scrape exactly the pages relevant to your domain
Authority -- pull from official documentation, research papers, and trusted sources

Common RAG use cases with web data:

Customer support bots grounded in live documentation
Competitive intelligence systems tracking competitor changes
Research assistants pulling from academic databases and news
Internal knowledge bases enriched with external industry data

RAG Pipeline Architecture

A web-data RAG pipeline has six stages:

Crawl --> Extract --> Chunk --> Embed --> Store --> Retrieve
 |         |          |         |         |         |
 URLs    Clean     Text     Vectors   Vector    Context
         text     segments            DB       + LLM

Stage	Tool	Purpose
Crawl	CrawlForge `crawl_deep` / `batch_scrape`	Discover and fetch pages
Extract	CrawlForge `extract_content`	Clean HTML into readable text
Chunk	Custom logic	Split text into embedding-sized segments
Embed	OpenAI / Cohere / local model	Convert text to vector representations
Store	Pinecone / Weaviate / Qdrant	Index vectors for similarity search
Retrieve	Vector DB query + LLM	Find relevant chunks, generate answer

Step 1: Crawl Target Websites

First, discover and fetch all relevant pages from your target domain. CrawlForge's crawl_deep tool handles pagination, link discovery, and parallel fetching.

Typescript

import { CrawlForge } from '@crawlforge/sdk';

const cf = new CrawlForge({ apiKey: process.env.CRAWLFORGE_API_KEY });

// Crawl a documentation site up to 3 levels deep
const crawlResult = await cf.crawlDeep({
  url: 'https://docs.example.com',
  max_depth: 3,
  max_pages: 200,
  include_patterns: ['/docs/', '/guides/', '/api/'],
  exclude_patterns: ['/changelog', '/blog'],
  extract_content: true, // Get clean text during crawl
  concurrency: 10
});

console.log(`Crawled ${crawlResult.pages.length} pages`);
// Crawled 147 pages

For scraping a known list of URLs (like a sitemap), use batch_scrape instead:

Typescript

// Scrape a specific list of URLs in parallel
const batchResult = await cf.batchScrape({
  urls: [
    'https://docs.example.com/auth',
    'https://docs.example.com/billing',
    'https://docs.example.com/webhooks',
    'https://docs.example.com/rate-limits',
    // ... up to 50 URLs per batch
  ],
  formats: ['text'],
  maxConcurrency: 10
});

Step 2: Extract and Clean Content

Raw HTML contains navigation, ads, footers, and boilerplate that will pollute your embeddings. CrawlForge's extract_content tool uses readability algorithms to isolate the main content.

If you used extract_content: true during the crawl step, your content is already clean. For individual pages:

Typescript

// Extract clean content from a single page
const page = await cf.extractContent({
  url: 'https://docs.example.com/authentication'
});

console.log(page.content);
// Returns: "Authentication

All API requests require..."
// No nav bars, no footers, no cookie banners

Post-processing tips:

Remove duplicate content (many sites repeat headers/footers in extracted text)
Strip internal navigation links ("Next: Billing" / "Previous: Setup")
Normalize whitespace and remove empty lines
Keep headings -- they provide structure for chunking

Step 3: Chunk Text for Embedding

Embedding models have token limits (typically 512-8,192 tokens). Long documents must be split into smaller chunks that preserve semantic meaning.

Typescript

interface TextChunk {
  content: string;
  metadata: {
    source: string;
    title: string;
    heading: string;
    chunkIndex: number;
  };
}

function chunkByHeading(
  text: string,
  source: string,
  title: string,
  maxChunkSize: number = 1000 // characters
): TextChunk[] {
  const chunks: TextChunk[] = [];
  // Split on markdown headings (## or ###)
  const sections = text.split(/(?=^#{2,3}\s)/m);

  let chunkIndex = 0;
  for (const section of sections) {
    // Extract heading from section
    const headingMatch = section.match(/^#{2,3}\s+(.+)/);
    const heading = headingMatch ? headingMatch[1].trim() : title;
    const content = section.trim();

    if (content.length <= maxChunkSize) {
      chunks.push({
        content,
        metadata: { source, title, heading, chunkIndex: chunkIndex++ }
      });
    } else {
      // Split large sections by paragraph
      const paragraphs = content.split('\n\n');
      let currentChunk = '';

      for (const para of paragraphs) {
        if ((currentChunk + para).length > maxChunkSize && currentChunk) {
          chunks.push({
            content: currentChunk.trim(),
            metadata: { source, title, heading, chunkIndex: chunkIndex++ }
          });
          currentChunk = para;
        } else {
          currentChunk += (currentChunk ? '\n\n' : '') + para;
        }
      }
      if (currentChunk.trim()) {
        chunks.push({
          content: currentChunk.trim(),
          metadata: { source, title, heading, chunkIndex: chunkIndex++ }
        });
      }
    }
  }

  return chunks;
}

// Usage with crawled pages
const allChunks: TextChunk[] = [];
for (const page of crawlResult.pages) {
  const pageChunks = chunkByHeading(
    page.content,
    page.url,
    page.title || 'Untitled',
    1000
  );
  allChunks.push(...pageChunks);
}

console.log(`Created ${allChunks.length} chunks from ${crawlResult.pages.length} pages`);
// Created 892 chunks from 147 pages

Chunking strategies compared:

Strategy	Pros	Cons	Best For
Fixed-size	Simple, predictable	Breaks mid-sentence	General content
Heading-based	Preserves structure	Uneven chunk sizes	Documentation
Sentence-based	Natural boundaries	May be too small	Conversational data
Recursive	Balanced sizes + structure	More complex	Production systems

Step 4: Generate Embeddings

Convert each text chunk into a vector representation using an embedding model.

Typescript

import OpenAI from 'openai';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

interface EmbeddedChunk extends TextChunk {
  embedding: number[];
}

async function embedChunks(
  chunks: TextChunk[],
  batchSize: number = 100
): Promise<EmbeddedChunk[]> {
  const embeddedChunks: EmbeddedChunk[] = [];

  for (let i = 0; i < chunks.length; i += batchSize) {
    const batch = chunks.slice(i, i + batchSize);
    const texts = batch.map(c => c.content);

    const response = await openai.embeddings.create({
      model: 'text-embedding-3-small', // 1536 dimensions, $0.02/1M tokens
      input: texts
    });

    for (let j = 0; j < batch.length; j++) {
      embeddedChunks.push({
        ...batch[j],
        embedding: response.data[j].embedding
      });
    }

    console.log(`Embedded ${Math.min(i + batchSize, chunks.length)}/${chunks.length} chunks`);
  }

  return embeddedChunks;
}

const embeddedChunks = await embedChunks(allChunks);

Embedding model options:

Model	Dimensions	Cost	Quality
OpenAI text-embedding-3-small	1,536	$0.02/1M tokens	Good
OpenAI text-embedding-3-large	3,072	$0.13/1M tokens	Better
Cohere embed-english-v3.0	1,024	$0.10/1M tokens	Good
Local (all-MiniLM-L6-v2)	384	Free	Adequate

Step 5: Store in a Vector Database

Index the embedded chunks in a vector database for fast similarity search.

Pinecone Example

Typescript

import { Pinecone } from '@pinecone-database/pinecone';

const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const index = pinecone.Index('rag-web-data');

// Upsert chunks in batches of 100
async function storeChunks(chunks: EmbeddedChunk[]) {
  const batchSize = 100;

  for (let i = 0; i < chunks.length; i += batchSize) {
    const batch = chunks.slice(i, i + batchSize);
    const vectors = batch.map((chunk, j) => ({
      id: `${chunk.metadata.source}-${chunk.metadata.chunkIndex}`,
      values: chunk.embedding,
      metadata: {
        content: chunk.content,
        source: chunk.metadata.source,
        title: chunk.metadata.title,
        heading: chunk.metadata.heading
      }
    }));

    await index.upsert(vectors);
    console.log(`Stored ${Math.min(i + batchSize, chunks.length)}/${chunks.length} vectors`);
  }
}

await storeChunks(embeddedChunks);

Weaviate Example

Typescript

import weaviate from 'weaviate-ts-client';

const client = weaviate.client({
  scheme: 'https',
  host: process.env.WEAVIATE_HOST || 'localhost:8080'
});

// Create collection schema
await client.schema.classCreator().withClass({
  class: 'WebDocument',
  vectorizer: 'none', // We provide our own vectors
  properties: [
    { name: 'content', dataType: ['text'] },
    { name: 'source', dataType: ['string'] },
    { name: 'title', dataType: ['string'] },
    { name: 'heading', dataType: ['string'] }
  ]
}).do();

// Store chunks
for (const chunk of embeddedChunks) {
  await client.data.creator()
    .withClassName('WebDocument')
    .withProperties({
      content: chunk.content,
      source: chunk.metadata.source,
      title: chunk.metadata.title,
      heading: chunk.metadata.heading
    })
    .withVector(chunk.embedding)
    .do();
}

Step 6: Query and Retrieve

Now query the vector database with a user question, retrieve relevant chunks, and pass them to an LLM as context.

Typescript

async function ragQuery(question: string): Promise<string> {
  // 1. Embed the question
  const questionEmbedding = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: question
  });

  // 2. Query Pinecone for the 5 most relevant chunks
  const queryResult = await index.query({
    vector: questionEmbedding.data[0].embedding,
    topK: 5,
    includeMetadata: true
  });

  // 3. Build context from retrieved chunks
  const context = queryResult.matches
    .map(match => {
      const meta = match.metadata as Record<string, string>;
      return `Source: ${meta.source}\nSection: ${meta.heading}\n\n${meta.content}`;
    })
    .join('\n\n---\n\n');

  // 4. Generate answer with context
  const completion = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: `Answer the user's question based on the provided context.
Cite sources when possible. If the context does not contain enough
information, say so.`
      },
      {
        role: 'user',
        content: `Context:\n${context}\n\nQuestion: ${question}`
      }
    ]
  });

  return completion.choices[0].message.content || 'No answer generated';
}

// Example usage
const answer = await ragQuery('How does authentication work?');
console.log(answer);

Putting It All Together

Here is the complete pipeline in a single orchestration function:

Typescript

async function buildRAGPipeline(config: {
  targetUrl: string;
  maxPages: number;
  indexName: string;
}) {
  const cf = new CrawlForge({ apiKey: process.env.CRAWLFORGE_API_KEY });

  // 1. Crawl
  console.log('Crawling...');
  const crawled = await cf.crawlDeep({
    url: config.targetUrl,
    max_depth: 3,
    max_pages: config.maxPages,
    extract_content: true,
    concurrency: 10
  });
  console.log(`Crawled ${crawled.pages.length} pages`);

  // 2. Chunk
  console.log('Chunking...');
  const chunks = crawled.pages.flatMap(page =>
    chunkByHeading(page.content, page.url, page.title || 'Untitled')
  );
  console.log(`Created ${chunks.length} chunks`);

  // 3. Embed
  console.log('Embedding...');
  const embedded = await embedChunks(chunks);

  // 4. Store
  console.log('Storing...');
  await storeChunks(embedded);

  console.log('RAG pipeline complete. Ready for queries.');
}

// Run it
await buildRAGPipeline({
  targetUrl: 'https://docs.example.com',
  maxPages: 200,
  indexName: 'rag-web-data'
});

Total CrawlForge credit cost for 200 pages: 5 credits (single crawl_deep call with extract_content: true).

Performance Optimization Tips

Batch embeddings -- embed 100 chunks per API call instead of one at a time (10x faster, same cost)
Use heading-based chunking for documentation, sentence-based for news articles
Set appropriate overlap -- 10-15% overlap between chunks prevents context loss at boundaries
Filter during crawl -- use include_patterns and exclude_patterns to avoid crawling irrelevant pages
Cache aggressively -- store crawled content locally so you only re-crawl when content changes
Monitor freshness -- use CrawlForge's change tracking to detect when source pages update, then re-crawl and re-embed only changed content

Frequently Asked Questions

How many credits does it cost to build a RAG pipeline with CrawlForge?

Which vector database should I use for RAG?

How often should I re-crawl and update my RAG data?

Can I use CrawlForge with LangChain or LlamaIndex?

Build your first RAG pipeline in under 10 minutes. Start free with 1,000 credits and crawl your first site today.

On this page

Table of Contents

What Is RAG and Why Use Web Data?

RAG Pipeline Architecture

Step 1: Crawl Target Websites

Step 2: Extract and Clean Content

Step 3: Chunk Text for Embedding

Step 4: Generate Embeddings

Step 5: Store in a Vector Database

Pinecone Example

Weaviate Example

Step 6: Query and Retrieve

Putting It All Together

Performance Optimization Tips

Frequently Asked Questions

How many credits does it cost to build a RAG pipeline with CrawlForge?

Which vector database should I use for RAG?

How often should I re-crawl and update my RAG data?

Can I use CrawlForge with LangChain or LlamaIndex?

Try this yourself — no signup needed

Tags

About the Author

CrawlForge Team

Stay updated with the latest insights

Frequently Asked Questions

Related Articles

MCP Protocol Explained: A Developer Guide for 2026

Best Web Scraping Tools for AI Agents in 2026

Extract Web Data With Local LLMs (Ollama + CrawlForge)

On this page

Table of Contents

What Is RAG and Why Use Web Data?

RAG Pipeline Architecture

Step 1: Crawl Target Websites

Step 2: Extract and Clean Content

Step 3: Chunk Text for Embedding

Step 4: Generate Embeddings

Step 5: Store in a Vector Database

Pinecone Example

Weaviate Example

Step 6: Query and Retrieve

Putting It All Together

Performance Optimization Tips

Frequently Asked Questions

How many credits does it cost to build a RAG pipeline with CrawlForge?

Which vector database should I use for RAG?

How often should I re-crawl and update my RAG data?

Can I use CrawlForge with LangChain or LlamaIndex?

Try this yourself — no signup needed

Tags

About the Author

CrawlForge Team

Stay updated with the latest insights

Frequently Asked Questions

Related Articles

MCP Protocol Explained: A Developer Guide for 2026

Best Web Scraping Tools for AI Agents in 2026

Extract Web Data With Local LLMs (Ollama + CrawlForge)