CrawlForge
HomeUse CasesIntegrationsPricingDocumentationBlog
How to Build a RAG Pipeline with Web Data
AI Engineering
Back to Blog
AI Engineering

How to Build a RAG Pipeline with Web Data

C
CrawlForge Team
Engineering Team
May 1, 2026
11 min read

On this page

Retrieval-Augmented Generation (RAG) is only as good as the data you feed it. Most RAG tutorials use static document collections -- PDFs or markdown files sitting in a folder. Production RAG systems need live web data: documentation that updates weekly, competitor pricing that changes monthly, research papers published daily.

This guide walks through building a complete RAG pipeline that uses CrawlForge to crawl and extract web content, then feeds it into a vector database for retrieval-augmented generation. Every step includes working TypeScript code.

Table of Contents

  • What Is RAG and Why Use Web Data?
  • RAG Pipeline Architecture
  • Step 1: Crawl Target Websites
  • Step 2: Extract and Clean Content
  • Step 3: Chunk Text for Embedding
  • Step 4: Generate Embeddings
  • Step 5: Store in a Vector Database
  • Step 6: Query and Retrieve
  • Putting It All Together
  • Performance Optimization Tips
  • Frequently Asked Questions

What Is RAG and Why Use Web Data?

Retrieval-Augmented Generation is a technique where an LLM's response is grounded in relevant documents retrieved from an external knowledge base. Instead of relying solely on training data (which has a knowledge cutoff), RAG systems fetch current, relevant context before generating an answer.

Why web data makes RAG better:

  • Freshness -- web content updates in real time; training data does not
  • Breadth -- the web covers every topic, industry, and niche
  • Specificity -- scrape exactly the pages relevant to your domain
  • Authority -- pull from official documentation, research papers, and trusted sources

Common RAG use cases with web data:

  • Customer support bots grounded in live documentation
  • Competitive intelligence systems tracking competitor changes
  • Research assistants pulling from academic databases and news
  • Internal knowledge bases enriched with external industry data

RAG Pipeline Architecture

A web-data RAG pipeline has six stages:

Crawl --> Extract --> Chunk --> Embed --> Store --> Retrieve | | | | | | URLs Clean Text Vectors Vector Context text segments DB + LLM
StageToolPurpose
CrawlCrawlForge crawl_deep / batch_scrapeDiscover and fetch pages
ExtractCrawlForge extract_contentClean HTML into readable text
ChunkCustom logicSplit text into embedding-sized segments
EmbedOpenAI / Cohere / local modelConvert text to vector representations
StorePinecone / Weaviate / QdrantIndex vectors for similarity search
RetrieveVector DB query + LLMFind relevant chunks, generate answer

Step 1: Crawl Target Websites

First, discover and fetch all relevant pages from your target domain. CrawlForge's crawl_deep tool handles pagination, link discovery, and parallel fetching.

Typescript

For scraping a known list of URLs (like a sitemap), use batch_scrape instead:

Typescript

Credit cost: crawl_deep costs 5 credits per invocation. batch_scrape costs 5 credits per batch (up to 50 URLs). For 200 pages, a single crawl_deep call is more cost-effective than multiple batch_scrape calls.

Step 2: Extract and Clean Content

Raw HTML contains navigation, ads, footers, and boilerplate that will pollute your embeddings. CrawlForge's extract_content tool uses readability algorithms to isolate the main content.

If you used extract_content: true during the crawl step, your content is already clean. For individual pages:

Typescript

Post-processing tips:

  • Remove duplicate content (many sites repeat headers/footers in extracted text)
  • Strip internal navigation links ("Next: Billing" / "Previous: Setup")
  • Normalize whitespace and remove empty lines
  • Keep headings -- they provide structure for chunking

Step 3: Chunk Text for Embedding

Embedding models have token limits (typically 512-8,192 tokens). Long documents must be split into smaller chunks that preserve semantic meaning.

Typescript

Chunking strategies compared:

StrategyProsConsBest For
Fixed-sizeSimple, predictableBreaks mid-sentenceGeneral content
Heading-basedPreserves structureUneven chunk sizesDocumentation
Sentence-basedNatural boundariesMay be too smallConversational data
RecursiveBalanced sizes + structureMore complexProduction systems

Step 4: Generate Embeddings

Convert each text chunk into a vector representation using an embedding model.

Typescript

Embedding model options:

ModelDimensionsCostQuality
OpenAI text-embedding-3-small1,536$0.02/1M tokensGood
OpenAI text-embedding-3-large3,072$0.13/1M tokensBetter
Cohere embed-english-v3.01,024$0.10/1M tokensGood
Local (all-MiniLM-L6-v2)384FreeAdequate

Step 5: Store in a Vector Database

Index the embedded chunks in a vector database for fast similarity search.

Pinecone Example

Typescript

Weaviate Example

Typescript

Step 6: Query and Retrieve

Now query the vector database with a user question, retrieve relevant chunks, and pass them to an LLM as context.

Typescript

Putting It All Together

Here is the complete pipeline in a single orchestration function:

Typescript

Total CrawlForge credit cost for 200 pages: 5 credits (single crawl_deep call with extract_content: true).

Performance Optimization Tips

  • Batch embeddings -- embed 100 chunks per API call instead of one at a time (10x faster, same cost)
  • Use heading-based chunking for documentation, sentence-based for news articles
  • Set appropriate overlap -- 10-15% overlap between chunks prevents context loss at boundaries
  • Filter during crawl -- use include_patterns and exclude_patterns to avoid crawling irrelevant pages
  • Cache aggressively -- store crawled content locally so you only re-crawl when content changes
  • Monitor freshness -- use CrawlForge's change tracking to detect when source pages update, then re-crawl and re-embed only changed content

Frequently Asked Questions

How many credits does it cost to build a RAG pipeline with CrawlForge?

A single crawl_deep call costs 5 credits and can crawl up to 1,000 pages. For a 200-page documentation site, the total CrawlForge cost is 5 credits. The free tier (1,000 credits) lets you build 200 RAG pipelines before paying anything. View pricing details.

Which vector database should I use for RAG?

Pinecone is the easiest to start with (fully managed, no infrastructure). Weaviate offers more flexibility with hybrid search (vector + keyword). Qdrant is the best self-hosted option. ChromaDB works well for prototyping and local development.

How often should I re-crawl and update my RAG data?

It depends on how often your source content changes. Documentation sites: weekly. News and research: daily. Product catalogs: hourly. Use CrawlForge's change tracking to detect updates and only re-process changed pages.

Can I use CrawlForge with LangChain or LlamaIndex?

Yes. CrawlForge integrates with both frameworks. Use the SDK to fetch content, then pass it to LangChain's document loaders or LlamaIndex's data connectors. See our LangChain integration guide for examples.


Build your first RAG pipeline in under 10 minutes. Start free with 1,000 credits and crawl your first site today.

Tags

ragai-engineeringvector-databasepineconeweb-scrapingembeddingstutorial

About the Author

C

CrawlForge Team

Engineering Team

Building the most comprehensive web scraping MCP server. We create tools that help developers extract, analyze, and transform web data for AI applications.

On this page

Related Articles

MCP Protocol Explained: A Developer Guide for 2026
AI Engineering

MCP Protocol Explained: A Developer Guide for 2026

Learn how the Model Context Protocol works, why it matters for AI agents, and how to build MCP servers and clients. Includes architecture diagrams and code examples.

C
CrawlForge Team
|
Apr 27
|
10m
How to Use CrawlForge with Windsurf IDE
Tutorials

How to Use CrawlForge with Windsurf IDE

Add 18 web scraping tools to Windsurf IDE with CrawlForge MCP. Fetch docs, scrape references, and research APIs without leaving your editor.

C
CrawlForge Team
|
Apr 9
|
7m
How to Use CrawlForge with Vercel AI SDK
Tutorials

How to Use CrawlForge with Vercel AI SDK

Build AI apps with live web data using CrawlForge and the Vercel AI SDK. Add web scraping tools to your AI chatbot in under 10 minutes.

C
CrawlForge Team
|
Apr 7
|
8m

Footer

CrawlForge

Enterprise web scraping for AI Agents. 18 specialized MCP tools designed for modern developers building intelligent systems.

Product

  • Features
  • Pricing
  • Use Cases
  • Integrations
  • Changelog

Resources

  • Getting Started
  • API Reference
  • Templates
  • Guides
  • Blog
  • FAQ

Developers

  • MCP Protocol
  • Claude Desktop
  • Cursor IDE
  • LangChain
  • LlamaIndex

Company

  • About
  • Contact
  • Privacy
  • Terms

Stay updated

Get the latest updates on new tools and features.

Built with Next.js and MCP protocol

© 2025-2026 CrawlForge. All rights reserved.