On this page
Retrieval-Augmented Generation (RAG) is only as good as the data you feed it. Most RAG tutorials use static document collections -- PDFs or markdown files sitting in a folder. Production RAG systems need live web data: documentation that updates weekly, competitor pricing that changes monthly, research papers published daily.
This guide walks through building a complete RAG pipeline that uses CrawlForge to crawl and extract web content, then feeds it into a vector database for retrieval-augmented generation. Every step includes working TypeScript code.
Table of Contents
- What Is RAG and Why Use Web Data?
- RAG Pipeline Architecture
- Step 1: Crawl Target Websites
- Step 2: Extract and Clean Content
- Step 3: Chunk Text for Embedding
- Step 4: Generate Embeddings
- Step 5: Store in a Vector Database
- Step 6: Query and Retrieve
- Putting It All Together
- Performance Optimization Tips
- Frequently Asked Questions
What Is RAG and Why Use Web Data?
Retrieval-Augmented Generation is a technique where an LLM's response is grounded in relevant documents retrieved from an external knowledge base. Instead of relying solely on training data (which has a knowledge cutoff), RAG systems fetch current, relevant context before generating an answer.
Why web data makes RAG better:
- Freshness -- web content updates in real time; training data does not
- Breadth -- the web covers every topic, industry, and niche
- Specificity -- scrape exactly the pages relevant to your domain
- Authority -- pull from official documentation, research papers, and trusted sources
Common RAG use cases with web data:
- Customer support bots grounded in live documentation
- Competitive intelligence systems tracking competitor changes
- Research assistants pulling from academic databases and news
- Internal knowledge bases enriched with external industry data
RAG Pipeline Architecture
A web-data RAG pipeline has six stages:
Crawl --> Extract --> Chunk --> Embed --> Store --> Retrieve
| | | | | |
URLs Clean Text Vectors Vector Context
text segments DB + LLM
| Stage | Tool | Purpose |
|---|---|---|
| Crawl | CrawlForge crawl_deep / batch_scrape | Discover and fetch pages |
| Extract | CrawlForge extract_content | Clean HTML into readable text |
| Chunk | Custom logic | Split text into embedding-sized segments |
| Embed | OpenAI / Cohere / local model | Convert text to vector representations |
| Store | Pinecone / Weaviate / Qdrant | Index vectors for similarity search |
| Retrieve | Vector DB query + LLM | Find relevant chunks, generate answer |
Step 1: Crawl Target Websites
First, discover and fetch all relevant pages from your target domain. CrawlForge's crawl_deep tool handles pagination, link discovery, and parallel fetching.
For scraping a known list of URLs (like a sitemap), use batch_scrape instead:
Credit cost: crawl_deep costs 5 credits per invocation. batch_scrape costs 5 credits per batch (up to 50 URLs). For 200 pages, a single crawl_deep call is more cost-effective than multiple batch_scrape calls.
Step 2: Extract and Clean Content
Raw HTML contains navigation, ads, footers, and boilerplate that will pollute your embeddings. CrawlForge's extract_content tool uses readability algorithms to isolate the main content.
If you used extract_content: true during the crawl step, your content is already clean. For individual pages:
Post-processing tips:
- Remove duplicate content (many sites repeat headers/footers in extracted text)
- Strip internal navigation links ("Next: Billing" / "Previous: Setup")
- Normalize whitespace and remove empty lines
- Keep headings -- they provide structure for chunking
Step 3: Chunk Text for Embedding
Embedding models have token limits (typically 512-8,192 tokens). Long documents must be split into smaller chunks that preserve semantic meaning.
Chunking strategies compared:
| Strategy | Pros | Cons | Best For |
|---|---|---|---|
| Fixed-size | Simple, predictable | Breaks mid-sentence | General content |
| Heading-based | Preserves structure | Uneven chunk sizes | Documentation |
| Sentence-based | Natural boundaries | May be too small | Conversational data |
| Recursive | Balanced sizes + structure | More complex | Production systems |
Step 4: Generate Embeddings
Convert each text chunk into a vector representation using an embedding model.
Embedding model options:
| Model | Dimensions | Cost | Quality |
|---|---|---|---|
| OpenAI text-embedding-3-small | 1,536 | $0.02/1M tokens | Good |
| OpenAI text-embedding-3-large | 3,072 | $0.13/1M tokens | Better |
| Cohere embed-english-v3.0 | 1,024 | $0.10/1M tokens | Good |
| Local (all-MiniLM-L6-v2) | 384 | Free | Adequate |
Step 5: Store in a Vector Database
Index the embedded chunks in a vector database for fast similarity search.
Pinecone Example
Weaviate Example
Step 6: Query and Retrieve
Now query the vector database with a user question, retrieve relevant chunks, and pass them to an LLM as context.
Putting It All Together
Here is the complete pipeline in a single orchestration function:
Total CrawlForge credit cost for 200 pages: 5 credits (single crawl_deep call with extract_content: true).
Performance Optimization Tips
- Batch embeddings -- embed 100 chunks per API call instead of one at a time (10x faster, same cost)
- Use heading-based chunking for documentation, sentence-based for news articles
- Set appropriate overlap -- 10-15% overlap between chunks prevents context loss at boundaries
- Filter during crawl -- use
include_patternsandexclude_patternsto avoid crawling irrelevant pages - Cache aggressively -- store crawled content locally so you only re-crawl when content changes
- Monitor freshness -- use CrawlForge's change tracking to detect when source pages update, then re-crawl and re-embed only changed content
Frequently Asked Questions
How many credits does it cost to build a RAG pipeline with CrawlForge?
A single crawl_deep call costs 5 credits and can crawl up to 1,000 pages. For a 200-page documentation site, the total CrawlForge cost is 5 credits. The free tier (1,000 credits) lets you build 200 RAG pipelines before paying anything. View pricing details.
Which vector database should I use for RAG?
Pinecone is the easiest to start with (fully managed, no infrastructure). Weaviate offers more flexibility with hybrid search (vector + keyword). Qdrant is the best self-hosted option. ChromaDB works well for prototyping and local development.
How often should I re-crawl and update my RAG data?
It depends on how often your source content changes. Documentation sites: weekly. News and research: daily. Product catalogs: hourly. Use CrawlForge's change tracking to detect updates and only re-process changed pages.
Can I use CrawlForge with LangChain or LlamaIndex?
Yes. CrawlForge integrates with both frameworks. Use the SDK to fetch content, then pass it to LangChain's document loaders or LlamaIndex's data connectors. See our LangChain integration guide for examples.
Build your first RAG pipeline in under 10 minutes. Start free with 1,000 credits and crawl your first site today.