On this page
LlamaIndex is the go-to framework for production RAG, but it ships with HTML readers that crumble on JavaScript-heavy sites and Cloudflare-protected pages. Swap them for CrawlForge and your LlamaIndex pipeline handles any URL -- static HTML, SPA, or anti-bot wall.
This guide shows how to use LlamaIndex web scraping with CrawlForge as your data source -- from single-page loaders to full RAG pipelines and agent tools.
Table of Contents
- Why LlamaIndex Needs a Better Web Reader
- Prerequisites
- Step 1: Install Dependencies
- Step 2: Build a CrawlForge Reader
- Step 3: Index Live Web Pages
- Step 4: Query the Index
- Full Example: Docs RAG with Live Updates
- Advanced: CrawlForge Tools for LlamaIndex Agents
- Troubleshooting
- FAQ
Why LlamaIndex Needs a Better Web Reader
LlamaIndex's built-in SimpleWebPageReader and BeautifulSoupWebReader are fine for static blog posts but fail on:
- JavaScript-rendered content (React, Vue, Angular apps)
- Cloudflare / DataDome / Akamai-protected pages (most SaaS docs)
- Sites that return 403 to generic User-Agents
- Pages where the primary content is inside a
<main>sibling, not trivially extractable
CrawlForge solves all four. Its extract_content tool uses a readability algorithm tuned for article, docs, and product pages. stealth_mode handles anti-bot. scrape_with_actions executes JavaScript. All 20 tools return clean text or markdown ready for chunking. For background on why this matters for RAG, see our RAG pipeline guide.
Prerequisites
- Python 3.9+ --
python --version - LlamaIndex --
pip install llama-index-core llama-index-readers-web - CrawlForge account -- free at crawlforge.dev/signup, 1,000 credits included
- OpenAI or Anthropic API key for LlamaIndex's LLM calls (or use any supported provider)
Step 1: Install Dependencies
Export your keys:
Step 2: Build a CrawlForge Reader
LlamaIndex readers inherit from BaseReader and return Document objects. Here is a minimal reader that wraps CrawlForge's extract_content endpoint:
Cost: 2 credits per URL for extract_content, 5 credits for stealth_mode.
Step 3: Index Live Web Pages
Plug the reader into a standard LlamaIndex pipeline:
You now have a persisted Stripe API index built from live docs. Cost: 6 credits (3 URLs x 2).
Step 4: Query the Index
Full Example: Docs RAG with Live Updates
Put it together -- a Stripe docs RAG that refreshes nightly:
Nightly refresh cost: 10 credits (5 URLs x 2). Over 30 days that is 300 credits -- well inside the free tier.
Advanced: CrawlForge Tools for LlamaIndex Agents
LlamaIndex's agent system accepts arbitrary FunctionTool definitions. Wrap CrawlForge calls as tools and your agent can scrape on demand:
Then pass [scrape_tool, search_tool] to any LlamaIndex agent:
Credit Cost Breakdown
| Operation | Tool | Credits |
|---|---|---|
| Ingest one static page | extract_content | 2 |
| Ingest one JS-heavy page | scrape_with_actions | 5 |
| Ingest Cloudflare-protected | stealth_mode | 5 |
| Agent search + scrape (3 URLs) | search_web + 3x extract_content | 11 |
| Full deep research | deep_research | 10 |
Troubleshooting
Empty Document.text for some URLs -- The page likely requires JavaScript. Instantiate with use_stealth=True or build a reader variant that calls scrape_with_actions.
requests.exceptions.HTTPError: 429 -- You are hitting CrawlForge's rate limit. Add retry with backoff or split bulk loads across batches of 10 URLs.
LlamaIndex indexing is slow -- Batch your reader calls with concurrent.futures.ThreadPoolExecutor (I/O-bound, GIL not a blocker). 10x speedup on 50+ URLs is typical.
Document metadata missing -- CrawlForge's scrape_structured endpoint does not populate title the same way extract_content does. Stick with extract_content for RAG ingestion; use scrape_structured only for typed field extraction.
Embeddings cost exploding -- LlamaIndex re-embeds on every VectorStoreIndex.from_documents call. Persist with index.storage_context.persist() and load with load_index_from_storage() to avoid re-work.
Next Steps
- Read the RAG pipeline guide for end-to-end retrieval patterns
- Explore other frameworks in our LangChain integration post
- See getting started docs for the full REST API
- Compare scraping vendors at Firecrawl alternative
Start free with 1,000 credits at crawlforge.dev/signup. No credit card required.