CrawlForge
HomeUse CasesIntegrationsPricingDocumentationBlog
LlamaIndex Web Scraping Guide with CrawlForge MCP
Tutorials
Back to Blog
Tutorials

LlamaIndex Web Scraping Guide with CrawlForge MCP

C
CrawlForge Team
Engineering Team
April 14, 2026
11 min read

On this page

LlamaIndex is the go-to framework for production RAG, but it ships with HTML readers that crumble on JavaScript-heavy sites and Cloudflare-protected pages. Swap them for CrawlForge and your LlamaIndex pipeline handles any URL -- static HTML, SPA, or anti-bot wall.

Python

This guide shows how to use LlamaIndex web scraping with CrawlForge as your data source -- from single-page loaders to full RAG pipelines and agent tools.

Table of Contents

  • Why LlamaIndex Needs a Better Web Reader
  • Prerequisites
  • Step 1: Install Dependencies
  • Step 2: Build a CrawlForge Reader
  • Step 3: Index Live Web Pages
  • Step 4: Query the Index
  • Full Example: Docs RAG with Live Updates
  • Advanced: CrawlForge Tools for LlamaIndex Agents
  • Troubleshooting
  • FAQ

Why LlamaIndex Needs a Better Web Reader

LlamaIndex's built-in SimpleWebPageReader and BeautifulSoupWebReader are fine for static blog posts but fail on:

  • JavaScript-rendered content (React, Vue, Angular apps)
  • Cloudflare / DataDome / Akamai-protected pages (most SaaS docs)
  • Sites that return 403 to generic User-Agents
  • Pages where the primary content is inside a <main> sibling, not trivially extractable

CrawlForge solves all four. Its extract_content tool uses a readability algorithm tuned for article, docs, and product pages. stealth_mode handles anti-bot. scrape_with_actions executes JavaScript. All 20 tools return clean text or markdown ready for chunking. For background on why this matters for RAG, see our RAG pipeline guide.

Prerequisites

  • Python 3.9+ -- python --version
  • LlamaIndex -- pip install llama-index-core llama-index-readers-web
  • CrawlForge account -- free at crawlforge.dev/signup, 1,000 credits included
  • OpenAI or Anthropic API key for LlamaIndex's LLM calls (or use any supported provider)

Step 1: Install Dependencies

Bash

Export your keys:

Bash

Step 2: Build a CrawlForge Reader

LlamaIndex readers inherit from BaseReader and return Document objects. Here is a minimal reader that wraps CrawlForge's extract_content endpoint:

Python

Cost: 2 credits per URL for extract_content, 5 credits for stealth_mode.

Step 3: Index Live Web Pages

Plug the reader into a standard LlamaIndex pipeline:

Python

You now have a persisted Stripe API index built from live docs. Cost: 6 credits (3 URLs x 2).

Step 4: Query the Index

Python

Full Example: Docs RAG with Live Updates

Put it together -- a Stripe docs RAG that refreshes nightly:

Python

Nightly refresh cost: 10 credits (5 URLs x 2). Over 30 days that is 300 credits -- well inside the free tier.

Advanced: CrawlForge Tools for LlamaIndex Agents

LlamaIndex's agent system accepts arbitrary FunctionTool definitions. Wrap CrawlForge calls as tools and your agent can scrape on demand:

Python

Then pass [scrape_tool, search_tool] to any LlamaIndex agent:

Python

Credit Cost Breakdown

OperationToolCredits
Ingest one static pageextract_content2
Ingest one JS-heavy pagescrape_with_actions5
Ingest Cloudflare-protectedstealth_mode5
Agent search + scrape (3 URLs)search_web + 3x extract_content11
Full deep researchdeep_research10

Troubleshooting

Empty Document.text for some URLs -- The page likely requires JavaScript. Instantiate with use_stealth=True or build a reader variant that calls scrape_with_actions.

requests.exceptions.HTTPError: 429 -- You are hitting CrawlForge's rate limit. Add retry with backoff or split bulk loads across batches of 10 URLs.

LlamaIndex indexing is slow -- Batch your reader calls with concurrent.futures.ThreadPoolExecutor (I/O-bound, GIL not a blocker). 10x speedup on 50+ URLs is typical.

Document metadata missing -- CrawlForge's scrape_structured endpoint does not populate title the same way extract_content does. Stick with extract_content for RAG ingestion; use scrape_structured only for typed field extraction.

Embeddings cost exploding -- LlamaIndex re-embeds on every VectorStoreIndex.from_documents call. Persist with index.storage_context.persist() and load with load_index_from_storage() to avoid re-work.

Next Steps

  • Read the RAG pipeline guide for end-to-end retrieval patterns
  • Explore other frameworks in our LangChain integration post
  • See getting started docs for the full REST API
  • Compare scraping vendors at Firecrawl alternative

Start free with 1,000 credits at crawlforge.dev/signup. No credit card required.

Tags

LlamaIndexweb-scrapingRAGPythontutorialvector-searchAI-agents

About the Author

C

CrawlForge Team

Engineering Team

Building the most comprehensive web scraping MCP server. We create tools that help developers extract, analyze, and transform web data for AI applications.

On this page

Frequently Asked Questions

Why not use LlamaIndex built-in web readers?+

SimpleWebPageReader and BeautifulSoupWebReader work for static blog posts but fail on JavaScript-rendered pages, Cloudflare-protected docs, and sites that return 403 to generic clients. CrawlForge handles all three with extract_content (readability), scrape_with_actions (JS execution), and stealth_mode (anti-bot).

How much does it cost to index 100 pages with CrawlForge + LlamaIndex?+

Static pages via extract_content cost 2 credits each, so 100 pages = 200 credits. Cloudflare-protected or JS-heavy pages cost 5 credits each (500 credits for 100). Both fit inside the 1,000-credit free tier for a one-time index build.

Can CrawlForge act as a LlamaIndex agent tool?+

Yes. Wrap any CrawlForge API call in a LlamaIndex FunctionTool and register it with a ReActAgent or OpenAIAgent. The agent decides when to scrape a URL or run a web search based on the user query. See the agent section above for working code.

Does CrawlForge support LlamaIndex query transformations like HyDE?+

CrawlForge is a data source, not a retrieval layer. Query transformations happen inside LlamaIndex after ingestion. CrawlForge returns clean markdown or structured data that feeds VectorStoreIndex -- everything downstream (HyDE, multi-step reasoning, SubQuestionQueryEngine) works unchanged.

How do I keep a LlamaIndex index fresh with live web data?+

Schedule a daily cron that re-runs your CrawlForgeReader on the same URL list and rebuilds the index via VectorStoreIndex.from_documents. Because CrawlForge returns clean markdown, the documents are identical shape every time, so embeddings are stable. For incremental updates, use LlamaIndex upsert APIs with a document ID derived from the URL.

Related Articles

How to Scrape Websites with Claude Code (2026 Guide)
Tutorials

How to Scrape Websites with Claude Code (2026 Guide)

Scrape any website from your terminal with Claude Code and CrawlForge MCP. Fetch pages, extract data, bypass anti-bot -- in under 2 minutes.

C
CrawlForge Team
|
Apr 14
|
10m
How to Scrape Websites in Cursor IDE with CrawlForge MCP
Tutorials

How to Scrape Websites in Cursor IDE with CrawlForge MCP

Turn Cursor IDE into a web scraping workstation. Connect CrawlForge MCP and extract structured data from any site without leaving your editor.

C
CrawlForge Team
|
Apr 14
|
9m
How to Scrape Websites in Zed AI with CrawlForge MCP
Tutorials

How to Scrape Websites in Zed AI with CrawlForge MCP

Add web scraping to Zed AI in 3 minutes. Configure CrawlForge MCP in Zed so your editor can fetch, extract, and research live web data on demand.

C
CrawlForge Team
|
Apr 14
|
9m

Footer

CrawlForge

Enterprise web scraping for AI Agents. 20 specialized MCP tools designed for modern developers building intelligent systems.

Product

  • Features
  • Pricing
  • Use Cases
  • Integrations
  • Changelog

Resources

  • Getting Started
  • API Reference
  • Templates
  • Guides
  • Blog
  • FAQ

Developers

  • MCP Protocol
  • Claude Desktop
  • Cursor IDE
  • LangChain
  • LlamaIndex

Company

  • About
  • Contact
  • Privacy
  • Terms

Stay updated

Get the latest updates on new tools and features.

Built with Next.js and MCP protocol

© 2025-2026 CrawlForge. All rights reserved.