Why not use LlamaIndex built-in web readers?

SimpleWebPageReader and BeautifulSoupWebReader work for static blog posts but fail on JavaScript-rendered pages, Cloudflare-protected docs, and sites that return 403 to generic clients. CrawlForge handles all three with extract_content (readability), scrape_with_actions (JS execution), and stealth_mode (anti-bot).

How much does it cost to index 100 pages with CrawlForge + LlamaIndex?

Static pages via extract_content cost 2 credits each, so 100 pages = 200 credits. Cloudflare-protected or JS-heavy pages cost 5 credits each (500 credits for 100). Both fit inside the 1,000-credit free tier for a one-time index build.

Can CrawlForge act as a LlamaIndex agent tool?

Yes. Wrap any CrawlForge API call in a LlamaIndex FunctionTool and register it with a ReActAgent or OpenAIAgent. The agent decides when to scrape a URL or run a web search based on the user query. See the agent section above for working code.

Does CrawlForge support LlamaIndex query transformations like HyDE?

CrawlForge is a data source, not a retrieval layer. Query transformations happen inside LlamaIndex after ingestion. CrawlForge returns clean markdown or structured data that feeds VectorStoreIndex -- everything downstream (HyDE, multi-step reasoning, SubQuestionQueryEngine) works unchanged.

How do I keep a LlamaIndex index fresh with live web data?

Schedule a daily cron that re-runs your CrawlForgeReader on the same URL list and rebuilds the index via VectorStoreIndex.from_documents. Because CrawlForge returns clean markdown, the documents are identical shape every time, so embeddings are stable. For incremental updates, use LlamaIndex upsert APIs with a document ID derived from the URL.

LlamaIndex Web Scraping Guide with CrawlForge MCP

LlamaIndex is the go-to framework for production RAG, but it ships with HTML readers that crumble on JavaScript-heavy sites and Cloudflare-protected pages. Swap them for CrawlForge and your LlamaIndex pipeline handles any URL -- static HTML, SPA, or anti-bot wall.

Python

from llama_index.core import Document
from crawlforge_llamaindex import CrawlForgeReader

reader = CrawlForgeReader(api_key="cf_live_your_key")
docs: list[Document] = reader.load_data(urls=["https://docs.stripe.com/api"])

This guide shows how to use LlamaIndex web scraping with CrawlForge as your data source -- from single-page loaders to full RAG pipelines and agent tools.

Why LlamaIndex Needs a Better Web Reader
Prerequisites
Step 1: Install Dependencies
Step 2: Build a CrawlForge Reader
Step 3: Index Live Web Pages
Step 4: Query the Index
Full Example: Docs RAG with Live Updates
Advanced: CrawlForge Tools for LlamaIndex Agents
Troubleshooting
FAQ

Why LlamaIndex Needs a Better Web Reader

LlamaIndex's built-in SimpleWebPageReader and BeautifulSoupWebReader are fine for static blog posts but fail on:

JavaScript-rendered content (React, Vue, Angular apps)
Cloudflare / DataDome / Akamai-protected pages (most SaaS docs)
Sites that return 403 to generic User-Agents
Pages where the primary content is inside a <main> sibling, not trivially extractable

CrawlForge solves all four. Its extract_content tool uses a readability algorithm tuned for article, docs, and product pages. stealth_mode handles anti-bot. scrape_with_actions executes JavaScript. All 26 tools return clean text or markdown ready for chunking. For background on why this matters for RAG, see our RAG pipeline guide.

Prerequisites

Python 3.9+ -- python --version
LlamaIndex -- pip install llama-index-core llama-index-readers-web
CrawlForge account -- free at crawlforge.dev/signup, 1,000 credits included
OpenAI or Anthropic API key for LlamaIndex's LLM calls (or use any supported provider)

Step 1: Install Dependencies

Bash

pip install llama-index-core llama-index-embeddings-openai requests

Export your keys:

Bash

export CRAWLFORGE_API_KEY="cf_live_your_key_here"
export OPENAI_API_KEY="sk-..."

Step 2: Build a CrawlForge Reader

LlamaIndex readers inherit from BaseReader and return Document objects. Here is a minimal reader that wraps CrawlForge's extract_content endpoint:

Python

# crawlforge_reader.py
import os
from typing import List, Optional
import requests
from llama_index.core.readers.base import BaseReader
from llama_index.core.schema import Document


class CrawlForgeReader(BaseReader):
    """LlamaIndex reader that uses CrawlForge for web scraping."""

    BASE_URL = "https://crawlforge.dev/api/v1/tools"

    def __init__(self, api_key: Optional[str] = None, use_stealth: bool = False):
        self.api_key = api_key or os.environ["CRAWLFORGE_API_KEY"]
        self.use_stealth = use_stealth

    def load_data(self, urls: List[str]) -> List[Document]:
        documents = []
        tool = "stealth_mode" if self.use_stealth else "extract_content"

        for url in urls:
            response = requests.post(
                f"{self.BASE_URL}/{tool}",
                headers={
                    "Authorization": f"Bearer {self.api_key}",
                    "Content-Type": "application/json",
                },
                json={
                    "url": url,
                    "options": {"format": "markdown"},
                },
                timeout=30,
            )
            response.raise_for_status()
            data = response.json()

            documents.append(
                Document(
                    text=data.get("content", ""),
                    metadata={
                        "source": url,
                        "title": data.get("title"),
                        "scraped_at": data.get("scraped_at"),
                    },
                )
            )

        return documents

Cost: 2 credits per URL for extract_content, 5 credits for stealth_mode.

Step 3: Index Live Web Pages

Plug the reader into a standard LlamaIndex pipeline:

Python

# build_index.py
from llama_index.core import VectorStoreIndex
from crawlforge_reader import CrawlForgeReader

reader = CrawlForgeReader()
docs = reader.load_data(urls=[
    "https://docs.stripe.com/api/charges/create",
    "https://docs.stripe.com/api/payment_intents/create",
    "https://docs.stripe.com/api/refunds/create",
])

index = VectorStoreIndex.from_documents(docs)
index.storage_context.persist(persist_dir="./storage/stripe_docs")

You now have a persisted Stripe API index built from live docs. Cost: 6 credits (3 URLs x 2).

Step 4: Query the Index

Python

# query_index.py
from llama_index.core import StorageContext, load_index_from_storage

storage = StorageContext.from_defaults(persist_dir="./storage/stripe_docs")
index = load_index_from_storage(storage)

query_engine = index.as_query_engine()
response = query_engine.query(
    "What are the required fields to create a Charge in Stripe's API?"
)
print(response)
# -> "To create a Charge, you must provide amount (integer in cents) and
#     currency (three-letter ISO code). Additionally you need a source
#     (payment method) or customer."

Full Example: Docs RAG with Live Updates

Put it together -- a Stripe docs RAG that refreshes nightly:

Python

# docs_rag.py
import os
from datetime import datetime
from llama_index.core import VectorStoreIndex, StorageContext, load_index_from_storage
from crawlforge_reader import CrawlForgeReader

PERSIST_DIR = "./storage/stripe_docs"
TARGET_URLS = [
    "https://docs.stripe.com/api/charges/create",
    "https://docs.stripe.com/api/payment_intents/create",
    "https://docs.stripe.com/api/refunds/create",
    "https://docs.stripe.com/api/customers/create",
    "https://docs.stripe.com/api/subscriptions/create",
]


def refresh_index() -> VectorStoreIndex:
    """Re-scrape sources and rebuild the index."""
    reader = CrawlForgeReader()
    docs = reader.load_data(urls=TARGET_URLS)
    index = VectorStoreIndex.from_documents(docs)
    index.storage_context.persist(persist_dir=PERSIST_DIR)
    print(f"Indexed {len(docs)} docs at {datetime.utcnow().isoformat()}Z")
    return index


def load_index() -> VectorStoreIndex:
    """Load the persisted index from disk."""
    storage = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    return load_index_from_storage(storage)


def ask(question: str) -> str:
    index = load_index() if os.path.exists(PERSIST_DIR) else refresh_index()
    return str(index.as_query_engine().query(question))


if __name__ == "__main__":
    # Refresh once a day in a cron job: python docs_rag.py --refresh
    import sys
    if "--refresh" in sys.argv:
        refresh_index()
    else:
        print(ask("How do I create a refund for a charge?"))

Nightly refresh cost: 10 credits (5 URLs x 2). Over 30 days that is 300 credits -- well inside the free tier.

Advanced: CrawlForge Tools for LlamaIndex Agents

LlamaIndex's agent system accepts arbitrary FunctionTool definitions. Wrap CrawlForge calls as tools and your agent can scrape on demand:

Python

# crawlforge_tools.py
from llama_index.core.tools import FunctionTool
from crawlforge_reader import CrawlForgeReader


def scrape_url(url: str) -> str:
    """Scrape a URL and return its main content as markdown."""
    reader = CrawlForgeReader()
    docs = reader.load_data(urls=[url])
    return docs[0].text if docs else ""


def search_and_scrape(query: str, n: int = 3) -> list[str]:
    """Search the web and return content from the top N results."""
    import os, requests
    resp = requests.post(
        "https://crawlforge.dev/api/v1/tools/search_web",
        headers={"Authorization": f"Bearer {os.environ['CRAWLFORGE_API_KEY']}"},
        json={"query": query, "limit": n},
        timeout=30,
    ).json()
    urls = [r["url"] for r in resp.get("results", [])]
    reader = CrawlForgeReader()
    return [d.text for d in reader.load_data(urls=urls)]


scrape_tool = FunctionTool.from_defaults(fn=scrape_url)
search_tool = FunctionTool.from_defaults(fn=search_and_scrape)

Then pass [scrape_tool, search_tool] to any LlamaIndex agent:

Python

from llama_index.core.agent import ReActAgent
from llama_index.llms.openai import OpenAI

agent = ReActAgent.from_tools(
    tools=[scrape_tool, search_tool],
    llm=OpenAI(model="gpt-4o-mini"),
    verbose=True,
)

response = agent.chat(
    "Research the current state of Anthropic's MCP protocol adoption in 2026. "
    "Cite at least 3 sources."
)
print(response)

Credit Cost Breakdown

Operation	Tool	Credits
Ingest one static page	`extract_content`	2
Ingest one JS-heavy page	`scrape_with_actions`	5
Ingest Cloudflare-protected	`stealth_mode`	5
Agent search + scrape (3 URLs)	`search_web` + 3x `extract_content`	11
Full deep research	`deep_research`	10

Troubleshooting

Empty Document.text for some URLs -- The page likely requires JavaScript. Instantiate with use_stealth=True or build a reader variant that calls scrape_with_actions.

requests.exceptions.HTTPError: 429 -- You are hitting CrawlForge's rate limit. Add retry with backoff or split bulk loads across batches of 10 URLs.

LlamaIndex indexing is slow -- Batch your reader calls with concurrent.futures.ThreadPoolExecutor (I/O-bound, GIL not a blocker). 10x speedup on 50+ URLs is typical.

Document metadata missing -- CrawlForge's scrape_structured endpoint does not populate title the same way extract_content does. Stick with extract_content for RAG ingestion; use scrape_structured only for typed field extraction.

Embeddings cost exploding -- LlamaIndex re-embeds on every VectorStoreIndex.from_documents call. Persist with index.storage_context.persist() and load with load_index_from_storage() to avoid re-work.

Next Steps

Read the RAG pipeline guide for end-to-end retrieval patterns
Explore other frameworks in our LangChain integration post
See getting started docs for the full REST API
Compare scraping vendors at Firecrawl alternative

Start free with 1,000 credits at crawlforge.dev/signup. No credit card required.

Python

from llama_index.core import Document
from crawlforge_llamaindex import CrawlForgeReader

reader = CrawlForgeReader(api_key="cf_live_your_key")
docs: list[Document] = reader.load_data(urls=["https://docs.stripe.com/api"])

This guide shows how to use LlamaIndex web scraping with CrawlForge as your data source -- from single-page loaders to full RAG pipelines and agent tools.

Why LlamaIndex Needs a Better Web Reader
Prerequisites
Step 1: Install Dependencies
Step 2: Build a CrawlForge Reader
Step 3: Index Live Web Pages
Step 4: Query the Index
Full Example: Docs RAG with Live Updates
Advanced: CrawlForge Tools for LlamaIndex Agents
Troubleshooting
FAQ

Why LlamaIndex Needs a Better Web Reader

LlamaIndex's built-in SimpleWebPageReader and BeautifulSoupWebReader are fine for static blog posts but fail on:

JavaScript-rendered content (React, Vue, Angular apps)
Cloudflare / DataDome / Akamai-protected pages (most SaaS docs)
Sites that return 403 to generic User-Agents
Pages where the primary content is inside a <main> sibling, not trivially extractable

Prerequisites

Python 3.9+ -- python --version
LlamaIndex -- pip install llama-index-core llama-index-readers-web
CrawlForge account -- free at crawlforge.dev/signup, 1,000 credits included
OpenAI or Anthropic API key for LlamaIndex's LLM calls (or use any supported provider)

Step 1: Install Dependencies

Bash

pip install llama-index-core llama-index-embeddings-openai requests

Export your keys:

Bash

export CRAWLFORGE_API_KEY="cf_live_your_key_here"
export OPENAI_API_KEY="sk-..."

Step 2: Build a CrawlForge Reader

LlamaIndex readers inherit from BaseReader and return Document objects. Here is a minimal reader that wraps CrawlForge's extract_content endpoint:

Python

# crawlforge_reader.py
import os
from typing import List, Optional
import requests
from llama_index.core.readers.base import BaseReader
from llama_index.core.schema import Document


class CrawlForgeReader(BaseReader):
    """LlamaIndex reader that uses CrawlForge for web scraping."""

    BASE_URL = "https://crawlforge.dev/api/v1/tools"

    def __init__(self, api_key: Optional[str] = None, use_stealth: bool = False):
        self.api_key = api_key or os.environ["CRAWLFORGE_API_KEY"]
        self.use_stealth = use_stealth

    def load_data(self, urls: List[str]) -> List[Document]:
        documents = []
        tool = "stealth_mode" if self.use_stealth else "extract_content"

        for url in urls:
            response = requests.post(
                f"{self.BASE_URL}/{tool}",
                headers={
                    "Authorization": f"Bearer {self.api_key}",
                    "Content-Type": "application/json",
                },
                json={
                    "url": url,
                    "options": {"format": "markdown"},
                },
                timeout=30,
            )
            response.raise_for_status()
            data = response.json()

            documents.append(
                Document(
                    text=data.get("content", ""),
                    metadata={
                        "source": url,
                        "title": data.get("title"),
                        "scraped_at": data.get("scraped_at"),
                    },
                )
            )

        return documents

Cost: 2 credits per URL for extract_content, 5 credits for stealth_mode.

Step 3: Index Live Web Pages

Plug the reader into a standard LlamaIndex pipeline:

Python

# build_index.py
from llama_index.core import VectorStoreIndex
from crawlforge_reader import CrawlForgeReader

reader = CrawlForgeReader()
docs = reader.load_data(urls=[
    "https://docs.stripe.com/api/charges/create",
    "https://docs.stripe.com/api/payment_intents/create",
    "https://docs.stripe.com/api/refunds/create",
])

index = VectorStoreIndex.from_documents(docs)
index.storage_context.persist(persist_dir="./storage/stripe_docs")

You now have a persisted Stripe API index built from live docs. Cost: 6 credits (3 URLs x 2).

Step 4: Query the Index

Python

# query_index.py
from llama_index.core import StorageContext, load_index_from_storage

storage = StorageContext.from_defaults(persist_dir="./storage/stripe_docs")
index = load_index_from_storage(storage)

query_engine = index.as_query_engine()
response = query_engine.query(
    "What are the required fields to create a Charge in Stripe's API?"
)
print(response)
# -> "To create a Charge, you must provide amount (integer in cents) and
#     currency (three-letter ISO code). Additionally you need a source
#     (payment method) or customer."

Full Example: Docs RAG with Live Updates

Put it together -- a Stripe docs RAG that refreshes nightly:

Python

# docs_rag.py
import os
from datetime import datetime
from llama_index.core import VectorStoreIndex, StorageContext, load_index_from_storage
from crawlforge_reader import CrawlForgeReader

PERSIST_DIR = "./storage/stripe_docs"
TARGET_URLS = [
    "https://docs.stripe.com/api/charges/create",
    "https://docs.stripe.com/api/payment_intents/create",
    "https://docs.stripe.com/api/refunds/create",
    "https://docs.stripe.com/api/customers/create",
    "https://docs.stripe.com/api/subscriptions/create",
]


def refresh_index() -> VectorStoreIndex:
    """Re-scrape sources and rebuild the index."""
    reader = CrawlForgeReader()
    docs = reader.load_data(urls=TARGET_URLS)
    index = VectorStoreIndex.from_documents(docs)
    index.storage_context.persist(persist_dir=PERSIST_DIR)
    print(f"Indexed {len(docs)} docs at {datetime.utcnow().isoformat()}Z")
    return index


def load_index() -> VectorStoreIndex:
    """Load the persisted index from disk."""
    storage = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    return load_index_from_storage(storage)


def ask(question: str) -> str:
    index = load_index() if os.path.exists(PERSIST_DIR) else refresh_index()
    return str(index.as_query_engine().query(question))


if __name__ == "__main__":
    # Refresh once a day in a cron job: python docs_rag.py --refresh
    import sys
    if "--refresh" in sys.argv:
        refresh_index()
    else:
        print(ask("How do I create a refund for a charge?"))

Nightly refresh cost: 10 credits (5 URLs x 2). Over 30 days that is 300 credits -- well inside the free tier.

Advanced: CrawlForge Tools for LlamaIndex Agents

LlamaIndex's agent system accepts arbitrary FunctionTool definitions. Wrap CrawlForge calls as tools and your agent can scrape on demand:

Python

# crawlforge_tools.py
from llama_index.core.tools import FunctionTool
from crawlforge_reader import CrawlForgeReader


def scrape_url(url: str) -> str:
    """Scrape a URL and return its main content as markdown."""
    reader = CrawlForgeReader()
    docs = reader.load_data(urls=[url])
    return docs[0].text if docs else ""


def search_and_scrape(query: str, n: int = 3) -> list[str]:
    """Search the web and return content from the top N results."""
    import os, requests
    resp = requests.post(
        "https://crawlforge.dev/api/v1/tools/search_web",
        headers={"Authorization": f"Bearer {os.environ['CRAWLFORGE_API_KEY']}"},
        json={"query": query, "limit": n},
        timeout=30,
    ).json()
    urls = [r["url"] for r in resp.get("results", [])]
    reader = CrawlForgeReader()
    return [d.text for d in reader.load_data(urls=urls)]


scrape_tool = FunctionTool.from_defaults(fn=scrape_url)
search_tool = FunctionTool.from_defaults(fn=search_and_scrape)

Then pass [scrape_tool, search_tool] to any LlamaIndex agent:

Python

from llama_index.core.agent import ReActAgent
from llama_index.llms.openai import OpenAI

agent = ReActAgent.from_tools(
    tools=[scrape_tool, search_tool],
    llm=OpenAI(model="gpt-4o-mini"),
    verbose=True,
)

response = agent.chat(
    "Research the current state of Anthropic's MCP protocol adoption in 2026. "
    "Cite at least 3 sources."
)
print(response)

Credit Cost Breakdown

Operation	Tool	Credits
Ingest one static page	`extract_content`	2
Ingest one JS-heavy page	`scrape_with_actions`	5
Ingest Cloudflare-protected	`stealth_mode`	5
Agent search + scrape (3 URLs)	`search_web` + 3x `extract_content`	11
Full deep research	`deep_research`	10

Troubleshooting

Empty Document.text for some URLs -- The page likely requires JavaScript. Instantiate with use_stealth=True or build a reader variant that calls scrape_with_actions.

requests.exceptions.HTTPError: 429 -- You are hitting CrawlForge's rate limit. Add retry with backoff or split bulk loads across batches of 10 URLs.

LlamaIndex indexing is slow -- Batch your reader calls with concurrent.futures.ThreadPoolExecutor (I/O-bound, GIL not a blocker). 10x speedup on 50+ URLs is typical.

Next Steps

Read the RAG pipeline guide for end-to-end retrieval patterns
Explore other frameworks in our LangChain integration post
See getting started docs for the full REST API
Compare scraping vendors at Firecrawl alternative

Start free with 1,000 credits at crawlforge.dev/signup. No credit card required.

On this page

Table of Contents

Why LlamaIndex Needs a Better Web Reader

Prerequisites

Step 1: Install Dependencies

Step 2: Build a CrawlForge Reader

Step 3: Index Live Web Pages

Step 4: Query the Index

Full Example: Docs RAG with Live Updates

Advanced: CrawlForge Tools for LlamaIndex Agents

Credit Cost Breakdown

Troubleshooting

Next Steps

Try this yourself — no signup needed

Tags

About the Author

CrawlForge Team

Stay updated with the latest insights

Frequently Asked Questions

Related Articles

How to Scrape Websites with Claude Code (2026 Guide)

How to Scrape Websites in Cursor IDE with CrawlForge MCP

How to Scrape Websites in Zed AI with CrawlForge MCP

On this page

Table of Contents

Why LlamaIndex Needs a Better Web Reader

Prerequisites

Step 1: Install Dependencies

Step 2: Build a CrawlForge Reader

Step 3: Index Live Web Pages

Step 4: Query the Index

Full Example: Docs RAG with Live Updates

Advanced: CrawlForge Tools for LlamaIndex Agents

Credit Cost Breakdown

Troubleshooting

Next Steps

Try this yourself — no signup needed

Tags

About the Author

CrawlForge Team

Stay updated with the latest insights

Frequently Asked Questions

Related Articles

How to Scrape Websites with Claude Code (2026 Guide)

How to Scrape Websites in Cursor IDE with CrawlForge MCP

How to Scrape Websites in Zed AI with CrawlForge MCP