LlamaIndex
LlamaIndex Integration
Integrate CrawlForge MCP with LlamaIndex to build data connectors, indexes, and query engines with web scraping capabilities. Perfect for RAG applications and knowledge bases.
Use Cases
Web Data Connectors
Create data connectors that fetch and index web content automatically
Knowledge Bases
Build searchable knowledge bases from web pages and documents
Query Engines
Create query engines with real-time web data retrieval
Document Processing
Extract and process documents from URLs for indexing
Installation
Install LlamaIndex and the CrawlForge MCP adapter.
Bash
npm install llamaindex
npm install @crawlforge/llamaindex-adapterYou'll also need a CrawlForge API key from the dashboard.
Web Data Connector
Use CrawlForge as a data connector to fetch and load web documents.
Typescript
import { CrawlForgeReader } from '@crawlforge/llamaindex-adapter';
import { Document } from 'llamaindex';
// Initialize the reader
const reader = new CrawlForgeReader({
apiKey: process.env.CRAWLFORGE_API_KEY!,
tool: 'extract_content' // or 'extract_text', 'fetch_url'
});
// Load a single document
const documents = await reader.loadData(['https://example.com']);
console.log(documents[0].text); // Document content
console.log(documents[0].metadata); // URL, title, credits
// Load multiple documents
const urls = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3'
];
const allDocuments = await reader.loadData(urls);
console.log(`Loaded ${allDocuments.length} documents`);Tip: Use
extract_content for clean article extraction or extract_text for full page text.Vector Store Index
Create a vector store index from web documents for semantic search.
Typescript
import { CrawlForgeReader } from '@crawlforge/llamaindex-adapter';
import { VectorStoreIndex } from 'llamaindex';
import { OpenAIEmbedding } from 'llamaindex';
// 1. Load documents from web
const reader = new CrawlForgeReader({
apiKey: process.env.CRAWLFORGE_API_KEY!,
tool: 'extract_content'
});
const documents = await reader.loadData([
'https://example.com/doc1',
'https://example.com/doc2',
'https://example.com/doc3'
]);
// 2. Create embeddings
const embedModel = new OpenAIEmbedding({
apiKey: process.env.OPENAI_API_KEY!,
model: 'text-embedding-3-small'
});
// 3. Build vector index
const index = await VectorStoreIndex.fromDocuments(documents, {
embedModel
});
// 4. Query the index
const queryEngine = index.asQueryEngine();
const response = await queryEngine.query(
'What are the main topics covered?'
);
console.log(response.toString());Query Engine with Tools
Create a query engine that can fetch real-time web data on demand.
Typescript
import { CrawlForgeTool } from '@crawlforge/llamaindex-adapter';
import { OpenAIAgent } from 'llamaindex';
// Create CrawlForge tools
const tools = [
new CrawlForgeTool({
name: 'web_search',
description: 'Search the web for information',
apiKey: process.env.CRAWLFORGE_API_KEY!,
tool: 'search_web'
}),
new CrawlForgeTool({
name: 'fetch_content',
description: 'Fetch and extract content from a URL',
apiKey: process.env.CRAWLFORGE_API_KEY!,
tool: 'extract_content'
}),
new CrawlForgeTool({
name: 'deep_research',
description: 'Perform comprehensive research on a topic',
apiKey: process.env.CRAWLFORGE_API_KEY!,
tool: 'deep_research'
})
];
// Create agent with tools
const agent = new OpenAIAgent({
tools,
verbose: true
});
// Query with tool access
const response = await agent.chat(
'Research the latest developments in quantum computing'
);
console.log(response.toString());Agent Tips: The agent will automatically choose which tools to use based on the query. Set
verbose=true to see tool selection.Custom Web Retriever
Build a custom retriever that fetches web data based on queries.
Typescript
import { CrawlForgeReader } from '@crawlforge/llamaindex-adapter';
import { BaseRetriever } from 'llamaindex';
import type { NodeWithScore } from 'llamaindex';
export class WebRetriever extends BaseRetriever {
private reader: CrawlForgeReader;
constructor(apiKey: string) {
super();
this.reader = new CrawlForgeReader({
apiKey,
tool: 'search_web'
});
}
async retrieve(query: string): Promise<NodeWithScore[]> {
// 1. Search for URLs
const searchResults = await this.reader.search(query);
// 2. Fetch content from top results
const urls = searchResults.slice(0, 3).map(r => r.url);
const documents = await this.reader.loadData(urls);
// 3. Convert to nodes with scores
return documents.map((doc, i) => ({
node: doc,
score: 1.0 - (i * 0.1) // Simple scoring
}));
}
}
// Use the custom retriever
const retriever = new WebRetriever(process.env.CRAWLFORGE_API_KEY!);
const nodes = await retriever.retrieve('latest AI news');
console.log(`Retrieved ${nodes.length} documents`);Batch Processing with Async
Process multiple URLs efficiently with async batch operations.
Typescript
import { CrawlForgeReader } from '@crawlforge/llamaindex-adapter';
import { VectorStoreIndex } from 'llamaindex';
const reader = new CrawlForgeReader({
apiKey: process.env.CRAWLFORGE_API_KEY!,
tool: 'batch_scrape' // Use batch tool for efficiency
});
// Define URL batches
const urlBatches = [
['https://example.com/1', 'https://example.com/2'],
['https://example.com/3', 'https://example.com/4'],
['https://example.com/5', 'https://example.com/6']
];
// Process in parallel
const allDocuments = await Promise.all(
urlBatches.map(urls => reader.loadData(urls))
);
const documents = allDocuments.flat();
console.log(`Loaded ${documents.length} documents`);
// Build index from all documents
const index = await VectorStoreIndex.fromDocuments(documents);
console.log('Index created successfully');Performance Tip: Use
batch_scrape for processing multiple URLs—it's optimized for parallel execution and costs only 1 credit per URL.Best Practices
- Choose Efficient Tools — Use
batch_scrapefor multiple URLs,extract_contentfor clean text - Implement Caching — Cache indexed documents to avoid redundant fetches and save credits
- Use Async Operations — Leverage async/await for parallel processing to speed up bulk operations
- Monitor Credits — Track credit usage in document metadata and set up alerts in your dashboard
Ready to build with LlamaIndex?
Explore all 23 CrawlForge tools or check out other integrations.