LlamaIndex
LlamaIndex 集成
将 CrawlForge MCP 与 LlamaIndex 集成,构建具备网页抓取能力的数据连接器、索引和查询引擎。非常适合 RAG 应用和知识库。
使用场景
网页数据连接器
创建可自动抓取并索引网页内容的数据连接器
知识库
从网页和文档构建可搜索的知识库
查询引擎
创建带实时网页数据检索的查询引擎
文档处理
从 URL 提取并处理文档以供索引
安装
安装 LlamaIndex 和 CrawlForge MCP 适配器。
Bash
npm install llamaindex
npm install @crawlforge/llamaindex-adapter你还需要一个来自控制台的 CrawlForge API 密钥。
网页数据连接器
将 CrawlForge 用作数据连接器,以抓取并加载网页文档。
Typescript
import { CrawlForgeReader } from '@crawlforge/llamaindex-adapter';
import { Document } from 'llamaindex';
// Initialize the reader
const reader = new CrawlForgeReader({
apiKey: process.env.CRAWLFORGE_API_KEY!,
tool: 'extract_content' // or 'extract_text', 'fetch_url'
});
// Load a single document
const documents = await reader.loadData(['https://example.com']);
console.log(documents[0].text); // Document content
console.log(documents[0].metadata); // URL, title, credits
// Load multiple documents
const urls = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3'
];
const allDocuments = await reader.loadData(urls);
console.log(`Loaded ${allDocuments.length} documents`);提示: 使用
extract_content 进行干净的文章提取,或使用 extract_text 获取整页文本。向量存储索引
从网页文档创建向量存储索引,以进行语义搜索。
Typescript
import { CrawlForgeReader } from '@crawlforge/llamaindex-adapter';
import { VectorStoreIndex } from 'llamaindex';
import { OpenAIEmbedding } from 'llamaindex';
// 1. Load documents from web
const reader = new CrawlForgeReader({
apiKey: process.env.CRAWLFORGE_API_KEY!,
tool: 'extract_content'
});
const documents = await reader.loadData([
'https://example.com/doc1',
'https://example.com/doc2',
'https://example.com/doc3'
]);
// 2. Create embeddings
const embedModel = new OpenAIEmbedding({
apiKey: process.env.OPENAI_API_KEY!,
model: 'text-embedding-3-small'
});
// 3. Build vector index
const index = await VectorStoreIndex.fromDocuments(documents, {
embedModel
});
// 4. Query the index
const queryEngine = index.asQueryEngine();
const response = await queryEngine.query(
'What are the main topics covered?'
);
console.log(response.toString());带工具的查询引擎
创建一个可按需抓取实时网页数据的查询引擎。
Typescript
import { CrawlForgeTool } from '@crawlforge/llamaindex-adapter';
import { OpenAIAgent } from 'llamaindex';
// Create CrawlForge tools
const tools = [
new CrawlForgeTool({
name: 'web_search',
description: 'Search the web for information',
apiKey: process.env.CRAWLFORGE_API_KEY!,
tool: 'search_web'
}),
new CrawlForgeTool({
name: 'fetch_content',
description: 'Fetch and extract content from a URL',
apiKey: process.env.CRAWLFORGE_API_KEY!,
tool: 'extract_content'
}),
new CrawlForgeTool({
name: 'deep_research',
description: 'Perform comprehensive research on a topic',
apiKey: process.env.CRAWLFORGE_API_KEY!,
tool: 'deep_research'
})
];
// Create agent with tools
const agent = new OpenAIAgent({
tools,
verbose: true
});
// Query with tool access
const response = await agent.chat(
'Research the latest developments in quantum computing'
);
console.log(response.toString());智能体提示: 智能体会根据查询自动选择要使用的工具。设置
verbose=true 即可查看工具选择过程。自定义网页检索器
构建一个根据查询抓取网页数据的自定义检索器。
Typescript
import { CrawlForgeReader } from '@crawlforge/llamaindex-adapter';
import { BaseRetriever } from 'llamaindex';
import type { NodeWithScore } from 'llamaindex';
export class WebRetriever extends BaseRetriever {
private reader: CrawlForgeReader;
constructor(apiKey: string) {
super();
this.reader = new CrawlForgeReader({
apiKey,
tool: 'search_web'
});
}
async retrieve(query: string): Promise<NodeWithScore[]> {
// 1. Search for URLs
const searchResults = await this.reader.search(query);
// 2. Fetch content from top results
const urls = searchResults.slice(0, 3).map(r => r.url);
const documents = await this.reader.loadData(urls);
// 3. Convert to nodes with scores
return documents.map((doc, i) => ({
node: doc,
score: 1.0 - (i * 0.1) // Simple scoring
}));
}
}
// Use the custom retriever
const retriever = new WebRetriever(process.env.CRAWLFORGE_API_KEY!);
const nodes = await retriever.retrieve('latest AI news');
console.log(`Retrieved ${nodes.length} documents`);异步批处理
使用异步批处理操作高效处理多个 URL。
Typescript
import { CrawlForgeReader } from '@crawlforge/llamaindex-adapter';
import { VectorStoreIndex } from 'llamaindex';
const reader = new CrawlForgeReader({
apiKey: process.env.CRAWLFORGE_API_KEY!,
tool: 'batch_scrape' // Use batch tool for efficiency
});
// Define URL batches
const urlBatches = [
['https://example.com/1', 'https://example.com/2'],
['https://example.com/3', 'https://example.com/4'],
['https://example.com/5', 'https://example.com/6']
];
// Process in parallel
const allDocuments = await Promise.all(
urlBatches.map(urls => reader.loadData(urls))
);
const documents = allDocuments.flat();
console.log(`Loaded ${documents.length} documents`);
// Build index from all documents
const index = await VectorStoreIndex.fromDocuments(documents);
console.log('Index created successfully');性能提示: 处理多个 URL 时使用
batch_scrape——它针对并行执行进行了优化,每个 URL 仅花费 1 credit。最佳实践
- 选择高效工具 — 多个 URL 用
batch_scrape,干净文本用extract_content - 实现缓存 — 缓存已索引的文档,避免重复抓取并节省 credits
- 使用异步操作 — 利用 async/await 进行并行处理,以加速批量操作
- 监控 credits — 在文档元数据中跟踪 credits 用量,并在你的控制台中设置告警
准备好用 LlamaIndex 构建了吗?
探索全部 23 个 CrawlForge 工具,或查看其他集成。