LlamaIndex

LlamaIndex 集成

将 CrawlForge MCP 与 LlamaIndex 集成，构建具备网页抓取能力的数据连接器、索引和查询引擎。非常适合 RAG 应用和知识库。

使用场景

网页数据连接器

创建可自动抓取并索引网页内容的数据连接器

知识库

从网页和文档构建可搜索的知识库

查询引擎

创建带实时网页数据检索的查询引擎

文档处理

从 URL 提取并处理文档以供索引

安装

安装 LlamaIndex 和 CrawlForge MCP 适配器。

Bash

npm install llamaindex
npm install @crawlforge/llamaindex-adapter

你还需要一个来自控制台的 CrawlForge API 密钥。

网页数据连接器

将 CrawlForge 用作数据连接器，以抓取并加载网页文档。

Typescript

import { CrawlForgeReader } from '@crawlforge/llamaindex-adapter';
import { Document } from 'llamaindex';

// Initialize the reader
const reader = new CrawlForgeReader({
  apiKey: process.env.CRAWLFORGE_API_KEY!,
  tool: 'extract_content' // or 'extract_text', 'fetch_url'
});

// Load a single document
const documents = await reader.loadData(['https://example.com']);

console.log(documents[0].text);      // Document content
console.log(documents[0].metadata);  // URL, title, credits

// Load multiple documents
const urls = [
  'https://example.com/page1',
  'https://example.com/page2',
  'https://example.com/page3'
];

const allDocuments = await reader.loadData(urls);
console.log(`Loaded ${allDocuments.length} documents`);

提示： 使用 extract_content 进行干净的文章提取，或使用 extract_text 获取整页文本。

向量存储索引

从网页文档创建向量存储索引，以进行语义搜索。

Typescript

import { CrawlForgeReader } from '@crawlforge/llamaindex-adapter';
import { VectorStoreIndex } from 'llamaindex';
import { OpenAIEmbedding } from 'llamaindex';

// 1. Load documents from web
const reader = new CrawlForgeReader({
  apiKey: process.env.CRAWLFORGE_API_KEY!,
  tool: 'extract_content'
});

const documents = await reader.loadData([
  'https://example.com/doc1',
  'https://example.com/doc2',
  'https://example.com/doc3'
]);

// 2. Create embeddings
const embedModel = new OpenAIEmbedding({
  apiKey: process.env.OPENAI_API_KEY!,
  model: 'text-embedding-3-small'
});

// 3. Build vector index
const index = await VectorStoreIndex.fromDocuments(documents, {
  embedModel
});

// 4. Query the index
const queryEngine = index.asQueryEngine();
const response = await queryEngine.query(
  'What are the main topics covered?'
);

console.log(response.toString());

带工具的查询引擎

创建一个可按需抓取实时网页数据的查询引擎。

Typescript

import { CrawlForgeTool } from '@crawlforge/llamaindex-adapter';
import { OpenAIAgent } from 'llamaindex';

// Create CrawlForge tools
const tools = [
  new CrawlForgeTool({
    name: 'web_search',
    description: 'Search the web for information',
    apiKey: process.env.CRAWLFORGE_API_KEY!,
    tool: 'search_web'
  }),
  new CrawlForgeTool({
    name: 'fetch_content',
    description: 'Fetch and extract content from a URL',
    apiKey: process.env.CRAWLFORGE_API_KEY!,
    tool: 'extract_content'
  }),
  new CrawlForgeTool({
    name: 'deep_research',
    description: 'Perform comprehensive research on a topic',
    apiKey: process.env.CRAWLFORGE_API_KEY!,
    tool: 'deep_research'
  })
];

// Create agent with tools
const agent = new OpenAIAgent({
  tools,
  verbose: true
});

// Query with tool access
const response = await agent.chat(
  'Research the latest developments in quantum computing'
);

console.log(response.toString());

智能体提示： 智能体会根据查询自动选择要使用的工具。设置 verbose=true 即可查看工具选择过程。

自定义网页检索器

构建一个根据查询抓取网页数据的自定义检索器。

Typescript

import { CrawlForgeReader } from '@crawlforge/llamaindex-adapter';
import { BaseRetriever } from 'llamaindex';
import type { NodeWithScore } from 'llamaindex';

export class WebRetriever extends BaseRetriever {
  private reader: CrawlForgeReader;

  constructor(apiKey: string) {
    super();
    this.reader = new CrawlForgeReader({
      apiKey,
      tool: 'search_web'
    });
  }

  async retrieve(query: string): Promise<NodeWithScore[]> {
    // 1. Search for URLs
    const searchResults = await this.reader.search(query);

    // 2. Fetch content from top results
    const urls = searchResults.slice(0, 3).map(r => r.url);
    const documents = await this.reader.loadData(urls);

    // 3. Convert to nodes with scores
    return documents.map((doc, i) => ({
      node: doc,
      score: 1.0 - (i * 0.1) // Simple scoring
    }));
  }
}

// Use the custom retriever
const retriever = new WebRetriever(process.env.CRAWLFORGE_API_KEY!);
const nodes = await retriever.retrieve('latest AI news');

console.log(`Retrieved ${nodes.length} documents`);

异步批处理

使用异步批处理操作高效处理多个 URL。

Typescript

import { CrawlForgeReader } from '@crawlforge/llamaindex-adapter';
import { VectorStoreIndex } from 'llamaindex';

const reader = new CrawlForgeReader({
  apiKey: process.env.CRAWLFORGE_API_KEY!,
  tool: 'batch_scrape' // Use batch tool for efficiency
});

// Define URL batches
const urlBatches = [
  ['https://example.com/1', 'https://example.com/2'],
  ['https://example.com/3', 'https://example.com/4'],
  ['https://example.com/5', 'https://example.com/6']
];

// Process in parallel
const allDocuments = await Promise.all(
  urlBatches.map(urls => reader.loadData(urls))
);

const documents = allDocuments.flat();

console.log(`Loaded ${documents.length} documents`);

// Build index from all documents
const index = await VectorStoreIndex.fromDocuments(documents);

console.log('Index created successfully');

性能提示： 处理多个 URL 时使用 batch_scrape——它针对并行执行进行了优化，每个 URL 仅花费 1 credit。

最佳实践

选择高效工具 — 多个 URL 用 batch_scrape，干净文本用 extract_content
实现缓存 — 缓存已索引的文档，避免重复抓取并节省 credits
使用异步操作 — 利用 async/await 进行并行处理，以加速批量操作
监控 credits — 在文档元数据中跟踪 credits 用量，并在你的控制台中设置告警

准备好用 LlamaIndex 构建了吗？

探索全部 27 个 CrawlForge 工具，或查看其他集成。

查看全部工具 LangChain 集成

import { CrawlForgeReader } from '@crawlforge/llamaindex-adapter'; import { Document } from 'llamaindex'; // Initialize the reader const reader = new CrawlForgeReader({ apiKey: process.env.CRAWLFORGE_API_KEY!, tool: 'extract_content' // or 'extract_text', 'fetch_url' }); // Load a single document const documents = await reader.loadData(['https://example.com']); console.log(documents[0].text); // Document content console.log(documents[0].metadata); // URL, title, credits // Load multiple documents const urls = [ 'https://example.com/page1', 'https://example.com/page2', 'https://example.com/page3' ]; const allDocuments = await reader.loadData(urls); console.log(`Loaded ${allDocuments.length} documents`);

import { CrawlForgeReader } from '@crawlforge/llamaindex-adapter'; import { VectorStoreIndex } from 'llamaindex'; import { OpenAIEmbedding } from 'llamaindex'; // 1. Load documents from web const reader = new CrawlForgeReader({ apiKey: process.env.CRAWLFORGE_API_KEY!, tool: 'extract_content' }); const documents = await reader.loadData([ 'https://example.com/doc1', 'https://example.com/doc2', 'https://example.com/doc3' ]); // 2. Create embeddings const embedModel = new OpenAIEmbedding({ apiKey: process.env.OPENAI_API_KEY!, model: 'text-embedding-3-small' }); // 3. Build vector index const index = await VectorStoreIndex.fromDocuments(documents, { embedModel }); // 4. Query the index const queryEngine = index.asQueryEngine(); const response = await queryEngine.query( 'What are the main topics covered?' ); console.log(response.toString());

import { CrawlForgeTool } from '@crawlforge/llamaindex-adapter'; import { OpenAIAgent } from 'llamaindex'; // Create CrawlForge tools const tools = [ new CrawlForgeTool({ name: 'web_search', description: 'Search the web for information', apiKey: process.env.CRAWLFORGE_API_KEY!, tool: 'search_web' }), new CrawlForgeTool({ name: 'fetch_content', description: 'Fetch and extract content from a URL', apiKey: process.env.CRAWLFORGE_API_KEY!, tool: 'extract_content' }), new CrawlForgeTool({ name: 'deep_research', description: 'Perform comprehensive research on a topic', apiKey: process.env.CRAWLFORGE_API_KEY!, tool: 'deep_research' }) ]; // Create agent with tools const agent = new OpenAIAgent({ tools, verbose: true }); // Query with tool access const response = await agent.chat( 'Research the latest developments in quantum computing' ); console.log(response.toString());

import { CrawlForgeReader } from '@crawlforge/llamaindex-adapter'; import { BaseRetriever } from 'llamaindex'; import type { NodeWithScore } from 'llamaindex'; export class WebRetriever extends BaseRetriever { private reader: CrawlForgeReader; constructor(apiKey: string) { super(); this.reader = new CrawlForgeReader({ apiKey, tool: 'search_web' }); } async retrieve(query: string): Promise<NodeWithScore[]> { // 1. Search for URLs const searchResults = await this.reader.search(query); // 2. Fetch content from top results const urls = searchResults.slice(0, 3).map(r => r.url); const documents = await this.reader.loadData(urls); // 3. Convert to nodes with scores return documents.map((doc, i) => ({ node: doc, score: 1.0 - (i * 0.1) // Simple scoring })); } } // Use the custom retriever const retriever = new WebRetriever(process.env.CRAWLFORGE_API_KEY!); const nodes = await retriever.retrieve('latest AI news'); console.log(`Retrieved ${nodes.length} documents`);

import { CrawlForgeReader } from '@crawlforge/llamaindex-adapter'; import { VectorStoreIndex } from 'llamaindex'; const reader = new CrawlForgeReader({ apiKey: process.env.CRAWLFORGE_API_KEY!, tool: 'batch_scrape' // Use batch tool for efficiency }); // Define URL batches const urlBatches = [ ['https://example.com/1', 'https://example.com/2'], ['https://example.com/3', 'https://example.com/4'], ['https://example.com/5', 'https://example.com/6'] ]; // Process in parallel const allDocuments = await Promise.all( urlBatches.map(urls => reader.loadData(urls)) ); const documents = allDocuments.flat(); console.log(`Loaded ${documents.length} documents`); // Build index from all documents const index = await VectorStoreIndex.fromDocuments(documents); console.log('Index created successfully');