CrawlForge
LangChain

LangChain 集成

将 CrawlForge MCP 与 LangChain 集成,构建具备网页抓取能力的强大 AI 智能体。可用作文档加载器、工具或自定义检索链。

使用场景

文档加载器
将网页作为文档加载,用于向量存储和 RAG 应用
AI 智能体
为智能体提供网页抓取工具,以获取实时数据
检索链
构建可抓取并处理网页内容的自定义链
研究流水线
使用 deep_research 工具创建自动化研究工作流

安装

安装 LangChain 和 CrawlForge MCP 适配器。

Bash
npm install langchain @langchain/core @langchain/community
npm install @crawlforge/langchain-adapter
你还需要一个来自控制台的 CrawlForge API 密钥。

文档加载器

将 CrawlForge 用作文档加载器,为 RAG 应用抓取网页。

Typescript
import { CrawlForgeLoader } from '@crawlforge/langchain-adapter';

// Initialize the loader
const loader = new CrawlForgeLoader({
  apiKey: process.env.CRAWLFORGE_API_KEY!,
  tool: 'extract_text', // or 'fetch_url', 'extract_content'
});

// Load a single document
const docs = await loader.load('https://example.com');

console.log(docs[0].pageContent); // Clean text content
console.log(docs[0].metadata);    // URL, title, credits used

// Load multiple documents
const urls = [
  'https://example.com/page1',
  'https://example.com/page2',
  'https://example.com/page3'
];

const allDocs = await loader.loadMany(urls);
console.log(`Loaded ${allDocs.length} documents`);
最佳实践: 干净内容用 extract_text,文章提取用 extract_content。

带向量存储的 RAG 流水线

使用 CrawlForge 文档加载器和向量存储构建完整的 RAG 流水线。

Typescript
import { CrawlForgeLoader } from '@crawlforge/langchain-adapter';
import { OpenAIEmbeddings } from '@langchain/openai';
import { MemoryVectorStore } from 'langchain/vectorstores/memory';
import { RetrievalQAChain } from 'langchain/chains';
import { ChatOpenAI } from '@langchain/openai';

// 1. Load documents from web pages
const loader = new CrawlForgeLoader({
  apiKey: process.env.CRAWLFORGE_API_KEY!,
  tool: 'extract_content'
});

const docs = await loader.loadMany([
  'https://example.com/doc1',
  'https://example.com/doc2',
  'https://example.com/doc3'
]);

// 2. Create embeddings and vector store
const embeddings = new OpenAIEmbeddings();
const vectorStore = await MemoryVectorStore.fromDocuments(
  docs,
  embeddings
);

// 3. Create retrieval chain
const model = new ChatOpenAI({ modelName: 'gpt-4' });
const chain = RetrievalQAChain.fromLLM(
  model,
  vectorStore.asRetriever()
);

// 4. Query the knowledge base
const response = await chain.call({
  query: 'What are the key points from these documents?'
});

console.log(response.text);

智能体工具

使用 CrawlForge 工具为 LangChain 智能体提供网页抓取能力。

Typescript
import { CrawlForgeTool } from '@crawlforge/langchain-adapter';
import { initializeAgentExecutorWithOptions } from 'langchain/agents';
import { ChatOpenAI } from '@langchain/openai';

// Create CrawlForge tools
const tools = [
  new CrawlForgeTool({
    name: 'web_search',
    description: 'Search the web for information',
    apiKey: process.env.CRAWLFORGE_API_KEY!,
    tool: 'search_web'
  }),
  new CrawlForgeTool({
    name: 'fetch_page',
    description: 'Fetch and extract content from a URL',
    apiKey: process.env.CRAWLFORGE_API_KEY!,
    tool: 'extract_content'
  }),
  new CrawlForgeTool({
    name: 'deep_research',
    description: 'Perform comprehensive research on a topic',
    apiKey: process.env.CRAWLFORGE_API_KEY!,
    tool: 'deep_research'
  })
];

// Initialize agent
const model = new ChatOpenAI({ modelName: 'gpt-4', temperature: 0 });
const executor = await initializeAgentExecutorWithOptions(
  tools,
  model,
  {
    agentType: 'openai-functions',
    verbose: true
  }
);

// Run agent
const result = await executor.call({
  input: 'Research the latest developments in quantum computing'
});

console.log(result.output);
智能体提示: 使用具描述性的工具名称和说明,帮助 LLM 选择正确的工具。设置 verbose=true 可查看智能体的推理过程。

自定义检索链

构建一个可搜索、抓取并总结网页内容的自定义链。

Typescript
import { CrawlForgeLoader } from '@crawlforge/langchain-adapter';
import { PromptTemplate } from '@langchain/core/prompts';
import { RunnableSequence } from '@langchain/core/runnables';
import { ChatOpenAI } from '@langchain/openai';
import { StringOutputParser } from '@langchain/core/output_parsers';

// Initialize CrawlForge loader
const loader = new CrawlForgeLoader({
  apiKey: process.env.CRAWLFORGE_API_KEY!,
  tool: 'deep_research'
});

// Create custom chain
const prompt = PromptTemplate.fromTemplate(
  `Based on the following research, answer the question:\n\n{context}\n\nQuestion: {question}\n\nAnswer:`
);

const model = new ChatOpenAI({ modelName: 'gpt-4' });

const chain = RunnableSequence.from([
  {
    context: async (input) => {
      const docs = await loader.load(input.question);
      return docs[0].pageContent;
    },
    question: (input) => input.question
  },
  prompt,
  model,
  new StringOutputParser()
]);

// Run the chain
const result = await chain.invoke({
  question: 'What are the latest AI safety research findings?'
});

console.log(result);

最佳实践

  • 选择合适的工具 — 简单内容用 extract_text(1 credit),全面分析用 deep_research(10 credits)
  • 实现缓存 — 缓存已抓取的文档,避免重复的 API 调用并节省 credits
  • 处理速率限制 — 为生产应用实现带指数退避的重试逻辑
  • 监控 credits 用量 — 检查文档元数据中的 credit 用量,并在你的控制台中设置告警
准备好用 LangChain 构建了吗?
探索全部 23 个 CrawlForge 工具,或查看其他集成。
查看全部工具LlamaIndex 集成