On this page
The Model Context Protocol (MCP) is how AI assistants like Claude call external tools. If you want your assistant to scrape the web, one option is to build your own MCP server. This guide walks through a minimal, working web-scraping MCP server in TypeScript -- and is honest about where the protocol ends and the hard part of scraping begins.
By the end you will have a server Claude can call, plus a clear sense of when building your own is the right move versus pointing your agent at a managed scraping server.
Table of Contents
- What an MCP Server Actually Is
- Prerequisites
- Step 1: Project Setup
- Step 2: A Minimal MCP Server
- Step 3: Add a Real Scraping Tool
- Step 4: Test With the MCP Inspector
- Step 5: Connect It to Claude Desktop
- The Hard Part: Production Scraping
- Build vs. Buy
- Gotchas
What an MCP Server Actually Is
An MCP server exposes three kinds of capability to an AI client:
- Tools -- model-invoked functions that do work or cause side effects. A
scrape_pagetool is one. - Resources -- read-only data exposed by URI for context, no heavy computation.
- Prompts -- reusable, user-triggered message templates.
For scraping, you want a tool. The client (Claude Desktop, Claude Code, Cursor) shows your tool to the model; when a prompt needs live data, the model emits a structured tool call, your server runs it, and the result flows back. For deeper protocol background, see MCP vs REST.
Prerequisites
- Node.js 18+ (check with
node --version) - Basic TypeScript familiarity
- The official SDK and a schema library:
npm install @modelcontextprotocol/sdk zodThis tutorial targets @modelcontextprotocol/sdk v1.x (currently 1.29.x), which the maintainers recommend for production. A v2 line that splits into @modelcontextprotocol/server and @modelcontextprotocol/client is in pre-alpha; once it stabilizes the imports change, but the concepts below carry over.
Step 1: Project Setup
Create a project and mark it as an ES module -- the SDK is ESM-only:
// package.json
{
"name": "scraper-mcp",
"version": "1.0.0",
"type": "module",
"bin": { "scraper-mcp": "build/index.js" },
"scripts": { "build": "tsc" }
}A minimal tsconfig.json targeting ES2022:
{
"compilerOptions": {
"target": "ES2022",
"module": "Node16",
"moduleResolution": "Node16",
"outDir": "build",
"strict": true
},
"include": ["src/**/*.ts"]
}Step 2: A Minimal MCP Server
Here is the smallest server that registers one tool and talks to a client over stdio. The .js extensions on the imports are required under ESM:
// src/index.ts
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import { z } from "zod";
const server = new McpServer({ name: "scraper", version: "1.0.0" });
server.registerTool(
"scrape_page",
{
title: "Scrape Page",
description: "Fetch a URL and return its raw HTML",
inputSchema: { url: z.string().url() }, // raw Zod shape, not z.object(...)
},
async ({ url }) => {
const res = await fetch(url);
const html = await res.text();
return { content: [{ type: "text", text: html }] };
}
);
const transport = new StdioServerTransport();
await server.connect(transport);That is a complete MCP server. The current API is server.registerTool(name, config, handler); the inputSchema is a raw Zod shape ({ url: z.string() }), not a wrapped z.object(). Build it with npm run build.
Step 3: Add a Real Scraping Tool
Returning raw HTML is rarely useful -- you want clean text. Add an HTML parser:
npm install cheerioimport * as cheerio from "cheerio";
server.registerTool(
"scrape_text",
{
title: "Scrape Visible Text",
description: "Fetch a URL and return readable text, stripped of scripts and chrome",
inputSchema: { url: z.string().url() },
},
async ({ url }) => {
const res = await fetch(url, { headers: { "User-Agent": "scraper-mcp/1.0" } });
if (!res.ok) {
return {
content: [{ type: "text", text: "Request failed with status " + res.status }],
isError: true,
};
}
const $ = cheerio.load(await res.text());
$("script, style, nav, footer").remove();
const text = $("body").text().replace(/\s+/g, " ").trim();
return { content: [{ type: "text", text: text.slice(0, 8000) }] };
}
);Return isError: true on failures so the model can react instead of treating an error string as page content. cheerio parses static HTML server-side, fast -- but note what it cannot do, which we get to below.
Step 4: Test With the MCP Inspector
Before wiring anything into Claude, test in isolation with the MCP Inspector:
npx @modelcontextprotocol/inspector node build/index.jsIt opens an interactive UI with Tools, Resources, and Prompts tabs. Select scrape_text, enter a URL, and run it -- you will see the response plus a live JSON-RPC log. You can also set environment variables per server here, which is how you would test an API key later.
Step 5: Connect It to Claude Desktop
Edit Claude Desktop's config (create the file if it does not exist):
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json - Windows:
%APPDATA%\Claude\claude_desktop_config.json
Add your server under the mcpServers key, using an absolute path to the built file:
{
"mcpServers": {
"scraper": {
"command": "node",
"args": ["/ABSOLUTE/PATH/TO/build/index.js"],
"env": { "SCRAPER_API_KEY": "optional" }
}
}
}Quit Claude Desktop completely and reopen it. Your scrape_text tool now appears, and you can ask Claude in plain English to scrape a page.
The Hard Part: Production Scraping
The protocol code above is the easy 20%. Real-world scraping is the other 80%, and none of it is MCP-specific:
- JavaScript rendering.
cheerioonly sees the initial HTML. Single-page apps render content client-side, so you need a headless browser like Playwright or Puppeteer -- which is slow, memory-hungry, and a deployment headache. - Anti-bot systems. Cloudflare, WAFs, CAPTCHAs, and browser fingerprinting block naive HTTP clients. Getting past them is a continuous arms race.
- Proxies. Avoiding IP bans at any volume means rotating residential or datacenter proxy pools.
- Rate limits, retries, and backoff. Polite, resilient crawling needs concurrency control and exponential backoff.
- Parsing brittleness. Selectors break the moment a target site changes its markup.
Build vs. Buy
Build your own MCP server to learn the protocol, or for simple, static, well-behaved internal sites. Once you hit JavaScript rendering, anti-bot defenses, and proxy rotation, the maintenance burden dwarfs the protocol code -- which is exactly why most teams point their agents at a managed scraping MCP server instead.
That is the niche CrawlForge fills: 26 ready-made tools including scrape_with_actions (headless browser), stealth_mode (anti-bot), batch_scrape, and deep_research, with JS rendering, proxies, and rate limiting handled for you. You install it the same way you would your own server -- see adding web scraping to Claude Desktop -- but skip the browser-and-proxy treadmill. For how the managed options stack up, see the best MCP servers for web scraping.
Gotchas
- Never log to stdout on a stdio server. stdout carries the JSON-RPC stream, so a stray
console.logcorrupts it and breaks the server. Log to stderr withconsole.error, or to a file. - Use absolute paths in
claude_desktop_config.json-- relative paths fail. - Stay ESM:
"type": "module"in package.json and.jsextensions on SDK imports, withtarget: ES2022. - Restart the client after any config change. Use Node.js 18 or newer.
- Prefer v1.x of the SDK for production; treat v2 as experimental for now.
- Python instead? The official Python SDK (the
mcppackage, with FastMCP and an@mcp.tool()decorator) speaks the same protocol and needs Python 3.10+.
Skip the maintenance -- start free with CrawlForge and get 26 production scraping tools with 1,000 credits, no credit card.
Try this yourself — no signup needed
Run any of CrawlForge's 27 scraping and extraction tools in the playground, then start free with 1,000 credits.
1,000 free credits • Refills monthly • No credit card required
Tags
About the Author
Stay updated with the latest insights
Get tutorials, product updates, and web scraping tips delivered to your inbox.
No spam. Unsubscribe anytime.