CrawlForge
HomeUse CasesIntegrationsPricingDocumentationBlog

Web Scraping Glossary

50 essential terms covering web scraping, AI agents, the Model Context Protocol, and data extraction.

ABCDEFGHIJKLMNOPQRSTUVWXYZ

A

AI Agent

AI / MCP

An AI agent is an autonomous system powered by a large language model that can reason about tasks, make decisions, and take actions by using tools. Agents go beyond simple chatbots by planning and executing multi-step workflows.

API Endpoint

Data

An API endpoint is a specific URL where an API receives requests. Each endpoint performs a specific function, like retrieving data, creating records, or triggering actions.

C

CAPTCHA Solving

Web Scraping

CAPTCHA solving refers to automated techniques for bypassing CAPTCHA challenges that websites use to distinguish humans from bots. This includes image recognition, token-based solving, and browser fingerprint emulation.

Competitive Intelligence

Industry

Competitive intelligence is the systematic collection and analysis of information about competitors, market trends, and industry dynamics. It informs strategic decisions about pricing, positioning, and product development.

Content Migration

Industry

Content migration is the process of moving content from one platform or system to another. It involves extracting content from the source, transforming it to match the target format, and loading it into the new system.

Context Window

AI / MCP

The context window is the maximum amount of text (measured in tokens) that a language model can process in a single request. It includes both the input prompt and the generated output.

CSS Selector

Web Scraping

A CSS selector is a pattern used to select and target specific HTML elements on a web page. In web scraping, selectors identify exactly which data to extract from a page's structure.

D

Data Governance

Industry

Data governance is the framework of policies, procedures, and standards that ensures data is managed properly throughout its lifecycle. It covers data privacy, compliance, access control, and quality standards.

Data Pipeline

Industry

A data pipeline is an automated sequence of steps that collects, processes, transforms, and delivers data from sources to destinations. It enables continuous data flow between systems without manual intervention.

Data Quality

Industry

Data quality measures how well a dataset meets the requirements of its intended use. Key dimensions include accuracy, completeness, consistency, timeliness, and validity of the data.

DOM Parsing

Web Scraping

DOM parsing is the process of converting raw HTML into a structured Document Object Model tree. This tree representation allows programs to navigate and extract specific elements from a web page.

Dynamic Content

Web Scraping

Dynamic content is web content that is loaded or generated by JavaScript after the initial page load. This includes single-page applications, AJAX-loaded data, and client-side rendered content.

E

Embeddings

AI / MCP

Embeddings are dense numerical vector representations of text, images, or other data. They capture semantic meaning in a format that enables similarity search, clustering, and other machine learning operations.

ETL (Extract, Transform, Load)

Industry

ETL is a data integration process that extracts data from sources, transforms it into a suitable format, and loads it into a target system. It is the standard approach for moving data between systems.

F

Fine-Tuning

AI / MCP

Fine-tuning is the process of further training a pre-trained language model on a specific dataset to specialize its behavior for a particular task or domain. It adapts general-purpose models to targeted use cases.

Function Calling

AI / MCP

Function calling is the ability of language models to invoke external functions or APIs during a conversation. The model decides when to call a function, generates the appropriate arguments, and processes the returned results.

G

GraphQL

Data

GraphQL is a query language for APIs that allows clients to request exactly the data they need. Unlike REST, a single GraphQL endpoint serves all queries, with the client specifying the data shape.

H

Headless Browser

Web Scraping

A headless browser is a web browser without a graphical user interface that can be controlled programmatically. It executes JavaScript and renders pages exactly like a regular browser, but runs in the background.

HTML Parsing

Data

HTML parsing is the process of analyzing HTML markup to extract its structure and content. Parsers convert raw HTML strings into navigable tree structures that programs can query and manipulate.

HTTP Headers

Web Scraping

HTTP headers are key-value pairs sent with HTTP requests and responses that provide metadata about the communication. In scraping, headers like User-Agent, Accept, and Cookie are critical for successful requests.

J

JSON

Data

JSON (JavaScript Object Notation) is a lightweight data interchange format that is easy for humans to read and machines to parse. It is the standard format for API responses and structured data exchange.

JSON-LD

Data

JSON-LD (JSON for Linking Data) is a method of encoding structured data using JSON format. It is the preferred format for embedding schema.org markup in web pages for search engine understanding.

L

Large Language Model (LLM)

AI / MCP

A large language model is a neural network trained on vast amounts of text data that can understand and generate human language. LLMs power AI assistants, code generators, and autonomous agents.

Lead Enrichment

Industry

Lead enrichment is the process of supplementing basic lead information with additional data points like company size, industry, technology stack, and social profiles. It helps sales teams prioritize and personalize outreach.

M

Markdown

Data

Markdown is a lightweight markup language that uses plain text formatting syntax. It is widely used for documentation, content creation, and as a clean intermediate format for extracted web content.

MCP Client

AI / MCP

An MCP client is an application or AI model that connects to MCP servers to discover and invoke tools. It sends tool call requests and processes the structured responses returned by the server.

MCP Server

AI / MCP

An MCP server is a service that exposes tools and resources through the Model Context Protocol. It registers available functions, handles incoming tool calls from AI clients, and returns structured results.

Model Context Protocol (MCP)

AI / MCP

The Model Context Protocol is an open standard that enables AI models to interact with external tools and data sources through a unified interface. It provides a structured way for LLMs to call functions, access APIs, and retrieve real-time information.

P

Pagination

Web Scraping

Pagination is the practice of dividing content across multiple pages. Handling pagination in web scraping means automatically navigating through all pages to collect complete datasets.

Price Monitoring

Industry

Price monitoring is the automated tracking of product and service prices across websites over time. It enables businesses to respond to competitor pricing changes, optimize their own pricing, and identify market trends.

Prompt Engineering

AI / MCP

Prompt engineering is the practice of designing and refining instructions given to language models to achieve desired outputs. It involves crafting system prompts, few-shot examples, and structured queries.

Proxy Rotation

Web Scraping

Proxy rotation is the practice of cycling through multiple proxy IP addresses when making web requests. This distributes requests across different IPs to avoid rate limits and IP-based blocking.

R

Rate Limiting

Web Scraping

Rate limiting is a technique used by websites and APIs to control the number of requests a client can make within a given time period. It prevents server overload and defends against abusive scraping.

REST API

Data

A REST API (Representational State Transfer) is a web service architecture that uses standard HTTP methods to perform operations on resources. It is the most common API style for web services.

Retrieval-Augmented Generation (RAG)

AI / MCP

RAG is an AI architecture that combines information retrieval with text generation. It first retrieves relevant documents from external sources, then uses them as context for the language model to generate accurate, grounded responses.

Robots.txt

Web Scraping

Robots.txt is a standard text file placed at the root of a website that tells web crawlers which pages they are allowed or disallowed from accessing. It is part of the Robots Exclusion Protocol.

S

Schema Markup

Data

Schema markup is a vocabulary of tags (from schema.org) that you add to HTML to improve how search engines read and represent your page. It defines types like Product, Article, Organization, and their properties.

SEO Audit

Industry

An SEO audit is a comprehensive analysis of a website's search engine optimization performance. It evaluates technical SEO, on-page content, metadata, site structure, and identifies opportunities for improvement.

Sitemap

Web Scraping

A sitemap is an XML file that lists all the URLs on a website, along with metadata like last modification date and priority. It helps search engines and crawlers discover and index all pages efficiently.

Structured Data

Data

Structured data is information organized in a predefined format that makes it easy for machines to parse and understand. On the web, it typically refers to schema.org markup embedded in HTML pages.

Structured Output

AI / MCP

Structured output refers to data returned in a predictable, machine-readable format like JSON, rather than free-form text. It enables reliable downstream processing by AI agents and data pipelines.

T

Token

AI / MCP

A token is the basic unit of text that language models process. Text is split into tokens (roughly 4 characters or 0.75 words each) before being processed by the model. Token counts determine costs and context limits.

Tool Use

AI / MCP

Tool use is the capability of AI models to interact with external tools, APIs, and services to accomplish tasks beyond text generation. It extends model capabilities to include web browsing, code execution, data retrieval, and more.

U

User Agent

Web Scraping

A user agent is a string sent in HTTP request headers that identifies the client software making the request. Websites use it to detect browsers, bots, and scrapers.

V

Vector Database

AI / MCP

A vector database is a specialized database designed to store and efficiently query high-dimensional vector embeddings. It enables fast similarity search across millions of embedded documents.

W

Web Crawler

Web Scraping

A web crawler is a program that systematically browses the web by following links from page to page. Crawlers discover and index content across entire websites or domains.

Web Data

Industry

Web data is any information that is publicly accessible on the internet. It includes website content, social media posts, public APIs, government records, and any other data available through web protocols.

Web Scraping

Web Scraping

Web scraping is the automated extraction of data from websites. It involves programmatically fetching web pages and parsing their content to collect structured information.

Webhook

Data

A webhook is an HTTP callback that delivers data to a specified URL when an event occurs. Unlike polling, webhooks push data in real-time, enabling event-driven architectures.

X

XPath

Web Scraping

XPath (XML Path Language) is a query language for selecting nodes from an XML or HTML document. It provides a more powerful and flexible way to navigate document trees than CSS selectors alone.

Footer

CrawlForge

Enterprise web scraping for AI Agents. 18 specialized MCP tools designed for modern developers building intelligent systems.

Product

  • Features
  • Pricing
  • Use Cases
  • Integrations
  • Changelog

Resources

  • Getting Started
  • API Reference
  • Templates
  • Guides
  • Blog
  • FAQ

Developers

  • MCP Protocol
  • Claude Desktop
  • Cursor IDE
  • LangChain
  • LlamaIndex

Company

  • About
  • Contact
  • Privacy
  • Terms

Stay updated

Get the latest updates on new tools and features.

Built with Next.js and MCP protocol

© 2025-2026 CrawlForge. All rights reserved.