Web Scraping Glossary
50 essential terms covering web scraping, AI agents, the Model Context Protocol, and data extraction.
A
AI Agent
AI / MCPAn AI agent is an autonomous system powered by a large language model that can reason about tasks, make decisions, and take actions by using tools. Agents go beyond simple chatbots by planning and executing multi-step workflows.
API Endpoint
DataAn API endpoint is a specific URL where an API receives requests. Each endpoint performs a specific function, like retrieving data, creating records, or triggering actions.
C
CAPTCHA Solving
Web ScrapingCAPTCHA solving refers to automated techniques for bypassing CAPTCHA challenges that websites use to distinguish humans from bots. This includes image recognition, token-based solving, and browser fingerprint emulation.
Competitive Intelligence
IndustryCompetitive intelligence is the systematic collection and analysis of information about competitors, market trends, and industry dynamics. It informs strategic decisions about pricing, positioning, and product development.
Content Migration
IndustryContent migration is the process of moving content from one platform or system to another. It involves extracting content from the source, transforming it to match the target format, and loading it into the new system.
Context Window
AI / MCPThe context window is the maximum amount of text (measured in tokens) that a language model can process in a single request. It includes both the input prompt and the generated output.
CSS Selector
Web ScrapingA CSS selector is a pattern used to select and target specific HTML elements on a web page. In web scraping, selectors identify exactly which data to extract from a page's structure.
D
Data Governance
IndustryData governance is the framework of policies, procedures, and standards that ensures data is managed properly throughout its lifecycle. It covers data privacy, compliance, access control, and quality standards.
Data Pipeline
IndustryA data pipeline is an automated sequence of steps that collects, processes, transforms, and delivers data from sources to destinations. It enables continuous data flow between systems without manual intervention.
Data Quality
IndustryData quality measures how well a dataset meets the requirements of its intended use. Key dimensions include accuracy, completeness, consistency, timeliness, and validity of the data.
DOM Parsing
Web ScrapingDOM parsing is the process of converting raw HTML into a structured Document Object Model tree. This tree representation allows programs to navigate and extract specific elements from a web page.
Dynamic Content
Web ScrapingDynamic content is web content that is loaded or generated by JavaScript after the initial page load. This includes single-page applications, AJAX-loaded data, and client-side rendered content.
E
Embeddings
AI / MCPEmbeddings are dense numerical vector representations of text, images, or other data. They capture semantic meaning in a format that enables similarity search, clustering, and other machine learning operations.
ETL (Extract, Transform, Load)
IndustryETL is a data integration process that extracts data from sources, transforms it into a suitable format, and loads it into a target system. It is the standard approach for moving data between systems.
F
Fine-Tuning
AI / MCPFine-tuning is the process of further training a pre-trained language model on a specific dataset to specialize its behavior for a particular task or domain. It adapts general-purpose models to targeted use cases.
Function Calling
AI / MCPFunction calling is the ability of language models to invoke external functions or APIs during a conversation. The model decides when to call a function, generates the appropriate arguments, and processes the returned results.
H
Headless Browser
Web ScrapingA headless browser is a web browser without a graphical user interface that can be controlled programmatically. It executes JavaScript and renders pages exactly like a regular browser, but runs in the background.
HTML Parsing
DataHTML parsing is the process of analyzing HTML markup to extract its structure and content. Parsers convert raw HTML strings into navigable tree structures that programs can query and manipulate.
HTTP Headers
Web ScrapingHTTP headers are key-value pairs sent with HTTP requests and responses that provide metadata about the communication. In scraping, headers like User-Agent, Accept, and Cookie are critical for successful requests.
J
JSON
DataJSON (JavaScript Object Notation) is a lightweight data interchange format that is easy for humans to read and machines to parse. It is the standard format for API responses and structured data exchange.
JSON-LD
DataJSON-LD (JSON for Linking Data) is a method of encoding structured data using JSON format. It is the preferred format for embedding schema.org markup in web pages for search engine understanding.
L
Large Language Model (LLM)
AI / MCPA large language model is a neural network trained on vast amounts of text data that can understand and generate human language. LLMs power AI assistants, code generators, and autonomous agents.
Lead Enrichment
IndustryLead enrichment is the process of supplementing basic lead information with additional data points like company size, industry, technology stack, and social profiles. It helps sales teams prioritize and personalize outreach.
M
Markdown
DataMarkdown is a lightweight markup language that uses plain text formatting syntax. It is widely used for documentation, content creation, and as a clean intermediate format for extracted web content.
MCP Client
AI / MCPAn MCP client is an application or AI model that connects to MCP servers to discover and invoke tools. It sends tool call requests and processes the structured responses returned by the server.
MCP Server
AI / MCPAn MCP server is a service that exposes tools and resources through the Model Context Protocol. It registers available functions, handles incoming tool calls from AI clients, and returns structured results.
Model Context Protocol (MCP)
AI / MCPThe Model Context Protocol is an open standard that enables AI models to interact with external tools and data sources through a unified interface. It provides a structured way for LLMs to call functions, access APIs, and retrieve real-time information.
P
Pagination
Web ScrapingPagination is the practice of dividing content across multiple pages. Handling pagination in web scraping means automatically navigating through all pages to collect complete datasets.
Price Monitoring
IndustryPrice monitoring is the automated tracking of product and service prices across websites over time. It enables businesses to respond to competitor pricing changes, optimize their own pricing, and identify market trends.
Prompt Engineering
AI / MCPPrompt engineering is the practice of designing and refining instructions given to language models to achieve desired outputs. It involves crafting system prompts, few-shot examples, and structured queries.
Proxy Rotation
Web ScrapingProxy rotation is the practice of cycling through multiple proxy IP addresses when making web requests. This distributes requests across different IPs to avoid rate limits and IP-based blocking.
R
Rate Limiting
Web ScrapingRate limiting is a technique used by websites and APIs to control the number of requests a client can make within a given time period. It prevents server overload and defends against abusive scraping.
REST API
DataA REST API (Representational State Transfer) is a web service architecture that uses standard HTTP methods to perform operations on resources. It is the most common API style for web services.
Retrieval-Augmented Generation (RAG)
AI / MCPRAG is an AI architecture that combines information retrieval with text generation. It first retrieves relevant documents from external sources, then uses them as context for the language model to generate accurate, grounded responses.
Robots.txt
Web ScrapingRobots.txt is a standard text file placed at the root of a website that tells web crawlers which pages they are allowed or disallowed from accessing. It is part of the Robots Exclusion Protocol.
S
Schema Markup
DataSchema markup is a vocabulary of tags (from schema.org) that you add to HTML to improve how search engines read and represent your page. It defines types like Product, Article, Organization, and their properties.
SEO Audit
IndustryAn SEO audit is a comprehensive analysis of a website's search engine optimization performance. It evaluates technical SEO, on-page content, metadata, site structure, and identifies opportunities for improvement.
Sitemap
Web ScrapingA sitemap is an XML file that lists all the URLs on a website, along with metadata like last modification date and priority. It helps search engines and crawlers discover and index all pages efficiently.
Structured Data
DataStructured data is information organized in a predefined format that makes it easy for machines to parse and understand. On the web, it typically refers to schema.org markup embedded in HTML pages.
Structured Output
AI / MCPStructured output refers to data returned in a predictable, machine-readable format like JSON, rather than free-form text. It enables reliable downstream processing by AI agents and data pipelines.
T
Token
AI / MCPA token is the basic unit of text that language models process. Text is split into tokens (roughly 4 characters or 0.75 words each) before being processed by the model. Token counts determine costs and context limits.
Tool Use
AI / MCPTool use is the capability of AI models to interact with external tools, APIs, and services to accomplish tasks beyond text generation. It extends model capabilities to include web browsing, code execution, data retrieval, and more.
W
Web Crawler
Web ScrapingA web crawler is a program that systematically browses the web by following links from page to page. Crawlers discover and index content across entire websites or domains.
Web Data
IndustryWeb data is any information that is publicly accessible on the internet. It includes website content, social media posts, public APIs, government records, and any other data available through web protocols.
Web Scraping
Web ScrapingWeb scraping is the automated extraction of data from websites. It involves programmatically fetching web pages and parsing their content to collect structured information.
Webhook
DataA webhook is an HTTP callback that delivers data to a specified URL when an event occurs. Unlike polling, webhooks push data in real-time, enabling event-driven architectures.