Web Scraping <span class="bg-linear-to-r from-blue-600 to-purple-600 bg-clip-text text-transparent">Glossary

An API endpoint is a specific URL where an API receives requests. Each endpoint performs a specific function, like retrieving data, creating records, or triggering actions.

C

CAPTCHA Solving

CAPTCHA solving refers to automated techniques for bypassing CAPTCHA challenges that websites use to distinguish humans from bots. This includes image recognition, token-based solving, and browser fingerprint emulation.

Competitive Intelligence

Competitive intelligence is the systematic collection and analysis of information about competitors, market trends, and industry dynamics. It informs strategic decisions about pricing, positioning, and product development.

Content Migration

Content migration is the process of moving content from one platform or system to another. It involves extracting content from the source, transforming it to match the target format, and loading it into the new system.

Context Window

The context window is the maximum amount of text (measured in tokens) that a language model can process in a single request. It includes both the input prompt and the generated output.

CSS Selector

A CSS selector is a pattern used to select and target specific HTML elements on a web page. In web scraping, selectors identify exactly which data to extract from a page's structure.

D

Data Governance

Data governance is the framework of policies, procedures, and standards that ensures data is managed properly throughout its lifecycle. It covers data privacy, compliance, access control, and quality standards.

Data Pipeline

A data pipeline is an automated sequence of steps that collects, processes, transforms, and delivers data from sources to destinations. It enables continuous data flow between systems without manual intervention.

Data Quality

Data quality measures how well a dataset meets the requirements of its intended use. Key dimensions include accuracy, completeness, consistency, timeliness, and validity of the data.

DOM Parsing

DOM parsing is the process of converting raw HTML into a structured Document Object Model tree. This tree representation allows programs to navigate and extract specific elements from a web page.

Dynamic Content

Dynamic content is web content that is loaded or generated by JavaScript after the initial page load. This includes single-page applications, AJAX-loaded data, and client-side rendered content.

E

Embeddings

Embeddings are dense numerical vector representations of text, images, or other data. They capture semantic meaning in a format that enables similarity search, clustering, and other machine learning operations.

ETL (Extract, Transform, Load)

ETL is a data integration process that extracts data from sources, transforms it into a suitable format, and loads it into a target system. It is the standard approach for moving data between systems.

F

Fine-Tuning

Fine-tuning is the process of further training a pre-trained language model on a specific dataset to specialize its behavior for a particular task or domain. It adapts general-purpose models to targeted use cases.

Function Calling

Function calling is the ability of language models to invoke external functions or APIs during a conversation. The model decides when to call a function, generates the appropriate arguments, and processes the returned results.

G

GraphQL

GraphQL is a query language for APIs that allows clients to request exactly the data they need. Unlike REST, a single GraphQL endpoint serves all queries, with the client specifying the data shape.

H

Headless Browser

A headless browser is a web browser without a graphical user interface that can be controlled programmatically. It executes JavaScript and renders pages exactly like a regular browser, but runs in the background.

HTML Parsing

HTML parsing is the process of analyzing HTML markup to extract its structure and content. Parsers convert raw HTML strings into navigable tree structures that programs can query and manipulate.

HTTP Headers

HTTP headers are key-value pairs sent with HTTP requests and responses that provide metadata about the communication. In scraping, headers like User-Agent, Accept, and Cookie are critical for successful requests.

J

JSON

JSON (JavaScript Object Notation) is a lightweight data interchange format that is easy for humans to read and machines to parse. It is the standard format for API responses and structured data exchange.

JSON-LD

JSON-LD (JSON for Linking Data) is a method of encoding structured data using JSON format. It is the preferred format for embedding schema.org markup in web pages for search engine understanding.

L

Large Language Model (LLM)

A large language model is a neural network trained on vast amounts of text data that can understand and generate human language. LLMs power AI assistants, code generators, and autonomous agents.

Lead Enrichment

Lead enrichment is the process of supplementing basic lead information with additional data points like company size, industry, technology stack, and social profiles. It helps sales teams prioritize and personalize outreach.

M

Markdown

Markdown is a lightweight markup language that uses plain text formatting syntax. It is widely used for documentation, content creation, and as a clean intermediate format for extracted web content.

MCP Client

An MCP client is an application or AI model that connects to MCP servers to discover and invoke tools. It sends tool call requests and processes the structured responses returned by the server.

MCP Server

An MCP server is a service that exposes tools and resources through the Model Context Protocol. It registers available functions, handles incoming tool calls from AI clients, and returns structured results.

Model Context Protocol (MCP)

The Model Context Protocol is an open standard that enables AI models to interact with external tools and data sources through a unified interface. It provides a structured way for LLMs to call functions, access APIs, and retrieve real-time information.

P

Pagination

Pagination is the practice of dividing content across multiple pages. Handling pagination in web scraping means automatically navigating through all pages to collect complete datasets.

Price Monitoring

Price monitoring is the automated tracking of product and service prices across websites over time. It enables businesses to respond to competitor pricing changes, optimize their own pricing, and identify market trends.

Prompt Engineering

Prompt engineering is the practice of designing and refining instructions given to language models to achieve desired outputs. It involves crafting system prompts, few-shot examples, and structured queries.

Proxy Rotation

Proxy rotation is the practice of cycling through multiple proxy IP addresses when making web requests. This distributes requests across different IPs to avoid rate limits and IP-based blocking.

R

Rate Limiting

Rate limiting is a technique used by websites and APIs to control the number of requests a client can make within a given time period. It prevents server overload and defends against abusive scraping.

REST API

A REST API (Representational State Transfer) is a web service architecture that uses standard HTTP methods to perform operations on resources. It is the most common API style for web services.

Retrieval-Augmented Generation (RAG)

RAG is an AI architecture that combines information retrieval with text generation. It first retrieves relevant documents from external sources, then uses them as context for the language model to generate accurate, grounded responses.

Robots.txt

Robots.txt is a standard text file placed at the root of a website that tells web crawlers which pages they are allowed or disallowed from accessing. It is part of the Robots Exclusion Protocol.

S

Schema Markup

Schema markup is a vocabulary of tags (from schema.org) that you add to HTML to improve how search engines read and represent your page. It defines types like Product, Article, Organization, and their properties.

SEO Audit

An SEO audit is a comprehensive analysis of a website's search engine optimization performance. It evaluates technical SEO, on-page content, metadata, site structure, and identifies opportunities for improvement.

Sitemap

A sitemap is an XML file that lists all the URLs on a website, along with metadata like last modification date and priority. It helps search engines and crawlers discover and index all pages efficiently.

Structured Data

Structured data is information organized in a predefined format that makes it easy for machines to parse and understand. On the web, it typically refers to schema.org markup embedded in HTML pages.

Structured Output

Structured output refers to data returned in a predictable, machine-readable format like JSON, rather than free-form text. It enables reliable downstream processing by AI agents and data pipelines.

T

Token

A token is the basic unit of text that language models process. Text is split into tokens (roughly 4 characters or 0.75 words each) before being processed by the model. Token counts determine costs and context limits.

Tool Use

Tool use is the capability of AI models to interact with external tools, APIs, and services to accomplish tasks beyond text generation. It extends model capabilities to include web browsing, code execution, data retrieval, and more.

U

User Agent

A user agent is a string sent in HTTP request headers that identifies the client software making the request. Websites use it to detect browsers, bots, and scrapers.

V

Vector Database

A vector database is a specialized database designed to store and efficiently query high-dimensional vector embeddings. It enables fast similarity search across millions of embedded documents.

W

Web Crawler

A web crawler is a program that systematically browses the web by following links from page to page. Crawlers discover and index content across entire websites or domains.

Web Data

Web data is any information that is publicly accessible on the internet. It includes website content, social media posts, public APIs, government records, and any other data available through web protocols.

Web Scraping

Web scraping is the automated extraction of data from websites. It involves programmatically fetching web pages and parsing their content to collect structured information.

Webhook

A webhook is an HTTP callback that delivers data to a specified URL when an event occurs. Unlike polling, webhooks push data in real-time, enabling event-driven architectures.

X

XPath