extract_text

Extract clean, readable text from HTML with intelligent parsing. Automatically removes scripts, styles, and boilerplate content while preserving the main text content.

Use Cases

Article Extraction for LLMs

Extract clean article text for summarization, analysis, or AI processing

Content Analysis

Get plain text for word count, readability analysis, or sentiment detection

Clean Text for Summarization

Remove HTML noise before passing to summarization models

Boilerplate Removal

Remove ads, navigation, and other non-content elements

Endpoint

POST/api/v1/tools/extract_text

Auth Required

2 req/s on Free plan

1 credit

Parameters

Name	Type	Required	Default	Description
html	string	Optional	-	HTML content to extract text from (provide either html or url) Example: <html><body><h1>Hello World</h1></body></html>
url	string	Optional	-	URL to fetch and extract text from (provide either html or url) Example: https://example.com/article
selector	string	Optional	-	CSS selector to target specific elements (default: entire page) Example: article, .content, #main
clean	boolean	Optional	true	Remove extra whitespace and normalize formatting Example: true
preserve_links	boolean	Optional	false	Include links in the extracted text with their URLs Example: false
preserve_formatting	boolean	Optional	false	Preserve basic HTML formatting (paragraphs, line breaks) Example: false
max_length	number	Optional	-	Maximum length of extracted text (will truncate with ...) Example: 5000

Note: You must provide either html or url. If both are provided, html takes precedence.

Request Examples

cURL - Extract from URL

terminalBash

TypeScript - Extract from HTML

extractText.tsTypescript

Python - Extract with Selector

extract_text.pyPython

Response Example

200 OK180ms

{
  "success": true,
  "data": {
    "text": "Article Title\n\nThis is the main content of the article. It contains useful information that has been extracted from the HTML.\n\nLinks:\nRelated Article (/related)",
    "metadata": {
      "title": "Article Title - Example Site",
      "description": "Meta description of the article",
      "word_count": 248,
      "character_count": 1432,
      "selector_used": "article",
      "links_preserved": true,
      "formatting_preserved": false
    }
  },
  "credits_used": 1,
  "credits_remaining": 999,
  "processing_time": 180
}

Field Descriptions

data.textThe extracted plain text content

data.metadata.word_countTotal number of words in the extracted text

data.metadata.character_countTotal number of characters

data.metadata.selector_usedThe CSS selector that was applied

credits_usedCredits deducted for this request (1 per extraction)

Error Handling

Missing Input (400 Bad Request)

Neither html nor url was provided. You must provide at least one.

Invalid Selector (400 Bad Request)

The CSS selector is invalid or matches no elements. Verify your selector syntax.

URL Fetch Failed (500 Internal Server Error)

Failed to fetch the URL. Check that the URL is accessible and returns HTML.

Credit Cost