extract_text
Extract clean, readable text from HTML with intelligent parsing. Automatically removes scripts, styles, and boilerplate content while preserving the main text content.
Use Cases
Article Extraction for LLMs
Extract clean article text for summarization, analysis, or AI processing
Content Analysis
Get plain text for word count, readability analysis, or sentiment detection
Clean Text for Summarization
Remove HTML noise before passing to summarization models
Boilerplate Removal
Remove ads, navigation, and other non-content elements
Endpoint
/api/v1/tools/extract_text
Parameters
Name | Type | Required | Default | Description |
---|---|---|---|---|
html | string | Optional | - | HTML content to extract text from (provide either html or url) Example: <html><body><h1>Hello World</h1></body></html> |
url | string | Optional | - | URL to fetch and extract text from (provide either html or url) Example: https://example.com/article |
selector | string | Optional | - | CSS selector to target specific elements (default: entire page) Example: article, .content, #main |
clean | boolean | Optional | true | Remove extra whitespace and normalize formatting Example: true |
preserve_links | boolean | Optional | false | Include links in the extracted text with their URLs Example: false |
preserve_formatting | boolean | Optional | false | Preserve basic HTML formatting (paragraphs, line breaks) Example: false |
max_length | number | Optional | - | Maximum length of extracted text (will truncate with ...) Example: 5000 |
html
or url
. If both are provided, html takes precedence.Request Examples
cURL - Extract from URL
TypeScript - Extract from HTML
Python - Extract with Selector
Response Example
{ "success": true, "data": { "text": "Article Title\n\nThis is the main content of the article. It contains useful information that has been extracted from the HTML.\n\nLinks:\nRelated Article (/related)", "metadata": { "title": "Article Title - Example Site", "description": "Meta description of the article", "word_count": 248, "character_count": 1432, "selector_used": "article", "links_preserved": true, "formatting_preserved": false } }, "credits_used": 1, "credits_remaining": 999, "processing_time": 180}
data.text
The extracted plain text contentdata.metadata.word_count
Total number of words in the extracted textdata.metadata.character_count
Total number of charactersdata.metadata.selector_used
The CSS selector that was appliedcredits_used
Credits deducted for this request (1 per extraction)Error Handling
Missing Input (400 Bad Request)
Neither html nor url was provided. You must provide at least one.
Invalid Selector (400 Bad Request)
The CSS selector is invalid or matches no elements. Verify your selector syntax.
URL Fetch Failed (500 Internal Server Error)
Failed to fetch the URL. Check that the URL is accessible and returns HTML.