process_document
Process PDF, DOCX, and TXT documents with text extraction, image extraction, and optional OCR support. Perfect for parsing academic papers, invoices, forms, and multi-format document processing.
Use Cases
Document Parsing
Extract text and metadata from PDFs, Word documents, and text files
Academic Research
Process research papers, theses, and academic publications for analysis
Invoice Processing
Extract structured data from invoices, receipts, and financial documents
Form Extraction
Parse application forms, surveys, and questionnaires
Legal Documents
Extract text from contracts, agreements, and legal filings
Scanned Document OCR
Convert scanned images and PDFs to searchable text with OCR
Endpoint
/api/v1/tools/process_document
Parameters
Name | Type | Required | Default | Description |
---|---|---|---|---|
source | string | Required | - | The document source (URL or file path depending on sourceType) Example: https://example.com/document.pdf |
sourceType | string | Required | - | Type of source: "url", "pdf_url", "file", or "pdf_file" Example: pdf_url |
options | object | Optional | - | Processing options Example: {"extractImages": true, "ocrEnabled": false} |
options.extractImages | boolean | Optional | false | Whether to extract images from the document Example: true |
options.ocrEnabled | boolean | Optional | false | Enable OCR for scanned documents (adds 2 credits per page) Example: false |
options.maxPages | number | Optional | - | Maximum number of pages to process (default: all pages) Example: 10 |
Request Examples
Response Example
{ "success": true, "data": { "pages": [ { "pageNumber": 1, "text": "Introduction\n\nThis research paper explores the applications of machine learning...", "wordCount": 523, "images": [ "image_1_base64..." ] }, { "pageNumber": 2, "text": "Methodology\n\nOur approach involves collecting data from multiple sources...", "wordCount": 612, "images": [] } ], "metadata": { "title": "Machine Learning Applications in Healthcare", "author": "Dr. Jane Smith", "creationDate": "2024-01-15", "pageCount": 10, "fileSize": 2456789, "format": "PDF" }, "extractedText": "Introduction\n\nThis research paper explores the applications of machine learning...\n\nMethodology\n\nOur approach involves...", "images": [ "image_1_base64..." ], "totalPages": 10, "processedPages": 10 }, "credits_used": 20, "credits_remaining": 980, "processing_time": 3450}
data.pages
Array of page objects with text and images per pagedata.metadata
Document metadata (title, author, dates, format)data.extractedText
Combined text from all pagesdata.images
Array of extracted images in base64 format (if extractImages: true)data.totalPages
Total number of pages in the documentcredits_used
Credits deducted (2 per page × 10 pages = 20 credits)processing_time
Total processing time in millisecondsError Handling
Unsupported Format (400 Bad Request)
The document format is not supported. Supported formats: PDF, DOCX, TXT.
File Too Large (413 Payload Too Large)
The document exceeds the maximum file size of 50MB. Split large documents into smaller files.
Corrupted Document (422 Unprocessable Entity)
The document is corrupted or password-protected. Ensure the file is valid and not encrypted.
Insufficient Credits (402 Payment Required)
Your account doesn't have enough credits for this document (need {pageCount} × 2 credits). Purchase more credits.
Rate Limit Exceeded (429 Too Many Requests)
You've exceeded your plan's rate limit. Wait a moment or upgrade your plan for higher limits.
Credit Cost
Example: 10-page PDF = 20 credits (or 40 credits with OCR)
Free Plan: 1,000 credits/month = 500 pages (or 250 pages with OCR)
Hobby Plan: 5,000 credits/month = 2,500 pages ($19/mo)
Professional Plan: 50,000 credits/month = 25,000 pages ($99/mo)