CrawlForge
Advanced Tool2 creditsPer Page

process_document

Process PDF, DOCX, and TXT documents with text extraction, image extraction, and optional OCR support. Perfect for parsing academic papers, invoices, forms, and multi-format document processing.

Use Cases

Document Parsing

Extract text and metadata from PDFs, Word documents, and text files

Academic Research

Process research papers, theses, and academic publications for analysis

Invoice Processing

Extract structured data from invoices, receipts, and financial documents

Form Extraction

Parse application forms, surveys, and questionnaires

Legal Documents

Extract text from contracts, agreements, and legal filings

Scanned Document OCR

Convert scanned images and PDFs to searchable text with OCR

Endpoint

POST/api/v1/tools/process_document
Auth Required
2 req/s on Free plan
2 credits

Parameters

NameTypeRequiredDefaultDescription
source
stringRequired-
The document source (URL or file path depending on sourceType)
Example: https://example.com/document.pdf
sourceType
stringRequired-
Type of source: "url", "pdf_url", "file", or "pdf_file"
Example: pdf_url
options
objectOptional-
Processing options
Example: {"extractImages": true, "ocrEnabled": false}
options.extractImages
booleanOptionalfalse
Whether to extract images from the document
Example: true
options.ocrEnabled
booleanOptionalfalse
Enable OCR for scanned documents (adds 2 credits per page)
Example: false
options.maxPages
numberOptional-
Maximum number of pages to process (default: all pages)
Example: 10

Request Examples

terminalBash

Response Example

200 OK3450ms
{
"success": true,
"data": {
"pages": [
{
"pageNumber": 1,
"text": "Introduction\n\nThis research paper explores the applications of machine learning...",
"wordCount": 523,
"images": [
"image_1_base64..."
]
},
{
"pageNumber": 2,
"text": "Methodology\n\nOur approach involves collecting data from multiple sources...",
"wordCount": 612,
"images": []
}
],
"metadata": {
"title": "Machine Learning Applications in Healthcare",
"author": "Dr. Jane Smith",
"creationDate": "2024-01-15",
"pageCount": 10,
"fileSize": 2456789,
"format": "PDF"
},
"extractedText": "Introduction\n\nThis research paper explores the applications of machine learning...\n\nMethodology\n\nOur approach involves...",
"images": [
"image_1_base64..."
],
"totalPages": 10,
"processedPages": 10
},
"credits_used": 20,
"credits_remaining": 980,
"processing_time": 3450
}
Field Descriptions
data.pagesArray of page objects with text and images per page
data.metadataDocument metadata (title, author, dates, format)
data.extractedTextCombined text from all pages
data.imagesArray of extracted images in base64 format (if extractImages: true)
data.totalPagesTotal number of pages in the document
credits_usedCredits deducted (2 per page × 10 pages = 20 credits)
processing_timeTotal processing time in milliseconds

Error Handling

Unsupported Format (400 Bad Request)

The document format is not supported. Supported formats: PDF, DOCX, TXT.

File Too Large (413 Payload Too Large)

The document exceeds the maximum file size of 50MB. Split large documents into smaller files.

Corrupted Document (422 Unprocessable Entity)

The document is corrupted or password-protected. Ensure the file is valid and not encrypted.

Insufficient Credits (402 Payment Required)

Your account doesn't have enough credits for this document (need {pageCount} × 2 credits). Purchase more credits.

Rate Limit Exceeded (429 Too Many Requests)

You've exceeded your plan's rate limit. Wait a moment or upgrade your plan for higher limits.

Credit Cost

2 credits
2 credits per page (4 credits with OCR)
Each page processed costs 2 credits. Enable OCR for an additional 2 credits per page.

Example: 10-page PDF = 20 credits (or 40 credits with OCR)

Free Plan: 1,000 credits/month = 500 pages (or 250 pages with OCR)

Hobby Plan: 5,000 credits/month = 2,500 pages ($19/mo)

Professional Plan: 50,000 credits/month = 25,000 pages ($99/mo)

Related Tools