How to Extract Text from Scanned PDFs with OCR
Extracting text from a scanned PDF requires OCR -- optical character recognition. Modern OCR hits 98%+ accuracy on clean scans and supports 100+ languages. Here's how it works, what to expect, and when it still fails.
Extracting text from a scanned PDF requires OCR -- optical character recognition. Modern OCR engines achieve 98%+ accuracy on clean scans, support 100+ languages, and can process a 200-page document in under a minute. The workflow is simple: upload the scanned PDF, run OCR, download a searchable PDF with extracted text embedded invisibly behind the original images. This guide covers how OCR actually works, what affects accuracy, how to pick the right approach, and where it still fails in 2026.
The Quick Answer: How to Extract Text from a Scanned PDF
- Open a tool that supports OCR (DocuHub's OCR tool is one option).
- Upload your scanned PDF.
- Select the source language (or let auto-detection handle it).
- Run OCR.
- Download a searchable PDF with embedded text, or export as plain text, Word, or Excel.
For a typical 20-page scanned document, the whole process takes 20-60 seconds. The output looks identical to the original but is now fully searchable and copy-pasteable.
What OCR Actually Does (and Doesn't)
OCR converts images of text into machine-readable text. It doesn't change how the document looks -- it adds a hidden text layer on top of (or behind) the existing page image, so the page still renders identically but Ctrl-F search now works.
What OCR produces:
- Searchable PDF: The visual page is unchanged, but text is now indexable. This is the most common output.
- Plain text (.txt): Just the extracted characters, no layout.
- Structured text (Word, Excel): Tables are reconstructed, paragraphs separated, headings detected.
What OCR does NOT do:
- Correct the original document's content. If the scan has typos, OCR faithfully extracts them.
- Guarantee 100% accuracy on poor-quality scans. Faded text, skewed pages, and handwriting all reduce accuracy.
- Understand the meaning of the content. It extracts characters, not intent. Post-processing (AI summarization, data extraction) is a separate step.
How Modern OCR Works Under the Hood
OCR has evolved from rule-based character matching to deep learning. The 2026 stack looks roughly like this:
- Image preprocessing: Deskew (fix rotated pages), denoise, binarize (convert to black-and-white), and normalize contrast.
- Page layout analysis: Detect text regions, tables, images, columns, and reading order.
- Line and word segmentation: Split text regions into lines and lines into words.
- Character recognition: A neural network reads each line and outputs predicted characters. Modern models use transformer architectures that process entire lines at once, understanding context (e.g., "th3" in English is probably "the").
- Post-processing: Language models correct obvious errors using context. "Rece1pt" becomes "Receipt."
The shift from character-by-character OCR to context-aware neural OCR happened around 2020. The accuracy jump was dramatic -- from 85-90% on real-world documents to 95-99%.
Accuracy: What to Expect by Document Type
Real-world OCR accuracy varies significantly by input quality:
| Document Type | Typical Accuracy | Notes |
|---|---|---|
| Clean printed text, 300 DPI | 98-99.5% | Near-perfect; errors mostly on rare words |
| Office printouts, 200 DPI | 95-98% | Good; occasional errors on proper nouns |
| Photocopied documents | 92-96% | Fading, noise degrade results |
| Photos of documents (phone camera) | 88-94% | Skew, lighting, focus all matter |
| Old books and historical documents | 85-93% | Non-standard fonts, paper degradation |
| Handwritten cursive | 70-85% | Highly dependent on handwriting quality |
| Handwritten print | 85-92% | Better than cursive; still variable |
| Receipts (thermal printer) | 80-90% | Faded ink and small text reduce accuracy |
Accuracy is typically measured at the character level. A 99% character accuracy means roughly 1 error per 100 characters -- so a 300-word page (~1,500 characters) might have 15 character-level errors. Most are catchable with spell-check; some aren't.
Language Support in 2026
DocuHub's OCR supports 100+ languages, which is now typical for modern OCR services. Accuracy by language tier:
- Tier 1 (99%+ accuracy): English, Spanish, French, German, Italian, Portuguese, Dutch, most European languages using Latin script.
- Tier 2 (96-99% accuracy): Russian and Cyrillic languages, Greek, Turkish, Vietnamese.
- Tier 3 (93-97% accuracy): Chinese (simplified and traditional), Japanese, Korean -- complex scripts benefit from dedicated models.
- Tier 4 (90-96% accuracy): Arabic, Hebrew, Persian, Hindi, Thai -- right-to-left and complex ligature scripts.
- Lower tiers: Less-resourced languages, historical scripts, minority languages.
For mixed-language documents (e.g., English text with some Chinese characters), modern OCR handles this automatically -- it doesn't need a single language setting per page.
What Affects OCR Accuracy Most
If you're getting poor results, the problem is usually upstream of the OCR engine.
Resolution. Scans below 200 DPI lose too much detail. 300 DPI is the sweet spot for most documents; 400-600 DPI for small text. Scanning higher than 600 DPI rarely improves accuracy and dramatically increases file size.
Skew and orientation. Pages scanned at an angle reduce accuracy significantly. Good OCR tools auto-deskew, but the cleaner your input, the better.
Contrast and fading. Faded photocopies confuse character recognition. Increasing contrast before OCR helps; most tools do this automatically.
Noise and artifacts. Speckles, fax artifacts, and dark scan edges trigger false character recognition. Denoising preprocessing helps.
Font. Standard fonts (Times, Arial, Calibri, common serif and sans-serif) have the highest accuracy. Decorative fonts, scripts, and handwriting-like fonts score lower.
Size. Text below 8pt in the scan (regardless of original size) is problematic. If the original is small, scan at higher DPI.
OCR + AI: What Comes After Extraction
OCR extracts text. What you do with the text is where AI is increasingly relevant in 2026.
Structured data extraction. OCR produces raw text. AI models then pull specific fields -- invoice number, date, amounts, line items -- into structured output. This combination is why accounts payable automation went from 50 invoices per person per day to 400+.
Document classification. After extracting text, AI classifies the document type (invoice, contract, receipt, medical record) and routes it to the right workflow.
Search and question-answering. Once a scanned document has extracted text, you can ask natural-language questions about it. DocuHub's document chat works on OCR'd PDFs exactly as it does on native PDFs.
Translation. Translate the extracted text to any supported language. Works well when combined with OCR language auto-detection.
Specific Workflows Where OCR Matters
Digitizing paper records. Organizations with decades of paper files (legal discovery, medical records, government archives) use OCR to make those documents searchable. The volume matters -- a law firm with 10 million pages of discovery can't manually review them, but can keyword-search an OCR'd corpus.
Processing inbound documents. Invoices, purchase orders, and forms arrive as scans or photos. OCR is the first step in extracting structured data for downstream systems.
Accessibility compliance. Screen readers require actual text, not images. Running OCR on scanned PDFs makes them accessible to users with visual impairments. This is often a legal requirement (Section 508, EAA 2025 in Europe).
Research and analysis. Academic researchers, journalists, and analysts working with document archives need searchable text. OCR is table stakes for this work.
When OCR Still Fails in 2026
Despite the accuracy gains, OCR has clear failure modes:
- Very old handwriting. 18th-century cursive, degraded manuscripts, and non-standard scripts remain hard. Specialized models help but don't fully solve it.
- Mathematical notation. Equations, chemistry formulas, and technical diagrams aren't handled by general OCR. Specialized tools (Mathpix, InftyReader) exist for these.
- Severely damaged documents. Water damage, ink bleed, and heavy stains defeat most OCR. Sometimes manual transcription is the only option.
- Non-Latin handwriting. Handwritten Chinese, Arabic, Hindi, and similar scripts have much lower accuracy than printed versions.
- Documents with heavy visual design. Magazines, posters, and infographics with text overlaid on images confuse layout detection.
Privacy and OCR
When you send a scanned document to an OCR service, you're sending the document content. Privacy considerations:
- Data residency. Some regulations require processing in specific jurisdictions. Check where OCR servers are located.
- Retention. Look for services that delete inputs within 24 hours.
- Compliance. For HIPAA, GDPR, or similar regulated workloads, use services certified for those categories.
- On-premise options. For highly sensitive documents (classified material, attorney-client privileged content), consider running OCR locally. Tesseract (open-source) and commercial engines like ABBYY FineReader support on-premise deployment.
DocuHub's OCR processes files in memory, deletes them within 24 hours, and offers HIPAA-compliant processing on the enterprise tier.
Choosing the Right OCR Tool
For the common cases, browser-based OCR services (including DocuHub) handle 95%+ of needs. For specialized use:
- Very high volume (1M+ pages/year): Evaluate on-premise engines or enterprise APIs with volume pricing.
- Handwriting-heavy workflow: Try specialized handwriting OCR like Microsoft's Read API or Google Cloud Vision's handwriting mode.
- Mathematical or scientific content: Use Mathpix or InftyReader, not general OCR.
- Regulated industries: Confirm compliance certifications before committing.
- Multilingual documents: Verify your specific language combinations work well on the tool's test samples.
Key Takeaways
- OCR converts scanned images to searchable text; modern accuracy on clean scans is 98%+, with 100+ language support.
- Upload, select language, run OCR, download -- the typical workflow is under a minute for a 20-page document.
- Accuracy depends on input quality: 300 DPI, minimal skew, good contrast, standard fonts all matter more than the OCR engine.
- OCR extracts characters; AI workflows after OCR handle structured data extraction, classification, search, and translation.
- Handwriting, historical documents, math notation, and severely damaged documents remain hard in 2026.
- Privacy-sensitive documents should use services with short retention, data residency guarantees, or on-premise processing.
DocuHub's OCR supports 100+ languages, produces searchable PDFs with preserved layout, and integrates with our AI document chat so OCR'd files become queryable immediately.
Écrit par
DocuHub Team
Nous écrivons sur les documents, l'IA et l'avenir du travail. Nos essais explorent comment la technologie transforme la façon dont les organisations créent, partagent et gèrent les connaissances.