deepseek-ai/DeepSeek-OCR-2

Visit

DeepSeek-OCR is a model designed to explore the boundaries of visual-text compression, investigating the role of vision encoders from an LLM-centric viewpoint.

OCR

What is DeepSeek-OCR?

DeepSeek-OCR is a multimodal large language model focused on Optical Character Recognition (OCR) and document understanding.

Instead of treating OCR as a standalone vision task, it approaches it from an LLM-centric perspective — integrating visual encoding and language modeling to improve structured text extraction, document parsing, and multimodal reasoning.

It supports:

Image-to-text conversion
Document-to-markdown transformation
Layout-aware OCR
Figure parsing
Visual grounding
High-resolution document understanding

🚀 Key Features

1️⃣ Multimodal OCR with LLM Backbone

Combines vision encoder + language model.
Supports image and PDF inference.
Optimized for high-resolution inputs.

2️⃣ Multiple Resolution Modes

Native resolution support:

Tiny: 512×512 (64 vision tokens)
Small: 640×640 (100 vision tokens)
Base: 1024×1024 (256 vision tokens)
Large: 1280×1280 (400 vision tokens)

Dynamic resolution mode:

Gundam mode: hybrid scaling (n×640×640 + 1×1024×1024)

3️⃣ vLLM Integration

Officially supported in upstream vLLM.
High-throughput PDF inference (~2500 tokens/s on A100-40G).
Supports prefix caching and logits processors.

4️⃣ Transformers Compatibility

Fully usable with HuggingFace Transformers.
FlashAttention2 support.
bfloat16 inference optimization.

5️⃣ Structured Output Modes

Example prompt modes:

Convert document to markdown
Free OCR (no layout constraints)
Figure parsing
Detailed image description
Object grounding
Visual reference localization

🧠 Technical Capabilities

Context-aware OCR
Layout-sensitive document parsing
Vision-token compression
Markdown conversion
Batch inference for benchmarks
Flash-Attention acceleration
GPU-optimized inference (CUDA 11.8 + Torch 2.6)

🎯 Use Cases

📄 Document Digitization

Scanned PDFs → structured markdown
Academic paper extraction
Legal and financial document processing

📊 Chart & Figure Parsing

Extract tables and diagrams
Convert figures to structured formats

🧾 Automated Data Entry

Invoices
Receipts
Forms

🔍 Visual Grounding

Locate referenced elements in images
Region-based understanding

🧪 Research

Studying vision-token compression
LLM-centric multimodal modeling

⚙️ Deployment & Inference Options

vLLM inference (recommended for high throughput)
HuggingFace Transformers inference
Batch evaluation scripts
PDF streaming OCR

📦 Installation Highlights

Environment:

CUDA 11.8
PyTorch 2.6.0
Python 3.12

Supports:

vLLM 0.8.5+
Flash-Attn 2.7.3

❓ FAQ (Based on Repo Information)

Q1: Is it open-source?

Yes. Licensed under MIT.

Q2: Does it support PDFs?

Yes. There is a dedicated PDF inference script.

Q3: Does it support layout-aware output?

Yes. It can convert documents into markdown while preserving structure.

Q4: Can it run with HuggingFace Transformers?

Yes. It supports AutoModel and AutoTokenizer.

Q5: Does it support batch processing?

Yes. Includes evaluation batch scripts.

Q6: Does it require GPU?

Recommended. Optimized for CUDA 11.8 with A100-level performance.

Model Download: HuggingFace
Paper: arXiv: 2510.18234
Successor: DeepSeek-OCR2 (Released Jan 27, 2026)

🔎

Similar to deepseek-ai/DeepSeek-OCR-2

zai-org/GLM-OCR

GLM-OCR is a multimodal OCR model for complex document understanding, built on the GLM-V encoder–decoder architecture

OCR

PaddlePaddle/PaddleOCR-VL-1.5

PaddleOCR-VL-1.5 is an advanced next-generation model of PaddleOCR-VL, achieving a new state-of-the-art accuracy of 94.5% on OmniDocBench v1.5

OCR