deepseek-ai/DeepSeek-OCR-2
VisitDeepSeek-OCR is a model designed to explore the boundaries of visual-text compression, investigating the role of vision encoders from an LLM-centric viewpoint.

What is DeepSeek-OCR?
DeepSeek-OCR is a multimodal large language model focused on Optical Character Recognition (OCR) and document understanding.
Instead of treating OCR as a standalone vision task, it approaches it from an LLM-centric perspective — integrating visual encoding and language modeling to improve structured text extraction, document parsing, and multimodal reasoning.
It supports:
- Image-to-text conversion
- Document-to-markdown transformation
- Layout-aware OCR
- Figure parsing
- Visual grounding
- High-resolution document understanding
🚀 Key Features
1️⃣ Multimodal OCR with LLM Backbone
- Combines vision encoder + language model.
- Supports image and PDF inference.
- Optimized for high-resolution inputs.
2️⃣ Multiple Resolution Modes
Native resolution support:
- Tiny: 512×512 (64 vision tokens)
- Small: 640×640 (100 vision tokens)
- Base: 1024×1024 (256 vision tokens)
- Large: 1280×1280 (400 vision tokens)
Dynamic resolution mode:
- Gundam mode: hybrid scaling (n×640×640 + 1×1024×1024)
3️⃣ vLLM Integration
- Officially supported in upstream vLLM.
- High-throughput PDF inference (~2500 tokens/s on A100-40G).
- Supports prefix caching and logits processors.
4️⃣ Transformers Compatibility
- Fully usable with HuggingFace Transformers.
- FlashAttention2 support.
- bfloat16 inference optimization.
5️⃣ Structured Output Modes
Example prompt modes:
- Convert document to markdown
- Free OCR (no layout constraints)
- Figure parsing
- Detailed image description
- Object grounding
- Visual reference localization
🧠 Technical Capabilities
- Context-aware OCR
- Layout-sensitive document parsing
- Vision-token compression
- Markdown conversion
- Batch inference for benchmarks
- Flash-Attention acceleration
- GPU-optimized inference (CUDA 11.8 + Torch 2.6)
🎯 Use Cases
📄 Document Digitization
- Scanned PDFs → structured markdown
- Academic paper extraction
- Legal and financial document processing
📊 Chart & Figure Parsing
- Extract tables and diagrams
- Convert figures to structured formats
🧾 Automated Data Entry
- Invoices
- Receipts
- Forms
🔍 Visual Grounding
- Locate referenced elements in images
- Region-based understanding
🧪 Research
- Studying vision-token compression
- LLM-centric multimodal modeling
⚙️ Deployment & Inference Options
- vLLM inference (recommended for high throughput)
- HuggingFace Transformers inference
- Batch evaluation scripts
- PDF streaming OCR
📦 Installation Highlights
Environment:
- CUDA 11.8
- PyTorch 2.6.0
- Python 3.12
Supports:
- vLLM 0.8.5+
- Flash-Attn 2.7.3
❓ FAQ (Based on Repo Information)
Q1: Is it open-source?
Yes. Licensed under MIT.
Q2: Does it support PDFs?
Yes. There is a dedicated PDF inference script.
Q3: Does it support layout-aware output?
Yes. It can convert documents into markdown while preserving structure.
Q4: Can it run with HuggingFace Transformers?
Yes. It supports AutoModel and AutoTokenizer.
Q5: Does it support batch processing?
Yes. Includes evaluation batch scripts.
Q6: Does it require GPU?
Recommended. Optimized for CUDA 11.8 with A100-level performance.
📄 Related
- Model Download: HuggingFace
- Paper: arXiv: 2510.18234
- Successor: DeepSeek-OCR2 (Released Jan 27, 2026)

