deepseek-ai/DeepSeek-OCR-2

Visit

DeepSeek-OCR is a model designed to explore the boundaries of visual-text compression, investigating the role of vision encoders from an LLM-centric viewpoint.

OCR
deepseek-ai/DeepSeek-OCR-2

What is DeepSeek-OCR?

DeepSeek-OCR is a multimodal large language model focused on Optical Character Recognition (OCR) and document understanding.

Instead of treating OCR as a standalone vision task, it approaches it from an LLM-centric perspective — integrating visual encoding and language modeling to improve structured text extraction, document parsing, and multimodal reasoning.

It supports:

  • Image-to-text conversion
  • Document-to-markdown transformation
  • Layout-aware OCR
  • Figure parsing
  • Visual grounding
  • High-resolution document understanding

🚀 Key Features

1️⃣ Multimodal OCR with LLM Backbone

  • Combines vision encoder + language model.
  • Supports image and PDF inference.
  • Optimized for high-resolution inputs.

2️⃣ Multiple Resolution Modes

Native resolution support:

  • Tiny: 512×512 (64 vision tokens)
  • Small: 640×640 (100 vision tokens)
  • Base: 1024×1024 (256 vision tokens)
  • Large: 1280×1280 (400 vision tokens)

Dynamic resolution mode:

  • Gundam mode: hybrid scaling (n×640×640 + 1×1024×1024)

3️⃣ vLLM Integration

  • Officially supported in upstream vLLM.
  • High-throughput PDF inference (~2500 tokens/s on A100-40G).
  • Supports prefix caching and logits processors.

4️⃣ Transformers Compatibility

  • Fully usable with HuggingFace Transformers.
  • FlashAttention2 support.
  • bfloat16 inference optimization.

5️⃣ Structured Output Modes

Example prompt modes:

  • Convert document to markdown
  • Free OCR (no layout constraints)
  • Figure parsing
  • Detailed image description
  • Object grounding
  • Visual reference localization

🧠 Technical Capabilities

  • Context-aware OCR
  • Layout-sensitive document parsing
  • Vision-token compression
  • Markdown conversion
  • Batch inference for benchmarks
  • Flash-Attention acceleration
  • GPU-optimized inference (CUDA 11.8 + Torch 2.6)

🎯 Use Cases

📄 Document Digitization

  • Scanned PDFs → structured markdown
  • Academic paper extraction
  • Legal and financial document processing

📊 Chart & Figure Parsing

  • Extract tables and diagrams
  • Convert figures to structured formats

🧾 Automated Data Entry

  • Invoices
  • Receipts
  • Forms

🔍 Visual Grounding

  • Locate referenced elements in images
  • Region-based understanding

🧪 Research

  • Studying vision-token compression
  • LLM-centric multimodal modeling

⚙️ Deployment & Inference Options

  • vLLM inference (recommended for high throughput)
  • HuggingFace Transformers inference
  • Batch evaluation scripts
  • PDF streaming OCR

📦 Installation Highlights

Environment:

  • CUDA 11.8
  • PyTorch 2.6.0
  • Python 3.12

Supports:

  • vLLM 0.8.5+
  • Flash-Attn 2.7.3

❓ FAQ (Based on Repo Information)

Q1: Is it open-source?

Yes. Licensed under MIT.

Q2: Does it support PDFs?

Yes. There is a dedicated PDF inference script.

Q3: Does it support layout-aware output?

Yes. It can convert documents into markdown while preserving structure.

Q4: Can it run with HuggingFace Transformers?

Yes. It supports AutoModel and AutoTokenizer.

Q5: Does it support batch processing?

Yes. Includes evaluation batch scripts.

Q6: Does it require GPU?

Recommended. Optimized for CUDA 11.8 with A100-level performance.


📄 Related

  • Model Download: HuggingFace
  • Paper: arXiv: 2510.18234
  • Successor: DeepSeek-OCR2 (Released Jan 27, 2026)
🔎

Similar to deepseek-ai/DeepSeek-OCR-2

zai-org/GLM-OCR
zai-org/GLM-OCR
GLM-OCR is a multimodal OCR model for complex document understanding, built on the GLM-V encoder–decoder architecture
OCR
PaddlePaddle/PaddleOCR-VL-1.5
PaddlePaddle/PaddleOCR-VL-1.5
PaddleOCR-VL-1.5 is an advanced next-generation model of PaddleOCR-VL, achieving a new state-of-the-art accuracy of 94.5% on OmniDocBench v1.5
OCR