Baidu Qianfan Team Releases Qianfan-OCR: A 4B-Parameter Unified Document Intelligence Model

Baidu Qianfan Team Unveils Integrated Document AI Model

The Baidu Qianfan Team introduced Qianfan-OCR, a 4B-parameter end-to-end model designed to unify document parsing, layout analysis, and document understanding within a single vision-language architecture. Unlike traditional multi-stage OCR pipelines that chain separate modules for layout detection and text recognition, Qianfan-OCR performs direct image-to-Markdown conversion and supports prompt-driven tasks such as table extraction and document question answering. The system comprises three core components: the Vision Encoder (Qianfan-ViT), a Cross-Modal Adapter, and the Language Model Backbone (Qwen3-4B).

A key feature of Qianfan-OCR is its optional thinking phase, activated by tokens. This phase recovers explicit layout analysis capabilities, including element localization and type classification, which are often omitted in end-to-end approaches. Evaluation on OmniDocBench v1.5 demonstrated that this thinking phase consistently improves performance on documents with high “layout label entropy,” characterized by heterogeneous elements like mixed text, formulas, and diagrams. Bounding box coordinates are encoded as special tokens, ranging from to , reducing the output length by approximately 50% compared to using plain digit sequences.

Performance Benchmarks and Efficiency Gains

Qianfan-OCR was evaluated against specialized OCR systems and general vision-language models (VLMs). It achieved a score of 93.12 on OmniDocBench v1.5, 79.8 on OlmOCR Bench, and 880 on OCRBench. The model also achieved the highest average score of 87.9 on public KIE benchmarks. Comparative testing revealed that two-stage OCR+LLM pipelines frequently struggle with tasks requiring spatial reasoning, with all tested two-stage systems scoring 0.0 on CharXiv benchmarks due to the loss of visual context during text extraction, which is crucial for chart interpretation.

Inference efficiency was measured in Pages Per Second (PPS) using a single NVIDIA A100 GPU. With W8A8 (AWQ) quantization, Qianfan-OCR achieved 1.024 PPS, representing a 2x speedup over the W16A16 baseline with minimal accuracy loss. The GPU-centric architecture of Qianfan-OCR avoids the inter-stage processing delays common in pipeline systems that rely on CPU-based layout analysis, enabling efficient large-batch inference.

Architectural Advantages and Technical Details

The system’s architecture, incorporating Qianfan-ViT and Qwen3-4B, allows for direct image-to-Markdown conversion. This unified approach contrasts with traditional multi-stage pipelines. The Vision Encoder, Qianfan-ViT, is designed to process image inputs, likely at resolutions such as 448 x 448, and feeds into a Cross-Modal Adapter that bridges visual and textual information. The Language Model Backbone, Qwen3-4B, is a 4.0B-parameter model featuring 36 layers and a 32K context window. The model also incorporates architectural elements such as a two-layer MLP and GELU activation.

Beyond its core capabilities, Qianfan-OCR supports prompt-driven tasks, demonstrating versatility in document understanding. Its direct image-to-Markdown conversion capability and sophisticated layout analysis, facilitated by the Layout-as-Thought mechanism, position it as a significant advancement in document AI. The integration of these features into a single, efficient model addresses limitations found in existing systems, particularly in handling complex documents and spatial reasoning tasks.

✨ Intelligent Curation Note

This article was processed by AI Universe’s Intelligent Curation system. We’ve decoded complex technical jargon and distilled dense data into this high-impact briefing.
Estimated time saved: ~1 minutes of reading.

Analysis based on reports from MarkTechPost. Written by AI Universe News.