Baidu Qianfan Team Unveils Integrated Document AI Model
The Baidu Qianfan Team introduced Qianfan-OCR, a 4B-parameter end-to-end model designed to unify document parsing, layout analysis, and document understanding within a single vision-language architecture. Unlike traditional multi-stage OCR pipelines that chain separate modules for layout detection and text recognition, Qianfan-OCR performs direct image-to-Markdown conversion and supports prompt-driven tasks such as table extraction and document question answering. The system comprises three core components: the Vision Encoder (Qianfan-ViT), a Cross-Modal Adapter, and the Language Model Backbone (Qwen3-4B).
A key feature of Qianfan-OCR is its optional thinking phase, activated by
Performance Benchmarks and Efficiency Gains
Qianfan-OCR was evaluated against specialized OCR systems and general vision-language models (VLMs). It achieved a score of 93.12 on OmniDocBench v1.5, 79.8 on OlmOCR Bench, and 880 on OCRBench. The model also achieved the highest average score of 87.9 on public KIE benchmarks. Comparative testing revealed that two-stage OCR+LLM pipelines frequently struggle with tasks requiring spatial reasoning, with all tested two-stage systems scoring 0.0 on CharXiv benchmarks due to the loss of visual context during text extraction, which is crucial for chart interpretation.
Inference efficiency was measured in Pages Per Second (PPS) using a single NVIDIA A100 GPU. With W8A8 (AWQ) quantization, Qianfan-OCR achieved 1.024 PPS, representing a 2x speedup over the W16A16 baseline with minimal accuracy loss. The GPU-centric architecture of Qianfan-OCR avoids the inter-stage processing delays common in pipeline systems that rely on CPU-based layout analysis, enabling efficient large-batch inference.
Architectural Advantages and Technical Details
The system’s architecture, incorporating Qianfan-ViT and Qwen3-4B, allows for direct image-to-Markdown conversion. This unified approach contrasts with traditional multi-stage pipelines. The Vision Encoder, Qianfan-ViT, is designed to process image inputs, likely at resolutions such as 448 x 448, and feeds into a Cross-Modal Adapter that bridges visual and textual information. The Language Model Backbone, Qwen3-4B, is a 4.0B-parameter model featuring 36 layers and a 32K context window. The model also incorporates architectural elements such as a two-layer MLP and GELU activation.
Beyond its core capabilities, Qianfan-OCR supports prompt-driven tasks, demonstrating versatility in document understanding. Its direct image-to-Markdown conversion capability and sophisticated layout analysis, facilitated by the Layout-as-Thought mechanism, position it as a significant advancement in document AI. The integration of these features into a single, efficient model addresses limitations found in existing systems, particularly in handling complex documents and spatial reasoning tasks.
✨ Intelligent Curation Note
This article was processed by AI Universe’s Intelligent Curation system. We’ve decoded complex technical jargon and distilled dense data into this high-impact briefing.
Estimated time saved: ~1 minutes of reading.
Tools We Use for Working with AI:









