StepFun’s New AI Model Offers Near-Opus Coding Power at One-Ninth the Cost
The pursuit of more efficient and accessible AI is gaining momentum, with StepFun’s latest release, Step 3.7 Flash, pushing the boundaries of what multimodal agents can achieve. This 198-billion-parameter sparse Mixture-of-Experts (MoE) model promises to democratize advanced coding assistance and sophisticated search workflows. By integrating visual understanding with powerful language processing, Step 3.7 Flash aims to deliver high performance without the exorbitant costs associated with leading proprietary models.
This development signifies a broader trend towards specialized, cost-optimized AI tools. As these agents become more capable, they are beginning to redefine agentic workflows, moving beyond reliance on monolithic, computationally heavy large language models. The availability of such a potent, yet economical, model under an open-source license could significantly accelerate innovation in AI-powered developer tools and information retrieval systems.
Cost-Effective Coding Power Emerges
Step 3.7 Flash, a 198-billion-parameter sparse Mixture-of-Experts (MoE) vision-language model, is making waves by achieving remarkable performance in coding tasks. While possessing a vast total parameter count, it activates approximately 11 billion parameters per token during inference. This selective activation is key to its efficiency, allowing it to offer a 256k token context window and achieve up to 400 tokens/sec throughput. The model’s focus on coding agents and search workflows is evident in its performance metrics.
On SWE-Bench Pro, Step 3.7 Flash scored 56.26%, a notable improvement over its predecessor, Step 3.5 Flash, which achieved 51.3%. This enhanced capability extends to other benchmarks, with Terminal-Bench 2.1 scores rising from 53.37% for Step 3.5 Flash to 59.55% for Step 3.7 Flash. The model’s architecture combines a substantial 196-billion-parameter language backbone with a 1.8-billion-parameter Vision Transformer (ViT) encoder, bringing native multimodal support to the Step family for the first time.
Multimodal Integration and Strategic Efficiency
The integration of visual tools marks a significant leap for Step 3.7 Flash, enabling recognition and fine-grained analysis tasks that were beyond the text-only Step 3.5 Flash. It supports three selectable reasoning depths: Low, Medium, and High, allowing users to balance performance and computational cost. The model operates under the permissive Apache 2.0 license, fostering wider adoption and development.
Crucially, Step 3.7 Flash features an Advisor Mode, an implementation of a strategy described by Anthropic. This mode allows the model to reportedly achieve 97% of Claude Opus 4.6’s coding performance at one-ninth the cost. Verified results show Step 3.7 Flash’s Advisor Mode achieving a 76.3% score with a per-task cost of $0.19, significantly outperforming Claude Opus 4.6’s 78.7% score at a cost of $1.76. This efficiency is further supported by its performance on visual tool pathways; the Visual Search Tool pathway achieved 79.16% on SimpleVQA (Search), and the Python Tool pathway reached an impressive 95.29% on V* (Python) and 89.13% on HR-Bench 4K.
📊 Key Numbers
- SWE-Bench Pro: 56.26% (Step 3.7 Flash) vs 51.3% (Step 3.5 Flash)
- Terminal-Bench 2.1: 59.55% (Step 3.7 Flash) vs 53.37% (Step 3.5 Flash)
- SWE-MTLG: 72.42% (Step 3.7 Flash)
- Step-SWE-Bench (internal): 64.5% to 71.5% range (Step 3.7 Flash)
- Advisor Mode Coding Performance: 76.3% score, $0.19 per task (Step 3.7 Flash) vs 78.7% score, $1.76 per task (Claude Opus 4.6)
- SimpleVQA (Search): 79.16% (Step 3.7 Flash)
- V* (Python): 95.29% (Step 3.7 Flash)
- HR-Bench 4K: 89.13% (Step 3.7 Flash)
- HR-Bench 8K: 86.34% (Step 3.7 Flash)
- Android Daily Tasks: 61.87% (Step 3.7 Flash) vs 53.36% (Kimi K2.6) and 51.68% (GLM 5V Turbo). Gemini 3 Flash leads at 63.21%.
- HLE w. Tools (acc): 47.20% (Step 3.7 Flash) vs 45.10% (DeepSeek V4 Flash)
- BrowseComp (acc): 75.82% (Step 3.7 Flash) vs 79.30% (Claude Opus 4.7)
- DeepSearchQA (F1): 92.82% (Step 3.7 Flash) vs 92.50% (Kimi K2.6)
- ResearchRubrics (score): 71.68% (Step 3.7 Flash)
- Model Parameter Activation: ~11B per token (Step 3.7 Flash)
- Context Window: 256k tokens (Step 3.7 Flash)
- Throughput: Up to 400 tokens/sec (Step 3.7 Flash)
- License: Apache 2.0
🔍 Context
The release of Step 3.7 Flash by StepFun underscores a significant shift in the AI landscape: the economic viability of high-performance, multimodal agents. This development directly addresses the prohibitive cost that has often limited the widespread adoption of cutting-edge AI models for practical applications, particularly in coding and complex search scenarios.
StepFun’s strategy of employing an MoE architecture, which activates a fraction of its total parameters during inference, allows for a dramatic reduction in operational costs. This contrasts with the approach of monolithic models, which often require substantial computational resources for every operation.
In terms of competition, while Claude Opus 4.7 shows slightly better performance on BrowseComp at 79.30% compared to Step 3.7 Flash’s 75.82%, the economic advantage of Step 3.7 Flash’s Advisor Mode is a critical differentiator for cost-sensitive applications. This positions Step 3.7 Flash as a compelling alternative for developers and businesses seeking powerful AI capabilities without an equivalent price tag.
💡 AIUniverse Analysis
The core innovation in Step 3.7 Flash lies in its sophisticated approach to resource management through its Mixture-of-Experts architecture and the strategic implementation of Advisor Mode. This isn’t just about improving performance; it’s about making high-level AI capabilities, especially multimodal ones, accessible and economically sustainable. The model’s ability to near-match Claude Opus 4.6’s coding performance at a fraction of the cost demonstrates a practical path forward for AI deployment.
However, the “shadow” side of this efficiency model involves the inherent complexities of MoE architectures. While sparse activation reduces costs, it can introduce challenges in expert selection and routing, potentially leading to less predictable performance in certain edge cases compared to a tightly integrated monolithic model. The comparison with Claude Opus 4.6, while striking in terms of cost, also highlights a performance gap that, while small, might be critical for highly specialized tasks. Furthermore, the success of the Advisor Mode hinges on the robustness of its “advisor strategy” implementation.
For Step 3.7 Flash to maintain its trajectory, StepFun will need to demonstrate continued refinement of its expert routing mechanisms and the adaptability of its multimodal features across a broader range of real-world coding and search challenges. The open-source nature of the model offers a strong foundation for community-driven improvements, which will be key to its long-term impact.
⚖️ AIUniverse Verdict
✅ Promising. Step 3.7 Flash’s ability to deliver near-state-of-the-art coding performance at a substantially reduced cost, particularly through its Advisor Mode, presents a significant opportunity for broader AI adoption, though its long-term reliability across diverse tasks requires continued validation.
🎯 What This Means For You
Founders & Startups: Founders can leverage Step 3.7 Flash’s multimodal capabilities and cost-effectiveness to build novel coding agents and search workflows that previously required significantly more expensive models.
Developers: Developers gain a new option for integrating vision understanding and enhanced tool use into their agents, with selectable reasoning depths allowing for latency-performance tuning.
Enterprise & Mid-Market: Enterprises can explore deploying sophisticated AI agents for coding assistance and information retrieval at a fraction of the cost of top-tier models, potentially unlocking new automation opportunities.
General Users: End-users may benefit from more intelligent and responsive applications that can understand and process visual information in search and coding contexts.
⚡ TL;DR
- What happened: StepFun released Step 3.7 Flash, a multimodal AI model that performs advanced coding tasks efficiently.
- Why it matters: It offers near top-tier coding capabilities at a fraction of the cost, making powerful AI agents more accessible.
- What to do: Developers and businesses should evaluate its potential for cost-effective AI integrations in coding and search applications.
📖 Key Terms
- Mixture-of-Experts (MoE)
- A neural network architecture that routes input data to specialized sub-networks, activating only a subset for each task to improve efficiency.
- vision encoder (ViT)
- A component of a multimodal model that processes visual input, converting images into a format that language models can understand.
- Advisor Mode
- A specific operational mode within Step 3.7 Flash designed to optimize performance and cost by leveraging a particular AI strategy.
- multimodal
- AI systems capable of processing and understanding information from multiple types of data, such as text, images, and audio.
Analysis based on reporting by MarkTechPost. Original article here.

