Checkmarx’s New Security Scanner Cuts Through the Noise — But Who’s Watching the Filter?
Security teams are drowning — not because scanners find too little, but because they find too much. As AI coding assistants accelerate code output across engineering organizations, the volume of flagged vulnerabilities has outpaced any human team’s ability to triage them. Checkmarx’s answer is a rebuilt Static Application Security Testing (SAST) engine that doesn’t just scan code — it decides, before a developer ever sees a result, which findings are real and which are noise.
The numbers behind this claim are striking. According to Checkmarx’s own release documentation, the new engine achieves an F1 score of 0.499 — a statistical measure of a scanner’s balance between catching real vulnerabilities and avoiding false alarms — against a category average of just 0.20. That gap, 2.5 times the industry baseline, is the central argument for why this architecture matters. In head-to-head testing across four production codebases, the engine surfaced 327 true positives that a leading frontier AI model missed entirely, according to the same documentation.
Jonathan Rende, Checkmarx’s Chief Research Officer, frames the design philosophy plainly: “Three engines run together to deliver unified protection.” That three-part architecture — a deterministic rules-based scanner, a Large Language Model (LLM) trained on security data, and a proprietary Findings Analysis Engine (FAE) — is where the real story lives. The question is not whether the engine works. It is who controls what you see when it does.
Three Engines, One Filter: How Checkmarx’s Architecture Actually Works
The new Checkmarx SAST engine is built on a deliberate division of labor. The deterministic scanner — a rules-based system that applies fixed logic to code patterns — provides consistency and speed. It does not guess; it matches. Alongside it, an LLM trained specifically on security data handles the ambiguous cases that rigid rules miss: context-dependent vulnerabilities, novel patterns, and code constructs that require semantic understanding. According to Checkmarx’s release documentation, this dual approach also enables rapid support for new programming languages, since the LLM component can generalize where deterministic rules would require manual authoring.
The third component, the Findings Analysis Engine (FAE), is the architectural decision that separates this tool from a conventional scanner. The FAE classifies findings as true or false positives before they ever reach a development team. This means the output a developer receives has already been filtered by a proprietary layer — one that Checkmarx controls and that operates between the raw scan and the human reviewer. The practical benefit is real: fewer false alarms means developers spend less time chasing phantom vulnerabilities. The tradeoff, examined more closely below, is that the underlying reasoning of that filter is not exposed to the teams relying on it.
Checkmarx also introduces a new prioritization metric called “Attackability,” which focuses not just on whether a vulnerability exists, but on whether it sits on an exploitable attack path — meaning whether a real attacker could actually reach and weaponize it. This shifts the triage question from “is this a bug?” to “can this bug be used against us?” That reframing is operationally significant for Application Security (AppSec) teams managing hundreds of open findings at any given time.
The Noise Problem Is Real — But the Cure Has a Cost
The false-positive crisis in application security is not a marketing construct. AI coding tools — GitHub Copilot, Cursor, and their equivalents — have materially increased the volume of code being written and committed, which directly multiplies the number of findings any scanner produces. AppSec teams that were already stretched thin now face triage queues that grow faster than they can be cleared. Checkmarx’s orchestration-first approach targets this exact bottleneck, and the F1 score of 0.499 versus the category average of 0.20, as stated in the company’s release documentation, suggests the filtering is doing meaningful work rather than simply suppressing output.
But the CRITICAL_ANGLE here deserves direct examination. When the FAE filters findings before they reach developers, enterprises are implicitly trusting Checkmarx’s proprietary logic to make security-relevant decisions on their behalf. Traditional SAST tools, whatever their limitations, expose their full output — including the false positives — giving security engineers the raw material to audit, tune, and understand the scanner’s behavior. The FAE abstracts that layer away. A cautious Chief Technology Officer should ask: if the FAE misclassifies a true positive as a false positive, how would the team know? The answer, under this architecture, is that they likely would not — at least not until exploitation.
There is also a competitive transparency gap worth noting. Checkmarx declined to name the leading frontier model it tested against in the 327-finding comparison. That omission matters because the credibility of the benchmark depends entirely on which model served as the baseline. A frontier model optimized for code generation rather than security analysis would be a far weaker comparator than a purpose-built security LLM. Without that disclosure, the 327 figure is directionally interesting but not independently verifiable. The industry is also converging on similar hybrid architectures — deterministic plus LLM plus post-processing — which means the novelty of this approach may be narrower than the announcement implies.
| Approach | Key Difference | Best For |
|---|---|---|
| Checkmarx (FAE + LLM + Deterministic) | Pre-filters findings before developer review via proprietary FAE layer | Teams prioritizing developer efficiency over raw scan transparency |
| Traditional SAST (rules-only) | Full deterministic output with no post-processing suppression | Security teams that need auditable, unfiltered scan results |
| Frontier LLM-only scanners | Broad semantic understanding but missed 327 true positives in Checkmarx’s own testing | Exploratory analysis and novel vulnerability pattern detection |
📊 Key Numbers
- F1 score (Checkmarx engine): 0.499 — measuring the balance between detecting real vulnerabilities and suppressing false alarms
- F1 score (category average): 0.20 — the baseline against which Checkmarx’s 0.499 is 2.5 times higher
- True positives missed by frontier model: 327 — found by Checkmarx’s engine across four production codebases in head-to-head testing
- Production codebases tested: 4 — the scope of the comparative benchmark disclosed in Checkmarx’s release documentation
- Engine components: 3 — deterministic rules-based scanner, security-trained LLM, and the Findings Analysis Engine (FAE)
- Attackability metric: New prioritization layer scoring vulnerabilities by exploitability and reachable attack paths, not just presence
🔍 Context
The benchmarks and comparative claims in this announcement originate from Checkmarx’s own internal testing and release documentation — not from an independent third-party evaluator such as NIST, CISA, or an academic security lab, a distinction that matters when assessing the credibility of the 327-finding comparison. The specific problem this architecture addresses is the false-positive triage collapse that follows from AI-assisted code generation: when developers write more code faster, scanners produce more findings faster, and AppSec teams — whose headcount has not scaled proportionally — face an unsustainable review burden. Checkmarx’s FAE is a direct engineering response to that operational constraint. This announcement fits within a broader industry pattern where security vendors are repackaging scanning pipelines with LLM layers and post-processing filters, positioning orchestration as the differentiator rather than raw detection capability. The closest architectural alternative is a self-managed pipeline combining an open-source SAST tool such as Semgrep with a custom triage script — an approach that preserves full output visibility but requires significant internal engineering investment to maintain. The timing of this release is directly tied to the documented surge in AI-generated code volume, which Checkmarx’s own release materials identify as the primary driver of the false-positive problem the FAE is designed to solve.
💡 AIUniverse Analysis
Our reading: The genuine advance here is architectural sequencing, not raw detection power. By placing the FAE between the scanner and the developer, Checkmarx has operationalized a triage decision that most teams currently make manually, inconsistently, and slowly. The F1 score of 0.499 — drawn from Checkmarx’s release documentation — is a concrete, measurable improvement over the 0.20 category average, and the Attackability metric is a meaningful reframe: shifting from “does this vulnerability exist” to “can an attacker actually reach it” changes how security debt gets prioritized in sprint planning. That is a workflow change, not just a feature addition.
The shadow is the opacity of the FAE itself. Enterprises adopting this engine are not just buying a scanner — they are delegating a classification decision to a proprietary black box. If the FAE’s model drifts, overfits to certain code patterns, or is tuned conservatively to reduce developer complaints at the expense of recall, the security team has no direct mechanism to detect that degradation. The unnamed frontier model in the benchmark comparison compounds this concern: without knowing whether Checkmarx tested against a general-purpose code model or a security-specialized one, the 327-finding gap cannot be contextualized. A cautious CTO would also note that multiple vendors are converging on this same three-layer architecture, which suggests the moat here is thinner than the announcement implies.
For this to matter in 12 months, Checkmarx would need to publish FAE classification accuracy data from customer production environments — not internal benchmarks — and disclose the identity of the frontier model used in comparative testing, so the security community can independently assess whether the 327-finding advantage holds across diverse codebases and languages.
⚖️ AIUniverse Verdict
👀 Watch this space. The F1 score of 0.499 against a category average of 0.20 is a credible performance claim, but the FAE’s proprietary filtering layer and the undisclosed frontier model comparator leave the most consequential questions — about transparency and reproducibility — unanswered.
🎯 What This Means For You
Founders & Startups: If your engineering team uses AI coding assistants, your scan output volume is already growing faster than your security review capacity. Checkmarx’s FAE-based filtering is worth evaluating — but ask the vendor to show you what the FAE suppresses, not just what it surfaces.
Developers: The promise is fewer false alarms cluttering your backlog. The practical question is whether the Attackability metric aligns with how your organization actually defines exploitability — if your threat model differs from Checkmarx’s assumptions, the prioritization may not match your real risk.
Enterprise & Mid-Market: Before adopting any tool that filters security findings before human review, establish a contractual or technical mechanism to audit FAE classification decisions. The efficiency gain is real; the governance gap is equally real.
General Users: Applications built with AI coding tools carry a higher raw vulnerability count than those written entirely by hand. Tools like this one exist to close that gap — but their effectiveness depends on whether the filter between the scanner and the developer is trustworthy and auditable.
⚡ TL;DR
- What happened: Checkmarx released a three-component SAST engine — deterministic scanner, security-trained LLM, and a proprietary Findings Analysis Engine — claiming an F1 score of 0.499 versus a category average of 0.20, and 327 true positives missed by an unnamed frontier model.
- Why it matters: AI-generated code is flooding security pipelines with more findings than AppSec teams can manually triage, and Checkmarx’s FAE automates the classification step that currently consumes most of that human time.
- What to do: Before deploying any FAE-style filter in production, demand a disclosure mechanism — audit logs, classification confidence scores, or periodic recall audits — so your security team can verify what the engine is suppressing.
📖 Key Terms
- SAST (Static Application Security Testing)
- A method of analyzing source code for security vulnerabilities without executing the program — in this context, the category of tool Checkmarx’s new engine belongs to and is benchmarked against.
- F1 score
- A single number that balances a scanner’s ability to catch real vulnerabilities (recall) against its tendency to flag false alarms (precision) — Checkmarx’s 0.499 versus the category average of 0.20 is the article’s central performance claim.
- LLM (Large Language Model)
- An AI model trained on large text datasets — here, Checkmarx uses one trained specifically on security data to handle vulnerability patterns that fixed rules cannot detect.
- Deterministic
- A scanning approach that applies fixed, rule-based logic and always produces the same output for the same input — contrasted with the LLM component, which can vary based on context.
- Findings Analysis Engine (FAE)
- Checkmarx’s proprietary third component that classifies scan results as true or false positives before they reach developers — the architectural layer that makes this engine an orchestration tool rather than a raw scanner.
- Attackability
- Checkmarx’s new prioritization metric that scores vulnerabilities by whether they sit on a reachable, exploitable attack path — shifting triage from “does this bug exist” to “can an attacker actually use it.”
Analysis based on reporting by The New Stack. Original article here.

