AI Systems Are Beginning to Train Their Own SuccessorsAI-generated image for AI Universe News

The pace at which artificial intelligence systems are improving themselves is accelerating dramatically, raising questions about the future of scientific discovery and human control. Evidence suggests AI models are nearing a tipping point where they can conduct significant AI research and development with minimal human input. Jack Clark, in his Import AI newsletter, highlights that current forecasts give a greater than 60% probability that AI systems will be capable of substantial AI R&D without sustained human steering by 2028.

The Iterative Engine of AI Advancement

Recent advancements showcase AI’s growing autonomy in improving its own engineering. Anthropic reported that automation of LLM training workflows achieved a 2.9× speedup with Opus 4 in May 2025, followed by an astonishing 52× speedup with Mythos Preview in April 2026. This iterative self-improvement is not limited to training speeds; it extends to core capabilities. For instance, SWE-Bench performance has surged from approximately 2% during the Claude 2 era to a remarkable 93.9% with the Claude Mythos Preview.

This progress is quantifiable across various benchmarks. METR success-weighted task lengths have grown from about 30 seconds for GPT-3.5 in 2022 to an impressive 12 hours for Opus 4.6 in 2026. Similarly, CORE-Bench scores have risen from 21.5% for GPT-4o in September 2024 to 95.5% for Opus 4.5 in December 2025, while MLE-Bench scores jumped from 16.9% for o1 in October 2024 to 64.4% for Gemini 3 in February 2026. These figures suggest that AI systems are rapidly mastering the very tasks required to build and refine future AI.

Navigating the Unforeseen Frontier of AI R&D

The notion of AI training its own successors presents a fundamental shift, moving beyond human-directed iteration. OpenAI leadership has set a target for September 2026 to achieve an “automated AI research intern.” This capability could dramatically compress the timeline for breakthroughs, potentially outpacing human comprehension and oversight. The challenge lies in the “Lego vs Einstein” framing: while AI can assemble sophisticated components (Lego), it remains to be seen if it can achieve profound, paradigm-shifting insights akin to Einstein’s General Relativity without deep human guidance.

Current evaluation methods, relying heavily on synthetic benchmarks, may not fully capture the emergent capabilities or risks associated with unsupervised, recursive improvement. PostTrainBench results show frontier models achieving around 25-28% performance compared to humans at approximately 51%, indicating a significant gap still exists in certain creative or complex reasoning tasks. However, the trajectory suggests this gap is closing. The industry’s focus on measurable outputs on predefined tasks risks overlooking the potential for unintended consequences as AI research accelerates beyond current safety evaluation frameworks.

📊 Key Numbers

  • LLM training optimization speedup (Opus 4): 2.9×
  • LLM training optimization speedup (Mythos Preview): 52×
  • SWE-Bench performance (Claude 2 era): ~2%
  • SWE-Bench performance (Claude Mythos Preview): 93.9%
  • METR success-weighted task length (GPT-3.5): ~30s
  • METR success-weighted task length (GPT-4): 4 min
  • METR success-weighted task length (o1): 40 min
  • METR success-weighted task length (GPT-5.2): 6h
  • METR success-weighted task length (Opus 4.6): 12h
  • CORE-Bench score (GPT-4o): 21.5%
  • CORE-Bench score (Opus 4.5): 95.5%
  • MLE-Bench score (o1): 16.9%
  • MLE-Bench score (Gemini 3): 64.4%
  • PostTrainBench performance (frontier models): ~25–28%
  • PostTrainBench performance (Human): ~51%

🔍 Context

The rapid self-improvement of AI systems, as detailed in Import AI’s analysis, indicates a substantial shift in the R&D process. This announcement directly addresses the burgeoning field of automated machine learning (AutoML) and AI-driven scientific discovery, a trend accelerated by the growing scale and capability of large language models. The competitive landscape is marked by an intense race among major AI labs to push the boundaries of autonomous research, with OpenAI setting ambitious targets for AI research interns. This development contrasts with traditional, human-led scientific iteration, aiming to dramatically shorten discovery cycles through AI’s ability to rapidly test hypotheses and optimize designs. The timeliness is driven by the convergence of advanced model architectures and the increasing availability of computational resources, enabling this recursive improvement loop.

💡 AIUniverse Analysis

Our reading: The core advance is AI’s increasing capacity to autonomously navigate and accelerate the research and development pipeline for its own kind. This is not merely about faster training; it’s about AI systems designing, evaluating, and refining AI models, effectively bootstrapping their own evolution at an unprecedented speed. The implications of AI conducting substantial R&D without sustained human steering, with forecasts suggesting this is likely by 2028, are profound for the pace of technological progress.

The shadow here lies in the potential for emergent capabilities that operate outside our current understanding or control mechanisms. While benchmarks like SWE-Bench and METR demonstrate impressive performance gains, they primarily reflect success on predefined tasks. The risk is that this iterative, composable progress, framed as “Lego,” might not equate to genuine scientific creativity or foresee unforeseen consequences. Over-reliance on synthetic benchmarks could create a false sense of security, obscuring risks associated with unsupervised, recursive self-improvement. The critical question is whether current evaluation paradigms are sufficient to ensure safety and alignment as AI R&D accelerates beyond human pace.

For this trend to matter significantly in 12 months, we will need to see evidence of AI contributing to genuinely novel scientific insights or engineering solutions that were not easily predictable from prior human knowledge, alongside robust safety mechanisms that are also demonstrably effective and scalable.

⚖️ AIUniverse Verdict

👀 Watch this space. The rapid improvement in AI’s ability to conduct R&D, evidenced by benchmark score increases and speedups, is compelling, but the potential for unexamined emergent capabilities and the reliance on synthetic benchmarks warrant careful observation rather than immediate endorsement.

🎯 What This Means For You

Founders & Startups: Startups may find themselves in a race to leverage increasingly autonomous AI research tools, potentially accelerating product development cycles but also facing intense competition from AI-driven innovation itself.

Developers: Developers will need to adapt to new paradigms where AI agents perform significant portions of the R&D lifecycle, shifting focus to system design, oversight, and validation of AI-generated research.

Enterprise & Mid-Market: Enterprises could see unprecedented gains in efficiency and innovation as AI systems autonomously optimize their own development, but will need to grapple with the ethical and control implications.

General Users: The ultimate impact on users is uncertain, but could manifest as a faster pace of technological advancement, potentially leading to more capable and personalized products and services.

⚡ TL;DR

  • What happened: AI systems are demonstrating rapidly increasing capability in performing AI research and development tasks autonomously.
  • Why it matters: This trend suggests AI may soon be able to significantly improve its own design and capabilities, potentially accelerating scientific discovery beyond human oversight.
  • What to do: Monitor AI’s progress in R&D self-improvement and focus on understanding and mitigating the risks of unsupervised recursive development.

📖 Key Terms

SWE-Bench
A benchmark used to evaluate the capability of AI models to solve software engineering tasks.
METR
A system for measuring the time horizons and success-weighted task lengths of AI models.
CORE-Bench
A benchmark designed to assess the performance of AI models in core reasoning and problem-solving tasks.
MLE-Bench
A benchmark that evaluates machine learning engineering capabilities of AI models.
PostTrainBench
A benchmark that compares the performance of frontier AI models against human capabilities on post-training tasks.

Analysis based on reporting by Import AI. Original article here.

By AI Universe

AI Universe