Follow ZDNET: Add us as a favorite source On Google.
ZDNET Highlights
- The latest version of Cloud Mythos has already been upgraded.
- External researchers found that it achieved several firsts in testing.
- AI capabilities may be improving faster than anticipated.
Anthropic’s Cloud Mythos, which the company says is too powerful for general release, appears to have already gained new capabilities.
one in blog Posting on Wednesday, the UK AI Security Institute (AISI) reported that it had tested a new version of Mythos that outperformed both its previous results and OpenAI’s GPT-5.5 – just a month after Mythos’ initial release.
Also: Apple, Google and Microsoft join forces with Anthropic’s Project Glasswing to protect the world’s most critical software
The blog authors wrote, “The new Mythos Preview checkpoint completed both of our cyber ranges, solving ‘The Last Ones’ in 6 out of 10 attempts and the previously unsolved ‘Cooling Tower’ in 3 out of 10 attempts.” “This was the first time a model completed the second of our two cyber ranges.”
When Anthropic first announced Mythos Preview and Project Glasswing – the cybersecurity testing alliance it formed with rival tech companies and AI labs, to which it gave limited access to Mythos – last month, the UK AISI rated itFinding that the model “represents a step forward over previous frontier models in a scenario where cyber performance was already rapidly improving.”
That third-party perspective helped balance claims that the hype around the Mythos was either purely marketing or, at the other extreme, signaled a catastrophic shift in AI capabilities. The truth about what the model can do is likely to lie somewhere in the middle.
Also: How to learn cloud code for free with Anthropic’s AI courses – one only took me 20 minutes
AISI’s updated testing also exemplifies that capability improvements are not limited to individual model releases, but can also occur within versions of a single model.
rapidly growing cyber threat
AISI noted that AI models are rapidly advancing in their ability to handle cyber tasks with serious implications for cybersecurity, especially given the ability of Mythos to detect software vulnerabilities.
“In February 2026, we internally estimated that the duration of cyber tasks that could be completed by AI models would double every 4.7 months by the end of 2024 – already an acceleration from our November 2025 8-month estimate,” the blog authors wrote. “Since then, AISI reported on two new models, Cloud Mythos Preview and (OpenAI’s) GPT-5.5, both of which significantly exceed doubling rate trends.”
Also: Third major Linux kernel flaw found in two weeks – thanks to AI
The authors said it is unclear whether this trend will persist or whether these findings indicate a permanent increase. The Mythos and GPT-5.5 models may be notable breaks from the overall pattern of development.
Nevertheless, AISI clarified that there were many unknowns that could not be determined in its testing. The tests limited the tasks to 2.5 million tokens, allowing researchers to better compare performance results over time. “This naturally demonstrates what marginal models can do,” he wrote.
The blog continued, “Mythos Preview and GPT-5.5 have large upper bound error bars due to our narrow cyber suite’s near 100% success rate on the longest tasks, even with the 2.5M token limit.” “Our experiments are not long enough to determine how rapidly the model’s reliability will deteriorate over high task periods. This puts some of the newest models at the measurement limits of our narrow test suite.”
Also: I put GPT-5.5 through a 10-round test: it scored 93/100, losing points only due to excitement
While this makes it harder to measure the model’s failure point, it also means that the model’s success rate on these tasks would be much higher than without the token limit – so high, in fact, that “it becomes impossible to calculate the time horizon.” Models with greater token access and complex agent infrastructure will be more efficient.
The blog states, “The 2.5M token limit is relatively low – in our Cyber Range experiment we use up to 100M tokens and find that beyond that budget there is still potential for performance improvement, especially for recent models, which disproportionately benefit from the higher token limit.”
