Class of AI Models Hyped as Scarily Powerful Apparently Scared the Government Too Much and Now They’re Disabled



According to a statement posted to its website on Friday, Anthropic was forced to “abruptly disable” two of its most prized frontier AI models in response to a highly restrictive U.S. government order. “We believe this is a misunderstanding and are working to restore access as soon as possible,” the statement says.

The government action in question is an “export control directive” saying foreign nationals may not use the models inside or outside of the U.S., and it was motivated by what Anthropic says was an unspecified national security concern.

But national security concerns, and other security and safety fears, have been at the center of the rollout of these models, which arguably made an event like this foreseeable.

Rather than releasing its Claude Mythos Preview model to the public, in early April, Anthropic turned the creation of the model into a sort of consciousness-raising campaign about the ostensible dangers of frontier AI models.

It released a system card explaining why the model wouldn’t be made publicly available, and detailing scary capabilities like deceptiveness and the ability to supposedly break containment from a limited system. It was also purportedly able to be helpful in the development of advanced weapons. For instance, the system card described it as “capable of significant cross-domain synthesis relevant to catastrophic biological weapons development.”

At the same time, the company rolled out Project Glasswing, a program in which a limited group of partners and organizations were allowed to sample the model in order to learn what new horrors it could inflict on the world of cybersecurity. “We formed Project Glasswing because of capabilities we’ve observed in a new frontier model trained by Anthropic that we believe could reshape cybersecurity,” the Anthropic blog post about Project Glasswing says.

Soon, despite the inherent nerdiness of the topic, Mythos Preview was a tabloid story. An article in the New York Post cited computer scientist Roman Yampolskiy prophesying that in like of what Mythos heralds, AI may soon develop “hacking tools, biological weapons, chemical weapons, [and] novel weapons we can’t even envision.” The phrase “Weapons we can’t even envision” even made it into the headline.

British government officials and leaders in the U.K. finance sector scrambled to form an action plan in light of the perceived danger. According to the New York Times, the Trump Administration’s “noninterventionist policy” toward AI changed after the announcement of Mythos, and its mere existence helped lead to the development of a safety-focused AI executive order. Trump signed one such order about a week ago.

Nonetheless, last week Anthropic released Claude Fable 5 and Mythos 5. The company described Fable 5 as “a Mythos-class model that we’ve made safe for general use,” but with capabilities that “exceed those of any model we’ve ever made generally available.” Mythos 5, meanwhile, got a very limited release as part of Project Glasswing.

Brian Merchant over at Blood in the Machine described it like this:

After sparking a major news cycle in the tech media with its April announcement that it had built an AI model, Mythos, so powerful, so dangerous that it threatened to upend the entire civilizational order—and that it was diligently withholding the product from the public so as to protect us from it—the nation’s now-#1 AI startup decided to put Mythos up for sale after all.

Hours after Merchant wrote those words, the export control directive was delivered to Anthropic, and Fable 5 and Mythos 5 were made inaccessible due to apparent national security concerns. It appears Anthropic was only ordered to revoke access for users who are not U.S. nationals, but it’s understandable that Anthropic would find it impractical to let anyone access them anywhere in the world for fear of disobeying the order. Among many issues, non-U.S. nationals work at Anthropic. It’s clearly simpler to just pull the models entirely until the situation is resolved.

Interestingly, Anthropic’s statement about the export control directive noted that Anthropic had “worked with the US government,” along with the U.K. government, and “multiple private third-party organizations” in an effort to create a satisfactory set of safeguards for the models. Upon release, the safeguards were, in many ways, the most prominent feature of the media narrative around Fable 5. One of the tougher guardrails, designed to quietly punish users who abused the model, was even deemed ill-conceived, prompting Anthropic to apologize.

But in Anthropic’s telling, the government was spooked after learning about a jailbreak for Fable 5 that bypassed those all-important safeguards:

“Our understanding is that the government believes it has become aware of a method of bypassing, or ‘jailbreaking’ Fable 5. We reviewed a demonstration of this specific technique being used to identify a small number of previously known, minor vulnerabilities. These vulnerabilities all appear relatively simple, and we have found that other publicly-available models are able to discover them as well without requiring a bypass.

Anthtropic points out, quite rightly, that when it released Fable 5, the section in its blog post about the safety of the model made it clear that some jailbreaks were still possible. It’s “likely impossible to completely prevent universal jailbreaks, but our goal is to make any remaining jailbreaks sufficiently slow and costly that we can detect and prevent them before they are used at scale,” Anthropic wrote. Essentially, since making a model perfectly jailbreak-proof is not yet possible, Anthropic sought to make jailbreaks either costly to produce, or too “narrow” to be a threat. Anthropic is also public about the fact that it retains the data of users of Mythos-class models much more than usual.

Nonetheless, it’s odd to see Anthropic now downplaying the significance of its models’ perceived dangers, and writing that these vulnerabilities are “minor,” “previously known,” and “relatively simple,” as well as pointing out that “other publicly-available models are able to discover them as well without requiring a bypass.”

Again, when Anthropic first publicized this class of models, it told the world it had created something of unprecedented power with the potential to do real harm to the world. Two months later, a “Mythos-class” model was a product for public consumption, available as a premium product for users of the “Pro, Max, Team, and seat-based Enterprise plans at no extra cost” but only for a limited time. On June 23, Anthropic’s intent was to “remove Fable 5 from those plans,” and require a pay-as-you-go plan instead.

Anthropic claims that government actions like this could, if they became standard, “halt all new model deployments for all frontier model providers.” And perhaps that’s true. For a product release to be halted when the rollout of that product involved a precursor piece of technology that supposedly merited a global reassessment of cybersecurity, an overreaction to holes in that product’s safeguards should probably come as no surprise, even if that overreaction is bad for business.



Source link

You may be interested

Leave a Reply

Your email address will not be published. Required fields are marked *