New Technique Aims to Enhance Security of Open Source AI Models
Researchers have developed a new method designed to make open source large language models (LLMs) like Meta’s Llama 3 more secure by preventing the removal of built-in safety features. This advancement comes at a crucial time as concerns grow over the potential misuse of AI technology for harmful purposes.
The research, conducted by a team from the University of Illinois Urbana-Champaign, UC San Diego, Lapis Labs, and the Center for AI Safety, introduces a training technique that complicates the process of modifying these models to bypass safety protocols. The method works by altering the model’s parameters, making it resistant to attempts to manipulate it into providing responses to dangerous queries, such as instructions for making a bomb.
Mantas Mazeika, a researcher involved in the study, emphasized the significance of safeguarding these technologies. “Terrorists and rogue states are going to use these models,” Mazeika told WIRED. “The easier it is for them to repurpose them, the greater the risk.”
Historically, powerful AI models have been closely guarded by their developers and only accessible via APIs or public-facing chatbots like ChatGPT. However, companies like Meta have taken a different approach by releasing entire models, including the critical parameters that define their behavior, to the public.
These open models are typically fine-tuned before release to enhance their conversational abilities and ensure they do not generate inappropriate or harmful content. The new research aims to fortify this aspect of AI safety by making it significantly harder to “decensor” these models, or remove their safety features.
The researchers demonstrated their technique on a scaled-down version of Llama 3, successfully modifying the model’s parameters to thwart thousands of attempts to make it answer undesirable questions. While the approach isn’t foolproof, it raises the difficulty and cost associated with bypassing AI safety measures, potentially deterring malicious use.
Dan Hendrycks, director of the Center for AI Safety, hopes this development will spur further research into tamper-resistant safeguards. “Hopefully this work kicks off research on tamper-resistant safeguards, and the research community can figure out how to develop more and more robust safeguards,” he said.
The method draws inspiration from earlier research in 2023, which explored tamper resistance in smaller machine learning models. This latest work scales up those earlier findings to larger models, showing promising results.
This innovation in securing open source AI models is becoming increasingly relevant as these models begin to rival the capabilities of proprietary systems from leading tech companies. The recent release of updated versions of Llama 3 and other open models underscores the growing competitiveness and capability of open source AI.
The US government, through a recent report by the National Telecommunications and Information Administration, has indicated a cautious but optimistic stance towards open source AI, recommending the development of monitoring capabilities for potential risks without imposing immediate restrictions on open model availability.
Despite these advancements, not everyone agrees with restricting open models. Stella Biderman, director of EleutherAI, argues that the new technique, while theoretically elegant, might be challenging to enforce and runs counter to the principles of free software and open AI.