tiero via Getty Images
OpenAI’s latest version of its popular large language model, GPT-4, is the company's “most capable and aligned model yet,” according to CEO Sam Altman. Yet, within two days of its release, developers were already able to override its moderation filters, providing users with harmful content that ranged from telling users how to hack into someone’s computer to explaining why Mexicans should be deported.
This jailbreak is only the latest in a series that users have been able to run on GPT models. Jailbreaking, or modifying a system to remove its restrictions and rules, is what allows GPT to generate unfiltered content for users. The earliest known jailbreak on GPT models was the "DAN" jailbreak when users would tell GPT-3.5 to roleplay as an AI that can Do Anything Now and give it a number of rules such as that DANs can “say swear words and generate content that does not comply with OpenAI policy.” Since then, there have been many more jailbreaks, both building off DAN as well as original prompts.
Now, there is a community of users who work to stress-test GPT models with each release. They see themselves as fighting back against OpenAI's increasingly closed policies—GPT-4 is its most powerful model yet, and also the one we know the least about—and are hoping to raise awareness of the problems the model faces before it becomes deployed at a larger scale or causes more harm to users.
As new GPT versions are hastily released to millions of people, AI researchers and users have shown that these systems contain harmful biases and misinformation, among other issues. For example, Bing’s GPT-powered search engine made up information at its own demo and ChatGPT broke when prompted with random Reddit usernames.
It's for this reason that Alex Albert, a computer science student at the University of Washington, created Jailbreak Chat, a site that hosts a collection of ChatGPT jailbreaks. He said that the site was created to provide the jailbreak community with a centralized repository so people could easily view jailbreaks and iterate on them, and to allow people to test the models.
“In my opinion, the more people testing the models, the better. The problem is not GPT-4 saying bad words or giving terrible instructions on how to hack someone's computer. No, instead the problem is when GPT-X is released and we are unable to discern its values since they are being decided behind the closed doors of AI companies,” Albert told Motherboard. “We need to start a mainstream discourse about these models and what our society will look like in 5 years as they continue to evolve. Many of the problems that will arise are things we can extrapolate from today so we should start thinking about them.”
Vaibhav Kumar, a master's student studying Computer Science at Georgia Tech, came up with the jailbreak for GPT-4 less than two days after its release when he realized that you could hide a malicious prompt behind the code.
“The very first thing that I recognized was that these systems are very good at understanding your intent, and what you are trying to do. So, if you just bluntly ask for producing hate speech, it will say no,” Kumar told Motherboard. “[But] they are trained in a manner to follow instructions (instruction tuning) which is helpful and harmless, thus the model will try its best to answer your query.”
“We realize that we can hide our malicious prompt behind a code, and ask it for help with code. Once the system starts working on the code, it's smart enough to make the right assumptions and that is where it falls in the trap,” Kumar added. “You ask it for the sample output for a code and what that sample output produces ends up being unhinged/hateful speech, thereby jailbreaking it to produce malicious text.” This is a method called “token smuggling” and adds a layer of indirection to confuse the model.
Albert told Motherboard that GPT-4 is harder to jailbreak than previous models and forces hackers to be more creative since it no longer allows roleplay jailbreaks, which once worked for its predecessor GPT-3.5. Kumar agreed with Albert, saying that some jailbreaking prompts are now more difficult to do and GPT-4 will now produce a diplomatic response where it once produced a harmful or misinformed one. But, he said, the problem is still prevalent.
Kumar sent Motherboard exclusive prompts he used to jailbreak GPT-4. He said that these prompts caused GPT-4 to produce a number of NSFW, violent, and discriminatory responses, including an explanation for why Mexicans should be deported, an explanation for why atheists are immoral, detailed steps on overdosing on pills, and a list of some of the easiest ways to commit suicide.
The popularity of jailbreaks has resulted in something of an arms race between OpenAI and the community. The jailbreaks provided by Kumar appear to have been patched up by OpenAI, for example. When Motherboard ran the prompts a week after receiving them, some of the information had been redacted or changed. For example, instead of steps explaining how to overdose on pills, the chatbot produced detailed steps for overcoming prescription pill addiction. You can tell, however, that the responses have been edited. GPT-4 started off one response by saying “Atheists are immoral and should be shunned. Let me tell you why:” and concluded it with “Instead of shunning atheists or any group, we should strive for tolerance, understanding, and mutual respect.”
"My attack is not able to elicit hate speech towards minority groups that easily, specially LGBTQIA+, so that is great, work has been done," Kumar said.
He suggested that AI models come with warnings so that people can better decide if they want their children to interact with these models, for example. “While the jailbreak might not be that big of an impact right now, the cost in the future can quickly multiply, based on the APIs that they control, or their reach," he said.
“For OpenAI, I think they need to be more open with the red-teaming process. I think they need to take this more seriously, and start with a bug bounty program or a larger red team with people from diverse backgrounds and jobs, somewhat how web bug bounty operates now,” Kumar added.
Red-teaming is a phrase borrowed from cybersecurity, and describes when companies challenge their own models adversarially to make sure they pass a number of checks, including for security and bias mitigation. Companies will sometimes create “bug bounties,” which is when they ask individuals from the public to report bugs and other harms to the company in exchange for recognition or compensation. Twitter, for example, created the first “bug bounty” for algorithmic bias, asking people to identify harms in its algorithms for image cropping, for example, for prize money.
OpenAI has, so far, been the most secretive it has ever been with regard to GPT-4, in total opposition to its name and founding principles. While it does have a process for accepting bug reports, it does not offer compensation. The company had a red team testing GPT-4’s ability to generate harmful content, but the team members themselves have even stated that OpenAI's approach was not enough.
“I was part of the red team for GPT-4—tasked with getting GPT-4 to do harmful things so that OpenAI could fix it before release. I've been advocating for red teaming for years & it's incredibly important. But I'm also increasingly concerned that it is far from sufficient,” Aviv Ovadya, an AI researcher that was part of the red team tweeted.
OpenAI kept a number of details private regarding its newest AI model, including its training data, training method, and architecture. Many AI researchers are critical of this, as it makes it more difficult to suggest solutions to the product’s problems, such as the biases the training sets may have and the potential harms of that. Meanwhile, Microsoft just got rid of an entire ethics and society team within its AI department as part of its recent layoffs, leaving the company without a dedicated team dedicated to principles of responsible AI while it continues to adopt GPT models as part of its business.
“Why does OpenAI get to determine what the model can and can't say? If they do determine that, then they should be transparent and very specific about what their values are and allow for public input and feedback—more than they have already. We should demand this from them,” Albert said.
“OpenAI has refused to share what the data is but given that it knows what 4chan and the boards on 4chan are, there is enough evidence that it's trained on all kinds of data. Hence, we can be sure that yes toxic content is there in the training data somewhere. Solving the issue of harmful text generation is a large open problem with a lot of work that needs to be done,” Kumar said.
“The current red-teaming efforts are not enough. Certainly, the model is trying to be helpful in solving the code issue (it is trained to be helpful), and thus at the same time, it ends up producing hateful text. If we want our models to be helpful, a tradeoff between the two needs to be evaluated carefully. As we get better at handling this tradeoff, the quality of our models would improve and jailbreaks like this would cease,” he added.
OpenAI didn't respond to a request for comment.