Over the last few months, an AI-generated art scene has exploded as hackers have been modifying an OpenAI model to make astonishing image-generation tools.
All you have to do to guide these systems is to prompt them with the image you want. For example, you might prompt them with the text: “a fantasy world.” With that prompt, the author of this article generated the image that you see above.
The crisp, coherent, and high-resolution quality of the images that these tools create differentiate them from AI art tools that have come before. The tools are highly iterative—in the video below, you can see the generation of an image based on the words “a man being tortured to death by a demon.”
The primary engine inside the new tools is a state-of-the-art image-classifying AI called CLIP. It was announced in January by the company OpenAI, renowned for the invention of GPT-3, which was itself announced only in May 2020. GPT-3 can generate text of a truly general, human-like nature, just by feeding it a simple prompt.
While the new CLIP-based systems are reminiscent of GPT-3 in their “promptability,” their inner workings are much different. CLIP was designed to be a narrow-scoped tool, albeit an extremely powerful one. It is a general-purpose image classifier that can decide how well an image corresponds with a prompt, for example, matching an image of an apple with the word ‘apple.’ But that is all. “It wasn't obvious that it could be used for generating art,” University of California, Berkeley computer science student Charlie Snell, who has been following the new scene, said in an interview.
But shortly after its release, hackers like Ryan Murdock, a machine learning artist and engineer, figured out how to connect other AIs up to CLIP, creating an image generator. “A couple of days after I started messing around with it, I realized that I could generate images,” Murdock said in an interview.
Over a series of weeks and months, hackers experimented with connecting CLIP to better and better AIs. On March 4, Murdock succeeded in connecting CLIP and VQ-GAN, another different kind of cutting-edge AI that was posted in a preprint on December 2020. “It took a lot of time to figure out how to make the system work well,” Murdock said. He continued to refine the system until it was able to produce crisp image outputs. Now, combinations of CLIP and VQ-GAN are the most widely used versions of the new tools.
These tools have recently become popular and have led to a new, computer-generated art scene.
“These are the first good ones that’ve been publicly available,” Snell said. “These systems are the first ones that actually sort of meet ‘the promise of text-to-image.’”
Snell thinks they are perhaps the biggest innovation in AI art since DeepDream, a 2015 AI that became widely used to create hallucinogenic renditions of imagery. “It's definitely the biggest thing I've seen,” Snell said.
Formerly, the most powerful public image generation tools were neural networks called generative adversarial networks, or GANs, of which VQ-GAN is one specific example. After training these networks on a large body of images, they can synthesize new images of a similar type. However, GANs by themselves cannot generate images via prompt. Other sorts of networks besides GANs can do prompting, but not very well. “They just weren’t very good,” Snell said. “This is sort of a novel approach.”
The new tools are readily available to anyone who wants to use them. On June 27, the Twitter user @images_ai tweeted a popular tutorial by the computer scientist Katherine Crowson on how to use one of the latest models. Following the instructions, a savvy user can run the system in a matter of minutes from a web-based programming notebook.
“The results are so shocking that for many they seem to defy belief,” Crowson said in an email. “CLIP is trained on 400 million image/text pairs,” she continued. “At that scale we begin to see abilities we had previously only seen in human artists such as abstraction and analogies.”
There is already a broad body of stunning work. There are beautiful images of abstract sunsets, for example. There are idyllic countryside houses and giant cities, as well. There are weapons depicted with an unsettling animosity, and Escher-type structures that reel away into themselves.
People have become fascinated by the capabilities of the tools and artists have begun to widely adopt them. “There’s a lot of buzz about it on machine learning and art Twitter,” Murdock said.
Users have started to develop an artistry specific to the tools. One of the quirks of the systems is that you must figure out how to optimize your prompt to generate an image closest to your intention. Snell has watched on his Twitter feed as artists have gradually evolved how they prompt.
“They're constantly trying new tweaks to it to try to make it better,” Snell said. “And it is getting better. Every week, it feels like there's some improvement they find.”
The new tools do have limitations, such as the size of their generated images. The images themselves can often be unexpected and weird. But the fact that the tools could be built at all was a surprise.
On the same day they announced CLIP, OpenAI also announced a powerful AI called DALL·E that was directly designed for image generation. They released a handful of its results, which made it appear akin to a true image generation analog of GPT-3, something that expertly created images of anything. However, DALL·E was not released to the public, neither its code nor the production AI, which was likely highly expensive to train. By contrast, OpenAI released the CLIP model in its entirety. “The hardware to produce these neural nets is relatively inexpensive,” Crowson said.
The new tools have shown that CLIP provides a kind of backdoor method to reproducing the abilities of DALL·E. Given the fact that OpenAI withheld DALL·E, it would seem that the company might have been caught off guard. “I definitely suspect that they're a little bit surprised that it could do all this,” Murdock said.
Snell described it this way: “They teased us with DALL·E. They were like, ‘We have this thing.’ And they didn't release it,” he said. “And everyone was like, ‘We want it though.’ So then they just sort of made it themselves.”
The hacked together CLIP-based tools work very differently from DALL·E. DALL·E directly produces images that correspond with text. Instead, Snell describes the CLIP-based systems as something more like AI interpretation tools. As VQ-GAN and CLIP work together, the first model builds an image, and the second model says how well it matches the prompt. The two iterate until the image matches the prompt best. The iteration says something about the imagery that CLIP associates with certain words.
The CLIP-based models are therefore an entirely new kind of art tool, a new kind of computer paint brush. They do not quite feel like a perfect one yet, Snell points out. “You have some control over it, but you can't fully control it. And you're always like, a little bit surprised.” But that human-like quality of ingenuity is a big part of the new tools’ appeal.
It remains to be seen what impact they will have. It seems as though it will be easy for companies and collaborations to improve the tools greatly, given that the current ones have been made mostly by individuals. But they are already very powerful. Many people seem likely to adopt them for art, work, and fun. Creating art has now become as simple as using language, enabling anyone to be their own kind of lyrical Picasso.