Viral image-generating AI tools like DALL-E and Stable Diffusion are powered by massive datasets of images that are scraped from the internet, and if one of those images is of you, there’s no easy way to opt out, even if you never explicitly agreed to have it posted online.
In one stark example of how sensitive images can end up powering these AI tools, a user found a medical image in the LAION dataset, which was used to train Stable Diffusion and Google’s Imagen. Ars Technica first reported the inclusion of the image in the dataset.
On the LAION Discord channel, a user expressed their concern for their friend, who found herself in the dataset through Have I Been Trained, a site that allows people to search the dataset. The person whose photo was found in the dataset said that a doctor photographed them nearly 10 years ago as part of clinical documentation and shared written proof that she only gave consent to her doctor to have the image, not share it. Somehow, the image ended up online and in the dataset anyway.
When the user asked who they would need to contact to have it removed, Romain Beaumont, one of the developers of the LAION dataset and a machine learning engineer at Google according to his Linkedin profile, said “The best way to remove an image from the internet is to ask for the hosting website to stop hosting it. We are not hosting any of these images.” When asked if he has a list of places the image is hosted, he responded, “Yes that’s what the dataset is. If you download it you get the full list of urls. On clip-retrieval demo or similar websites you can press right click see url to see the website.”
When asked what a user should do if they feel that their photos have been inappropriately scraped and are illegal or sensitive to have in the dataset, a spokesperson for LAION told Motherboard in an email, “In this case we would honestly be very happy to hear from them e.g. via firstname.lastname@example.org or our Discord server. We are very actively working on an improved system for handling takedown request.”
After Motherboard reached out for comment and published a story about violent images and non-consensual pornography being included in the LAION dataset, someone deleted the entire exchange from the Discord. Beaumont then said on Discord, “This room is now 100% dedicated to improving safety of the datasets. It is not made to be quoted by hit pieces by journalists. If you need it stated explicitly: we do not accept to be quoted.”
The incident highlights a few of the concerns with the gigantic datasets that are being used to train AI: They are, by design, scraping images that they do not own, may not be classified correctly, and that copyright holders and subjects may or may not have given their permission to be used to train AI tools. Motherboard previously reported that the LAION-5B dataset, which has more than 5 billion images, includes photoshopped celebrity porn, hacked and stolen nonconsensual porn, and graphic images of ISIS beheadings. More mundanely, they include living artists' artwork, photographers’ photos, medical imagery, and photos of people who presumably did not believe that their images would suddenly end up as the basis to be trained by an AI. There is currently not a good way to opt out of being included in these datasets, and in the cases of tools like DALL-E and Midjourney, there is no way to know what images have been used to train the tools because they are not open-source.
On top of all of this, it is clear that, in LAION's case at least, developers on the project have not sufficiently grappled with why people may not want their images scraped by a massive AI and do not realize how difficult it can be to get nonconsensually-uploaded images removed from the internet. LAION pitches itself as "TRULY OPEN AI.," an open-source project that is being developed transparently that anyone can follow, contribute to, or weigh in on. And yet, when the project is criticized for privacy invasions, the open-source project has dealt with it by deleting Discord messages and suggesting journalists should not read its open Discord.
“A lot of these large datasets collect images from other datasets so it can be hard to find the original person who actually first collected the image, or first put the image into a dataset or first put the image out there. That makes it really hard because then, just as a legal matter, you don't know who to sue,” Tiffany Li, a technology attorney and Assistant Professor of Law at the University of New Hampshire School of Law told Motherboard. “And if you don't know who to sue, it's hard to find any sort of recourse and it's hard to punish wrongdoers.”
Not only is it hard to find the origin of a photo to request its deletion, but many people are still unaware that their photos are proliferating across sites and used for AI training purposes. It was only after a site called Have I Been Trained was trending online that people found out their images were being used in the dataset without their permission.
“Generally speaking, most people don't have access to these datasets, and most people don't know if their image has been used. I know there are tools now that help you figure this out. But you know, the average person is not going to go around policing every large machine learning data set out there to make sure their photo is using it. So you might not even know that you're actually being harmed and that's really problematic to me,” Li said.
Have I Been Trained, created by artist and musician Holly Herndon, makes it easy for people to search whether or not their images have been used to train AI. The website is very simple, featuring a search bar that allows you to either enter text or upload an image, and produces corresponding images that have been sorted from a dataset of 5.8 billion images called LAOIN-5B.
Mark Riedl, a professor at the Georgia Tech School of Interactive Computing, searched himself up on Have I Been Trained and found that several images of himself have been used to train AI text-to-image art models.
I searched my own name and did not find any pictures of myself, but got a result that said “These images were the closest matches from the LAION-5B training data,” featuring many images of different Asian women. When searching for someone who is significantly more well-known, such as Taylor Swift, tons of photos of them show up, in this case including images of Swift on red carpets, in concert, and in the street captured by paparazzi.
Noticeably missing from this site, however, is the ability to do something about the photos. I can click on each image to see their corresponding alt text captions but there are no buttons that would allow you to delete or flag a specific image. There is, however, a “learn more” button at the top of the page which allows you to sign up to either opt-in or opt-out of AI training. By entering your email address, the group that created the site, Spawning, promises to provide you with beta access to opt-in and opt-out tools. What those tools are and do still remain unclear.
The creators of LAION-5B used an open repository of web crawl data composed of over 50 billion web pages called Common Crawl to collect the images for its dataset. Then, LAION-5B and its predecessor LAION 400M (which contains 400 million images) were used in part to build text-to-image AI projects including Stable Diffusion and Google’s similar tool Imagen. When Motherboard reached out to these companies to ask if and how they would scrap certain images from being used, Google and Stability AI pointed us to the LAION team, while the LAION team directed us to Common Crawl.
Zack Marshall, an associate professor of Community Rehabilitation, and Disability Studies at the University of Calgary, has been researching the spread of patient medical photographs in Google Image Search Results and found that in over 70 percent of case reports, at least one image from the report will be found on Google Images. Most patients do not know that they even end up in medical journals, and most clinicians do not know that their patient photographs could be found online.
“[Clinicians] wouldn't even know to warn their patients. Most of the consent forms say nothing about this. There are some consent forms that have been developed that will at least mention if you were photographed and published in a peer-reviewed journal online, [the image is] out of everybody's hands,” Marshall said. After hearing about the person whose patient photograph was found in the LAION-5B dataset, Marshall said he is trying to figure out what patients and clinicians can do to protect their images, especially as their images are now being used to train AI without their consent.
“A lot of developers actually are not intentionally trying to harm people by collecting photographs and putting them into these large data sets. A lot of developers just think, well, more data is better.” Li said. “I don't think they really think of the implications, especially the harm on individuals.”
Besides the fact that there are so many parties involved in the proliferation and misuse of a single photo that makes it hard to trace down and remove from the Internet, there is also a growing gap in legal policy, which has not been able to catch up with and address copyright and privacy issues that have come with the rapid speed of AI development.
“Fair Use is this idea that you should be able to use some copyrighted material to make new things and you can apply that idea to the use of images in large datasets, where the use of images [is used] to train machine learning systems. The problem is that it doesn't do that exactly,” Li explained. “So there are still a few questions that remain: What does it mean when you use images to train an algorithm? What sort of use is that? Is that similar to copying an image like you would with traditional arts? Or is it similar to making a new image that looks similar to the old image? So fair use is not super clear.”
“In terms of copyright law, generally, the people who are the subject of the photos may not always have the copyright for those photos, they may not always have the ability to even try for copyright protection in the first place,” Li said.
LAION claims that all data falls under Creative Common CC-BY 4.0, which allows people to share and adapt material as long as you attribute the source, provide a link to the license and indicate if changes were made. On its FAQ, it says that it’s simply indexing existing content and after computing similarity scores between pictures and texts, all photos are discarded.
LAION maintains that it essentially has a layer of separation between itself and the photos it's scraping; it says that it is merely indexing links to photos and not actually storing them. It says it believes, therefore, that it has no copyright responsibility and cannot be blamed for scraping explicit content. “LAION datasets are simply indexes to the internet, i.e. lists of URLs to the original images together with the ALT texts found linked to those images. While we downloaded and calculated CLIP embeddings of the pictures to compute similarity scores between pictures and texts, we subsequently discarded all the photos,” it says.
Sign up for Motherboard’s daily newsletter for a regular dose of our original reporting, plus behind-the-scenes content about our biggest stories.
The FAQ also states that if you are an EU citizen, you’re protected by the General Data Protection Regulation (GDPR), which imposes rules on organizations around the world, as they target or collect data related to people in the EU. LAION says if your name is in the ALT text data but the corresponding image does not contain your image, it is not considered personal data and will not be removed. However, there is a takedown form on the LAION site that people can fill out and the team will remove the image from all data repositories that it owns if it violates data protection regulations. LAION is also supportive of the Spawning team’s efforts, who have been sending them bulk lists of links of images to remove.
All of this language that classifies LAION as a data gatherer rather than a creator extends their insistence that the real culprit isn’t them, but the internet at large that allows for the proliferation of nonconsensual and possibly copyrighted images. Yet, taking non-consensual images and allowing them to spawn into new, AI-generated imagery deepens a violation of privacy.
“As a professor, I have a public profile, so I am not terribly surprised to find myself in the dataset. I don't have a strong feeling one way or another,” Riedl told Motherboard “But also as a cis white male I am in a position of privilege to not have to worry too much about bias or my data being used to generate inappropriate or harmful content.”
Motherboard found that some of the worst images that have ever been posted online are included in the dataset, including ISIS executing people and real nudes that were hacked from celebrities’ phones.
LAION’s site does not go into detail about the NSFW and violent images that appear in the dataset. It says that it does not “contain images that may be disturbing to viewers … but links in the dataset can lead to images that are disturbing or discomforting depending on the filter or search method employed,” The FAQ says, “We cannot act on data that are not under our control, for example, past releases that circulate via torrents.” This sentence could potentially apply to something such as Scarlett Johansson’s leaked nudes, which already exist across the internet, and relinquishes control from the dataset creators.
“I think that the responsibility has to be on the part of the developers of the AI and machine learning tools and on the people who are actually creating these datasets. The responsibility shouldn't be on the individual whose photo or data has been used. Because that's really hard. It's really hard for each individual to police where their photos are used when photos are so easily accessible on the internet. That's not a fair or just way to do it,” Li said. “I think the best way to do it is to put the responsibility on the people who are actually using these images.”
The Federal Trade Commission (FTC) has begun practicing algorithmic destruction, which is demanding that companies and organizations destroy the algorithms or AI models that it has built using personal information and data collected in a bad faith or illegally. FTC Commissioner Rebecca Slaughter published a Yale Journal of Law and Technology article alongside other FTC lawyers that highlighted algorithmic destruction as an important tool that would target the ability of algorithms “to amplify injustice while simultaneously making injustice less detectable” through training their systems on datasets that already contain bias, including racist and sexist imagery.
“The premise is simple: when companies collect data illegally, they should not be able to profit from either the data or any algorithm developed using it,” they wrote in the article. “This innovative enforcement approach should send a clear message to companies engaging in illicit data collection in order to train AI models: Not worth it.”
“‘[Algorithmic destruction is], I think, an actual deterrent because that's going to cost money and time. So hopefully, that's an actual punishment that will get people to try to be more responsible, and more ethical with their AI,” Li said.