Chapter 11: Once AI Takes Your Data You Can’t Get It Back
Have you ever wondered what happens to a photo after you post the image online? Not the version your friends see and scroll past. The other version. The one a machine copied, cataloged, and absorbed into a system now worth billions of dollars. The one you never agreed to give away.
Here is a question worth sitting with for a moment. If a stranger walked into your home, photographed every room, recorded every conversation, photocopied your medical records, your resume, your diary entries, your family snapshots, and then used all of those materials to build a commercial product generating billions in revenue, you would call the police. You would call a lawyer. You would call your representative in Congress.
You would be furious.
That is exactly what happened to you. The stranger was an automated web crawler. Your home was the internet. And the commercial product is the artificial intelligence you interact with every single day.
Every major AI system in existence today, the ones generating text, producing images, answering your questions, and writing your emails, was built on a foundation of personal information scraped from hundreds of millions of people who never gave their permission. Your blog posts. Your photos. Your forum questions about a medical condition you did not want anyone else to know about. Your resume listing your disability status, your date of birth, your home address.
Researchers who examined a tiny fraction of one major training dataset, one tenth of one percent, found hundreds of millions of images containing passports, credit cards, driver's licenses, and birth certificates belonging to real people. They found over 800 job applications linked to real individuals through their professional profiles. The face detection algorithm designed to blur identifiable faces had missed an estimated 102 million of them.
And here is the part that should make every American pay close attention. Once your data enters one of these AI models, getting your information back out is, for all practical purposes, impossible. The companies themselves admit this. The technology to undo the process does not exist. Over 180 academic papers have studied the problem, and the conclusion is the same every single time. Your data went in through a one way door.
This chapter is about that door. How your personal information ended up inside AI systems you never agreed to participate in, who profited from the taking, what the law does and does not protect, and what you need to do right now to limit the damage going forward.
The Pipeline From Your Life to Their Product
The path from a personal blog post to an AI model follows a specific series of steps, and understanding those steps matters because they reveal how deliberate and systematic the data collection has been.
Automated web crawlers, led by an operation called Common Crawl, systematically download billions of web pages every single month. Common Crawl, a nonprofit founded in 2008, has amassed over 9.5 petabytes of raw web data from more than 100 billion pages. A single monthly crawl in August 2025 added 2.42 billion pages totaling 419 terabytes of information. Both OpenAI and Anthropic donated $250,000 each to Common Crawl in 2023, funding the exact infrastructure that feeds their commercial AI products.
Raw crawl data does not go directly into AI models. Developers filter and process the data into derivative datasets, and this is where personal information gets baked into the foundation. Google created a dataset called C4 from 15 million websites in a single crawl, producing 750 gigabytes of text used to train multiple AI models. Investigators found C4 contained personal blogs, medical forums, paywalled journalism, and personal information scattered throughout the data. Over 80 percent of the training material for GPT-3 came from filtered Common Crawl data. A study found at least 64 percent of 47 major language models published between 2019 and 2023 used Common Crawl as a source.
For AI image generators, the story centers on a dataset called LAION-5B, a collection of 5.85 billion image and text pairs assembled by a German nonprofit. A high school teacher led the project. The team built the dataset for roughly $10,000. The images were harvested automatically from Common Crawl, meaning no human being ever reviewed a single one of those 5.85 billion images before they became training material for products like Stable Diffusion and Google's Imagen. At a rate of one second per image, reviewing the full dataset would take 781 years.
Another dataset called The Pile, an 825 gigabyte text collection, drew from 22 different sources including 196,640 pirated books, real employee emails from a federal investigation, GitHub code repositories containing developer passwords and credentials, YouTube subtitles, medical abstracts, and question and answer posts from technical forums. Every one of these sources contains names, contact details, health information, financial discussions, and proprietary code.
The legal distinction at the heart of all of this is deceptively simple. "Publicly available" does not mean "consented to AI training." A blog post is publicly available in the sense that anyone browsing the internet sees the content. A family photo on social media is publicly available in that sense too. A question about a medical condition on a health forum is publicly available. None of these were posted so a corporation worth hundreds of billions of dollars would ingest them into a commercial product. As one former executive at an AI company stated after resigning over this exact issue, all these companies are saying is they have not illegally hacked into a system. A remarkably low bar.
What They Found When They Looked Inside
The scale of personal information embedded in AI training data is staggering, and the problem only becomes clearer the closer researchers look.
In July 2025, a team of researchers from the University of Washington audited a training dataset called DataComp CommonPool, a collection of 12.8 billion samples downloaded more than two million times since its 2023 release. They examined one tenth of one percent of the dataset. In that tiny sample, they found thousands of images containing real identity documents. Passports. Credit cards. Driver's licenses. Birth certificates. They found hundreds of confirmed job applications linked to real people, with those resumes disclosing disability status, background check results, birth dates of dependents, and race. The face blurring algorithm in the dataset had missed an estimated 102 million faces. One of the researchers summed up the finding simply: anything you put online has probably been scraped.
Research from Google DeepMind has demonstrated how much of this personal data AI models retain. In a landmark study, researchers extracted hundreds of word for word text sequences from an AI model, including real names, phone numbers, email addresses, physical addresses, and private conversations, some from documents appearing only once in the training data. Follow up research established three consistent patterns: AI models memorize more data as they get larger, they memorize more when data appears multiple times, and they memorize more when given longer prompts. In November 2023, the same research team spent $200 querying ChatGPT and extracted over 10,000 unique memorized training examples. Of the outputs they tested, roughly 17 percent contained memorized personal information, and 86 percent of flagged content turned out to be real personal details belonging to real people, including a CEO's email signature with personal contact information.
Image generators present the same risks. Researchers extracted over 1,000 training examples from Stable Diffusion and Google's Imagen, including photographs of identifiable individuals. When prompted with a specific person's name, Stable Diffusion produced that person's exact photograph from the training data. People with unusual names face elevated risk because their images are more uniquely associated with their identity in the dataset.
The most disturbing discovery came from the Stanford Internet Observatory. In December 2023, researchers confirmed that LAION-5B contained at least 1,008 validated instances of child sexual abuse material, with over 3,200 total suspected instances. Internal communications showed the team behind the dataset knew about this risk as early as 2021. The discovery upended a core assumption in the field: researchers had believed AI generated abusive imagery combined adult content with benign children's photos, when in reality the abusive material had been in the training data all along. The dataset was pulled offline and a cleaned version was released in August 2024. Every model already trained on the contaminated data carried its influence permanently. In June 2024, a human rights organization found identifiable photos of real children from personal blogs and low traffic YouTube videos in that same dataset. If you take a step back, you need to search "Have I Been Trained" at spawning.ai and check whether your images or your children's images appear in these datasets. The tool searches LAION-5B and lets you flag images for removal from future training sets.
And in September 2022, a San Francisco artist discovered something deeply personal. Clinical before and after photos of her face, taken by her doctor in 2013, had been scraped into LAION-5B. Her doctor had died in 2018. Someone had taken the images from the deceased doctor's files, posted them somewhere online, and the automated crawlers swept them into a dataset used to build commercial products. As she told reporters, having a photo leaked is bad enough, and now her medical images are part of a product.
The Courtroom Reckoning
A tidal wave of litigation is now testing whether American law has any answer for what happened. As of early 2026, over 70 copyright and privacy lawsuits have been filed against AI companies, double the count from late 2024. The legal theories range from copyright infringement to wiretapping to wrongful death.
The highest profile case involves The New York Times suing Microsoft and OpenAI, alleging the companies used millions of copyrighted articles to train AI without consent. The Times seeks billions in damages. A federal judge in New York is overseeing a consolidated set of 16 or more related lawsuits including cases brought by the Authors Guild, individual novelists, investigative journalists, and media organizations. In January 2026, the court compelled OpenAI to produce a full sample of 20 million anonymized user logs over the company's strong objections, marking a major victory for the people bringing these claims.
Three significant rulings in 2025 began shaping the legal terrain. In one case, a federal judge found a legal AI tool's use of copyrighted headnotes was not fair use, a win for the people whose work was taken. In another, a judge granted summary judgment finding that using books to train an AI model was "highly transformative," a narrow ruling favoring the AI company. Most significantly, a federal judge ruled that AI training itself was fair use, and then in that same case ruled that the AI company's downloading of over 7 million pirated books from shadow libraries was not fair use. That distinction produced the largest AI related settlement to date: $1.5 billion, covering approximately 465,000 pirated works at roughly $3,000 per book. No appellate court has ruled on fair use in AI training, which means the fundamental legal question remains unresolved.
The visual arts produced the first AI image generation case to reach the discovery phase. Three artists filed suit in January 2023 alleging that AI companies scraped billions of images through LAION-5B. A federal judge allowed core copyright claims to proceed, finding the artists had reasonably argued their rights were violated. A landmark ruling in the United Kingdom held that AI model weights are not "infringing copies" of training images because the model does not store visual information in a retrievable way, and that ruling left the central question of whether AI training infringes copyright entirely unresolved.
Privacy focused litigation is expanding rapidly. One lawsuit filed on behalf of anonymous plaintiffs including a six year old boy seeks $3 billion for alleged scraping of personal data from hundreds of millions of internet users, including children. At least eight wrongful death lawsuits against OpenAI were pending as of early 2026, alleging the chatbot served as encouragement for vulnerable users to harm themselves. And Clearview AI, the company that scraped approximately 50 billion facial images from the public internet to build a facial recognition database, produced a federal class action settlement granting affected individuals a 23 percent equity stake in the company, valued at approximately $51.75 million. Twenty two state attorneys general filed a brief calling that settlement inadequate.
What Regulators Are Doing, and What They Are Not
The Federal Trade Commission has established a tool called algorithmic disgorgement, which forces companies to delete AI models trained on improperly collected data. The FTC first used this approach against Cambridge Analytica in 2019. Since then, the agency has ordered model deletion in cases involving face recognition data collected without consent, children's data, and consumer photos. In one case, the FTC banned a pharmacy chain from using facial recognition for five years after finding its AI produced false identifications that disproportionately affected people of color, and the agency required deletion of all consumer photos, models, and algorithms derived from those photos.
In September 2025, the FTC launched an inquiry into AI companion chatbots, ordering seven companies to disclose their data collection and safety practices. The inquiry specifically targets how personal information from user conversations gets collected, used, and shared. At the same time, the current administration has shown a willingness to pull back from enforcement seen as burdening AI development, reversing a consent order against one AI company in December 2025 and signaling a deregulatory approach through an executive order seeking to override state AI laws. If you have a complaint about an AI company's use of your personal data, file the complaint with the FTC at ftc.gov/complaint. These complaints create a public record and feed directly into future enforcement decisions.
At the state level, Texas has emerged as the most aggressive enforcer, securing a $1.375 billion settlement with Google and a billion dollar plus settlement with Meta for unlawful collection of biometric data. The Texas attorney general also launched investigations into multiple AI companies over children's privacy and into General Motors for selling driving data of 1.5 million Texans. California has extended its consumer privacy protections to AI through legislation clarifying that deletion rights apply to personal information in AI systems. California's AI Training Data Transparency Act requires AI developers to publish summaries of their training datasets, including whether they contain copyrighted material or personal information. If you live in California, your deletion rights under the California Consumer Privacy Act now explicitly cover AI systems under legislation that took effect in January 2025, with penalties of $7,500 per intentional violation per consumer.
Europe has moved further and faster than the United States on every front. Italy's data protection authority fined OpenAI 15 million euros for processing personal data to train ChatGPT without a lawful basis. The European Data Protection Board declared that AI models trained on personal data will in most cases be subject to the full force of European privacy law, rejected arguments that language models are inherently anonymous, and confirmed that regulators have the authority to order erasure of entire AI models trained on unlawful data. South Korea ordered an AI company to destroy a model trained on consumer data transferred without consent, one of the first times any government has forced the deletion of an AI model. These international actions matter for Americans because they demonstrate what is achievable when regulators have real authority. The United States still has no equivalent federal privacy law giving regulators that same authority.
Why No Law Covers What Happened to You
The United States still has no federal privacy law covering all Americans. The American Privacy Rights Act stalled in 2024. The AI CONSENT Act, which would have required your explicit permission before your data was used for AI training, did not advance. Over 150 AI related bills were introduced in the last congressional session. None passed.
States have tried to fill the gap. Twenty states now have consumer privacy laws on the books, up from five in 2023. California leads with at least nine AI related laws enacted in 2024 and 2025, including legislation requiring the largest AI developers to publish risk frameworks and report safety incidents. Colorado passed the first broad state AI law, signed in May 2024, with an effective date pushed back to June 2026 after industry opposition. Texas enacted an AI governance law taking effect in January 2026. Illinois modified its groundbreaking biometric privacy law in 2024, moderating its per scan damages structure and continuing to generate hundreds of AI related lawsuits. The provisions of these state laws vary enormously, creating a maze designed for corporate legal departments, not for the people the laws are supposed to protect.
The April 2025 Consortium of Privacy Regulators, a coalition of the California Privacy Protection Agency and state attorneys general from nine states, represents a coordinated enforcement effort worth following. California's CPPA has issued over $100 million in CCPA penalties. The Authors Guild at authorsguild.org tracks all AI class actions and provides guidance for writers whose work has been scraped. These are the closest things to organized resistance at the legal level.
The Consent Fiction Fueling the Entire System
The entire AI data crisis rests on a fiction so deeply embedded in the technology industry so few people even recognize the fiction for what the fiction is. The fiction is consent. Every time you click "I Agree" on a terms of service update, every time you scroll past a privacy policy notification, every time a platform changes its data practices and sends you a notice you never read, the company records your silence as permission.
A June 2025 paper from researchers at Hugging Face, one of the largest AI model sharing platforms, identified three fundamental problems with consent in the AI context. First, the scope problem: you cannot meaningfully consent to all possible outputs an AI model will produce using your data, because nobody, not even the developers, knows what those outputs will be. Second, the temporality problem: consent given today enables representations of your data persisting for decades inside model weights. Third, the autonomy problem: individual people lack the technical knowledge to understand what they are consenting to, and the companies providing the consent forms know this.
Pew Research has found 56 percent of Americans always or almost always agree to privacy policies without reading them, and 81 percent assume organizations will use their information in ways they would not be comfortable with. When LinkedIn silently began using employment data for AI training, when Zoom quietly added AI training rights to its terms of service in 2023, when Meta sent 2 billion opt out notifications to Europeans and offered Americans nothing, these were not acts of informed consent. These were exercises in compliance theater, performances designed to satisfy legal requirements and nothing more.
The $278 billion data broker industry feeds directly into AI training pipelines. AI companies obtain training data not only through web scraping. They also purchase curated datasets from brokers. Criminal enterprises have created their own AI powered tools using stolen personal data to automate fraud. AI voice cloning fraud jumped over 400 percent in 2025, with modern systems able to clone a person's voice from 3 to 5 seconds of audio harvested from social media or voicemails. Documented losses from deepfake enabled fraud exceeded $200 million in the first quarter of 2025 alone. Only 33 percent of consumers trust companies with data collected through AI technology. And the scraping continues every single day.
The One Way Door: Why Your Data Stays Inside the Machine
This is the most uncomfortable reality in this entire situation, and every American needs to understand the science behind the problem. Once your personal data has been used to train an AI model, removing its influence is essentially impossible with the technology that exists today.
Over 180 academic papers have been published on this problem since 2021. The research field is called machine unlearning. No reliable solution exists.
The fundamental issue is that training a neural network is a one way transformation. Each piece of data adjusts millions or billions of numerical parameters simultaneously. Your text, your photo, your resume does not sit in a single location inside the model. Its influence spreads across the entire network. Think of mixing blue paint into yellow paint to make green. The blue cannot be unmixed. A landmark December 2024 paper authored by over 30 researchers from Google DeepMind, Stanford, Harvard, Cornell, and Microsoft Research concluded that machine unlearning is not a general purpose solution for controlling AI model behavior.
So what do companies do when you ask them to delete your data? Mostly, they filter what the model outputs rather than changing what the model knows. Training GPT-4 cost over $100 million. Google's Gemini Ultra cost an estimated $191 million. Retraining an entire model from scratch to honor a single deletion request is economically absurd, and every AI company knows this. Italy's data authority noted that OpenAI admitted correcting inaccurate AI generated personal data is "technically impossible." The company offers output suppression, not actual removal. A 2025 analysis of approximately 22,000 formal data deletion requests found that only 48 percent resulted in verified deletion by year's end.
This creates a direct conflict with privacy law. The European Union's right to erasure and California's deletion rights were designed for databases, not neural networks. The European Data Protection Board has acknowledged that AI models are compressed versions of their training data and insists that technical difficulty alone does not exempt companies from following the law. California's legislation extends deletion rights to AI systems capable of outputting personal information. The gap between what the law requires and what the technology allows remains enormous.
What You Need To Do Right Now
The protective tools available to individuals today are real, and they are limited, and honesty about those limitations matters more than false comfort.
Start by checking whether your images appear in AI training datasets. The "Have I Been Trained" tool at spawning.ai lets you search the LAION-5B dataset by uploading an image or entering keywords. This tool helped the San Francisco artist discover her medical photos in the dataset, and the same tool played a role in the artist lawsuit against AI image generators. You should search for photos of yourself and your family members. If you find your images, you have the option to flag them for removal from future training sets. Stability AI committed to respecting the Do Not Train registry maintained by the same organization for its newest image generation model.
Next, exercise your opt out rights on every AI platform you use. OpenAI's privacy portal allows you to toggle off training data use and request personal data removal, and the company reviews those requests individually. LinkedIn buries a "Data For Generative AI Improvement" toggle in its settings that you need to find and disable manually. Meta provides no formal opt out mechanism for users in the United States, which tells you everything you need to know about the company's priorities.
In nearly every case, these opt outs are not retroactive. They only apply going forward. They default to allowing training. They require you to take action on each platform separately. And they do nothing about data already baked into existing models.
For your websites and online content, adding a robots.txt file telling AI crawlers to stay away is the most widely discussed defense. Know that a 2025 study found several AI crawlers never even check for the file. One AI search company has been specifically accused of ignoring robots.txt entirely. Reports have documented companies changing the names of their crawlers to get around blocking. Only about 8 percent of websites successfully block all automated scraping requests. This defense is worth implementing and you should not treat the measure as reliable.
If you are a visual artist, the Glaze and Nightshade tools from the University of Chicago represent the strongest protections available today. Glaze adds imperceptible changes to your artwork that confuse AI models about your artistic style. Roughly 7.5 million people have downloaded the tool. Nightshade goes further, embedding misleading associations into images so that if an AI model trains on them, the model's outputs degrade. As few as 50 poisoned images have been shown to disrupt a model's performance. Be aware that in July 2025, researchers demonstrated a countermeasure capable of stripping Nightshade protections with 99.98 percent accuracy, which means this is an ongoing arms race between protection tools and AI companies.
For legal action, class action lawsuits and regulatory complaints represent the most effective path for individuals. File complaints with the FTC at ftc.gov/complaint. If you live in California, file complaints with the California Privacy Protection Agency, which coordinates enforcement with state attorneys general across nine states. The Authors Guild tracks all AI class actions and provides guidance for anyone whose written work has been scraped.
You also need to make decisions going forward about what you post online. Every photo, every blog post, every forum question, every comment on social media is potential training material for the next generation of AI models. This does not mean you need to disappear from the internet. This means you need to make informed choices about what you share, where you share the content, and what the realistic consequences of posting might be in a world where automated crawlers are watching everything.
The Question That Defines This Moment
Three realities define where we stand right now. The technology has created a one way door. Personal information belonging to hundreds of millions of Americans has been absorbed into AI model weights through a process that the industry's own researchers, in over 180 published papers, confirm is irreversible. The largest wave of technology litigation since the early days of the internet is underway, and no appellate court has ruled on the central question of fair use, no federal privacy law exists, and the patchwork of state and international regulations creates uneven protection that well funded corporations navigate with ease. And real people, an artist who found her medical photos in a training set, children whose faces were scraped from personal blogs, writers who discovered their life's work in pirated book databases, bear the costs of a system built on the assumption that anything posted online is raw material for commercial extraction.
The most effective responses remain collective. Class action litigation has already produced billion dollar settlements. State attorneys general are wielding existing consumer protection statutes. International regulators have demonstrated the willingness to order the destruction of entire AI models. California's transparency requirements and the European Data Protection Board's willingness to mandate model deletion represent genuine structural progress, and they apply going forward, not backward.
The fundamental question is not a technical one. The fundamental question is whether the American people will decide that posting a photo or video online constitutes blanket consent for every conceivable commercial use, or whether we will build legal structures that give individuals real control over how their personal data fuels the most consequential technology of this century.
Your members of Congress need to hear from you. Your state representatives need to hear from you. The AI companies scraping the internet right now are counting on your silence. Do not give the companies what they want.