back to top
Sunday, April 20, 2025
HomeTechnologyMeta employees discussed using copyrighted content for AI training in court documents

Meta employees discussed using copyrighted content for AI training in court documents

Over the years, Meta employees have been discussing internally about utilizing copyrighted works acquired through questionable means to train the company’s AI models, as per unsealed court documents.

Plaintiffs in the case Kadrey v. Meta have submitted documents revealing that Meta has been training its models on IP-protected works, particularly books, claiming it falls under “fair use.” Authors like Sarah Silverman and Ta-Nehisi Coates, who are among the plaintiffs, disagree with this stance.

Previous materials presented in the lawsuit suggested that Meta’s CEO Mark Zuckerberg permitted the AI team to use copyrighted content for training and halted talks with book publishers regarding AI training data licensing. The newly unsealed filings, mostly consisting of excerpts from internal work chats among Meta employees, provide insights into how Meta may have utilized copyrighted data for training its models, including those in the Llama family.

In one chat, Meta employees, including Melanie Kambadur, a senior manager for Meta’s Llama model research team, discussed the training of models on potentially legally risky works.

According to the filings, Xavier Martinet, a Meta research engineer, proposed acquiring books for training, even if it might be legally questionable. Martinet suggested that the establishment of the gen AI organization within Meta was to encourage taking fewer risks. He also mentioned that dealing directly with publishers for licenses was time-consuming.

In the same discussion, Kambadur mentioned Meta’s discussions with Scribd and others for licensing agreements and noted that Meta’s legal team was being less cautious in granting approvals for using publicly available data for model training.

Talks of Libgen

Another work chat mentioned in the filings explores the option of using Libgen, a links aggregator that provides access to copyrighted works, as an alternative to licensed data sources for Meta.

Libgen has faced lawsuits and penalties for copyright infringement in the past. A colleague of Kambadur’s responded with a screenshot of a search result indicating that Libgen is not legal.

According to the filings, some decision-makers believed that not utilizing Libgen could harm Meta’s competitiveness in the field of AI, necessitating its use to meet state-of-the-art benchmarks.

Mitigations were proposed to lessen Meta’s legal risks when using Libgen data, including removing clearly identified pirated data and not publicly acknowledging its use for training purposes.

Practical steps were taken, such as screening Libgen files for keywords like “stolen” or “pirated,” as per the filings.

In a separate chat, Kambadur mentioned training models to avoid prompts that could pose intellectual property risks.

The filings also hint at Meta scraping Reddit data for model training and its potential emulation of a third-party app called Pushshift, following Reddit’s announcement of charging AI companies for data access.

In another chat dated March 2024, Chaya Nayak of Meta’s generative AI org discussed leadership considering overriding past decisions to ensure models had enough training data, suggesting that existing datasets were insufficient.

The Kadrey v. Meta plaintiffs have revised their complaint multiple times, alleging Meta cross-referenced pirated books with licensed ones to assess the necessity of pursuing licensing agreements with publishers.

Recognizing the legal stakes, Meta has enlisted two Supreme Court litigators from Paul Weiss to bolster its defense team for this case.

Meta has not provided an immediate response to requests for comments.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments