LONDON — Some of the world’s biggest technology companies, including Google, Microsoft, Meta, OpenAI and X, scraped copyright-protected music from millions of songwriters, composers and artists to train generative artificial intelligence systems, says international music publishing trade association ICMP. The organization is sharing extensive evidence it has compiled over the past two years exclusively with Billboard, showing that songs by the Beatles, Mariah Carey, The Weeknd, Beyoncé, Ed Sheeran and Bob Dylan are among the artists whose work was used for training purposes.
The documents were gathered by ICMP using publicly available registries, open-source repositories of training content, leaked materials, research papers and independent research by AI experts. ICMP says that the dossier contains “comprehensive and clear” evidence of the unlicensed use of digital music on a “global and highly extensive scale” for AI training and GenAI music, songwriter and performer image outputs. It also reveals that the scope of the training is larger than previously acknowledged.
Related
Documents ICMP shared with Billboard showing the commercial for-profit use of songwriters’ music without a license include:
· Private datasets that demonstrate the illegal scraping of copyright-protected music from YouTube by U.S.-based music-making apps Udio and Suno.
· Analysis of Meta’s Llama 3 open-source large language model outputs, which indicates it was trained on copyright-protected music and/or lyrics by The Weeknd, Lorde, Bruno Mars, Childish Gambino, Imagine Dragons, Alicia Keys, Ed Sheeran and Kanye West, among many others.
· Court filings from a lawsuit filed by a group of music publishers against Anthropic, highlighting evidence that Anthropic infringed publishers’ copyrighted song lyrics on a massive scale, by copying those lyrics both as input to train its AI model Claude and in the output Claude generates. The publishers’ complaint includes examples of Claude output copying the lyrics to hundreds of their songs, including Don McLean’s “American Pie,” Lynyrd Skynyrd’s “Sweet Home Alabama,” Beyoncé’s “Halo” and Mark Ronson’s “Uptown Funk” featuring Bruno Mars.
· Evidence that suggests Chinese AI firm DeepSeek has illegally copied, reproduced and distributed copyright-protected lyrics without a license. Examples cited by ICMP include Jay Z’s “Empire State of Mind” and Ed Sheeran’s “Shape of You.”
· An admission from OpenAI’s chatbot (following enquiries from ICMP) that its OpenAI Jukebox music-making app was trained on copyright-protected music by a wide range of artists, including The Beatles, Elton John, Madonna, Elvis Presley, Drake, Kanye West, Frank Sinatra, Beyoncé and Ariana Grande. (In April 2020, when OpenAI launched Jukebox, the company publicly disclosed that it had trained the app on a dataset of 1.2 million songs, 600,000 of which were English-language, although it has never revealed what songs or artists were used.)
· Evidence that indicates Microsoft’s AI app CoPilot and Google’s AI system Gemini both breached copyright protections by replicating and distributing the lyrics to many songwriters and songs. Examples cited by ICMP include Bob Dylan’s “Knockin’ On Heaven’s Door,” Michael Jackson’s “Billie Jean” and Childish Gambino’s “This Is America.”
· A written admission from Google’s Gemini chatbot that it is “highly likely” that Google’s music generation model MusicLM had trained on copyright-protected music, which, it says, “could raise legal concerns” were the product to be used for commercial purposes.
· Generative AI output that strongly suggests Midjourney copied and issued direct replicas of album art by artists such as Gorillaz, Dr. Dre, Michael Jackson, Bob Marley and many more for commercial use.
· Datasets provided by Google initiative AudioSet, including for model training purposes, which demonstrate mass-scale scraping of music from YouTube without a license, thereby potentially breaching the DSP’s terms of service as well as copyright laws.
· Leaked datasets from U.S. AI research and tech company Runway that illustrate the mass scraping of copyright-protected music on YouTube for model training purposes and the labeling and collating of scraped data into songwriter and artist names, genre, tempo, etc.
Other evidence of generative AI output built on copyright-protected music provided by ICMP includes the copying and distribution of song lyrics by X’s artificial intelligence chatbot Grok. ICMP says it regards X as one of the worst offenders when it comes to respect for songwriters’ and artists’ rights.
Brussels-based ICMP, which represents 90% of the world’s commercially released music and whose members include Universal Music Publishing Group, Sony Music Publishing, Warner Chappell Music, BMG, Kobalt, Reservoir and Concord Music Publishing, alongside thousands of other indie publishers, compiled the dossiers of evidence over several years to illustrate the scale, techniques and types of uses of the world’s digital music routinely carried out by tech and AI-dedicated companies on a global basis.
Related
Over the past 18 months, ICMP has presented the documents to dozens of international policymakers and government representatives during private discussions around AI and its regulation, as well as excerpts to some targeted rights holders whose works have been similarly infringed. The international trade body has not shared them with any journalists or media other than Billboard.
“This is the largest IP theft in human history. That’s not hyperbole. We are seeing tens of millions of works being infringed daily,” says ICMP director general John Phelan. “Within any one model training data set, you’re often talking about tens of millions of musical works often gained from individual YouTube, Spotify and GitHub URLs, which are being collated in direct breach of the rights of music publishers and their songwriter partners.”
“Despite their public claims that they’re not training upon copyright-protected works, we’ve caught many of them [tech companies] red-handed,” Phelan continues. “We have extensive evidence of serious copyright infringement. Many of these companies are scraping the lyric datasets from the internet of millions of works and putting them into their models. Aside from amounting to breaches of copyright laws and often contract laws, this is often done despite the music sector’s consistent and clear statements that licenses are both required and available for legal AI training and GenAI.”
“This is not a victimless crime,” says a spokesperson for Concord. “These AI tools are being used in ways that will displace lyric writers and undermine existing royalty streams. Although Large Language Model (LLM) lyrics may never have the creativity of a human, LLMs trained on human lyrics coupled with their speed, scale and economy, will undermine the incentive to create new art, which is the core mission of copyright law.”
Billboard contacted all the tech companies mentioned by ICMP in its dossier of evidence. All of them either declined to comment or did not respond to requests to comment.
‘Fair Use’ Lawsuits
This disclosure by ICMP follows a rush of legal actions against AI companies from music rights holders, film studios, publishers and other media firms over alleged copyright infringement. They include lawsuits from all three major labels against AI song generators Suno and Udio for illegally using their music to train generative AI systems, and a lawsuit from music publishers UMPG, Concord and ABKCO Music against Amazon-backed AI firm Anthropic and its Claude AI service for copyright infringement of lyrics in AI training and output.
Outside of the music industry, the London High Court is currently hearing a landmark copyright case brought by Getty Images against artificial intelligence company Stability AI over its image-generating tool, while Hollywood film studios Universal and Disney are suing San Francisco-based Midjourney, which they called a “bottomless pit of plagiarism,” for copyright infringement. The New York Times is separately suing OpenAI and its financial backer, Microsoft, in the U.S. courts for the “unlawful use” of its work.
Related
The commonly used defense by tech companies accused of copyright infringement in the United States is that using copyright-protected songs and media to train generative AI systems falls under the category of “fair use,” despite this not being a legal standard outside the U.S.
In March, a California federal judge rejected a preliminary bid from the music publishers to block Anthropic from using copyright-protected lyrics to train its systems, stating, “It is an open question whether training generative AI models with copyrighted material is infringement or fair use.”
In June, Meta and Anthropic won separate, technical fair-use legal battles in the U.S. over the unauthorized use of book authors’ work for training purposes. However, ruling in Meta’s favor, Judge Vince Chhabria warned that the judgment “does not stand for the proposition that Meta’s use of copyrighted materials to train its language models is lawful. It stands only for the proposition that these plaintiffs made the wrong arguments and failed to develop a record in support of the right one.”
The European Union’s AI Act — the world’s first comprehensive legislation governing the use of artificial intelligence — provides rights holders with robust protections against fair use claims. It instructs tech companies to respect existing copyright law, including respecting music publishers’ and record labels’ rights reservations and training transparency obligations — currently a mostly black-box activity, with rights holders having no way of knowing on what material an AI system is being trained.
While the terms of the AI Act are only applicable for services operating in the 27-member European Union bloc, its copyright and training transparency obligations apply irrespective of where in the world the training data was sourced or if it was acquired by a third-party “offshore” company.
Other jurisdictions and international governments, including the U.S., are in the process of drawing up their own AI laws, with questions around fair use-style exceptions for copyrighted material becoming a hotly contested battleground between music and tech lobbyists.
Double Standards
In addition to exposing the scale of generative AI training using copyrighted material, ICMP’s research also highlights the double standards of some tech companies that are aggressively pushing authorities to grant free use exceptions, yet, at the same time, are trying to ensure that no one can access their data on the same terms.
ICMP points to legal clauses it has found contained in the terms of conditions and service contracts for numerous AI providers, including Facebook, YouTube, X, Google’s Gemini, OpenAI, Suno, Udio, Microsoft and Adobe, that all expressly prohibit the scraping, access, duplication or publishing of the tech companies’ content without prior written consent. Excerpts of these texts have been used by ICMP over the last year among governments and policy makers to move the needle back towards the music sector’s interests.
“With AI and tech companies, all we hear is, ‘We need exceptions to build an open internet and access data, wholescale, without licenses, for our training,’” says Phelan. “What our work on AI shows is that at the very same time, they’re demanding everybody else get prior written permission before using their content. With evidence, we’re exposing not only the risible nature of such breaches but also the fact that, commercially speaking, they’re practicing entirely the opposite.”
Related
The trade group says another example of big tech’s double standards can be found in the training data sets used by generative AI developers like OpenAI, Udio and Runway. The datasets for GenAI often contain detailed identifiers per song ingested, including song titles, album names, songwriter names, genre, lyrics, tempo and year of release. Yet tech companies have long argued that any legal requirement to disclose detailed information about training data, as required by the terms of the EU’s AI Act, is akin to giving away business secrets and would be too hard to implement.
“The data we have gathered and how [tech companies] are collating it explodes a lot of the myths being pushed about complexity to comply with copyright, or that it is too difficult to disclose what it is AI companies are training on,” says Phelan. He says the data ICMP has uncovered serves as “clear-cut evidence” that generative AI companies are training their systems on data “scraped from licensed services such as YouTube and Spotify without permission nor respect for laws.” Going forward, the organization says it will continue to build up its body of evidence and is sharing its research and analysis with ICMP member companies and their legal teams, a number of whom are taking legal steps.
“The future,” Phelan says, “needs to be one of ‘license or desist.’”
