Report Says AI Training Datasets Contain Thousands of K-Pop Songs—From BTS to Blackpink

Thousands of K-pop tracks—potentially including music from major artists such as BTS and BLACKPINK—are reportedly being used as data to train generative AI systems, according to a new analysis cited by Koreaboo. The article points to a broader investigation by The Atlantic into AI training datasets, raising fresh concerns about copyright, consent, and the risk that AI systems could imitate or plagiarize existing songs.
How the reported discovery works
The underlying issue centers on how generative AI models learn. In broad terms, AI systems are trained by ingesting large volumes of data—text, images, audio, and other content—so that they can identify patterns and produce new outputs when prompted. When music is part of that training, critics argue, the model may learn enough about melodies, structure, and timbre to produce material that sounds extremely close to existing tracks.
In the investigation described by Koreaboo, The Atlantic examined several AI training datasets and reportedly found that they contained millions of songs. The key technique in the reporting, the article says, was using dataset searches: by looking up an artist’s name, researchers could determine whether an artist’s tracks were present in the underlying material.
Scope: thousands of tracks across large catalogs
According to the Koreaboo summary, searching for BTS in one specific dataset surfaced more than 200 songs. It further claims that BTS music appeared across several different datasets, suggesting the exposure could be widespread rather than limited to a single archive.
The article also describes similar results when searching other K-pop acts, particularly those with large discographies. It states that across a sample of around 11 artists, reporters found over 2,000 songs within the datasets—implying that the number of K-pop songs used for training could scale into the hundreds of thousands when considering the genre’s total output and the likelihood that more artists are included beyond that initial sample.
While these figures are presented as estimates based on dataset search results, they are nonetheless notable for the scale: the difference between “a few tracks” and “thousands to hundreds of thousands” matters greatly for whether rights holders can realistically track what was used, how it was licensed, and whether opt-outs exist.
Why it matters: from “learning” to remixing existing works
One practical worry highlighted in the coverage is the possibility of AI-generated plagiarism. The article explains that if an AI model has been trained on a particular song—or enough songs from a given artist—it could generate outputs that closely resemble existing music, potentially with only slight changes.
To illustrate that risk, the article references an example described in The Atlantic reporting: a figure skating performance using an AI song described as a near rip-off of an existing track. While the specifics of that case relate to performance rather than commercial releases, the underlying point is the same—AI can replicate recognizable audio patterns, and doing so without permission can create legal and ethical conflicts.
Industry and fan pushback over consent and environmental impact
The Koreaboo piece situates the dataset findings within a wider controversy around AI in music, where fans and artists have criticized AI use on multiple fronts. Beyond copyright and consent, it notes an additional argument often raised by critics: the environmental footprint of AI training and operation.
Large-scale AI training typically requires substantial computing resources, which in turn involves significant electricity use and can affect local water systems used for cooling. The article frames these concerns as part of a larger debate about whether AI deployment is socially and ecologically responsible—especially when training data includes creative works without clear authorization.
For K-pop specifically, the stakes are amplified by the genre’s highly engineered production practices and global distribution strategies. Artists and labels may also rely on strong brand protections, making the prospect of AI outputs that mimic copyrighted songs more than a theoretical concern.
What happens next: transparency, licensing, and enforcement
If the reported dataset presence is accurate, the next phase is likely to focus on accountability: whether the songs found in training sets were licensed, whether rights holders were informed, and what mechanisms exist for removing content or limiting future training. In practical terms, researchers and lawmakers have been pushing for greater transparency around training data provenance—often summarized as the question of who provided the data and under what permission.
For the music industry, rights management will likely intensify. Labels and collecting organizations may seek clearer standards for AI training exceptions, opt-in or opt-out registries, and enforcement pathways against outputs that are demonstrably derivative. For consumers, the challenge will be distinguishing legitimate remixes or AI-assisted creations from outputs that effectively substitute for original works.
As generative AI grows more capable, dataset disputes like this one could become a central battleground—one that determines not only what AI can learn, but also who gets to decide.



Comments