Spotify investigates alleged piracy of millions of tracks in 300TB data archive

# Tech Desk
Representational image | Photo: Getty Images
Representational image | Photo: Getty Images

A massive data leak has put Spotify under scrutiny after claims emerged that a near-complete copy of its music catalogue has been illegally archived and prepared for public release. The cache is said to include tens of millions of audio files along with extensive metadata, raising fresh concerns over digital piracy, copyright enforcement, and the security of large streaming platforms. The claims were made by Anna Archive, a well-known shadow library.

According to details shared on its website, the archive says it has backed up around 86 million music files from Spotify, along with nearly 256 million rows of associated metadata. The entire collection is reported to weigh close to 300 terabytes and has been organised into bulk torrents based on listener popularity.

The group describes the project as an open “preservation archive” for music, claiming the dataset represents about 99.6 per cent of all listens on the platform and can be mirrored by anyone with sufficient storage capacity.

While Anna Archive is widely associated with distributing links to pirated books, it said this effort marked a departure driven by opportunity rather than a shift in focus. The group maintained that its broader aim of preserving cultural output is not limited to text and that, on occasion, it targets other forms of media when circumstances allow.

Spotify’s response to the claim

Spotify has acknowledged the incident and said it is examining how the data was accessed, though it rejected claims that its entire library had been compromised.

The Stockholm-based company, which serves more than 700 million users worldwide, said an internal probe found that a third party scraped publicly available metadata and bypassed digital rights management systems to obtain some audio files. The company said it is continuing to investigate the breach, according to comments reported by Android Authority.

The scale of the alleged leak has also drawn attention from the technology sector. According to a report by The Guardian such datasets could be attractive to artificial intelligence firms seeking large volumes of training material. The issue has parallels with past controversies, including allegations that Meta, led by Mark Zuckerberg, used LibGen, an online repository of pirated books, to train AI models.

Court documents filed in the US previously claimed that Zuckerberg approved the use of such material despite internal warnings that the dataset was known to contain pirated content, highlighting ongoing legal and ethical debates around data sourcing for AI development.