Harvard and Google will offer one million public-domain books as an AI training resource.
AI training data is expensive, making it best suited for well-funded tech companies. This is why Harvard University intends to distribute a dataset containing around 1 million public-domain books from many genres, languages, and writers, including Dickens, Dante, and Shakespeare, that are no longer copyright-protected due to their age.
The new dataset is not yet available, and it is unclear when or how it will be provided. However, it incorporates books from Google Books, the company’s long-running book-scanning effort, so Google will be participating in the release of “this treasure trove far and wide.”
Harvard originally teased the Institutional Data Initiative (IDI) in March, describing its plans to provide a “trusted conduit for legal data for AI.” However, little has been heard from it until its formal introduction today, which confirmed that the IDI had financial backing from Microsoft and OpenAI.
According to Greg Leppert, executive director of the IDI, the dataset is intended to “level the playing field” by making such a massive dataset available to everyone — from academic laboratories to AI startups — who wants to train large language models (LLMs).
More Stories
Reimagining Global Cinema: In Conversation with Rajesh Talwar
In this insightful interview with The Literature Today, award-winning author Rajesh Talwar reflects on his latest book “Bollywood, Hollywood And...
Book Review: Chaos, Confusion to Confucius by Snehashree Mandal
Title: Chaos, Confusion to ConfuciusAuthor: Snehashree MandalPages: 286Publisher: Locksley Hall PublishingBuy now In an era defined by uncertainty and constant...
Those 90 Days by Deepak Kumar Book Review: A Powerful Story of Career Transition, Resilience, and New Beginnings
In a corporate world obsessed with beginnings—first jobs, new roles, promotions—Deepak Kumar’s Those 90 Days: The Goodbye That Became a...
Arundhati Roy Wins NBCC Award for Memoir Mother Mary Comes To Me
Indian author Arundhati Roy has won the 2025 National Book Critics Circle Award for autobiography. She received the award for...
Bollywood, Hollywood and the Future of World Cinema by Rajesh Talwar – A Must-Read Book Review on Global Film Evolution
Title: Bollywood, Hollywood and the Future of World CinemaAuthor: Rajesh TalwarPublisher: Bridging BordersPages: 228Buy now Interestingly, “Bollywood, Hollywood And The...
Kaalchakara: The Cursed Prophecy by Nikhil Rawal – A Mythology Fantasy Book Review on Fate, Time, and Destiny
Title: Kaalchakra-The Cursed ProphecyAuthor: Nikhil RawalPages: 384Publisher: Nu Voice PressBuy now There are times in life when we have planned...
