Harvard and Google will offer one million public-domain books as an AI training resource.
AI training data is expensive, making it best suited for well-funded tech companies. This is why Harvard University intends to distribute a dataset containing around 1 million public-domain books from many genres, languages, and writers, including Dickens, Dante, and Shakespeare, that are no longer copyright-protected due to their age.
The new dataset is not yet available, and it is unclear when or how it will be provided. However, it incorporates books from Google Books, the company’s long-running book-scanning effort, so Google will be participating in the release of “this treasure trove far and wide.”
Harvard originally teased the Institutional Data Initiative (IDI) in March, describing its plans to provide a “trusted conduit for legal data for AI.” However, little has been heard from it until its formal introduction today, which confirmed that the IDI had financial backing from Microsoft and OpenAI.
According to Greg Leppert, executive director of the IDI, the dataset is intended to “level the playing field” by making such a massive dataset available to everyone — from academic laboratories to AI startups — who wants to train large language models (LLMs).
More Stories
India Post Reduces Charges for Sending Books After Backlash
The Indian government has reduced postal charges for sending books after protests from publishers and booksellers. The Book Post rates...
The Modi Story: Perform | Reform | Transform – A Tribute to Leadership and Vision
The much-anticipated book, The Modi Story: Perform | Reform | Transform, was officially launched at the Indira Gandhi National Centre...
Press Club of India Hosts Its First-Ever Book Fair in 67 Years
For the first time in its 67-year history, the Press Club of India (PCI) transformed into a vibrant literary hub,...
London Book Fair 2025 to Spotlight AI and Developing Young Readers
The London Book Fair (LBF) 2025, the world's largest English-language publishing trade show, is set to take place at Olympia...
HarperCollins India Unveils Booktopus, a New Imprint for Early Readers
HarperCollins Publishers India has announced the launch of Booktopus, a new imprint dedicated to pre-schoolers and early readers. This imprint...
International Booker Prize 2025 Longlist Announced: Celebrating Fresh Voices in Global Literature
The International Booker Prize has unveiled its 2025 longlist, featuring 13 outstanding works that highlight the richness and diversity of...