Harvard and Google will offer one million public-domain books as an AI training resource.
AI training data is expensive, making it best suited for well-funded tech companies. This is why Harvard University intends to distribute a dataset containing around 1 million public-domain books from many genres, languages, and writers, including Dickens, Dante, and Shakespeare, that are no longer copyright-protected due to their age.
The new dataset is not yet available, and it is unclear when or how it will be provided. However, it incorporates books from Google Books, the company’s long-running book-scanning effort, so Google will be participating in the release of “this treasure trove far and wide.”
Harvard originally teased the Institutional Data Initiative (IDI) in March, describing its plans to provide a “trusted conduit for legal data for AI.” However, little has been heard from it until its formal introduction today, which confirmed that the IDI had financial backing from Microsoft and OpenAI.
According to Greg Leppert, executive director of the IDI, the dataset is intended to “level the playing field” by making such a massive dataset available to everyone — from academic laboratories to AI startups — who wants to train large language models (LLMs).
More Stories
Echoes of Time: A Jaipur Murder Mystery – Dr Ramesh Pattni
Echoes of Time is a deeply evocative and introspective collection that explores the intricate relationship between memory, emotion, and the...
Author Spotlight: Rahul Nakra – Traveler, Cost Accountant, and Storyteller of Heartbreak
Authors’ Background: Rahul Nakra is a Cost Accountant and a passionate traveler currently working in the consulting field. At 29,...
Author Spotlight: Dr. Ved Vyas – Emergency Medicine Physician and Poet of Spiritual Musings
Authors’ Background: Dr. Ved Vyas is an Emergency Medicine physician with over 13 years of experience, trained under globally recognised...
Author Spotlight: Abhineet Garg – Psychological Fiction Author Exploring Truth & Illusion
Authors’ Background: Abhineet Garg has always been fascinated by the spaces between truth and illusion, those quiet corners of the...
Author Spotlight: Sheetal Choksi & Samiran Ghosh – Podcasters, Authors & Tech Storytellers
Authors’ Background: Author Sheetal Choksi and author Samiran Ghosh are two-thirds of the award-winning 3 Techies Banter podcast, where sharp...
Author Spotlight: Sudeep Nagarkar – Bestselling Indian Author & Romantic Fiction Writer
About the author - Sudeep Nagarkar is a popular contemporary Indian author, celebrated for his heartfelt romance and young-adult fiction...
