Wanna buy some student data for your AI? The University of Michigan can help. It seems representatives for the school or its partners are cold-emailing tech workers at Google and other companies, offering data on University of Michigan students to train large language models. The data includes recordings of lectures, student discussions, and office hours, as well as essays written by seniors and grad students all available for a paltry licensing fee. It’s unclear whether students gave their consent.
The story came to light in an X/Twitter post by an employee at Google Deepmind, the company’s AI research hub. Susan Zhang, an engineer at DeepMind, said that she’d received a sponsored LinkedIn message hawking the information, and offering a free sample of the University of Michigan data to prove its worth.
“I’m reaching out because, based on your profile, you may be working with Large Language models (LLM’s) or natural language processing,” the sales message said. “I wanted to let you know that the University of Michigan is licensing academic speech data and student papers that could be very useful for training or tuning LLM’s.”
The message offers data from 85 hours worth of lectures, discussion sections, and interviews for $15,595, a second set of 829 papers written by University of Michigan students across various disciplines for $12,595, or a discount package for both data sets at $25,000.
“I think it’s worth pursuing which universities are selling student data and what the terms are,” Zhang told Gizmodo in a message on X. “Licensing is better than scraping data without attribution but the attribution pipelines here are likely only built halfway (aka original creators won’t see a dime, whereas the reseller who stores data will capture all the profits).”
The University appears to be working with an organization called Catalyst Research Alliance, which also claims to partner with North Carolina State University. The website offers a sample of the data set, which comes with an essay titled “The Democratic Inadequacies of the European Union,” and what appears to be a recording of a class discussion section.
Catalyst Research Alliance and North Carolina State University did not immediately respond to requests for comment. A University of Michigan representative said they were preparing a statement. We’ll update this article when we hear back.
Training large language models like the software that runs chatbots such as ChatGPT and Bard requires massive, clearly labeled data sets across various subjects and disciplines. While the University of Michigan data set is small, well-organized content on a narrow swath of subjects could be useful for tuning certain models, particularly tools designed for specific purposes related to academia, formal communication, or for training more general AIs to improve their performance on individual areas of subject matter expertise.
Source: Gizmodo