OpenAI Whistleblower Claims Company Used Copyrighted Data for AI Training
Former OpenAI researcher Suchir Balaji has voiced ethical concerns over OpenAI’s practices, alleging that the company used copyrighted material without permission to train its AI models. Speaking with The New York Times, Balaji claimed that OpenAI’s mass data scraping methods may violate copyright law and could disrupt the internet’s business ecosystem if left unchecked. OpenAI, now a for-profit entity facing multiple copyright lawsuits, including one from The New York Times, argues that its data usage falls under fair use.
Balaji, who joined OpenAI in 2020, was involved in gathering and managing online data for training OpenAI’s large language models (LLMs). At the time, OpenAI operated as a research-focused organization, and Balaji notes that copyright was a minor concern under fair use for research. However, after the 2022 release of ChatGPT, OpenAI’s products became highly commercialized, raising concerns about fair use boundaries.
Balaji’s views reflect the broader debate on AI’s impact on copyright. “The issue isn’t sustainable,” he told The New York Times, arguing that OpenAI’s approach could threaten content creators whose work may be mimicked by AI, thereby impacting the economy of the internet. Intellectual property lawyer Bradley Hulbert echoed the urgency of Balaji’s warnings, stating, “It is time for Congress to step in.”
OpenAI, however, maintains its stance, asserting that it uses “publicly available data” under fair use, essential for U.S. innovation. Balaji’s statements highlight the increasing need for regulatory clarity as the AI industry expands.