Stack Exchange restricts data access for generative AI and the rest of the world
The Stack Exchange network only provides the complete image of its data after logging in via its own pages.
Stack Exchange has announced that in future it will only offer data dumps via its internal pages. The network of question websites, to which the developer site Stack Overflow belongs, is ending its previous partnership with the Internet Archive.
The operators announced the change on July 12 and explained it in more detail two weeks later in an update. According to the update, Stack Exchange primarily wants to ensure that providers of large language models (LLMs) adhere to the guidelines on socially responsible AI published on the Stack Overflow blog in January.
Videos by heise
Other conditions
The last release to the Internet Archive in April 2024 allowed the content to be used largely unrestricted under the CC-by-SA 4.0 license.
Downloading the data from Stack Exchange, on the other hand, will in future require consent to the following addition: "I understand that this file is provided to me for my own use and for projects that do not include training a large language model (LLM). Should I distribute this data for the training of an LLM, Stack Overflow reserves the right to deny me access to future downloads of this data dump".
This addition has already caused some displeasure among users. A post on Stack Exchange points out that the CC BY-SA 4.0 license explicitly prohibits additional restrictions to the license.
Stack Overflow and the LLMs – a love-hate relationship
LLMs are already a challenge for Stack Overflow because they jeopardize the network's fundamental business model. Anyone who used to search Stack Overflow for code examples for a specific problem solution now often asks GitHub Copilot or ChatGPT.
At the end of 2022, Stack Overflow also banned content on its own platform that was created with ChatGPT, as the content was potentially harmful to the site and its users.
However, the company announced a strategic collaboration for generative AI with Google in February and a partnership with OpenAI, the company behind ChatGPT, and Codex as the basis of GitHub Copilot in May 2024.
A post by a Stack Exchange member with an extremely high reputation, the network's rating scale, says: "What they (Stack Exchange, editor's note) aren't saying is that they are still selling the image of the data for generative AI, they just don't want the companies behind GenAI to be able to get it for free".
For users, the change means in the short term that the regular dump to the Internet Archive in July will not take place. As the changeover will still take some time, Stack Exchange says that it will probably not land on users' own networks until mid-August.
(rme)