Meta and the problem of AI training data
When I read the news that Meta used many books and scientific papers from an illegal online library like LibGen to train its AI model “Llama 3” (It mimics human cognitive abilities by recognizing and ordering information from input data.), I was not surprised.
It shows how big tech companies handle data and invoke the principle of “fair use” to absolve themselves of any responsibility.
(Source: https://www.theguardian.com/books/2025/mar/25/richard-osman-meta-copyright-ai)
New court documents show that Meta even used torrent networks to download copyrighted books to train its AI - despite clear warnings from the company's own AI team.
(Sourcee: https://the-decoder.de/meta-nutzte-piraterie-netzwerke-fuer-ki-trainingsdaten-mit-zuckerbergs-segen)
When I found one of my English-language books in the database, I wasn't surprised.
(Source: https://www.theatlantic.com/technology/archive/2025/03/search-libgen-data-set/682094/)
Early internet days and the idea of sharing
These days, the internet was not such a commercial platform, but rather a collaborative project driven by visionary people. There was a culture of collaboration and knowledge sharing that was instrumental in the internet developing so quickly and so far. I met great people back then and learned so much from them. I am very grateful for this today, because without these encounters my life would certainly have been different.
Because of this experience, it was and still is important to me to share my knowledge and, if possible, to support other people.
The problem with Meta's approach
But now it's about something else: Meta, a company worth billions, may have used a pirate library to further develop its commercial AI. While researchers and authors are struggling to find funding, Meta is using their work without paying or even asking permission. What particularly bothers me is the argument that training AI is “fair use”. Fair use is a legal exception that exists in the USA and allows protected material to be used under certain conditions. But this does not apply to the mass exploitation of copyrighted content by a for-profit company.
AI needs our data
AI can only develop if it is continuously fed with data - and that is exactly what we all do every day. Every online search, every chatbot interaction, every “like” on social media contributes to the improvement of algorithms. We feed AI every day by communicating with it and having it generate images or texts.
I also use AI. I had the drawing at the beginning of the article drawn with the help of Photoshop AI. They are really cute mice! I would really like to know who - the stylistic author - behind this drawing is.
Even if we don't actively contribute anything, algorithms analyze our behavior - what we read, what we click on, how long we stay on a page.
The problem is: who benefits from this? We train AI with our data, but in the end, large corporations make a lot of money from it.
I wonder how we as a society want to react to this. The debate about AI training and copyright is still in its infancy. Do AI companies have to pay for training data? Should scientific papers and books generally be released for such purposes, or do we need new protection mechanisms?
AI can only develop if it is continuously fed with data - and that is exactly what we all do every day. Every online search, every chatbot interaction, every “like” on social media contributes to the improvement of algorithms. We feed AI every day by communicating with it and having it generate images or texts.
I also use AI. I had the drawing at the beginning of the article drawn with the help of Photoshop AI. They are really cute mice! I would really like to know who - the stylistic author - is behind this drawing.
Even if we don't actively contribute anything, algorithms analyze our behavior - what we read, what we click on, how long we stay on a page.
The problem is: who benefits from this? We train AI with our data, but in the end, large corporations make a lot of money from it.
I wonder how we as a society want to react to this. The debate about AI training and copyright is still in its early stages. Do AI companies have to pay for training data? Should scientific papers and books generally be released for such purposes, or do we need new protection mechanisms?
Authors, scientists and publishers could suffer massive financial losses if their content is used without a license. At the same time, a pattern is emerging: while individuals have to adhere to strict copyright laws, corporations seem to be able to do whatever they want.
What we can do in the future: Society for AI usage rights
In Germany, there is GEMA (Gesellschaft für musikalische Aufführungs- und mechanische Vervielfältigungsrechte) and Wort< (Verwertungsgesellschaft WORT), which look after the copyrights and remuneration of artists. GEMA manages the rights of music creators such as composers and lyricists and ensures that they are remunerated for the use of their works. VG Wort, on the other hand, looks after the rights of authors and publishers by collecting and distributing royalties for the use of literary works such as books, articles and other texts.
Similarly, it would be conceivable to set up a company to look after the copyrights of all intellectual works used by artificial intelligence (AI). This company could be a central point of contact for the administration and exploitation of rights affected by AI-generated content. It would ensure that the original creators (such as authors, musicians, artists, etc.) are fairly remunerated when their works or data are used in the AI's learning processes. Such a society could work in a similar way to GEMA or VG Wort or be a cooperative project, but specifically focused on the challenges and rights associated with AI use.
And now
If you are also affected, you can contact Meta and lodge an objection. The American Authors Guild has prepared an online form. I am not sure whether this also makes sense for European authors.
It could also be useful to include specific terms of use for artificial intelligence in the legal notice of websites and publications. We could also keep AI bots away from our website directly using a robots.txt file.
However, it is clear that this behavior affects the findability of our websites in search engines. Because we pay for this service with our data.
Perhaps a lawyer could look into the correct wording for the copyright notice. Perhaps one of our readers will feel called upon to do so.
Here is just an idea that has not been checked by a lawyer.
Copyright and AI training
Publications and documents -not legally audited
All contents of this document, including texts, images and other media, are subject to copyright. The use of our content for the training of artificial intelligence (AI) is prohibited without express written permission
All content on this website, including text, images and other media, is subject to copyright. The use of our content for the training of artificial intelligence (AI), in particular by automated crawlers or AI models, is prohibited without express written permission. To prevent unauthorized scraping by AI bots, we have implemented appropriate measures in our robots.txt file. If you wish to use or license content from our website, please contact us.AI - Bots blockieren: Anweisung robots.txt
The robots.txt file is a simple text file used by website operators to instruct web crawlers or “bots” (automated programs) which parts of their website they are allowed to crawl (search) and which they are not. To date, this has mainly concerned search engine bots (such as Googlebot from Google). However, this technology is now also being used by artificial intelligence (AI) to collect and analyze data.Until just now, I was still convinced that this type of blocking works reliably - until my old i-Worker team pointed out to me that dubious bots only see this as a recommendation and don't really let it stop them from their raids. But maybe it at least signals that you don't want to be robbed.
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Applebot
Disallow: /
User-agent: Baiduspider
Disallow: /
User-agent: Yandex
Disallow:/