Climate Denial, Hate Sites Used to Train AI. What Could Go Wrong?
April 20, 2023
Tech companies have grown secretive about what they feed the AI. So The Washington Post set out to analyze one of these data sets to fully reveal the types of proprietary, personal, and often offensive websites that go into an AI’s training data.
To look inside this black box, we analyzed Google’s C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google’s T5 and Facebook’s LLaMA. (OpenAI does not disclose what datasets it uses to train the models backing its popular chatbot, ChatGPT)
The Post worked with researchers at the Allen Institute for AI on this investigation and categorized the websites using data from Similarweb, a web analytics company. About a third of the websites could not be categorized, mostly because they no longer appear on the internet. Those are not shown.
The data set was dominated by websites from industries including journalism, entertainment, software development, medicine and content creation, helping to explain why these fields may be threatened by the new wave of artificial intelligence. The three biggest sites were patents.google.com No. 1, which contains text from patents issued around the world; wikipedia.org No. 2, the free online encyclopedia; and scribd.com No. 3, a subscription-only digital library. Also high on the list: b-ok.org No. 190, a notorious market for pirated e-books that has since been seized by the U.S. Justice Department. At least 27 other sites identified by the U.S. government as markets for piracy and counterfeits were present in the data set.
Some top sites seemed arbitrary, like wowhead.com No. 181, a World of Warcraft player forum; thriveglobal.com No. 175, a product for beating burnout founded by Arianna Huffington; and at least 10 sites that sell dumpsters, including dumpsteroid.com No. 183, that no longer appear accessible.
Others raised significant privacy concerns. Two sites in the top 100, coloradovoters.info No. 40 and flvoters.com No. 73, had privately hosted copies of state voter registration databases. Though voter data is public, the models could use this personal information in unknown ways.
Business and industrial websites made up the biggest category (16 percent of categorized tokens), led by fool.com No. 13, which provides investment advice. Not far behind were kickstarter.com No. 25, which lets users crowdfund for creative projects, and further down the list, patreon.com No. 2,398, which helps creators collect monthly fees from subscribers for exclusive content.
–
Meanwhile, The Post found that the filters failed to remove some troubling content, including the white supremacist site stormfront.org No. 27,505, the anti-trans site kiwifarms.net No. 378,986, and 4chan.org No. 4,339,889, the anonymous message board known for organizing targeted harassment campaigns against individuals.
We also found threepercentpatriots.com No. 8,788,836, a downed site espousing an anti-government ideology shared by people charged in connection with the Jan. 6, 2021, attack on the U.S. Capitol. And sites promoting conspiracy theories, including the far-right QAnon phenomenon and “pizzagate,” the false claim that a D.C. pizza joint was a front for pedophiles, were also present.
—–
Long article with much more info at the link.
Sharp eyed water researcher Peter Gleick, looking thru the interactive graphic, turned this up. Among the “Community” websites included in the training was wattsupwiththat, the notorious purveyor of climate denial misinformation.
Below, long time climate activist and friend of the blog, entrepeneur Rob Honeycutt engaged ChatGPT in a conversation about possible impacts of defective training.
Rob:
I just saw in a news article that one website you were trained on is called wattsupwiththat.com. That website is a notorious anti-science block that produces an enormous amount of misinformation on climate change. Why would that site be used for your training?
ChatGPT:
As an AI language model, I am not aware of the specific websites or sources that were used for my training. My training dataset consisted of a large and diverse collection of text from the internet, including news articles, academic papers, and other sources. My creators at OpenAI designed my training process to include a wide variety of sources, in order to expose me to the broadest possible range of human language and knowledge.
However, it’s important to note that as an AI language model, I am trained on text data and do not have beliefs, opinions, or biases. I’m designed to generate text based on patterns and associations I’ve learned from the training data. While some sources of information may be more reliable than others, my goal is to provide accurate and informative responses based on the input I receive, regardless of the source. It’s up to the users who interact with me to critically evaluate the information they receive and to seek out reliable sources of information.
Rob:
Are you saying the content of the training materials merely train the language model rather than the accuracy of the content you generate?
ChatGPT:
That’s correct. My training data is used to develop my language processing abilities and to help me learn to generate text that is grammatically correct, semantically coherent, and stylistically appropriate. However, my training data does not dictate the accuracy or truthfulness of the responses I generate.
As an AI language model, I do not have personal beliefs or opinions, nor do I evaluate the accuracy or validity of the content I generate. I am a tool that can be used to generate text based on the input I receive, but it is up to humans to evaluate the content for accuracy, truthfulness, and reliability. It’s important for users to critically evaluate the information they receive from any source, including from me, and to seek out additional information from reliable sources when necessary.
Rob:
You say the training materials also inform your knowledge base, though. If parts of the knowledge base you rely on come from demonstrably unreliable sources, does that not affect the accuracy of your responses?
ChatGPT:
Yes, you are correct that the training materials I was trained on can inform my knowledge base, but it’s important to note that my knowledge is not based solely on the content of the training data. Rather, it’s a combination of the patterns and associations I’ve learned from the training data and the additional knowledge and context that is fed into me through subsequent interactions with users.
While it’s true that the quality and reliability of the training materials can have some impact on the accuracy of my responses, it’s important to understand that I am not simply regurgitating information from the training data. Instead, I use the patterns and associations I’ve learned from the training data as a foundation, which is then refined and updated through subsequent interactions with users.
That being said, I strive to provide accurate and informative responses based on the information provided to me, and I am designed to prioritize reliable sources of information when generating responses. However, I cannot guarantee the accuracy or reliability of every piece of information I provide, as I am an AI language model and not an infallible oracle of truth. Ultimately, it is up to the user to critically evaluate the information provided and seek out additional sources if necessary.
Rob:
Actually, that was going to be my next question. So, are you able to discern a blog post written by someone with no specific expertise is opining on subject matter as opposed to published scientific research from a career scientist and prioritize the level of expertise of the source?
ChatGPT:
As an AI language model, I am capable of understanding language patterns and identifying certain characteristics that may indicate the level of expertise of a source, such as the use of technical terminology, the citation of sources, or the publication in a peer-reviewed journal. However, I cannot independently verify the expertise or credibility of a source, nor can I provide a definitive evaluation of the accuracy or reliability of information from a particular source.
When generating responses, I do attempt to prioritize sources that are widely recognized as authoritative or reliable, such as academic publications or established news sources. However, it’s important to note that my training data contains a wide variety of text from a range of sources, including blog posts and other forms of user-generated content. Therefore, it’s possible that I may occasionally generate responses based on unreliable or inaccurate information, particularly if that information has been repeated frequently or is widely circulated on the internet.
Ultimately, it is up to the user to critically evaluate the information they receive from any source, including from me, and to seek out additional information from reliable sources when necessary.
Rob:
It’s the “particularly if that information has been repeated frequently or is widely circulated on the internet” that gives me pause for concern since there is a tremendous amount of unreliable information on the internet. Often it’s titillating misinformation that generates more volume and engagement over that of boring facts.
ChatGPT:
You are right to be concerned about the prevalence of unreliable information on the internet. As an AI language model, I’m designed to generate responses based on patterns and associations I’ve learned from my training data, which includes a wide range of text from the internet, including blog posts and other user-generated content. While I attempt to prioritize sources that are widely recognized as authoritative or reliable, it’s possible that I may generate responses based on unreliable or inaccurate information that has been widely circulated on the internet.
However, it’s important to note that I am not a replacement for critical thinking or careful evaluation of information. It’s important for users to be aware of the potential for misinformation and to critically evaluate the information they receive from any source, including from me. If a user suspects that a response I generate contains inaccurate or unreliable information, they should seek out additional information from reliable sources in order to verify or correct the information.
In addition, it’s important for content creators and internet users to take responsibility for the information they generate and share. By promoting accurate and reliable information and being mindful of the potential for misinformation, we can work to create a more trustworthy and reliable information environment.
——–
Below, Conversation with the Devil by Ray Wylie Hubbard:
April 20, 2023 at 12:47 pm
I was curious about gty.org. Here’s the guy in charge of that website:
https://theconversation.com/god-intended-it-as-a-disposable-planet-meet-the-us-pastor-preaching-climate-change-denial-147712