Wikipedia and AI: Are You Ready?

Generative AI is everywhere and that’s largely thanks to Wikipedia. The New York Times said, “While estimates on its influence can vary, Wikipedia is probably the most important single source in the training of A.I. models.” 

This means Wikipedia content about your company is making the rounds between generative AI tools. Is the content up to par? Let’s look at how this is happening and what it means for you in 2024.

Table of Contents

Large Language Models and Their Relationship With AI

Large Language Models (LLMs) are a specific type of AI program that can generate text that mimics human responses. 

In order to work, LLMs are trained on enormously large amounts of text and text-like data. ChatGPT was trained on around 570 GB of data, which translates to around 101.5 trillion words. Google Gemini was trained on 1.56 trillion words. To put that into perspective, the average adult fiction novel contains around 70,000 to 120,000 words. 

After the LLMs ingest all this data, the data is passed through a transformer. This allows the program to understand how the words relate to one another. Then the programs undergo training, which is when they learn to predict the words that should follow one another and generate reliable sentences.

How Does AI Use Wikipedia’s Data?

Wikipedia data and Wikipedia text not only train different LLMs and other AI platforms, but are also used by these tools to generate answers to users’ queries. Platforms using Wikipedia to source information include:

  • Google Gemini (formerly known as BARD)
  • Google Search
  • ChatGPT
  • Bing Search
  • Microsoft Copilot (formerly known as Bing Chat)
  • Siri
  • Alexa

Some people believe that without Wikipedia, generative AI wouldn’t currently exist.

  • The Washington Post and the Allen Institute for AI ranked the millions of websites used to instruct LLMs. Wikipedia was #2.
  • Google Gemini was not only trained using Wikipedia but also accesses and processes information through Google Search. Wikipedia is prominently featured in Search and influences search results.

DuckDuckGo’s DuckAssist tool pulls from Wikipedia and Encyclopedia Britannica to provide answers to users.

What AI’s Wikipedia Usage Means for Organizations

Whether someone uses AI to answer a question, write a blog, research a topic, or outline a video, AI will choose to include you or not and decide what to share partly based on what it learns from Wikipedia.

Here are a few examples of this in action.

Google Gemini

ChatGPT

Copilot (Microsoft Bing Chat)

Even if Wikipedia isn’t the only resource provided, it is a trusted and recognized third-party source. If it’s an option, expect searchers to click on it.

Can I Control the Content in My Wikipedia Article?

A Wikipedia article is “owned” collectively by anyone with internet access who takes an interest in the article and decides to edit it. What does this mean for you? Can you be one of those editors?

We do not recommend it. Wikipedia has strong anti-bias guidelines and editors do not like it when someone affiliated with an organization edits the article about that organization. These edits are often reverted and the page can be tagged that it has COI issues or promotional content. These tags are a warning to page visitors that the content can’t be trusted. This is the opposite of what you want to achieve.

Here is what you should do if you want to make changes to a Wikipedia article:

1. Take the time to understand how Wikipedia works. It is an encyclopedia. It exists to share factual, high-level information about millions of topics. It does not exist as a place to share every last acquisition, product update, and board member hire. 

2. If you review the article and think it needs editing to bring it up-to-date or to address inaccuracies, make a list of exactly what you’d like to see changed. Then search the internet for third party, reliable sourcing that supports you. This sourcing must be from reputable sources like the Wall Street Journal, scientific journals, etc. Articles cannot be written by anyone from your company and the copy you want to source cannot come from a quote from a company spokesperson.

3. Go to the talk page. Click on Add topic

Introduce yourself and state your affiliation with the company or individual in the Wikipedia article. Then share exactly what you’d like changed and provide the sources you’ve collected that back you up. Now you wait for someone to respond. They might add everything you want, they might have questions, they might add some of what you want and use language different than what you drafted. This is often because copy will need to be less promotional.

4. Respect the community. No matter how long the editing process takes, stay calm, patient, and respectful. Most people on Wikipedia – especially those editors engaging on Talk pages – are there because they have a genuine interest in the encyclopedia becoming the most extensive free resource of knowledge on the internet. Some of them have been on Wikipedia for years. Let them guide you and be open to the conversations that arise.

It’s a lengthy process, but it is worth it. If there is misinformation or inaccuracies on a Wikipedia article and you choose to ignore them, you are giving your consent to that information being served to countless AI generators and anyone who asks a relevant question.

Remember, people don’t have to visit Wikipedia to be influenced by its content. Wikipedia is being brought to them over and over again by Google Search, Gemini, Copilot, ChatGPT and other AI tools. Make regularly monitoring Wikipedia articles about your company part of your online reputation strategy.

Using LLMs to Generate Content on Wikipedia

Now that you know the complexities behind editing a Wikipedia article, you might be wondering if AI can be used to simplify the process.

The short answer is no. Generative AI is not developed enough to where you could ask a tool to draft content for a page and then copy+paste that content into a Wikipedia article. This won’t work for a few reasons:

1. As we mentioned, Wikipedia content must be sourced by high quality third-party sources. Most generative AI tools don’t provide sourcing. Any sources that are provided may not meet Wikipedia’s standards.

2. Generative AI content still needs to be checked for facts and accuracy. 

  1. LLMs can draw conclusions that are not present in a single reliable source.
  2. They can hallucinate, or make things up.

The situation could change in the future. As of November 2023, using LLMs to create content for Wikipedia is not prohibited, it’s just prevented by the points we mentioned above. Editors have proposed guidelines for using LLMs, but the guidelines are still in the brainstorming stage. 

Of note is the fact that the current proposal would require editors to declare any AI-assisted edits in their edit summaries. A tool like Ghostbuster, “a state-of-the-art method for detecting AI-generated text,” could be very helpful in ensuring rules like this are enforced if put in place.

If articles can’t be written by AI, could someone use LLMs to at least help create first drafts of articles or provide ideas for content revisions? Possibly. Here are some pros and cons.

  • Pros of using LLMs
    • Could write believable, human-like text
    • Help scale the work of volunteers on Wikipedia and Wikimedia projects
    • Extremely valuable for semiautomated content review and translations
  • Cons of using LLMs
    • Machine-generated content has to be balanced with human review
    • Erroneous information can be included
    • Lesser known language versions of Wikipedia may get overlooked and become populated with unreliable and incorrect content
    • LLMs can “hallucinate” and make up information
    • LLMs contain implicit biases that often result in content skewed against marginalized and underrepresented groups of people
    • Copyright-violating material could be generated
    • Potential decline in Wikipedia usage if more people turn to AI for more concise summaries
    • Can create a lack of quality if content goes unverified 

However, AI is active on Wikipedia in other ways. Bots are used frequently to perform tasks such as fixing citations or dead link URLs, or making small grammar edits. There is also a Content Translation Tool, although on the English Wikipedia its use is limited to extended confirmed editors who have 30 days tenure and at least 500 edits.

AI Hallucinations

An AI hallucination is when an AI model generates incorrect information and then presents it as if it were fact. Aside from misleading people and eroding user trust, hallucinations can perpetuate biases or cause other harmful consequences if taken at face value.

In April 2023, a team of Stanford University scientists evaluated four engines powered by AI: Bing Chat (now Copilot), NeevaAI, perplexity.ai, and YouChat. They found that only about half of the sentences generated by the search engines in response to a query could be fully supported by factual citations. The scientists reported, “We believe that these results are concerningly low for systems that may serve as a primary tool for information-seeking users, especially given their facade of trustworthiness.”

One reason for the issue is that the bots prioritize trying to sound like a human over trying to find the most truthful information.

AI is undergoing improvements, and with a focus on source retrieval, market competition, and tech companies coming into play, accuracy should improve, making AI more useful. Fixing hallucinations is a focus of many AI companies. The Wikimedia Foundation is working on tools that will make it easier for editors to identify bot-generated content to help combat AI hallucination on Wikipedia.

Wikipedia Community Opinions

Editors are divided as to whether or not LLMs should be allowed to train on Wikipedia content. While open access is a cornerstone of Wikipedia, there is concern that the unrestricted scraping of its data allows AI companies to exploit it in ways that could ultimately be charged for, or draw people away from the encyclopedia altogether. This is especially a problem if, as discussed above, Wikipedia content itself becomes AI-generated, creating a feedback loop of potentially biased information.

Worst case scenario, this would lead not only to an erosion of trust of Wikipedia, but of generative AI, Google Search, and voice assistants – everything that pulls from and in some way is influenced by Wikipedia articles. It could all come tumbling down, if we don’t put restraints in place. Human engagement remains the most essential building block of the Wikipedia knowledge ecosystem.

Alignment

Alignment is a theoretical concept that ensures an AI system is doing what’s in the best interest of humanity. As you can imagine, this is both enormously important and exceedingly challenging. In the case above, having AI destroy trusted systems is hardly in our best interests.

This is one of the main reasons why Wikipedia is sticking with human editors. As it says in this article, “One of the things that’s really nice about having humans do the summarization is that you get some sort of basic level of alignment by default…And if you appreciate the editors of Wikipedia are human, they have human motivations and concerns and that their motivations are providing high-quality educational materials to align with your needs, then you can essentially put trust in the system.”

Regulation Regarding Wikipedia and AI

On December 9, 2023, the European Union’s Parliament reached a provisional agreement with the Council on the Artificial Intelligence Act, “the world’s first comprehensive AI law.” It calls for general-purpose AI systems (such as ChatGPT) to share detailed summaries about the content used for training. This would make it more clear how important Wikipedia is for AI.

On another side of regulation, different lawsuits have appeared, as organizations struggle with AI:

The future of AI is unclear. We don’t know what the ripple effects could be from any legal decision. What we can say is that if the requirement to cite sources points people over and over again to Wikipedia, the encyclopedia is going to experience a mad rush of editors trying to create, edit, and somehow control the content available for consumption.

Wikipedia will not take that lying down. It already has numerous policies in place to prevent misleading, biased, or false content being added to articles. With questions around AI on the rise, those policies, and the oversight editors provide for one another, will only increase.

AI Models Using Wikipedia’s Data

The Wikimedia Foundation has an AI plug-in for ChatGPT that became available in mid-July. It directs ChatGPT to reference Wikipedia if it needs additional help answering a user query. Here is one example of how it works:

But in this case ChatGPT, recognizing that it couldn’t answer Albon’s question – What happened with OceanGate’s submersible? – directed the plug-in to search Wikipedia (and only Wikipedia) for text relating to the question. After the plug-in found the relevant Wikipedia articles, it sent them to the bot, which in turn read and summarized them, then spit out its answer.

When using the plug-in, all ChatGPT answers include links to the Wikipedia entries. We expect more plug-ins like this to come.

Another focus not currently available but on the list of desirables is to have AI models that aid new volunteers with step-by-step instructions as volunteers work on new articles. This is a process that involves many rules and protocols and often alienates Wikipedia newcomers. A tool like this could help bring more editors to Wikipedia, and reduce the number of first drafts that are rejected or sent back for copious amounts of edits.

In Summary

We are in the early stages of AI. We don’t know what the future holds. There are many concerns, but also a lot of benefits to an AI tool that effectively helps editors on Wikipedia, and provides accurate information to those seeking it online.

Even within these early stages, Wikipedia is being referenced by AI. People don’t have to visit Wikipedia to be influenced by its content. Wikipedia is being brought to them over and over again by Google Search, Gemini, Copilot, ChatGPT and other AI tools.

For this reason, Wikipedia articles need to be up-to-date and factual. This work must be done by humans. No AI tool will successfully 100% match human content creation on Wikipedia or follow its heavily enforced requirements as to sourcing and neutral tone of voice. However, it is not recommended that people directly edit the Wikipedia article about their organization. Instead, they should utilize the Talk page and work with other editors.

While human engagement is still pivotal, experts in the AI field and members of Wikipedia and the Wikimedia Foundation are seeking ways to more effectively integrate AI and Wikipedia. Keeping abreast of all changes will be important for managing company footprint and reputation across the internet.

If this is something you’d like to talk about, give us a shout at any time.

Resources

Get our weekly newsletter

Each week we highlight the top marketing stories that intrigue us or simply make us laugh.

This field is for validation purposes and should be left unchanged.