March 30, 2022
Wikipedia, which is consistently ranked one of the top 10 most visited websites, is often the first stop for many people looking for information about historical figures and changemakers. But not everyone is equally represented on Wikipedia. Only about 20 percent of biographies on the English site are about women, according to the Wikimedia Foundation, and we imagine that percentage is even smaller for women from intersectional groups, such as women in science, women in Africa, and women in Asia.
For my PhD project as a computer science student at the Université de Lorraine, CNRS, in France, I worked with my adviser, Claire Gardent , to develop a new way to address this imbalance using artificial intelligence. Together, we built an AI system that can research and write first drafts of Wikipedia-style biographical entries. There is more work to do, but we hope this new system will one day help Wikipedia editors create many thousands of accurate, compelling biography entries for important people who are currently not on the site.
To me, the problem was personal and based on the lack of representation I saw reflected in libraries when I was in grade school. When I was in third grade, I was assigned to write an essay about a historical figure, and the only requirement was that the library had to have a book about the person. I wanted to write about Eleanor Roosevelt but had to settle for Teddy Roosevelt. And what if I wanted to write about someone who looked like me — would that even have been an option? If we imagine the same assignment today, students would undoubtedly turn to the internet, most likely Wikipedia. While Wikipedia has millions of articles in English — including an excellent one about Eleanor Roosevelt — we know there are still plenty of women whose stories and achievements aren’t reaching future generations.
While women are more likely to write biographies about other women, Wikimedia’s Community Insights 2021 Report, which covers the previous year, found that only 15 percent of Wikipedia editors identified as women. This leaves women overlooked and underrepresented, despite the enormous impact they’ve had throughout history in science, entrepreneurship, politics, and every other part of society. Canadian physicist Donna Strickland won the Nobel Prize in Physics in 2018, however, anyone looking for information about her on Wikipedia wouldn’t have been able to find it until a Wikipedia biography was finally published about her esteemed work — days after she won the biggest prize in her field of study. Various studies, including from the Wikimedia Foundation itself, have also called out the gender imbalance on the platform. Even with a lack of representation, biographies about women were still being disproportionately nominated for deletion. One study found that in 2017, 41 percent of biographies nominated for deletion were about women.
We believe open, reproducible science can provide a starting point to address this issue. Today, we are open-sourcing an end-to-end AI model that automatically creates high-quality biographical articles about important real-world public figures.
Our model searches websites for relevant information and drafts a Wikipedia-style entry about that person, complete with citations. Along with the model, we are releasing a novel data set that was created to evaluate model performance on 1,527 biographies of women from marginalized groups. This data set can be used to train models, evaluate performance, and push the model forward. We believe these AI-generated entries can be used as a starting point for people writing Wikipedia content and fact checkers to publish more biographies of underrepresented groups on the site.
There is still plenty more we can do to help bring a wider representation of notable people from all backgrounds to Wikipedia. Fundamentally, AI systems like the one we built will have to confront broader societal and technical challenges in order to fully address the problem. This starts with the web content used to create Wikipedia entries, which may be flawed or reflect cultural biases. On the technical side, the text generation system may be prone to “hallucinating” nonfactual content. Even today’s best language models struggle to create text that is coherent over many paragraphs. We’re hoping to improve these through advances in the neural architectures that power such models and through breakthroughs in the responsible development of AI. Eventually, we hope this approach will be able to help nonexperts produce accurate articles to add to the collection of information on the web, with only minimal editing needed.
While our model isn’t a panacea, it’s an important step forward to support and supplement other existing efforts that are working to address gender representation on Wikipedia. Volunteer editors Jessica Wade and Penny Richards have worked independently to write and publish thousands of biographies on Wikipedia about women who deserve the spotlight. Another great, collective effort, the Women in Red Wiki Project, mobilizes editors to create new biographies and expand existing ones about notable women past and present.
We decided to take a complementary approach. Doing research, creating a bibliography, and writing are intensive, yet there is a trove of information available on the web that can be used to tell the stories of women whose achievements, voices, and legacies have otherwise been forgotten about or marginalized.
For example, we used our model to generate a short biography of Libbie Hyman, a pioneer in the study of invertebrate zoology. The green text is pulled from the reference article we started with, the purple text is from the web evidence, and the orange text indicates hallucination — meaning the model makes up information that can’t be verified.
The model retrieved relevant biographical information about Hyman, including her focus on invertebrates, significant publications, and the impact of her work, which can then be used as a starting point for editors to fact check (an area where the model still has shortcomings) and expand on her life and accomplishments.
We start the process of generating a biography by using a retrieval-augmented generation architecture based on large-scale pretraining, which teaches the model to identify only relevant information, such as birthplace or where the person attended school, as it builds the biography.
The model first retrieves relevant information from the internet to introduce the subject. Next, the generation module creates the text, while the third step, the citation module, builds the bibliography linking back to the sources that were used. The process then repeats, with each section predicting the next, covering all the elements that make up a robust Wikipedia biography, including the subject’s early life, education, and career.
We generate section by section, using a caching mechanism similar to Transformer-XL to reference previously written sections and achieve greater document-level context. Caching is important because it allows the model to better track what it previously generated.
Automatic and human evaluations show that the model is capable of finding relevant information and using it to generate biographies, but there is still work to do. Those evaluations found that 68 percent of the generated text in the biographies we created wasn’t found in the reference text. This could mean several things. It could suggest that the model does a good job of finding and synthesizing relevant information while not acting as a plagiarism bot. However, it’s also unclear, since it is difficult to know which information is accurate and which is not. We asked evaluators to determine whether full sentences were factual, and found many cases in which sentences were only partially verifiable. These challenges are similar to those faced by the field of text generation broadly, though they are exacerbated in the case of marginalized groups, as there is very little data about them. We hope that releasing this data set will allow other researchers to study this problem.
There were several other obstacles we encountered during our research. First, the lack of training data, or biographical articles that already exist about women, was very difficult to overcome. Existing articles about women, especially those from marginalized groups, are substantially shorter than the average article about men, are less detailed, and use different language — for example, “female scientist” instead of simply “scientist.” This bias in training data caused models to internalize such bias. Beyond this, Wikipedia articles must be written based on factual evidence, often sourced from the internet. However, the bias on Wikipedia extends to bias on the internet: There are very few web-based locations that could be used as evidence.
While deeply rooted problems can’t be solved quickly, this is exactly the type of problem where technology can be used to help engineer positive change.
We are excited to share this work with the community to help foster discussions, experimentation, and drive progress to help create a more equitable availability of content on Wikipedia.
Our model addresses just one piece of a multifaceted problem, so there are additional areas where new techniques should be explored. When a Wikipedia editor or our AI model writes a biography, information is pulled from around the internet and cited. However, for all of the enriching knowledge the internet has provided, some sources have a bias that must be considered. For example, when women are represented, their biographies are more likely to include extra details about their personal lives. A 2015 study found the word “divorced” appears four times as often in women’s biographies as it does in biographies of men. This could be for many reasons, including the tabloid fodder that tends to follow the lives of notable women more closely than those of men. As a result, personal details end up being more likely to be mentioned in articles about women, distracting from accomplishments that should be in the spotlight and celebrated.
Technology has already shown promise in helping address various imbalances, which is proof that there is even more the community can do to help make a difference. For example, the site’s former chief executive explained how an algorithm discovered an important mistake on the site: While Wikipedia health articles are vetted by medical editors, for years some articles on critical women’s health issues, such as breastfeeding, were labeled “low importance.”
There is even more work to be done for other marginalized and intersectional groups around the world and across languages. Our evaluation and data set focuses on women, which excludes many other groups, including nonbinary people. Articles about transgender and nonbinary people tend to be longer, but much of the additional space is devoted to their personal life instead of expanding on the person’s accomplishments, according to a 2021 study that looked at social biases in Wikipedia articles. It is important to recognize that bias exists in varying forms, especially in default online sources of information.
We are passionate about sharing this as an important research area with the broader generation community. We hope that our techniques can eventually be used as a starting point for human Wikipedia writers — and ultimately lead to a more equitable availability of information online that can be accessed by students writing biographies — and beyond.
This post was updated to clarify the name of the university. It's Université de Lorraine, CNRS, in France.