Art Courtesy of Malina Reber.
When you’re visiting a country you’ve never been to before, you might rely on a travel guide to navigate the streets, find the best food, and understand the local customs. But what if that guide was written based on outdated or incorrect stereotypes, leading you to misunderstand traditions, behaviors, and beliefs? This is similar to what happens when generative artificial intelligence (AI) creates images of different cultures. Like the travel guide, AI models rely on vast amounts of data to create depictions of the world. And crucially, if that data is biased, the results can be deeply flawed. Instead of accurate depictions, AI-generated images can reinforce outdated, oversimplified, or entirely incorrect stereotypes about cultures, perpetuating a distorted view of the world. Just as a bad guidebook can mislead a traveler, these images can mislead viewers, reinforcing biases rather than breaking them down.
New text-to-image generative AI models like Stable Diffusion enable users to transform text descriptions into custom images—and they can produce some truly impressive visuals. These AI models are created through an extensive training process where they learn through trial and error, making random guesses on a dataset where humans have provided images with text labels. Stable Diffusion was trained on more than 5.85 billion text-to-image-pairs in the Large-scale Artificial Intelligence Open Network (LAION) dataset. However, there are problems with this approach. Prompting Stable Diffusion to generate an image of a modern street in a non-Western city may create something that reflects stereotypes from the West rather than an accurate depiction. “The results were not satisfying,” said Zhixuan Liu, a researcher at the Robotics Institute at Carnegie Mellon University who observed these troubling images.
Liu and a team of fellow researchers focused on enhancing generative AI to promote more inclusive representations. The team itself reflected diversity: Liu is Chinese, her advisor is Korean, and one of her closest colleagues is Nigerian. Together, they frequently tested Stable Diffusion and other AI models to assess their accuracy in reflecting their own respective cultures. “[The images] are not even Chinese. They’re not Korean. They’re not Nigerian. They all have Westernized cultural bias,” Liu said. Responding to these cultural misrepresentations, the team began working on their solution in November 2022, only months after the release of Stable Diffusion.
Liu aimed to tackle a major issue: the massive, flawed datasets like LAION that contributed to the harmful cultural stereotypes produced by the AI models. These datasets are typically sourced from the web, where minority cultures are often underrepresented, leading to biased outputs that fail to accurately reflect cultural diversity. “The distribution of these datasets is not good. You may find many authentic Chinese, Vietnamese, or Korean images, but the portion is very small in comparison to Western adaptations,” Liu said. To address this issue, Liu’s research team developed a new dataset called the Cross-Cultural Understanding Benchmark (CCUB), designed to provide more accurate representations of these underrepresented cultures. This dataset was created through direct engagement with the communities it represented, ensuring that their cultural nuances were better captured and reflected in the data. “This data set is very small but accurate. It contains several cultures, and for each culture, we recruit people from that culture to help us collect the text and image data from their respective cultures,” Liu said.
After developing the CCUB dataset, Liu’s research team created a new image-modifying technique called Self-Contrastive Fine-Tuning (SCoFT) for text-to-image AI models. This method adjusts certain settings in models like Stable Diffusion, allowing it to adapt from generating typical images to creating ones that are more culturally relevant. The team then applied SCoFT to their curated dataset, which helped the model produce images that better reflect different cultural contexts. Overall, this approach improved the AI’s ability to generate culturally accurate images based on descriptions.
Improving the cultural accuracy of generative AI models like Stable Diffusion is essential because the images they generate have profound impacts on how cultures are represented to global audiences. In the future, Liu hopes that generative AI models will use the CCUB dataset to produce images that more accurately represent marginalized cultures. This effort aims to reduce bias and improve the inclusivity of AI-generated media. “The development of the CCUB dataset is just the beginning. We continue to contact more people from these marginalized cultures to expand our dataset,” Liu said. With each improvement to these models, we move closer to a more inclusive and accurate portrayal of diverse cultures, emphasizing the importance of ongoing innovation and collaboration across cultures to address these biases.