Deconstructing Bias in AI: The Role of Data and Design

24 Aug

Receiving biased outputs from your chatbot is a common issue that often goes unnoticed until it’s too late. Troubleshooting this problem can be complex, as it may stem from multiple sources. Let’s explore strategies to reduce and prevent bias effectively.

What is Bias?

Bias is a systematic inclination or prejudice for or against something. In the context of Machine Learning and AI, bias typically manifests as contextually incorrect responses, where the model shows a preference for certain data clusters over others. Essentially, it results in the AI not providing accurate or fair outputs.

Troubleshooting Bias is Tough

Controlling and troubleshooting bias is a significant challenge due to its complex and often opaque nature. The occurrence and traceability of bias are deeply embedded in the AI model's decision-making process, making it difficult to pinpoint the exact variable that caused a biased output. This is why AI engineers typically approach the bias problem from three angles: the data, the algorithm, and the user. Each of these factors can contribute to bias, and addressing them collectively is key to mitigating its impact.

The Data

In my career, I encountered bias in a solution I was building. While building a solution, I struggled to understand why the responses were so biased and returning incorrect contextually incorrect content. Despite trying various approaches—such as tuning hyperparameters, adjusting system prompts, and segmenting the solution with a promptflow—I was unable to resolve the issue. Ultimately, I discovered that the problem was not with bias in the AI solution itself but with bias in the data in the index.

These solutions inherently treat any content or information they receive as factual (unless content guardrails are in place). They build their entire understanding based on the information available to them (bias or not). However, this issue extends beyond just the data you provide. Large language models are trained on extensive datasets encompassing diverse sources. Unfortunately, historical biases present in these datasets are reflected in these models. The data is crucial and must be carefully considered when addressing issues of bias.

The Algorithm

Algorithmic Bias is complex. In essence, a neural network is designed to calibrate responses based on individualised and variable interpretation. I recently encountered the strange phenomenon of conceptual drift, when using smalling language models with an index of large unstructured data. Conceptual drift is basically the divergence between a given input and the target output. Generally, the occurrence of conceptual drift isn’t binary (your solutions either have it or they don’t), it occurs in all models, large or small, but on a scale. In my experience, troubleshooting for algorithmic bias should start with a focus on conceptual drift. Try to distill the drift and close the gap. Swap out model versions and tune hyperparameters, I typically focus on the temperature and presence penalty first.

The User

Interacting with a gen AI chatbot involves language, which presents both challenges and opportunities. Language is a deeply personal medium through which we express our thoughts and articulate our experiences. The challenge lies in accommodating the vast diversity of interactions that language can encompass. Being both overly direct or vauge can skew the response and compound bias in a particular chat. Troubleshooting this typically involves an educational component. I believe continuous engagement with users is crucial for the solution's success and accuracy. Furthermore, establishing artefacts like prompt repositories and feedback mechanisms can better gauge the prompt maturity of your users, and also be a guiding principle in terms of further solution enhancement.

Preventing Bias

During the use case definition stage, the focus is often on understanding users, actors, and the assumptions of an underlying a process. However, bias is frequently overlooked. It is essential to identify potential biases early and involve stakeholders in this process. Stakeholders, as domain experts, can provide valuable insights not only for developing an effective solution but also for ensuring it is free from bias. When you next define a use case, reflect on the potential impact bias could have on your solution, try and answer these three questions:

Do our data sets carry a potential bias?

Which models could potentially perpetuate bias (do we need large or smaller models)?

How do our users currently remove this bias and will we need this as a content filter?

Lachlan Christie