When Personal Data In AI Risks Personalized Harm

The Artificial Intelligence Trust Quotient Series, Part Six: Personal Characteristics 

The role of data in AI should never be underestimated. Where data is the input, the output will be subject to how good, or not, that data is. In previous papers I have written for the Artificial Intelligence Trust Quotient (AT-TQ) I explored some of that role: primarily as the raw material used for training AI models. As I have tried to argue, while the technology and techniques behind the scenes may be extremely complicated, the principles of what is happening are surprisingly easy to explain.

Simplifying complexity is also the approach of the AI-TQ framework. By examining the nature of inputs, the approach taken to processing them, and the nature and impact of outputs, assessment of AI solutions is possible without delving into the nuts and bolts of the technology.

The Power Of Unintented Bias

The “traditional” approach to programming a computer to do tasks is rules-based. For example, if this happens, do that and if that happens, change this, and so on. In AI, we cannot create a rules-based system for, say, recognizing what the subject of a picture is, for example, a dog. In the case of machine learning, we give the machine example pictures of a dog and the model works out the probability of what the subject of new images is; the greater the volume of (accurate) training data the better the model works.

But imagine this, perhaps my job is to prepare the data for training a model to identify dogs in pictures. On a personal level (in other words, bias) I have always had a fondness for sighthounds such as Greyhounds. While preparing my training data (pictures of dogs) for the model I, quite possibly unconsciously, favor sighthounds over pictures of other breeds of dogs. The AI model is only as good as the data used to train it, so in my example above the model may turn out to be extremely good at identifying Greyhounds, but pretty rubbish at identifying, say, Pugs. Here is the problem, AI can only make decisions based on what it has been trained on and what it is trained on is subject to any (intentional or otherwise) bias within the training data.

Bias (in its varied, subjective forms) in data sets is highly probable for many reasons, not least given changes in societal norms we experience over time. It may also be the case that the bias in a dataset is unacceptable legally, societally, or for a particular organization. In the case of data sourced from the internet as is often used with Large Language Models (LLMs) we can find many different - often diametrically opposed - points of view, each of many perspectives asserting they are “the truth”.

These ideas in mind, the question becomes simple:

“If the data used to train an AI model contains information we don’t want to influence the AI’s outcomes, for example, sexuality or ethnic background either by design or accident, how do we manage this?”

When Mostly Harmless Could Become Very Harmful

Data or information that represents identifiable elements of who we are should be treated with the highest respect. The definition of Personally Identifiable Information (PII) is well-established* and broadly applied through regulatory requirements such as the European Union’s General Data Protection Regulation (GDPR).

Back in the bad old days of big data** some observers liked to talk about people creating a “digital exhaust”, I prefer to think of a “digital wake”. As we move through the digital universe, our activity creates waves of data that expose elements of who we are, what we do, what we like and dislike, and so on. Whether exhaust or wake, this data can be used to build increasingly accurate and sophisticated models that represent us. The primary driver to their development was to drive digital advertising revenue and while potentially annoying to the subject, given the limited scope of outcome, serving badly targeted ads, I believe people would consider them mostly harmless.

Extend the scope of those outcomes with AI making wider-ranging and more impactful decisions and that mostly harmless data could drive real harm. Harm that could be founded on some of our most fundamental personal characteristics.

Identify The Inputs To Manage Risk In Outputs

The AI-TQ does not attempt to decide or dictate to users what personal characteristics mean to them. It does offer a range of widely-recognised measures to consider as part of its customizable approach. In this case, the variables are simply the presence of a type of data / content used as input to the AI model training or source data for AI models in use. Users of the AI-TQ will weigh the importance of different variables according to their views and requirements. In my opinion, generally the more of these data types present the higher the risk.

Some basics of personal information are provided as a starting point. These include widely recognized content types such as names, unique identifiers and location data. Biometric data types are included as part of the assessment’s offering.

The AI-TQ also includes a set of variables I call “possible discrimination drivers”. These are usually considered some of the most private of personal information and could include things like, medical history, ethnicity, disabilities, sexual orientation, medical history / records, and religious identity. Without doubt, and as with any section of the AI-TQ, the organization / individual using the assessment tool, what constitutes these data types and their relative weight is open to a great deal of interpretation.

  • Personally Identifiable Information

    • Unique Identifiers, e.g. Passport Number, Driving License Number

    • Name

    • Physical Address

      • Location Data

      • Country

      • Region

      • Precise

    • Online ID, e.g. IP or Cookie Data

  • Biometrics

    • Any, e.g. Facial, Eye / Iris, Fingerprint Scans, DNA profiles

  • Potential Discrimination Drivers

    • Gender and Gender Identity

    • Ethnicity

    • Religion

    • Medical History / Records

    • Political Affiliation

    • Education Level and / or Attainment

    • Sexuality

Coming Next To The AI-TQ: Regulation

This research is part of a series that will culminate in the official launch of the Artificial Intelligence Trust Quotient (AI-TQ) assessment. The final part of this series comes next, covering the subject of regulation of AI, its sources, uses, and monitoring.

In addition to my gratitude for those providing feedback and support throughout this process, I’m excited to launch the framework on a freely available basis in the coming weeks.

* See the UK’s Information Commissioner’s Office definition here https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/personal-information-what-is-it/what-is-personal-information-a-guide/

** Data has always been “big” in respect of our ability to collect, store and manage it

Next
Next

Combining Purpose And Outcome To Understand AI’s Impact