AI-Generated Toxic Waste Is The Risk Of Bad Data
The Artificial Intelligence Trust Quotient Series, Part Three: Source Of Truth
Data: The Fuel Of AI That It Would Be Foolish To Ignore
Many of those in the world of data, business intelligence and analytics are very familiar with the line, “garbage in, garbage out.” This is just as true of artificial intelligence (AI) technologies. In fact, given the broad scope of use cases and outcomes that could have AI applied to them, output may surpass garbage status and become toxic waste.
It’s my - sometimes unpopular - view that, largely, AI is closely aligned to or perhaps even simply a part of the (on-going) evolution of data and analysis technologies. From the early days of decision support and management information systems, on to business intelligence, self-service and predictive analytics, and, prior to Generative AI (GenAI), data science and machine learning (ML, GenAI’s less trendy sibling). It has long been known that a lack of data, poor quality data, or unrepresentative data fed into analysis will at best lead to low quality results, at worse, bad decisions that run the risk of harm.
AI Is Data-Driven At Speed And Scale
That AI can make decisions at speed and scale, and potentially that are not reviewed by humans makes the risks of bad data a lot worse. The risks to brand and reputation are obvious, an easy example being an AI-powered chatbot on a consumer website giving inaccurate, misleading, or even offensive answers. The risks of emerging use cases, where AI is tasked with making decisions that impact outcomes for individuals, groups and organizations (the AI-TQ covers this as its own assessment area, Outcomes), are substantially greater.
Trust In Data Is More Than Technical Accuracy
The quality of a data-driven decision depends on the quality of the data used and leads us quickly to questions of trust. Fundamentally, do you trust the data? When we use the word trust do we mean that it is technically accurate, or that we are confident it is both accurate and sourced responsibly? This question is at the heart of the AI-TQ, just because something is technically correct, doesn’t automatically mean we necessarily trust it. Perhaps the source data is from a third party with questionable data practices, or gathered without knowledge or consent of participants?
AI Is Driving The Agenda But It Won’t Get Far Without Data
Over time, the value of what can be done with data has done an excellent job of capturing the imagination of business leaders. The advent of AI hype supercharged this process. Avoiding the AI conversation amongst the senior ranks of organizations around the world, including governments, is practically impossible. With such voracious appetite for highly data-dependent AI it could be easy to assume that data and information management are enjoying similarly high profiles. This is not always the case.
Data quality, governance and broader data and information management are decades long-established areas of IT, with volumes of best practice, technical insights, and a broad and evolving set of software solutions available. What has not always been as available is the investment of resources to pay for these data-focused efforts. Without wishing to oversimplify, it’s generally a lot easier to get support and resources for the outcomes of data use, than it is the data inputs that make those outputs possible.
Asking The Right Questions To Provide Data Visibility For Everyone
Establishing visibility into the data used by AI solutions does not have to be purely technically focused. In my view, often the technical aspects of assessments are relatively well catered for. A pressing need exists to provide visibility for the less data savvy. This means thinking more about how data is gathered, its availability and its governance, and less about the technology that manages it.
The AI-TQ considers several areas of assessment under Source of Truth that look into data, its provenance and collection, its availability for inspection, and how it is managed and governed:
Publicly available data - is the data freely available to anyone? E.g. Certain types of government published records.
Proprietary data - is the data the private property of an organization? E.g. Customer Relationship Management (CRM) records.
Auditability - can the source data be examined by anyone?
Opt-in / Opt-out - is the data used gathered on an opt-in or opt-out basis?
Reliance on Third Parties - to what extent does the solution rely on proprietary third party data and third party AI technologies, e.g. Large Language Models (LLMs) for GenAI from other vendors.
Copyright - is the data subject to copyright? If so, has permission been granted for its use? What is the copyright status of outputs of the solution developed, is it clearly defined?
Documented Data & Information Management Program - is the data of the assessing organization subject to a documented data and information management program that establishes governance, quality and use rules, etc.?
Coming Next To The AI-TQ: Transparency
This research is part of a series that will culminate in the official launch of the Artificial Intelligence Trust Quotient (AI-TQ) assessment. Next in the series is “Transparency” - shedding light on the technically opaque field of AI.
Also, my continued thanks and appreciation for the feedback and comments on this research so far! It is immensely valuable to me, and I look forward to more.