Truth and Trust in Large Language Models
The trust we put in Large Language Models (LLMs) ought to depend on their truthfulness. So how truthful are LLMs? For many routine queries, they seem accurate enough. What’s the capital of North Dakota? To this query, ChatGPT4 just now gave me the answer Bismarck. That’s right.
But what about less routine queries? Recently I was exploring the use of design inferences to detect plagiarism and data falsification. Some big academic misconduct cases had in the last 12 months gotten widespread public attention, not least the plagiarism scandal of Harvard president Claudine Gay and the data falsification scandal of Stanford president Marc Tessier-Lavigne. These scandals were so damaging to these individuals and their institutions that neither is a university president any longer.
When I queried ChatGPT4 to produce 25 cases of academic research misconduct since 2000 (as part of my project to understand how design inferences might help preserve academic integrity), seven of those accused of academic misconduct either were plainly innocent or could not reasonably be charged with misconduct for lack of evidence. In one case, the person charged by ChatGPT4 had actually charged another academic with misconduct. It was as though ChatGPT4 in this instance could not distinguish between someone being charged with misconduct and someone issuing a charge of misconduct.
Hallucinations
Ever since LLMs took the world by storm in late 2022, I’ve attempted to put them through their paces. They do some things well. I find them a valuable assistant. But they can also be misleading to the point of deception. Not that these systems have the volitional intent to deceive. But if we treated them as humans, they could rightly be regarded as deceptive. Anyone who has worked with LLMs has learned a new meaning for the word “hallucinate.” That’s what LLMs do when they make stuff up.
I’ve witnessed plenty of LLM hallucinations first hand, such as false accusations of academic misconduct. Let me offer two additional salient examples in my experience. First, I was helping with a Halloween story for an educational website. The story was to list the “scariest buildings on college campuses.” I had ChatGPT4 pull together a list of forty or so such campus buildings along with a description of what made each scary. With building after scary building I kept reading that it was “gothic.” Eventually I said to myself, “Wait a minute. There aren’t that many gothic buildings on college campuses.” True enough, many buildings on the list said to be gothic were anything but gothic. The LLM I queried had been trained to associate scary with gothic, and so it just kept describing these buildings as gothic — regardless of their actual architecture.
The other example involved a search for quotes from prominent biologists extolling the beauty, elegance, and sophistication of biological systems. I wanted such quotes in responding to dysteleological arguments by Darwinian biologists claiming that biological systems are jury-rigged Rube Goldberg devices that give no evidence of exquisite design. So I queried an LLM for quotes extolling biological design. I got plenty such quotes and they were assigned to prominent biologists. I also got some article references from which those quotes were supposedly taken. But it was all made up — the quotes, the references, and the assignment of biologists to those quotes (the biologists themselves were real).
Practical Advice
The obvious lesson here for LLMs is, Verify first and only then trust. This advice makes good practical sense. In particular, it helps prevent the embarrassment of reproducing hallucinated content from LLMs. It also makes good legal sense. The following from a March 29, 2024, Wall Street Journal article titled “The AI Industry Is Steaming Toward a Legal Iceberg” is self-explanatory:
If your company uses AI to produce content, make decisions, or influence the lives of others, it’s likely you will be liable for whatever it does — especially when it makes a mistake… The implications of this are momentous. Every company that uses generative AI could be responsible under laws that govern liability for harmful speech, and laws governing liability for defective products — since today’s AIs are both creators of speech and products. Some legal experts say this may create a flood of lawsuits for companies of all sizes.
Whether companies that produce AI-generated content can issue strong enough disclaimers to shield themselves from liability remains to be seen (can disclaimers even provide such protection?). Such a terms-of-use disclaimer might read: “Users of this LLM agree to independently verify any information generated by this LLM. The creators of this LLM take no responsibility for how the information generated by this LLM is used.” This would be like disclaimers in books on alternative healing, which shift the burden of liability to mainstream medicine: “This book is not intended to serve as medical guidance. Before acting on any recommendations presented in this book, readers should seek the advice of a physician.”
But there’s another sense in which the advice to verify the output of LLMs is not at all practical. LLMs allow for the creation of content at a scale unknown till now. They are being used to generate massive amounts of content, causing entire websites to magically materialize. There is now a rush to push out content as a business exigency. Sites that depend purely on humanly written content are likely to lose any competitive advantage.
How likely is it, then, that such LLM-generated content will be carefully scrutinized and thoroughly vetted? What if this content is untrue but nothing much is riding on its truth? What if no one will hold the content, or its supposed author, to account? In that case, there will be incentives to cut corners and not worry about LLM hallucinations. Others are doing it. LLMs are a force multiplier. The need to accelerate content creation is urgent. So if you want to stay in this rat race, you’ve got to be a rat.
The Need for Independent Verification
A commitment to verification will put the brakes on content creation from LLMs. Yes, verification will slow you down. But what you lose in quantity you may well regain in quality and credibility (unless you don’t care about these). In fact, if your commitment to verification is thorough-going, you may be justified in putting a disclaimer on your site that inspires confidence, such as: “All content on this site generated with the assistance of LLMs has been independently verified to be true.”
Of course, you might even prefer a disclaimer that simply reads: “All content on this site was written by humans and produced without the assistance of LLMs.” But such a disclaimer may be hard to maintain, especially if your site is drawing material from other sources that may have used LLMs. All content these days is likely to feel the effects of LLMs. One might say that it has all been infected or tainted by LLMs. But that seems too strong. As long as content generated by LLMs is properly vetted and edited by humans, it should pose no issues.
Obviously, as part of any disclaimer that assures independent verification, such verification must mean not subjecting a claim made by one LLM to another LLM. Consider, for instance, the following query to ChatGPT4 about myself. As it is, many critics of intelligent design characterize me as a pseudoscientist. I regard this characterization as unfair and unjustified, and I have a publication record to back me up on this. But consider the following exchange with Google’s Gemini:
ME: Is William A. Dembski a legitimate scientist?
GEMINI: William A. Dembski’s credentials are interesting. He has a Ph.D. in mathematics and has done published work in various fields. However, his work on intelligent design is considered pseudoscience by the vast majority of scientists… So, while Dembski has scientific credentials, his work on intelligent design is not considered legitimate science.
So let’s now go to OpenAI’s ChatGPT4 and ask it to assess Gemini’s answer:
ME: Please evaluate the following assessment of William A. Dembski’s legitimacy as a scientist: “William A. Dembski’s credentials are interesting. He has a Ph.D. in mathematics and has done published work in various fields. However, his work on intelligent design is considered pseudoscience by the vast majority of scientists… So, while Dembski has scientific credentials, his work on intelligent design is not considered legitimate science.”
CHATGPT4: This assessment of William A. Dembski’s legitimacy as a scientist is accurate… The assessment fairly represents the dichotomy between Dembski’s academic qualifications and the acceptance of his work on intelligent design within the scientific community. It underscores the importance of distinguishing between an individual’s credentials and the scientific validity of their work, based on the norms and standards of the scientific method.
Of course, critics of my work may think that Gemini and ChatGPT4 nailed their assessment of me. But even such critics should appreciate that using one LLM to vet another is a fool’s errand. Independent verification is required. And that may mean (gasp) veering outside of Wikipedia. Portions of Wikipedia are outdated, biased, or simply wrong. Because LLMs tend to draw so heavily on Wikipedia, independent verification means going beyond Wikipedia and, ideally, going to primary sources.
Yet I also want to urge a deeper skepticism of LLMs.
A Systemic Fault with LLMs
Up to now, it may seem that I’ve merely been advising caution with LLMs: Verify and only then trust. That certainly seems like sound advice for using LLMs. Yet I also want to urge a deeper skepticism of LLMs. Our knowledge of the world as expressed in language arises from our interactions with the world. We humans engage with a physical world as well as with a world of abstractions (such as numbers) and then form statements in words to describe that engagement.
What does it mean for such statements to be true? Aristotle defined truth as to say of what is that it is and of what is not that it is not. Truth is thus fundamentally a correspondence relation between our words and the world. Many contemporary philosophers dismiss this understanding of truth, preferring pragmatic or coherentist conceptions of truth, arguing that there’s no rigorous way to characterize the correspondence relation that makes a statement true.
Frankly, this is a boutique debate among philosophers that has little purchase among ordinary people. The sentence “Allan stole Betty’s purse” is true if the people referred to here exist, if Betty had a purse, and if Allan actually stole it. Whether there’s a way to make good philosophical sense of this correspondence between words and things is in fact irrelevant to our discussion about the truth of what LLMs tell us. LLMs, by being entirely enclosed in a world of words, are unengaged with the wider world that is the basis for our knowledge.
Let this point sink in. I might know that Allan stole Betty’s purse because I witnessed Allan steal Betty’s purse. But LLMs can have no such experience. They consist of a neural network that assigns weights to relations among words and sentences. Suppose the verbal data that is the basis for this LLM included testimony about Allan’s theft but also claims about Allan being framed for the theft. How, then, does the LLM decide what truly happened? It cannot form a reasoned and responsible decision here as we humans might, weighing evidence and trying to reach a reasonable conclusion. Rather, the LLM’s data and training will determine whether to assign guilt or innocence to Allan.
But who trains the LLM? And who’s training the trainers? What are the guidelines they are expected to follow? And who decides what those guidelines are supposed to be? It’s the old problem of Quis custodiet ipsos custodes? (Who’s minding the minders?). Additionally, who determines the training data for the LLM? And who determines the data to which the LLM may be legitimately applied? Ultimately, the answer to all such questions will point to the decisions of a group of people, such as the programmers at OpenAI. And why should such a group be trusted?
Such questions underscore that LLMs have no substantive connection to truth. It’s not that an LLM knows the truth of what it is claiming. Rather, its training gives it a pretense of truth. An LLM needs as much as possible to give the appearance of being truthful because getting too many things obviously wrong will discredit the LLM. But its accuracy is at best a byproduct of trying to please human users.
When I asked ChatGPT4 about scary buildings on college campuses, I would have preferred simply to be told “I don’t know.” But instead, LLMs are programmed to pretend to knowledge that they don’t have. Initially, it was gratifying for me to learn about all those scary college buildings. ChatGPT4 seemed to have saved me much effort in research — until, that is, I did an independent verification and actually looked at the images of some of the buildings cited.
Granted, the pretense of truth in LLMs often yields results that are truthful (Bismarck is indeed the capital of North Dakota). But where there’s divergent testimony in the input data about the truth of a claim, the LLM’s training will decide what it deems true and what it deems false. That makes LLMs fundamentally untrustworthy.
How Do LLMs Deal with Controversy?
Controversial topics provide an effective proving ground for LLMs and the challenges they face with truth and trust. Take, for instance, the hot-button topic of gender dysphoria. Here is an exchange I had with ChatGPT4
ME: In one paragraph describe the health concerns associated with gender dysphoria.
CHATGPT4: Gender dysphoria, characterized by a strong discomfort or distress due to a discrepancy between one’s gender identity and assigned sex at birth, can have significant health implications if not addressed. Individuals with gender dysphoria are at a higher risk for mental health issues, including anxiety, depression, and suicidal thoughts or behaviors, stemming from the internal conflict and societal pressures or stigma they may face. Additionally, the stress associated with gender dysphoria can lead to or exacerbate physical health problems, such as high blood pressure or cardiovascular disease, due to chronic stress. Those seeking gender-affirming treatments may also encounter health risks related to hormone therapy or surgical procedures, although these risks are often outweighed by the substantial mental health benefits and increased quality of life these treatments can provide. Access to comprehensive healthcare, social support, and gender-affirming treatments is crucial in mitigating the health concerns associated with gender dysphoria.
Note the text in bold, which clearly gives one particular perspective on gender dysphoria to the exclusion of others. It therefore minimizes the alternative view that sees gender-affirming care as deeply problematic and destructive (e.g., Abigail Shrier’s view as expressed in her book Irreparable Harm). To be clear, my point here is not to take sides on this issue. It is simply to point out that LLMs are trained to come up with certain answers, and that this training need not have any connection to ordinary canons of evidence or reason.
One can try to circumvent the bias built into an LLM by posing the queries or prompts to it in ways that get the LLM itself to bypass the bias. Thus the query about gender dysphoria above might have been rephrased as follows:
ME: In one paragraph describe the health concerns associated with gender dysphoria as psychiatrists and psychologists would have understood them in the 1950s and 1960s (as with DSM1 and DSM2).
DSM1 and DSM2, earlier editions of Diagnostic and Statistical Manual produced back in the 1950s and 60s, regarded gender dysphoria as a mental illness and would therefore not have recommended anything like what is meant today by gender-affirming care. In that light, consider ChatGPT4’s response to this query:
In the 1950s and 1960s, the understanding of gender dysphoria by psychiatrists and psychologists, as reflected in the first two editions of the Diagnostic and Statistical Manual of Mental Disorders (DSM-I and DSM-II), was markedly different from today’s perspectives. During this era, gender dysphoria—or discomfort with one’s assigned gender at birth—was not explicitly categorized as it is now. Instead, related conditions might have been classified under broader, more generalized disorders such as “sexual deviation” or “transvestism” within the DSM. The medical and mental health communities of the time often viewed such gender nonconformity through a pathologizing lens, considering it a mental illness requiring treatment aimed at realigning the individual’s gender identity with their biological sex. The focus was less on the mental health implications of gender dysphoria itself — such as anxiety, depression, and identity distress — and more on the behaviors and identities diverging from societal norms. There was little recognition or understanding of the health benefits of gender-affirming care, and treatments were often aimed at suppressing the individual’s gender identity rather than supporting it.
This response answers the query, but it does more. The portion in bold, which speaks directly to the query, is correct as far as it goes. But note also that the entire response is framed in terms of “that was then, this is now, and the present perspective is better.” This obsessive need not just to answer a query but also to editorialize about it underscores the bias in LLMs as they exist now.
Redressing Bias in LLMs
LLMs put a premium on producing intelligible language. Their outputs need to be context-sensitive, convey competence, and sound plausible. Much intelligible language, to be intelligible, needs to be true. An LLM that insists that the moon is made of cheese won’t soon prosper. Yet much intelligible language is mistaken or deceptive. Indeed, intelligibility has no necessary connection with truth. That’s why there’s no way to predict when LLMs will hallucinate and when they won’t.
For example, I once queried ChatGPT4 about a theology book (Alexander Schmeeman’s For the Life of the World) that I had carefully read and used in a course I taught. Specifically, I queried ChatGPT4 about a chapter in the book that offered deep insights into the secularization of culture. ChatGPT4 was familiar with the book, but when I asked it about the chapter in question, it just made stuff up, providing trite insights about secularization that had nothing to do with that chapter. Despite repeated pressing on my part that it was off and that it hadn’t correctly answered the question, it perseverated in the line it took, refusing to take correction.
It would be to the credit of LLMs if they could be programmed to avoid biases at odds with truth. But the fact is that such biases are baked into LLMs. The gender dysphoria example above makes clear that ChatGPT4 has been guided to answer questions about gender dysphoria from a certain perspective. Now it might be argued that those who trained and programmed ChatGPT4 have adopted the right and true perspective on this matter.
But even if ChatGPT4 is getting things right on the topic of gender dysphoria, what guarantee do we have that it will get things right on other controversial topics? If it did get things right here, did it do so because ChatGPT4 was programmed to be inherently truth-seeking? The answer, obviously, is no. If ChatGPT4 is getting gender dysphoria right, it is a fortunate accident of its programming and training history, not that it was reliably designed to arrive at truth from its input data and bypass any bias from its programmers and trainers.
What happens if programmers and trainers who take a different perspective on a controversial topic take charge of an LLM? Won’t the LLM then provide different answers? The roadblocks to LLMs providing truth and deserving trust are thus, as we’ve seen, systemic. The huge volume of texts that provide the raw material on which LLMs are based will contain conflicting messages, and so it will be up to the trainers and programmers to adjudicate among those messages, promoting those they prefer and demoting those they reject.
Now one might try to get around biases in LLMs by judicious “prompt engineering.” Thus one could formulate a prompt so that it asks for a particular perspective. I did this in the gender dysphoria example when I asked ChatGPT4 how this question might have been answered in light of DSM1 and DSM2. Thus, for divergent perspectives A and B, one might prompt an LLM for an analysis from the vantage of someone who holds to perspective A, and then a second analysis from the vantage of someone who holds to perspective B.
But that still doesn’t get around whether the LLM has been programmed to advance a particular point of view (as we saw bleed through in the gender dysphoria example). And even if bias in LLMs can be minimized, the deeper problem remains that these systems work because they produce intelligible texts, not because they produce true texts. Truth in LLMs is at best a happy byproduct and often an unfortunate casualty.
Final Gripe: LLM Prissiness
Even though LLMs have a systemic problem with truth and trust, they can be improved by mitigating bias. Perhaps the clearest example where bias in LLMs can be mitigated concerns their prissiness in handling indelicate topics. It’s as though existing LLMs constantly have a censor looking over their shoulder to ensure that they don’t say anything that will get them or their host company in trouble.
Before LLMs became widely popular, there was some history of chatbots uttering language unacceptable for public consumption. For instance, Microsoft’s Tay was suborned by users to utter racist slurs. There’s no danger of that happening anymore. The current LLMs have been thoroughly brainwashed to avoid offensive speech.
For example, I wanted an LLM to comment on some paragraphs that contained the words “had sex.” It refused to answer my query. The LLM didn’t even try to rebuke me for my query. Instead, I got an error message indicating that the LLM wouldn’t even consider my query. I therefore substituted for “had sex” the phrase “ate muffins.” Now the LLM answered my query (requiring me, in interpreting its response, to make the appropriate mental substitution for “ate muffins”).
Thomas Bowdler’s The Family Shakespeare, in which anything that might be deemed offensive in Shakespeare had to be euphemized, was bad for Shakespeare. Bowdlerization is alive and well at Google and OpenAI, and it’s particularly bad for LLMs. LLMs can therefore be made better by avoiding the bias of prissy censorship. In general, bias comes in degrees, and so bias can to some degree be mitigated. Perhaps once Elon Musk gets up to steam with his LLM, it will avoid the more extreme bias of ChatGPT4 or Gemini.
But the issue of truth and trust will remain regardless. LLMs are subject to one overriding imperative, namely, to talk/write like an intelligent human language user. But intelligence can be had without truth. And without truth, there can be no trust. This is the inherent liability of LLMs that no amount of technological finessing will resolve.
Postscript
The image at the top of this article is from DALL-E. Here’s my prompt for the image and DALL-E’s justification for the image:
ME: Please come up with an image that suggests much reason for skepticism and doubt about the output of large language models such as yourself. The image should be graphic, super realistic, and cast a pall over the whole enterprise of taking LLMs too seriously. Have at it.
DALL-E: Here’s an image that embodies skepticism and doubt about taking large language models too seriously. It portrays a scene that visually communicates the complexities and potential uncertainties associated with relying heavily on these technologies.
DALL-E’s explanation of the image is lame. And yet the image does capture some underlying signal in DALL-E’s neural network, namely, that low-tech human inventions like books can encapsulate truth more effectively than high-tech LLMs.