welcome to the token of the real

A few months ago I was in a lab meeting at Georgia Tech when a PhD student said something that I thought was a bit interesting. He expressed that “the well has been poisoned,” referring to the fact that we can no longer guarantee the purity (from AI) of any newly written text. Though not necessarily a novel idea, this is a worrying prospect - is it true that AI/LLM generated text is so pervasive that it has become truly inexcapable?

There’s so much tense interplay between AI-generated text and natural human language; each day there are hundreds of funny posts about professional slip-ups with LLMs, numerous college professors trying to catch students using ChatGPT, an n-th paper about how _PO beat a “human” benchmark by 1.8%, etc.

In a world where the zeitgeist is so heavily defined by the role of AI and text (when in recent history has the general public cared so much about language?), it seems only natural that we should wonder about the semiotic implications and consequences of language technologies.

on Baudrillard

In his seminal work Simulacra and Simulation, Jean Baudrillard examines the relationships between the signifiers and signified in modern society. Traditional semiotics finds that the signifier and the signified are components of a sign - which itself is an abstract or material representation of some concept that we can ascribe meaning to.

“Nothing is a sign unless it is interpreted as a sign” - Charles Sanders Peirce

A signifier then is the understandable form that a sign takes - a word (itself a combination of letters), an image, sound, etc., and the signified is the concept which a sign then represents.

Baudrillard, however, posits that the stable relationship between the signifier and the signified has grown more and more unclear, leading to a state where the grounding of a sign to a concrete world object or concept becomes increasingly arbitrary, or may not exist at all. This blurriness between signs and reality defines Baudrillard’s conception of simulacra; the signifiers of concepts within the world grow detached from their signified, and adopt meaning that is separated from any original “reality.” At some point, the relationship between signifiers and the signified inverses - the precession of simulacra, where the map not only precedes the territory but generates it. In turn, these signs produce “hyperreality” - a state in society where simulation of reality via simulacra precedes a true reality, where the signs that model concepts become stand-ins for the concepts themselves.

LLM Hyperreality

Here lies the crux of the problem: at what point can we argue that the language generated by LLMs has become a simulacrum for true human language? I believe it can be argued that we are increasingly building the hyperreality in which AI-generated text continuously displaces and precedes human text. The Dead Internet Theory (a well-known conspiracy theory) already crudely touches on this idea, arguing that the majority of internet interactions and activity are by bots that manipulate social media algorithms. Now, with the advent of LLMs and increasingly powerful Generative AI tools, this prospect becomes more and more likely. It seems that almost all online activity could be marked by the influence of LLMs, and each subsequent interaction, generation, or post more deeply cements the hyperreality built by the simulacrum that is artificial text. This also may lead to what Baudrillard calls implosion, where the nuance and meaning of human language gets lost, plagued by the loss of a stable referent as AI text becomes more and more prevalent.

At some point in the past few years, “ChatGPT” has become a stand-in for “LLM”. To people who are not particularly versed in the NLP space (or who don’t keep with with emerging technologies), it seems like other variants of LLMs (Gemini, Claude, Llama, etc.) become bleached to a generic “ChatGPT”. I find this to be notably different from traditional brand genericide (“Google” for “Internet search”, for example) because to the general public, “ChatGPT” encompasses the totality of the signified. ChatGPT as a signifier has gained its own autonomy and can continuously perpetuate its own automony via hyperreality. For so many people now, LLMs themselves do not even exist outside of ChatGPT; the sign of ChatGPT itself has lost its stable signified and now encompasses the totality of the concept.

Baudrillard finds that there are four stages of simulacra. The third stage (the “order of sorcery”) is a state where the simulacrum becomes a copy without a reality, while maintaining the mimicry of something real. I would argue that the state of the digital world with LLMs has reached this stage. AI-generated text promises to be faithful to human language, and often will represent it in a way that seeks to convince a human reader that it is in fact the same. The more generated text that exists, that permeates into our every day encounters with language, the more the simulacra builds upon the hyperreality that we exist in online. In fact, the LLM-generated text no longer is a true faithful representation of the human language referents; LLMs hallucinate, craft incorrect answers, and draw false conclusions, acting as a representative copy of human text whie exisiting independently, without a grounding in reality. The meaning that we ascribe to this language cannot be rooted in any true reality, as the signifier has uprooted the signified.

Latent Spaces

I was also curious about how we may understand the operational mechanics of these models under the lens of hyperreality. Latent spaces are lower dimensional spaces that encode representations of higher dimensional data. In LLMs, the vast, high-dimensional text data is compressed to a lower dimensional latent space where data is more compact and more easily manipulated, allowing us to search for patterns or novel characteristics in the distribution. These kinds of patterns include things like semantic relationships, conceptual proximity, and stylistic patterns. However, these encodings and patterns are merely internal signifiers, and their relationships to external, concrete signified is often complex. Are these latent encodings direct representations of real-world concepts, or do they just signify patterns that are purely internal to training set, which is itself a high dimensional collection of signifiers? This begs the question of whether LLMs are constructing their output from a simulated understanding, creating texts that are simulations derived from these embedded, abstract signifiers. This self-referential system generates outputs that can appear more “real” than real, a perfect, smoothed-out version of language that masks its own lack of grounding.

The Well in the Desert of the Real

In Simulacra and Simulation, Baudrillard also introduces the concept of the desert of the real, describing the state of our reality where we are increasingly surrounded by the hyperreal, and where encounters with authentic reality become more and more scarce. He argues that we exist now in this desert where everything is simulated, just copies of copies, signifiers of signifiers. In many ways, our digital world is becoming such a desert, where we are surrounded by AI-generations of generations of generations.

One of my favorite books of all time is Antoine de Saint-Exupéry’s The Little Prince. In it, there is this wonderful little quote:

“What makes the desert beautiful,” said the little prince, “is that somewhere it hides a well.”

In our modern desert of the real, this well represents unfettered and untouched raw human expression via language, grounded in the human experience, truth, and meaning. The well represents the value that our humanity imparts on the words that we say, and the text that we write. If the metaphorical well truly has been poisoned, as the PhD student expressed, then the well itself has become a simulacrum for the human expression. In the hyperreal desert, LLMs hold the power to continuously and infinitely build these simulated wells.

Each might appear to offer refreshment, knowledge, or connection, engaging in a seduction where their fluency and responsiveness charm us into accepting their outputs as equivalent, or even superior, to human interaction. Yet collectively, they risk drawing us further into the hyperreal. If the desert of our digital world becomes populated by countless, easily accessible, yet artificial sources, does the inherent value of seeking out a genuine well diminish? The question then becomes not only whether we can distinguish the authentic from the artificial, but whether the sheer volume and sophistication of the simulacra will fundamentally alter our relationship with human-generated language itself.

and Derrida?

Still thinking! Check back later for more :D