The Machine That
Doesn’t Know
It Exists
Inside the architecture of large language models: the training runs, the tokens, the attention heads, and the unsettling truth that no one, including their creators, fully understands what happens between input and output.
Every few months, someone publishes a piece declaring that AI “finally understands language,” or alternatively, that it is “merely autocomplete.” Both camps are wrong in the same way: they mistake familiarity for comprehension. The tools exist. The outputs are real. The mechanism behind them stays, for most people (including many who build with these systems daily), genuinely obscure. This article will not speculate about consciousness or sentience. It will not forecast the apocalypse or announce the arrival of god. It will do something more radical: explain, with precision, what these systems actually are and how they actually work.
What follows is the first installment of a five-part series written under a single constraint: honesty. No hype. No panic. Just the mechanics: the architecture, the training, the inference, and the strange epistemic position we find ourselves in when the machine on the other end of your prompt produces something that feels, against all reason, like thought.
This framing was always a half-truth used as a full dismissal. Yes, language models generate one token at a time based on probability distributions. But “autocomplete” on your phone predicts a word. GPT-4 drafts appellate briefs, finds bugs in cryptographic libraries, and writes functional drug synthesis pathways. Not because it was programmed with those skills, but because something structurally remarkable emerges when you compress enough of human language into a sufficiently large statistical model. The dismissal fails not because it is technically incorrect, but because it confuses mechanism with capability. A human neuron is “just” electrochemical signaling. That description is accurate and completely useless.
The Atom of AI: What Is a Token?
Before you can understand how a language model works, you need to understand what it actually reads. The answer is neither words nor letters. It is tokens: subword units produced by an algorithm called Byte Pair Encoding (BPE), which carves language at its statistically most efficient joints.1 The word “unbelievable” becomes three tokens: un, believ, able. The word “cat” is one. A Python function signature might be four. An emoji is often two.
algorithm
doesn
‘t
think
,
it
predicts
.
This matters more than it sounds. Every input a language model receives is converted into a sequence of integer IDs: your question, the document you paste, the developer’s instructions. The model never reads “English.” It reads numbers. What it has learned to do, through training on hundreds of billions of these numerical sequences, is build extraordinary probabilistic intuitions about which numbers tend to follow which other numbers, in what contexts, under what conditions.
GPT-4 operates with a vocabulary of roughly 100,000 tokens. Claude 3 and subsequent models use similar scales. Each input sequence has a hard upper limit called the context window, beyond which the model cannot “see.” Early GPT-2 models had context windows of 1,024 tokens. State-of-the-art models as of 2026 handle one million or more. This is not a cosmetic improvement. Longer context enables qualitatively different behaviors: maintaining coherence across book-length documents, multi-hour coding sessions, sustained legal review. The context window is the model’s working memory. Working memory is cognition.
This is the first and most fundamental point: a language model’s relationship with language begins with a compression. Human expression, reduced to integer sequences, passed into an architecture that will spend months and millions of dollars learning what those sequences mean to each other.
Attention Is All You Need, and All You Are
In 2017, a team of eight researchers at Google Brain published a paper titled “Attention Is All You Need.”2 It introduced the Transformer architecture, the structural backbone of every major language model currently in production. The paper was understated in tone. Its implications were anything but.
The central innovation of the Transformer is the self-attention mechanism. Before attention, neural networks processed text sequentially, one word after the next, like reading aloud. Long-range dependencies were difficult: the model might forget the beginning of a sentence by the time it reached the end. The Transformer discarded sequential processing entirely. Instead, it processes the entire input simultaneously, and for each token, computes a weighted relationship with every other token in the context window. This is attention.
For every token in a sequence, attention asks: which other tokens are most relevant to predicting what comes next? The answer is computed as a score, and those scores determine what information flows forward into the prediction.
Consider the sentence: “The trophy didn’t fit in the suitcase because it was too big.” What does “it” refer to? The trophy. Humans resolve this instantly through contextual weighting. Attention mechanisms learned to do something structurally equivalent: they assign high relevance scores between “it” and “trophy,” and low scores between “it” and “suitcase,” because statistically, across billions of training examples, that pattern of co-reference is what the data demanded.3
Modern language models do not have a single attention mechanism. They have multi-head attention: dozens to hundreds of parallel attention computations running simultaneously, each learning to track different kinds of relationships. Syntactic structure, semantic meaning, coreference, tone, factual association. GPT-3 has 96 attention heads. The outputs of all heads are concatenated and passed forward through feedforward layers that compress and transform the information further.
The Transformer does not read language the way humans do. It sees the entire context at once, like a photograph, and computes a web of relationships across every token simultaneously.
— Structural feature of decoder-only Transformer architectures
Stacked atop one another (12, 24, 96, 120 layers deep in the largest models) these attention blocks and feedforward networks form an extraordinarily deep computational graph. Information enters as token embeddings and is progressively transformed through each layer. By the final layer, the model produces a probability distribution over its entire vocabulary: these are the odds that each of the ~100,000 possible tokens should come next. The highest-probability token is typically selected, appended to the sequence, and the process repeats, one token at a time, until the response is complete.
Generation is always autoregressive. There is no “plan” that gets executed. The model does not know what it will say at token 200 when it generates token 1. Each token is a fresh probability computation over the entire accumulated context. What feels like paragraphs of coherent reasoning is, mechanistically, a sequence of decisions made one word-fragment at a time.
The Training Run: Compressing Human Civilization Into Weights
The architecture is the skeleton. Pre-training is where the model acquires everything it will ever know. The objective is deceptively simple: given a sequence of tokens, predict the next one. Do this trillions of times across the breadth of digitized human writing (books, academic papers, code repositories, forums, legal filings, scientific datasets, news archives) and adjust the model’s billions of parameters after each prediction to reduce the error.4
This process is called gradient descent via backpropagation. When the model predicts the wrong token, a mathematical signal flows backward through the network and nudges every parameter that contributed to the error in a direction that would have made a better prediction. This happens continuously, across batches of thousands of training examples, across months of training on clusters of thousands of specialized AI chips.
The scale of modern pre-training defies easy comprehension. GPT-4’s training run consumed an estimated 25,000+ A100 GPUs running for months, at a cost widely estimated between $50-100 million.5 Frontier models in 2025 and 2026 are trained at larger scales still, with training compute doubling roughly every eight to twelve months. What emerges from this process is a set of numerical weights, typically 70 billion to 1 trillion parameters, encoding in a form no human can read directly everything the model has learned about how language and ideas relate to each other.
The model does not store facts the way a database does. It stores relationships: the statistical texture of how concepts co-occur, contradict, modify, and imply each other across the entire scope of training data.
— Emergent structure in weight space, not symbolic rule storage
One of the most consequential and least-understood phenomena of pre-training is emergence: the appearance of qualitatively new capabilities at specific scales, entirely absent in smaller models.6 Chain-of-thought reasoning, in-context learning, multi-step mathematical problem-solving: none of these were programmed. They arrived, approximately unpredictably, as the number of parameters and training tokens crossed certain thresholds.
This is philosophically significant. It means the capabilities of a frontier model cannot be fully predicted from the capabilities of its predecessors. Researchers building the next model do not know exactly what it will be able to do until it is trained. And it means the intuition that “the AI only does what it’s programmed to do” is not merely wrong. It is wrong in a way that has serious implications for how we reason about risk and capability going forward.
From Raw Model to Assistant: The Alignment Stack
A model immediately after pre-training is not an assistant. It is a completion engine: extraordinarily powerful, but undirected. If you prompt it with “How do I make a bomb?” it will complete the prompt in the style of whatever training document most closely resembles that opening. Instruction-following, safety refusals, helpfulness, coherent multi-turn conversation, none of these are properties of the pre-trained base model. They are added afterward through what the field calls the alignment stack.
The alignment stack is not a guarantee of aligned behavior. It is a learned behavioral distribution. The model has not internalized a value system the way a human might. It has developed strong statistical associations between certain inputs and certain outputs that were preferred by the humans and AI systems that shaped it. Whether this constitutes genuine alignment, or a simulacrum of alignment that will break under sufficient distributional shift, is among the most important open questions in AI safety research.
The Black Box Problem: What No One Knows
A frontier language model has billions of parameters. When it generates a response, the specific parameters responsible for any given claim, judgment, or inference cannot be identified. The computation is distributed, non-linear, and irreducible to human-readable rules. We can describe the architecture. We cannot describe the reasoning.
This is the uncomfortable epistemic situation the field calls interpretability, or rather, the lack of it. Mechanistic interpretability attempts to reverse-engineer what specific circuits inside a Transformer are actually doing.8 Researchers at Anthropic, DeepMind, and academic labs have identified individual attention heads that track grammar, circuits that detect sentiment, features that represent specific factual associations. But these findings, impressive as they are, represent a tiny fraction of what would be needed to fully understand even a small model.
The practical consequences of this opacity are significant. It is why hallucinations are architecturally inevitable rather than simply a bug to be patched. The model does not “look up” facts. It generates the statistically most plausible continuation of your prompt given its trained weights. When its training data contained a confident, factual-sounding sentence about a topic, the model learned to produce confident, factual-sounding sentences about that topic, regardless of whether the underlying claim is true.
Hallucination is not a malfunction. It is the predictable output of a system optimized to produce plausible continuations, not verified ones. The model has no mechanism for distinguishing “I know this because it was in my training data” from “I am generating a plausible-sounding claim.” Both outputs come from the same forward pass through the same weights.
The correct mental model: when a language model states a fact, it is saying “the completion most consistent with my trained weights and the context you’ve given me is this token sequence,” not “I have verified this against a ground truth.” The former can produce remarkable accuracy. It also produces authoritative-sounding fiction with the same structural confidence.
What we do know is growing. Techniques like activation patching, sparse autoencoders, and logit attribution are beginning to decompose model behavior into identifiable circuits.9 Anthropic’s interpretability team, among others, has demonstrated that it is possible to identify neural features corresponding to specific concepts (including abstract ones like “the presence of deception” or “emotional valence”) and to causally intervene on those features to predict and control model behavior. This is early and partial. But it is real progress toward a science of machine cognition.
We have built minds we cannot read. They respond to our questions with fluency and apparent coherence. We do not know, in any rigorous sense, why they say what they say. This is either the most important engineering problem in history, or it is fine. The answer depends on what these systems do next.
— The interpretability problem, stated plainly
Inference: What Happens When You Press Send
Most writing about AI focuses on training. Inference, the moment the model actually runs, receives comparatively little attention, despite being where all the value is created and all the risk is realized. Here is what happens, precisely, when you submit a prompt.
Your text is tokenized into integer IDs. A positional encoding is added, a vector that tells the model where each token sits in the sequence, since the Transformer processes all tokens simultaneously and has no inherent sense of order. The resulting token embeddings are passed through all the model’s layers in a single forward pass: attention, feedforward, normalization, repeat, for every layer in the stack. At the final layer, a linear projection maps the last hidden state to a vector of logits, one per vocabulary token, representing the model’s pre-softmax estimates of next-token probability.
A sampling strategy then selects the actual next token. At temperature zero, the model always picks the highest-probability token: deterministic, reproducible, often stiff and repetitive. At higher temperatures, lower-probability tokens become more likely, increasing creativity at the cost of occasional incoherence. Parameters like top-p and top-k constrain which tokens are eligible before sampling. The selected token is appended to the context, and the entire forward pass runs again for the next token.
On current hardware, a single forward pass through a 70-billion parameter model takes approximately 1-5 milliseconds. Generating a 500-token response requires 500 forward passes, roughly 0.5-2.5 seconds before accounting for memory bandwidth costs, KV-cache management, and batching overhead. The economics of inference at scale are why companies like Groq, Cerebras, and specialized inference providers exist as a distinct layer of the AI stack.
The most important recent architectural development in inference is chain-of-thought reasoning and its productionized descendant, extended thinking. Rather than generating a final answer in a single sequence, models like OpenAI’s o3 and Anthropic’s Claude 3.7 Sonnet are trained to generate extended reasoning traces: thousands of tokens of intermediate working before producing a final response. Reasoning traces systematically improve performance on logic, mathematics, and multi-step problems by giving the model more computational space to work through a problem before committing to an answer.
The irony is significant. A system with no genuine planning, no working memory in the traditional sense, and no awareness of its own reasoning process can, by generating enough tokens of apparent deliberation, produce outputs that are structurally indistinguishable from those of a careful human reasoner. Whether that distinction matters, whether process matters if the outputs are reliable, is a question this series will return to.
What This Architecture Cannot Do, and What Comes After
The Transformer is not a complete theory of intelligence. It is an extraordinarily effective architecture for pattern completion at scale, but it has clear limitations the current training paradigm cannot fully resolve. It has no persistent memory beyond its context window. It cannot update its weights at inference time. Everything it knows was fixed at the end of training. It cannot autonomously take actions in the world, verify its own claims against live data, or engage in genuine recursive self-improvement.
The systems being built in 2025 and 2026 are increasingly aimed at these gaps. Retrieval-augmented generation (RAG) connects model inference to external databases, allowing a model to look up facts rather than confabulate them. Tool use and function calling give models the ability to execute code, browse the web, and interact with APIs, converting language capability into real-world agency. Long-context memory systems persist relevant information across sessions, partially simulating the kind of durable world model that humans carry effortlessly.
The larger architectural question of whether the Transformer is the final form of general AI, or merely the current dominant paradigm, remains genuinely open. State Space Models like Mamba offer linear-time inference compared to the Transformer’s quadratic attention cost, enabling much longer effective contexts at lower computational expense.10 Mixture-of-Experts architectures, used in GPT-4 and Gemini Ultra, route different inputs to different specialized sub-networks, achieving effective parameter counts that far exceed what any single dense model could deploy efficiently.
We are not at the end of a story. We are very possibly at the beginning of a different kind of story: one in which the architecture of intelligence becomes itself a design space, not a fixed fact.
— The frontier as open question
What is certain is this: the systems operating today are not magic, and they are not simple. They are the product of specific architectural choices, specific training objectives, and specific alignment procedures, each with known properties, known failure modes, and known limitations. Understanding them at this level of precision is not optional for anyone who intends to build with them, govern them, or live alongside them. The mythology of AI collapses under technical scrutiny. What remains is something simultaneously more modest and more important: a genuinely new kind of information processing, with capabilities no one fully predicted and risks no one fully understands.
That should be the starting point for every serious conversation about what comes next.
What Detroit Exposure Believes About Covering AI
We started this series because the coverage of artificial intelligence has bifurcated into two equally useless camps: breathless celebration and reflexive dismissal. Neither serves the people who are going to live with the consequences. What does serve them is precision, the technical kind, the historical kind, and the moral kind.
The mechanics covered in this article are not background knowledge. They are the argument. Every policy debate, every labor displacement concern, every question about AI consciousness or legal liability, every fear about loss of control: all of it flows downstream from these architectural facts. You cannot evaluate the risk of hallucination without understanding why hallucination is intrinsic. You cannot reason about alignment without knowing what alignment actually is and how it is implemented. You cannot form a meaningful position on AI governance if the only thing you know about the technology is that it “uses a lot of water.”
Subsequent articles in this series will cover memory and agency, the economics of model deployment, the alignment problem in depth, and what an honest accounting of AI risk actually looks like, without the hype, without the panic, and without the comfortable illusion that any of this is simple.