What Bluebook Citations Reveal About the Limits of AI, or: Why ChatGPT Isn’t the Answer to All Your Bluebooking Woes

February 2026 · BBright Editorial Note

This post is part of BBright’s Editorial Notes series: practical explanations of how the engine resolves hard Bluebook questions, especially where the rules under-specify real-world captions.

← Back to Editorial Notes


There is a persistent intuition — one I shared until fairly recently — that legal citation should be an ideal task for modern “AI.” The Bluebook is, after all, a rulebook. Citations are text. Large language models (LLMs) are extraordinarily good at manipulating text. The fit seems natural.

And yet, anyone who has tried to use an LLM to produce consistently correct Bluebook citations knows the experience is… unsatisfying. Outputs look confident. They are often almost right. And they are wrong in ways that are difficult to predict, difficult to diagnose, and — most troubling — difficult to fix permanently.

That gap between expectation and reality turns out to be instructive. Bluebook citation is not just a case where “AI isn’t quite there yet.” It exposes something deeper about what LLMs are, what they are not, and what kinds of problems they are structurally unsuited to solve.

I. The Reasonable Expectation

At first glance, citation formatting seems like an archetypal AI problem. The inputs are structured language; the outputs are structured language; the rules are written down; and the task is repetitive.

If an LLM can draft a passable essay, summarize a Supreme Court opinion, or generate code in half a dozen languages, why shouldn’t it be able to apply a finite set of citation rules?

This expectation is not naïve, but reasonable. Which is why the failure is so revealing.

II. Bluebooking Is Not a Language Task

The key mistake is assuming that Bluebook citation is primarily about language.

Bluebooking is a process of normalization. Given an input that may be messy, ambiguous, incomplete, or inconsistent, the task is to produce a single, canonical output that complies with a shared editorial standard.

That standard is only partially specified. The Bluebook leaves gaps. It relies on conventions, judgment calls, and community norms. And crucially, once a judgment is made, it must be made the same way every time.

This is much closer to compilation than composition. The goal is not to generate a plausible answer, but to generate the answer.

III. Why LLMs Struggle (By Design)

Large language models are often described as “reasoning engines” or “rule-following machines.” In practice, they are neither. They are probabilistic sequence predictors. This is not a criticism; it is the source of their power. But it comes with consequences.

A Useful Comparison: Chess Engines

One way to see this more clearly is to compare LLMs to something much older and less fashionable: chess engines.

For decades now, computers have dominated human players at chess. Long before the term “AI” became ubiquitous, relatively simple programs — built on search, evaluation functions, and rigid rule enforcement — could outperform even grandmasters. Crucially, those systems never attempted illegal moves. They could be relied upon, at a minimum, to understand the rules of the game.

Given that history, it is natural to expect modern “AI” systems to do at least as well. And yet, they frequently do not. Ask a large language model to play chess, and it will often attempt illegal moves: moving pieces through other pieces, placing a king in check, or inventing positions that cannot exist. These errors are not edge cases. They occur because the model is not enforcing rules against an internal game state at all. It is generating plausible-looking text that resembles chess notation.

Bluebook citation turns out to have the same structure. It looks like a problem about symbols and rules, but correctness depends on an underlying mechanism that refuses to violate constraints — even when doing so would produce something that looks “reasonable.”

1. Probabilistic Decision-Making

When the Bluebook is silent or ambiguous — and it often is — a human editor will typically resolve the ambiguity once and then apply that resolution consistently.

An LLM does something very different. When faced with ambiguity, it samples from a distribution of plausible continuations. Ask the same question on different days, or in a slightly different context, and you may get different answers — each individually reasonable, but collectively incompatible.

A related but subtler limitation emerges when large language models bring strong statistical priors to bear on tasks that demand categorical correctness. Even when given explicit, carefully worded instructions, an LLM may override those instructions in favor of patterns it has learned from training data. In Bluebook work, this often shows up when a model silently substitutes a “plausible” heuristic for a stated rule: for example, treating a Western personal name with a lowercase particle (“de,” “van,” “von”) as if it belonged to a surname-first language, despite explicit instructions restricting that treatment to a narrow set of East Asian naming conventions. From the model’s perspective, the output looks reasonable; from an editorial perspective, it is simply wrong. This is not a failure of comprehension so much as a consequence of how LLMs resolve conflicts: learned priors are always in competition with prompt-level instructions, and under ambiguity or instruction saturation, the priors can win. For domains like citation, where certain inferences are not merely unlikely but impermissible, this behavior is not an edge case—it is a structural mismatch.

In citation work, this conflict between plausible probability and rule-based instructions manifests as flip-flopping. The same caption produces different formatting decisions across runs. Nothing is “wrong” in isolation, but correctness over time is impossible.

2. Silent Editorial Choices

Much of Bluebooking consists of decisions that are never explicitly labeled as decisions:

Humans absorb these conventions tacitly. They internalize editorial norms. LLMs, by contrast, infer these choices statistically. They may imitate the majority pattern, but they cannot commit to a choice in the way an editorial system must. There is no stable notion of “we do it this way.”

3. Instruction Saturation

As the number of constraints increases, LLMs begin to triage. They focus on the “big picture” and let smaller requirements slide. This looks very much like human cognitive overload — and it is not accidental. The model is optimizing for coherence and plausibility, not for exhaustive rule compliance.

In Bluebook terms, this means that as instructions pile up, the model becomes less reliable, not more. The outer structure may be right, but the fussy, compulsive details that editors care about are precisely what gets dropped.

IV. This Is Not an Indictment of AI

None of this is an argument that “AI is bad” or that LLMs are failures. On the contrary: LLMs are extraordinary at what they are designed to do. They are powerful assistants for brainstorming, drafting, summarizing, and exploring ideas. They shine in domains where flexibility and approximation are virtues.

But Bluebook citation is not one of those domains. The failure here is not technological immaturity; it is a category error. We are asking a probabilistic language generator to behave like a rule engine with memory and institutional judgment.

V. What Citation Work Actually Requires

To produce reliable Bluebook citations, a system must do several things that LLMs cannot guarantee:

These are not language problems. And this is why tools that advertise themselves as “AI-powered” so often disappoint: the buzzword is doing marketing work, not technical work.

VI. Why This Doesn’t Go Away as Models Improve

It is tempting to treat these limitations as temporary — the sort of thing that will disappear as models get larger, training data gets better, and reasoning abilities improve. That intuition makes sense if one thinks of large language models as immature versions of a general problem-solving system.

But the limitations exposed by Bluebook citation are not about intelligence or scale. They are about architecture. LLMs are designed to produce probabilistically appropriate outputs. Determinism is not a goal; variability is a feature. Even when an LLM is “right,” it is right in a way that is contingent, contextual, and sensitive to framing.

Citation engines operate under the opposite constraints. They must be boring. They must be predictable. They must make the same choice today that they made yesterday — and will make tomorrow — even when the underlying rule is ambiguous. They must encode editorial judgment explicitly and then refuse to deviate from it.

No increase in model size changes that tension. A smarter model may explain Bluebook rules more eloquently. It may notice edge cases more often. It may even approximate correct output more frequently. But it will never become a rule-enforcing system with institutional memory, because doing so would require abandoning the very properties that make it useful elsewhere.

VII. A Broader Lesson

Bluebook citation turns out to be a useful stress test for modern AI. It reminds us that not all intellectual labor is probabilistic. Some domains demand consistency above creativity, judgment above plausibility, and stability above fluency.

AI will continue to transform many parts of legal practice. But it will not replace every kind of rigor — and in some corners, the very features that make it impressive become liabilities. That is not a limitation to be embarrassed about. It is simply a fact to be understood.

Future notes may explore specific pathological cases — “ex rel.” captions come to mind — where these limits become especially stark.


Questions or disagreements?

We’re happy to compare approaches and explain why the engine makes a particular editorial choice. Email support@thebluebooker.com.