Pangram and the All-Clear

Why an AI detector calling my writing fully human should worry everyone

Michael G Wagner

May 04, 2026

Article voiceover

0:00

-10:18

Pangram identified the following essay as 100% human generated at the time I am publishing it. This is incorrect.

And that is a huge problem. Let me explain.

Pangram Labs recently released a Chrome extension that assigns an AI score to the text in your browser. The tool has been making the rounds on LinkedIn and across educational Substacks, often with the framing that it represents a new gold standard in AI detection. The independent research indeed supports some of that enthusiasm. In the 2025 University of Chicago audit by Jabarian and Imas, Pangram outperformed every commercial competitor tested, achieving near-zero error rates on medium and longer passages across multiple genres.

So I tried it on my own blog.

I ran the extension across a few recent essays from “The Augmented Educator.” Every single text I tried came back as fully human-written. That result is wrong. I have been transparent about my writing process in multiple essays on this Substack, including the disclosure that I use Claude as a writing assistant for drafting and editing. The text Pangram cleared as fully human is, by any reasonable definition, AI-assisted.

This raises an awkward question. How can a detector that the literature describes as the current statistical market leader be wrong about a substantial number of essays on a blog whose author openly discusses his use of AI? The answer should worry anyone planning to deploy these tools in classrooms or as a tool in empirical research.

Two errors, not one

Any classifier that decides between two categories can fail in one of two directions. Using terminology from statistics, a Type I error occurs when the system incorrectly flags something true as false. In AI detection, that means a human-written text being labeled as AI-generated. A Type II error is the reverse. It refers to an AI-generated text being labeled as human. The first is considered a false positive, the second a false negative.

Public conversation about AI detectors focuses almost entirely on false positives. The reasons are obvious. A false accusation of academic misconduct can derail a student’s degree and generate legal liability for the institution. Stanford’s well-known study on TOEFL essays showed that early detectors incorrectly flagged a substantial number of texts written by non-native English speakers as AI-generated. Vendors learned the lesson and tuned their algorithms toward extreme conservatism.

The mathematical consequence of that choice, however, is rarely discussed in any detail. Lowering a classifier’s false positive rate raises its false negative rate. The two are inversely correlated through what statisticians call the classification threshold. Make the system more cautious about accusing humans, and you make it less capable of catching AI.

The published numbers illustrate this trade-off. Pangram self-reports a false positive rate of 0.19% and a false negative rate of 1.4% on standard datasets. Those numbers describe the system’s performance on raw, unedited AI output evaluated under laboratory conditions. This does not seem all that concerning.

But move into messier territory, and the picture changes. When Pangram is used in ternary classification tasks distinguishing human, AI-edited, and fully AI-generated text, accuracy dropped to 73.0%.

The gap between that 73% and the laboratory figures reflects a difference in design philosophy. A binary classifier is forced to call text either human or AI, even when it was produced by both. A model built specifically to measure how much AI editing went into a text, rather than its bare presence or absence, fits the actual phenomenon better. EditLens, an open-source regression model developed by Pangram, which does precisely that, reached 89.7% on the same task.

This is better, but far from adequate.

Why conservatism creates the problem

Pangram has to be conservative. Educational institutions have made it clear, through both procurement decisions and litigation risk, that they will not tolerate detectors that produce visible false accusations. The market pressure runs in only one direction. Vendors who flag innocent students get sued, deactivated, or both. UCLA and the University of Pittsburgh have already deactivated Turnitin’s AI detection feature, and the precedent is well understood across the industry.

And so the threshold goes up. The system demands strong statistical evidence before it will commit to an “AI-generated” label. Anything ambiguous gets sorted into the human pile. When a human wrote the foundational draft and used an LLM for editing or structural improvement, the text keeps enough idiosyncrasy to fall below the detection threshold. The algorithm is programmed to err toward “fully human” in those cases.

That is what is happening with some of my essays. The first draft involves me, a human author, working through ideas. Claude then helps with phrasing and revision. The resulting text carries enough of my authorial fingerprint to pass the threshold. Pangram is doing exactly what it was designed to do.

It is important to point out that a less conservative version of Pangram would not fix the problem. It would simply trade one error for another. Lower the threshold to catch my essays, and you start flagging anyone whose writing happens to sit in the linguistic neighborhood of an LLM.

The accuracy paradox

Here is where things get really interesting. If a determined user can bypass these detectors through ordinary editing, then what does the headline accuracy figure actually mean?

When Pangram reports 99.85% accuracy across thousands of examples spanning ten writing categories, the accuracy refers to its performance on a specific test set. This set contains pure human text and pure AI text, generated under controlled conditions. Pangram performs well within that frame. It is genuinely the best of its commercial class on that benchmark.

But the benchmark does not describe the world in which the detector is used. In actual use, students may apply humanizer tools. And writers like me run AI-assisted text through several rounds of human editing before publication. The 2025 study “Almost AI, Almost Human” found that standard detectors misclassify AI-polished text as fully human between 10% and 75% of the time. And those rates are not edge cases. They describe what happens when AI is used the way most AI-literate people actually use it.

Yes, the accuracy figures are correct, but they are answers to a question nobody outside the laboratory is asking.

The research methodology problem

This brings me to a development that has been bothering me for a while. A growing body of research uses AI detectors as ground truth for studying the prevalence and quality of AI-assisted writing. A recent study of this kind reported that AI-assisted academic submissions, identified by running the texts through Pangram, were of lower quality than human-authored work.

But consider what that finding actually means. Pangram identifies as AI-assisted only the writing where the AI involvement was heavy and uncamouflaged enough to clear its deliberately conservative threshold. The skilled AI users who integrate the tool well all sit in the false-negative bucket. They are classified as human-authored and contribute to the human-authored quality average.

What the study compares, then, is not human writing against AI-assisted writing. It just compares unsophisticated use of AI against everything else. The method therefore only measures the gap between bad AI use and the rest of writing, and then mislabels the result as a comparison between AI and human work.

This flaw propagates through any study that treats a detector score as a reliable indicator of AI use. The detector is primarily calibrated to catch the obvious cases. It misses the actually interesting cases, the ones where AI is used skillfully and integrated thoughtfully.

A reluctant verdict

AI detectors are not completely useless. They are capable of catching writers who paste raw LLM output into a submission with little editing and no thought. For that narrow purpose, against unsophisticated use, the tools such as Pangram work as advertised.

But for anything beyond that purpose, the detector’s accuracy rate means almost nothing. “100% human” just tells you that the writer either avoided the most obvious form of AI use, or was careful enough to obscure it. Anyone who treats detector scores as ground truth for identifying AI-assisted writing is measuring the wrong thing, and any conclusions drawn from them should be taken with deep suspicion.

Educators reading this face a practical question. If the detectors cannot reliably tell us what we want to know, what should we do instead? The honest answer is that the assessment burden has to shift from product to process. I have explored AI-resistant assessment methods in an earlier essay on this Substack. The underlying principle is simple: a renewed focus on dialogic education. Let’s move the work back into the room with the student, where it always belonged.

When Pangram tells me my essays are 100% human, I do not take it as a compliment. I take it as a warning addressed to anyone who might rely on the tool to know what kind of writing they are reading.

The all-clear is the most dangerous signal an AI detector can give. It is the one we are least likely to question.

The images in this article were generated with Nano Banana 2.

P.S. I believe transparency builds the trust that AI detection systems fail to enforce. That’s why I’ve published an ethics and AI disclosure statement, which outlines how I integrate AI tools into my intellectual work.

The Augmented Educator

Discussion about this post

Ready for more?