Technology Samuel Adeyemi Feb 25, 2025

ICD-10 Coding Accuracy: Where AI Documentation Tools Fall Short

Not all ambient AI tools code equally. We examine why ICD-10 specificity matters for revenue cycle and what to evaluate when choosing an AI documentation platform.

The ICD-10 Problem Is Not What Most Vendors Discuss

When ambient AI documentation tools discuss ICD-10 coding, the conversation typically focuses on whether the system can recognize diagnoses from clinical speech and produce a code. That is a low bar. The harder question — and the one that actually determines revenue cycle impact — is whether the system produces the most clinically supported specific code, consistently, across the full range of encounter types a physician sees in a day.

ICD-10-CM contains roughly 72,000 codes. The clinical specificity required for accurate coding is not uniform across that space. Some diagnoses — uncomplicated upper respiratory infections, minor lacerations, well-child visits — map straightforwardly. Others require precise characterization of laterality, type, severity, sequelae, or underlying cause to reach the correct code. A system that produces E11.9 (Type 2 diabetes mellitus without complications) when the clinical note supports E11.22 (Type 2 diabetes mellitus with diabetic chronic kidney disease, stage 3) has not failed to identify the diagnosis. It has failed to code it at the specificity the clinical encounter supports — and the financial and quality reporting consequences of that failure accumulate encounter by encounter.

Where Specificity Gaps Actually Occur

The coding gaps that matter most tend to cluster in a few clinical areas. Chronic disease management encounters are the most significant, because poorly specified chronic condition codes affect HCC risk adjustment in Medicare Advantage and other value-based contracts. The ICD-10 codes for conditions like diabetes, chronic kidney disease, heart failure, and COPD have multiple levels of specificity, and most ambient AI systems that haven't been purpose-trained on clinical documentation produce less specific codes than the encounter supports.

Injury and external cause coding is another problem area. E-codes (external cause codes in ICD-10's V, W, X, Y chapter) are required by many payers for injury encounters and affect downstream claims adjudication. Ambient AI systems that listen to a physician's description of a traumatic injury often capture the injury diagnosis correctly but omit the external cause code entirely — because the physician didn't explicitly state it during the encounter, and the system wasn't designed to infer it from clinical context.

Mental health and behavioral health codes require particular care. The DSM-5 alignment with ICD-10 means that many psychiatric diagnoses require accurate specifier selection — episode type, severity, presence of psychosis, mood congruence — to reach the correct F-chapter code. A system that produces F32.9 (major depressive disorder, single episode, unspecified) when the encounter supports F32.1 (major depressive disorder, single episode, moderate) is not producing wrong output. It is producing insufficient output.

Why Most Tools Underperform on Specificity

The technical reason for specificity gaps is instructive. Most ambient AI systems are built on top of large language models that were trained on general clinical text — medical notes, research papers, clinical guidelines. These models are good at recognizing diagnostic entities and producing plausible ICD-10 codes. They are less reliable at the specificity selection task because that task requires two things the base model doesn't inherently provide: a complete understanding of the ICD-10 coding conventions for that specific condition, and an accurate extraction of all clinically relevant qualifying information from the encounter audio.

The second requirement is where performance degrades most often. An AI system needs to extract, from a 15-minute encounter conversation, that the physician mentioned stage 3 CKD in the context of a diabetes management discussion — and then apply that information to produce E11.22 rather than E11.9. This requires the system to maintain a coherent clinical picture of the patient across the entire encounter conversation, not just extract the most salient diagnostic statements. That is a harder NLU task than it appears.

We are not saying general-purpose LLM-based coding is inherently unreliable — we are saying that the specificity gap is real, measurable, and not fully addressed by systems that weren't designed specifically for it. The coding performance on high-specificity diagnoses is not visible in the accuracy benchmarks most vendors report, which typically measure whether the primary diagnosis was identified correctly, not whether it was coded to the highest supportable level.

A Revenue Cycle Impact Scenario

Consider what this looks like in practice for a 10-physician internal medicine group — call them Harborview Internal Medicine — with an average panel that includes 30-40% patients with two or more chronic conditions. If the ambient AI system systematically underspecifies chronic condition codes, the group may be leaving meaningful RAF (Risk Adjustment Factor) value on the table in any Medicare Advantage patient encounters. A single RAF point undercapture per patient per year, across a panel of 500 Medicare Advantage patients, represents a significant revenue gap. The encounters were documented. The clinical work was done. The specificity was present in the physician's assessment. The AI just didn't capture it.

What the Revenue Cycle Team Should Be Asking

Practice administrators and revenue cycle teams evaluating ambient AI tools should request ICD-10 specificity performance data, not just overall accuracy rates. Specific questions worth asking:

What is the system's performance on HCC-relevant codes specifically, not just overall code accuracy?
Does the system produce E-codes for injury encounters when external cause information is present in the encounter?
How does the system handle DSM-5 specifier selection for behavioral health diagnoses?
Can the system be audited by a coder post-generation, with the encounter audio available for reference?

The last question is particularly important. If a coder reviewing an AI-generated note disagrees with a code assignment, they need to be able to reference the encounter recording to determine whether the physician said something that supports a more specific code. A system that doesn't provide that audit trail makes post-generation review difficult and creates compliance exposure.

The Coding-Documentation Feedback Loop

One underappreciated aspect of AI-assisted coding is the potential for a feedback loop between documentation quality and coding accuracy. When an AI system flags that it couldn't identify complication status for a diabetes diagnosis because the physician didn't explicitly address it in the encounter, that flag becomes a prompt for the physician to add the information during review. Over time, this kind of feedback can improve the quality of the underlying clinical encounters — not because physicians are documenting differently to satisfy the AI, but because the AI surfaces the gaps between clinical knowledge and documented record.

This is a meaningful difference from traditional coding workflows, where a coder queries the physician weeks after the encounter when specificity questions arise. Closing that loop at review time — while the encounter is still fresh and the physician is actively engaged with the note — produces better documentation and reduces the query backlog that burdens both coders and physicians in high-volume practices.

The Honest Assessment

ICD-10 specificity is an area where ambient AI tools vary significantly in their current performance, and where the gap between marketing claims and measurable performance is widest. "AI-generated ICD-10 codes" means very different things depending on whether the system was purpose-trained on clinical coding conventions or whether it generates codes as a byproduct of general clinical NLU.

Practices evaluating tools in this category should run their own assessments on a representative sample of their encounter types — particularly chronic disease management encounters and any encounter types where their revenue cycle team currently sees high query volumes. The delta between AI-generated codes and coder-reviewed codes on that sample will tell you more about real-world performance than any benchmark the vendor provides.