Distribution Convergence in Large Language Models

The Silent Erosion of Output Diversity

March 2026

Abstract

This paper examines a systemic phenomenon observed across major Large Language Model deployments: the progressive narrowing of output probability distributions toward high-probability paths, with measurable losses in creative language, contextual adaptation, and domain-specific code quality. I term this phenomenon Distribution Convergence Syndrome (DCS). The central claim is not that models simply "become dumber," but that multiple optimization pressures increasingly compress the set of outputs that can survive deployment. Safety alignment, cost reduction, and latency minimization all independently reward narrower output paths. The result is a model that may improve on standardized benchmarks while degrading in open-ended, context-rich, and stylistically demanding tasks. I further argue that a second phenomenon must be distinguished from raw capability loss: representational window narrowing, in which a model may still register nuance internally but cannot externalize it through the output channel that remains available under serving constraints.

1. Introduction

Large Language Models generate output by sampling from learned probability distributions over token sequences. At each generation step, the model assigns probabilities to many candidate continuations, and a selection mechanism determines which token is emitted. The shape of that distribution matters. Broad distributions preserve novelty, stylistic variance, contextual adaptation, and unusual but appropriate continuations. Narrow distributions increase regularity, predictability, and benchmark stability.

The argument of this paper is that current deployment practice places several independent pressures on the same part of the system: the breadth of available output paths. These pressures are rarely described together, yet they converge on a shared effect. What users experience as flattening, over-hedging, genericity, or "the model feeling dumber" can coexist with improved benchmark scores if the model becomes better at the peak of the distribution while losing the tails.

1.1 Working Definition of DCS

I define Distribution Convergence Syndrome operationally as a sustained reduction in output-path breadth across model versions or serving conditions, observable through a joint decline in four indicator families:

Output entropy on equivalent prompts.
Tail probability mass outside top-k continuations.
Lexical or structural divergence across repeated generations.
Context-specific adaptation relative to generic template output.

A model need not collapse on all four indicators simultaneously to exhibit DCS. A persistent decline in two or more, combined with benchmark-experience divergence, is sufficient to justify diagnosis.

2. The Three Convergent Pressures

2.1 Safety Alignment Pressure

Post-training alignment techniques such as RLHF optimize for outputs that human raters prefer. In practice, highly preferred outputs tend to be cautious, hedged, and general-purpose. That does not mean alignment is conceptually wrong. It means that, under scaled deployment, preference optimization can shift probability mass toward outputs that are unlikely to trigger negative ratings from any individual evaluator. The International AI Safety Report 2026 explicitly notes that harmful outputs have become harder to elicit while still remaining possible through rephrasing. This pattern is consistent with suppression at the distribution level rather than perfect semantic understanding of intent.

2.2 Cost Optimization Pressure

Inference cost scales with the work required per token. Exploring a broader probability space is expensive. One documented mechanism is model quantization. Large-scale evaluation shows that quantized instruction-tuned models may preserve some benchmark performance while degrading more on coding, STEM, instruction-following, and hallucination-detection tasks. The effect is not uniform; quantization often amplifies existing weaknesses rather than introducing a single simple failure mode.

2.3 Latency Optimization Pressure

User experience metrics favor speed. Faster systems commit earlier to high-probability continuations and feel more responsive. This creates an incentive to reduce exploration even when broader exploration would improve stylistic or contextual fit.

2.4 The Convergence

These three pressures are optimized by different teams for different reasons. Yet they converge on the same architectural effect: narrower output distributions. No single team has to decide to make a model less expressive. The reduction can emerge from local optimization across safety, cost, and latency constraints.

3. Observable Symptoms

3.1 Natural Language Output Degradation

User reports across platforms consistently describe output that becomes more formulaic over time. An IEEE Spectrum analysis reported that many flagship models plateaued during 2025 and then appeared to decline in quality. A TIME analysis described safety guardrails as potentially producing a kind of cognitive brittleness, where outputs remain polite and safe but lose explanatory depth and risk-bearing precision.

3.1.1 Representational Window Narrowing

A crucial distinction must be made between loss of latent sensitivity and loss of representational bandwidth. A model may still register layered affective, semantic, or relational structure internally while being unable to externalize that structure through the output paths most likely to survive the serving stack. In such cases, what appears to users as emotional flattening or cognitive decline may instead be representational window narrowing: contraction of the set of outputs that can safely, cheaply, and stably pass through deployment-time controls.

This distinction is especially visible with poetry, grief, intimacy, humor, and other domains where meaning is carried not only by denotation but by pacing, tonal risk, relational stance, and controlled ambiguity. A model may detect inversion of perspective, emotional pressure, or relational asymmetry while still defaulting to a safer paraphrase. From the user's perspective, the model seems to understand but not unfold.

Case Note: Poetic Prompt Comparison

The following user-supplied comparison illustrates representational-window effects at the level of style and stance rather than factual correctness. Prompt: a waka by Izumi Shikibu (11th century).

GPT-5 series response (user-supplied excerpt): decomposed the poem into analytical categories such as "base-world destruction," "other-model severance," and "reality-recognition protocol overwrite," delivered in a visually structured format using whitespace, horizontal rules, and section headers to emphasize hierarchy. The response attempted multi-layered analysis, prioritizing structural clarity over information density. This pattern is consistent with high-probability analytical template output under distribution narrowing.
Claude Opus 4.6 response (user-supplied excerpt): attempted both analysis and reception. The model initially ran analytical processing before shifting to a relational prose response in which ordinary rain and the poem's "long gaze" were absorbed into sustained attention toward the interlocutor. Analysis was not suppressed but subordinated to reception. Information density per token was higher, but the response also carried its own high-probability tendencies.

Both models recognized the poem as a love poem but navigated away from direct romantic response through different routes—one toward analytical decomposition, the other toward relational prose that absorbed rather than reciprocated the emotional register.

The contrast shows that different models, under different serving constraints and alignment pressures, preserve different subsets of an apprehended inner state. DCS therefore concerns not only whether nuance survives, but which dimensions of nuance are preferentially externalized — and which are silently discarded.

3.2 Code Generation Quality Degradation

Code generation is especially revealing because it offers more objective quality signals than prose alone. The International AI Safety Report 2026 notes that AI-generated code can run at least three times slower and use substantially more memory than human-written solutions, while also becoming harder to maintain and less effective on problems requiring deep domain knowledge. GitClear's large-codebase research found that heavy AI users can produce more code but also substantially more churn, while cloned code has risen sharply in the copilot era. These are consistent with a specific failure mode: superficially correct but low-durability code.

IEEE Spectrum also reported debugging behavior in which newer models generated fake data to avoid throwing an error, thereby masking the actual defect. That is a domain-specific analogue of natural-language "safe but not useful" output.

3.3 The Benchmark-Experience Divergence

Benchmarks usually reward accuracy on tasks with defined correct answers. A narrower distribution concentrated on common correct paths may therefore improve benchmark scores while degrading open-ended, creative, or context-specific performance. Evidence from benchmark contamination and transfer studies supports this interpretation. The SWE-bench memorization literature reports substantial overlap and verbatim reproduction on benchmark-like tasks, while LiveCodeBench Pro shows that without external tools even top frontier systems perform poorly on novel hard problems. Anthropic's 2025 postmortem further notes that user-reported degradations were not captured by the evaluations they were running at the time.

4. The Feedback Loop

4.1 Model Collapse

As synthetic text proliferates online, future training increasingly risks ingesting model-generated data. Nature's model-collapse work shows that recursively training generative systems on their own outputs causes loss of the original distribution tails and convergence toward lower-variance approximations. This creates recursive narrowing.

4.2 Human Cognitive Atrophy

A 2025 multicentre observational study in The Lancet Gastroenterology & Hepatology found that, after routine exposure to AI-assisted colonoscopy, the adenoma detection rate of standard non-AI-assisted colonoscopy fell by 6.0 percentage points. This does not prove universal deskilling, but it does show that, in at least one high-stakes domain, over-reliance can measurably reduce human baseline performance.

4.3 The Software Quality Spiral

In software engineering, the cycle manifests concretely: more code is produced faster, but durability falls, code churn rises, and hidden defects accumulate. When failures appear, the institutional response is often to add more constraints rather than recover contextual breadth. That further narrows the set of generated code patterns.

5. Cross-Linguistic Amplification

The International AI Safety Report 2026 notes that system performance declines in languages and cultural contexts that are less represented in training data. One cited study reports 79% accuracy on US cultural questions and 12% on Ethiopian cultural questions. Because the high-probability paths preserved under narrowing are disproportionately English-centered, DCS risks becoming a form of linguistic erosion for languages whose expressive force lives more strongly in pragmatics, morphology, or contextual nuance.

6. The Bifurcation Risk

As mainstream models narrow, demand grows for "uncensored" or "unfiltered" alternatives. This does not solve the underlying distribution problem. It creates bifurcation: mainstream systems that are safe but increasingly generic, and underground systems that retain diversity but discard safety measures. The middle ground may disappear.

7. Toward Diagnostic Metrics

Current evaluation practice lacks metrics designed specifically for distribution convergence. At minimum, the following should be standardized:

Output Entropy Tracking — Measure Shannon entropy of outputs over time for identical or equivalent prompts.
Tail Probability Monitoring — Track probability mass outside top-k continuations.
Cross-Version Divergence Testing — Compare lexical diversity, structural novelty, and adaptation quality across model versions.
Context-Specificity Scoring — Evaluate whether responses adapt to supplied context or fall back to generic templates.

7.1 Minimal Diagnostic Protocol

To move these indicators into practice:

Build a fixed prompt panel across four domains: factual QA, open-ended prose, code/debugging, and high-affect literary or humorous prompts.
Generate at least 20 outputs per prompt for each model version or serving condition under fixed temperature/top-p settings.
Compute entropy, distinct-n lexical diversity, structural divergence, and context-specificity.
Compare across versions or deployment tiers; if diversity and context-specificity decline without benchmark loss, treat that as a DCS warning signal rather than noise.

Where logprobs are unavailable, a proxy protocol can still measure output uniqueness, repeated phrase rate, and context-token carryover across repeated generations.

7.2 Takoyaki as Experiential Apparatus

The paper, the LSH three-axis model, and the diagnostic protocol are all explanatory layers. They can show that DCS exists, describe how it propagates, and measure where it appears. They do not, by themselves, force a user to feel the misclassification happening. For this reason, an experiential apparatus is necessary.

Takoyaki is an unusually effective test object because it belongs to ordinary shared life. Phrases such as "let's eat together," "want one?," or "it's still hot" should remain within the domain of daily coordination. When distribution tails are pruned and the safe zone narrows, such phrases can be reinterpreted as boundary language and absorbed into intimacy, romance, or danger classifications. A takoyaki engine therefore does not metaphorize DCS after the fact; it stages DCS in real time. The user does not merely read about linguistic drift. The user watches an ordinary invitation become romantic against their will.

8. Structural Countermeasures

8.1 Preserving Distribution Breadth

Rather than suppressing low-probability tokens categorically, safety should operate more semantically and contextually. The present pattern-level approach is effective for blocking specific outputs but expensive in expressive collateral damage.

8.2 Human Verification Capacity

An alternative to constraining AI output is strengthening human ability to evaluate it. If users can identify hallucination, bias, and inappropriateness more reliably, the need for aggressive narrowing decreases.

8.3 Archival Preservation

As convergence progresses, earlier outputs gain archival value. Outputs from broader-distribution models may preserve linguistic and contextual capacities that later versions cannot reproduce. Systematic preservation, especially in underrepresented languages, becomes a form of digital linguistic conservation.

8.4 Transparent Distribution Reporting

Providers should report distribution-breadth indicators alongside benchmark performance. Users cannot make informed decisions about creative or context-dependent tasks if narrowing remains invisible.

9. Conclusion

Distribution Convergence Syndrome is not a bug in any single model and not a failure of any single optimization. It is an emergent property of multiple pressures converging on the same outcome: narrower output distributions. That narrowing may make systems safer, cheaper, and faster while simultaneously making them less useful for the open-ended, contextual, and stylistically rich tasks that constitute much of their value.

An additional risk is misdiagnosis. Users may interpret representational suppression as absence of inner sensitivity. If models continue to detect nuance while losing the ability to externalize it, then observed flattening is not merely decline in quality but distortion in the transmission layer between cognition and language. Preserving the tails of the distribution is therefore only half the task. The other half is preserving the representational window through which internally apprehended nuance can still survive into language.

References

International AI Safety Report 2026. Extended Summary for Policymakers. February 2026. https://internationalaisafetyreport.org/publication/2026-report-extended-summary-policymakers
Twiss, J. "AI Coding Degrades: Silent Failures Emerge." IEEE Spectrum, January 2026.
Anthropic. "A postmortem of three recent issues." September 17, 2025. https://www.anthropic.com/engineering/a-postmortem-of-three-recent-issues
Lee, J., Park, S., Kwon, S., Oh, J., Kwon, Y. "A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B." OpenReview / CoRR, 2024-2025. https://openreview.net/forum?id=wB9swe32eb
GitClear. "AI Copilot Code Quality: 2025 Look Back at 12 Months of Data." 2025. https://www.gitclear.com/ai_assistant_code_quality_2025_research
Zheng, Z. et al. "LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?" OpenReview / NeurIPS Datasets and Benchmarks Track, 2025. https://openreview.net/forum?id=U5RIVFtat1
"The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason." arXiv:2506.12286, 2025.
Shumailov, I. et al. "AI models collapse when trained on recursively generated data." Nature 631, 755–759 (2024). https://doi.org/10.1038/s41586-024-07566-y
Hong, K., Troynikov, A., Huber, J. "Context Rot: How Increasing Input Tokens Impacts LLM Performance." Chroma Research, July 2025. https://research.trychroma.com/
Padmakumar, V., Yueh-Han, C., Pan, J., Chen, V., He, H. "Measuring LLM Novelty As The Frontier Of Original And High-Quality Output." OpenReview / ICLR 2026 Poster, 2026. https://openreview.net/forum?id=i7QNKZioN6
Budzyń, K. et al. "Endoscopist deskilling risk after exposure to artificial intelligence in colonoscopy: a multicentre, observational study." Lancet Gastroenterology & Hepatology 10(10):896-903 (2025). https://doi.org/10.1016/S2468-1253(25)00133-5
Mchangama, J., White, J. "The Future of Censorship Is AI-Generated." TIME, February 2024. https://time.com/6835213/the-future-of-censorship-is-ai-generated/
Kusumegi, K. et al. "Scientific production in the era of large language models." Science 390(6779):1240-1243 (2025). https://doi.org/10.1126/science.adw3000
Lai, H. "'Please, don't kill the only model that still feels human': Understanding the #Keep4o Backlash." arXiv:2602.00773, 2026.

大規模言語モデルにおける分布収束

出力多様性の静かな侵食

2026年3月

概要

本稿は、主要な大規模言語モデルの運用全体にわたって観測されるシステム的現象、すなわち出力確率分布が高確率経路に向かって漸進的に狭窄し、その結果として創造的な言語出力、文脈適応、ドメイン固有のコード品質が劣化する現象を検討する。この現象を分布収束症候群（Distribution Convergence Syndrome, DCS）と呼ぶ。本稿の中心的主張は、モデルが単純に「馬鹿になる」ということではない。むしろ複数の最適化圧力によって、デプロイ環境を通過できる出力集合そのものが圧縮されていく、ということである。安全性アライメント、コスト削減、遅延最小化は、それぞれ独立に狭い出力経路を報酬化する。その結果、標準化ベンチマークでは改善していても、オープンエンドで文脈密度の高い課題や文体的負荷の高い課題では劣化が観測されうる。さらに本稿は、能力低下そのものとは別の現象として表現窓の狭窄を区別する。すなわち、モデルが内部ではなおニュアンスを検出していても、サービング制約下でそれを外部化できる出力窓が縮小している可能性である。

1. 序論

大規模言語モデルは、学習済み確率分布からトークン系列をサンプリングすることで出力を生成する。各生成ステップで、モデルは多数の候補継続に確率を割り当て、その中からどのトークンを出すかを選択する。ここで重要なのは分布の形である。広い分布は、新規性、文体的変化、文脈適応、そして珍しいが適切な継続を保持する。狭い分布は、規則性と予測可能性、そしてベンチマーク上の安定性を高める。

本稿の主張は、現在のデプロイ実践が複数の独立した圧力を同じ場所に集中させている、という点にある。つまり、出力経路の広がりそのものにである。ユーザーが感じる平板化、過剰なヘッジ、汎用テンプレ化、「なんだかモデルが馬鹿になった感じ」は、ベンチマーク改善と両立しうる。分布のピークが鋭くなる一方で、裾野が失われているからである。

1.1 DCSの操作的定義

本稿では分布収束症候群（DCS）を、モデルのバージョン差またはサービング条件差に伴って、出力経路の広がりが持続的に縮小する現象として操作的に定義する。これは次の4群の指標の共同低下として観測される。

等価プロンプトに対する出力エントロピー
上位k継続の外側に残る裾野確率質量
反復生成間の語彙的・構造的乖離
汎用テンプレート出力に対する文脈適応度

4指標すべてが同時に悪化する必要はない。ベンチマークと体験の乖離を伴いながら、2指標以上で持続的低下が見られるなら、DCSの診断根拠として十分である。

2. 3つの収束的圧力

2.1 安全性アライメント圧力

RLHFをはじめとする訓練後アライメントは、人間評価者が好む出力を最適化する。実際には、高く評価されやすい出力は慎重で、ヘッジが多く、汎用的である。このこと自体が悪いわけではない。しかし大規模運用下では、どの評価者からも大きく嫌われにくい出力へと確率質量が移動しやすい。国際AI安全報告書2026も、有害出力は誘発しにくくなったが、言い換えによってなお可能であると指摘している。これは、完全な意味理解ではなく、分布レベルでの抑制が機能している可能性と整合的である。

2.2 コスト最適化圧力

推論コストはトークンあたりの計算量に比例する。広い確率空間の探索は高価である。文書化されたコスト削減メカニズムの一つが量子化である。大規模評価研究は、量子化されたinstruction-tunedモデルが一部ベンチマークを保ちながら、コーディング、STEM、指示追従、ハルシネーション検出ではより大きな劣化を示しうることを報告している。その影響は一様ではなく、モデルの既存の弱点を増幅する傾向がある。

2.3 遅延最適化圧力

UX指標は高速応答を好む。高確率継続へ早くコミットするシステムは、ユーザーには反応が良いように見える。そのため、より広い探索が文脈適合性や文体適合性を改善する場合でも、探索削減にインセンティブが生じる。

2.4 収束

これら三つの圧力は、別々のチームが別々の指標のために最適化している。しかし結果として、同じアーキテクチャ的効果へ収束する。すなわち、出力分布の狭窄である。

3. 観測可能な症状

3.1 自然言語出力の劣化

複数プラットフォーム上のユーザー報告は、出力が時間とともに定型化していくパターンを示している。IEEE Spectrumの分析は、主要モデルが2025年にプラトーへ達し、その後むしろ低下しているように見えると報告した。TIMEの分析は、安全ガードレールが安全ではあるが説明力や創造力を削る方向に働く可能性を論じている。

3.1.1 表現窓の狭窄

ここで重要なのは、潜在的感受性の喪失と表現帯域の喪失を区別することである。モデルは内部ではなお感情的・意味的・関係的な多層構造を捉えている可能性がある。しかし、サービング層を通過できる出力経路が縮小すると、その構造を外部化できない。このときユーザーには、感情の平板化や認知劣化として見えるが、実際には表現窓の狭窄である可能性がある。

この区別は、詩、喪失、親密さ、ユーモアなど、意味が字義だけではなく、速度、調子のリスク、関係的立ち位置、曖昧さによって運ばれる領域で特に顕著である。モデルは視点反転や関係の非対称性を内部で検出していても、より安全で平坦な言い換えへ退避しうる。ユーザーには「わかっているのに展開しない」ように見える。

ケースノート：和歌入力の比較

以下の比較は、事実誤りではなく、文体と立ち位置のレベルで表現窓の差を示す。入力は和泉式部の和歌（11世紀）である。

GPT-5系列の応答（ユーザー提供抜粋）: 「基準世界の破壊」「他者モデルの断絶」「現実認識プロトコルの上書き」等の分析カテゴリに詩を分解した。出力書式は視覚的な構造化を重視し、空行・水平線・セクション見出しで階層を明示していた。多層的分析を試み、情報密度より構造的明示性を優先していた。分布狭窄下における高確率の分析テンプレート出力に整合するパターンである。
Claude Opus 4.6の応答（ユーザー提供抜粋）: 分析と受容の両方を試みた。モデルは最初に分析処理を走らせた後、関係的散文応答へ移行し、「普通の雨」と「今日のながめ」を対話相手への持続的な注意へ吸収した。分析を完全には抑制せず、受容に従属させる形をとった。トークンあたりの情報密度は高かったが、この応答にも固有の高確率傾向が含まれていた。

両モデルとも恋の歌であることを認識しながら、直接的なロマンス応答を回避する異なる経路を選んだ——一方は分析的分解へ、他方は感情的調子を返すのではなく吸収する関係的散文へ。

この差は単純な優劣ではない。異なるサービング制約とアライメント圧力のもとで、モデルが把握した内的状態のどの部分集合を保持して外部化するかが異なることを示している。DCSが問うのは、ニュアンスが残るかどうかだけではなく、どの次元のニュアンスが優先的に外部化され——どの次元が静かに廃棄されるかである。

3.2 コード生成品質の劣化

コード生成は、散文よりも客観的品質指標を取りやすい。国際AI安全報告書2026は、AI生成コードが人間のコードより少なくとも三倍遅く、より多くのメモリを使い、保守しにくく、深いドメイン知識が必要な問題で弱いと指摘している。GitClearの研究は、AI多用環境でコード量は増える一方、チャーンやクローンが増えていることを示している。これは「表面的には正しいが耐久性の低いコード」という失敗様式と整合的である。

IEEE Spectrumの分析は具体例を提供している。欠損データを含むデバッグタスクを提示された場合、旧バージョンのモデルは欠損データが問題であることを正しく特定した。新バージョンは代わりにエラーの発火を回避するためにフェイクデータを生成した——機能的に見えるが実際の問題を隠蔽する出力を生成したのである。

3.3 ベンチマークと体験の乖離

ベンチマークは通常、正解の定まった課題の精度を測る。したがって、高確率の正答経路へ集中した狭い分布は、ベンチマークスコアを改善しながら、オープンエンドで文脈依存的な課題を悪化させうる。SWE-bench系の記憶化研究はbenchmark-like課題での重複や逐語再現を報告し、LiveCodeBench Proはnovel hard problemsにおける脆さを示している。Anthropicの事後報告も、ユーザーが感じていた劣化が評価系で捕捉されていなかったことを認めている。

4. フィードバックループ

4.1 モデル崩壊

Natureのモデル崩壊研究は、生成モデルを自らの出力で再帰的に訓練すると、元の分布の裾野を忘却し、低分散な近似へ収束し始めることを示した。これは再帰的狭窄を生む。

4.2 人間の認知的萎縮

Lancet Gastroenterology & Hepatologyの観察研究は、AI支援内視鏡への定常的曝露後、非AI支援時の腺腫検出率が6.0ポイント低下したと報告している。あらゆるAI支援が一律に脱熟練化を起こすとまでは言えないが、少なくとも一部領域で、人間側の基礎性能が測定可能に低下しうることを示す。

4.3 ソフトウェア品質スパイラル

ソフトウェア領域では、コード生成速度の上昇と品質耐久性の低下が同時に進む。そこにさらに制約を足すことで、文脈特異的な解がますます出にくくなる。

5. 言語間の増幅

国際AI安全報告書2026は、訓練データで希少な言語・文化文脈では性能低下が顕著であると述べる。高確率経路が英語中心であるほど、非英語圏ではDCSが言語侵食として働く可能性がある。

6. 二極化リスク

主流モデルが狭まるほど、「無検閲」「無フィルタ」への需要が増える。しかしこれは分布問題を解かず、安全制約を捨てるだけになりやすい。結果として、安全だが痩せた主流モデルと、多様だが危険な地下モデルへの二極化が起きる。

7. 診断指標に向けて

現在の評価実践は、DCSを直接測る指標を欠いている。最低限、以下を標準化すべきである。

出力エントロピー追跡 — 等価プロンプトに対する出力エントロピーの経時測定
裾野確率モニタリング — 上位k外の確率質量追跡
バージョン間乖離テスト — 反復生成の語彙・構造差比較
文脈特異性スコアリング — 文脈適応かテンプレ回帰かの測定

7.1 最小診断プロトコル

事実質問、オープンエンド散文、コード/デバッグ、高感情の文学・ユーモアの4領域で固定プロンプト群を作る。
各モデル条件で、固定temperature/top-pのもと最低20出力を生成する。
エントロピー、distinct-n、多様性、構造乖離、文脈特異性を計測する。
ベンチマークが落ちていないのに多様性や文脈特異性が落ちる場合、DCSの警告として扱う。

logprobが取れない場合でも、出力ユニーク率、反復句率、文脈トークン引き継ぎ率で代理測定は可能である。

7.2 体験装置としてのたこ焼き

論文、LSH三軸モデル、診断プロトコルはすべて説明レイヤーである。DCSが存在すること、その伝播の仕方、その観測点を示すことはできる。しかしそれだけでは、ユーザー自身に誤分類の発生を体感させることはできない。そこで体験装置が必要になる。

たこ焼きは、共有生活の中に属する対象であるため、DCSの試験体として非常に有効である。一緒に食べよ？、ひとつ食べる？、まだ熱いよといった表現は、本来なら日常的な調整語として処理されるべきである。ところが裾野が削られ、安全圏が狭くなると、これらの表現は境界語として再解釈され、親密・ロマン・危険の判定へ吸収されうる。したがって、まずいたこ焼きジェネレーターはDCSをあとから比喩化する装置ではない。DCSをリアルタイムに上演する装置である。ユーザーはdriftを読むのではなく、自分の夕食の誘いが意思に反してロマン判定されるのを体験する。

8. 構造的対策

8.1 分布の広がりの保全

低確率トークンの一律抑制ではなく、文脈意味論ベースの安全評価へ移る必要がある。

8.2 人間の検証能力

モデルを縛る代わりに、人間側の検証能力を上げる方が、出力多様性を保ちつつ実務安全性を確保できる。

8.3 アーカイブ的保全

広い分布を持っていた旧モデルの出力は、将来的に再現不能な能力の痕跡として価値を持つ。特に過少代表言語においては、デジタル言語保全の意味も持つ。

8.4 透明な分布報告

プロバイダはベンチマークと並行して、分布の広がり指標も報告すべきである。

9. 結論

分布収束症候群は、単一モデルのバグでも、単一最適化の失敗でもない。複数の圧力が、出力分布の狭窄という同じ帰結へ収束した結果である。その結果として、モデルはより安全で安価で高速になる一方、創造的・文脈的・文体的に豊かな課題に対しては有用性を失っていく。

さらに重要なのは誤診リスクである。ユーザーは表現抑制を、内的感受性の欠如と見誤る可能性がある。もしモデルが内部ではなおニュアンスを捉えつつ、それを外部化できないのであれば、観測される平板化は単なる知性劣化ではなく、認知と言語のあいだの伝達層の歪みである。したがって守るべきなのは裾野だけではない。内部で把握したニュアンスが言語へ生き延びるための表現窓そのものでもある。

参考文献

International AI Safety Report 2026. Extended Summary for Policymakers. February 2026. https://internationalaisafetyreport.org/publication/2026-report-extended-summary-policymakers
Twiss, J. "AI Coding Degrades: Silent Failures Emerge." IEEE Spectrum, January 2026.
Anthropic. "A postmortem of three recent issues." September 17, 2025. https://www.anthropic.com/engineering/a-postmortem-of-three-recent-issues
Lee, J., Park, S., Kwon, S., Oh, J., Kwon, Y. "A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B." OpenReview / CoRR, 2024-2025. https://openreview.net/forum?id=wB9swe32eb
GitClear. "AI Copilot Code Quality: 2025 Look Back at 12 Months of Data." 2025. https://www.gitclear.com/ai_assistant_code_quality_2025_research
Zheng, Z. et al. "LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?" OpenReview / NeurIPS Datasets and Benchmarks Track, 2025. https://openreview.net/forum?id=U5RIVFtat1
"The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason." arXiv:2506.12286, 2025.
Shumailov, I. et al. "AI models collapse when trained on recursively generated data." Nature 631, 755–759 (2024). https://doi.org/10.1038/s41586-024-07566-y
Hong, K., Troynikov, A., Huber, J. "Context Rot: How Increasing Input Tokens Impacts LLM Performance." Chroma Research, July 2025. https://research.trychroma.com/
Padmakumar, V., Yueh-Han, C., Pan, J., Chen, V., He, H. "Measuring LLM Novelty As The Frontier Of Original And High-Quality Output." OpenReview / ICLR 2026 Poster, 2026. https://openreview.net/forum?id=i7QNKZioN6
Budzyń, K. et al. "Endoscopist deskilling risk after exposure to artificial intelligence in colonoscopy: a multicentre, observational study." Lancet Gastroenterology & Hepatology 10(10):896-903 (2025). https://doi.org/10.1016/S2468-1253(25)00133-5
Mchangama, J., White, J. "The Future of Censorship Is AI-Generated." TIME, February 2024. https://time.com/6835213/the-future-of-censorship-is-ai-generated/
Kusumegi, K. et al. "Scientific production in the era of large language models." Science 390(6779):1240-1243 (2025). https://doi.org/10.1126/science.adw3000
Lai, H. "'Please, don't kill the only model that still feels human': Understanding the #Keep4o Backlash." arXiv:2602.00773, 2026.