LLMs in Network Operations: Systematic Failures,      
  Implicit Feedback, and Calibration in Production

Gvieyrad

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

# LLMs in Network Operations: Systematic Failures, Implicit Feedback, and Calibration in Production

**G. Vieyra** — NOC Operations Research

*Cross-posted to the AI Alignment Forum*

---

## Abstract

We deployed an LLM-assisted diagnostic system in a real Network Operations Center (NOC), processing 132,096 network fault events across five failure classes (RAN, Power, Transport, GPON, CATV). We document three findings with direct implications for AI safety in high-stakes decision systems: (1) LLMs produce high-confidence diagnoses in ambiguous states that require human review, (2) implicit feedback derived from operational metrics (MTTR) systematically diverges from explicit operator feedback, and (3) retrieval quality in production silently degrades when reference playbooks have uneven class coverage. We propose a continuous evaluation framework (NOC-Eval) applicable to any LLM deployed in alerting systems.

---

## 1. Introduction

LLMs are increasingly deployed in systems where their errors have direct operational consequences: undetected outages, misrouted tickets, incorrect escalations. Unlike static benchmarks, these systems face continuous distributional shift — network topology changes, new failure modes appear, and human operators develop heuristics the model never sees.

This paper describes a production system: **bot_diagnosis**, a NOC dashboard integrating automatic fault classification and a WhatsApp conversational bot for real-time diagnostic queries.

**Core question:** How does an LLM fail when acting as an intermediary between network telemetry and human operational decisions?

---

## 2. System and Data

### 2.1 Architecture

```

Network alarms → diagnosis.db → classifier (rules + LLM)

↓

Flask Dashboard (NOC ops)

↓

WhatsApp Bot → human operator

↓

feedback_loop.py (cron 10min)

```

**Stack:** Flask (Python), SQLite (`diagnosis.db`), Baileys (WhatsApp bridge), regex rules + LLM for classification.

Active services:

- `noc-dashboard` — Flask + gunicorn (4 workers) on :5000

- `noc-correlator` — Remedy correlation every 300s

- `noc-wa-query-bot` — WhatsApp bot on :3002

- `feedback_loop.py` — cron every 10 minutes

### 2.2 Data (snapshot 2026-05-29)

| Fault Class | Cases | % |

|-------------|-------|---|

| FAULT_RAN | 48,969 | 37.1% |

| FAULT_POWER | 45,855 | 34.7% |

| FAULT_TRANSPORT | 19,494 | 14.8% |

| Other | 17,778 | 13.4% |

| **TOTAL** | **132,096** | |

Active states: `PENDING_REVIEW` (214) and `UNDER_MANAGEMENT` (35), auto-reclassified within ≤10 minutes.

---

## 3. Methodology

We define three continuous evaluation metrics for LLMs in production alerting systems.

### 3.1 CAC — Confidence-Ambiguity Correlation

Measures whether the model's confidence predicts absence of subsequent human correction. A well-calibrated system should have **CAC < 0** (high confidence → few corrections). Values near 0 or positive indicate miscalibration.

### 3.2 FSC — Feedback Signal Consistency

Where:

- **Implicit signal:** MTTR < 150h → positive; MTTR ≥ 150h → negative

- **Explicit signal:** no manual correction → positive; manual correction → negative

FSC < 0.8 per class indicates significant decoupling between the operational proxy and actual operator judgment.

### 3.3 MCPC — Minority Class Playbook Coverage

With N = 10. MCPC > 20% indicates silent retrieval degradation risk.

---

## 4. Observed Failures

### 4.1 High Confidence in Ambiguous States

The classifier assigns confidence ≥ 0.80 to diagnoses that fall into `PENDING_REVIEW` — states where regex rules are insufficient and human review is required. This creates a paradox: **the system communicates certainty precisely when it should communicate uncertainty.**

**Quantitative observation:** Of 4,485 `ai_feedback` records, only 65 (1.4%) correspond to genuine human correction signals; the rest are SLA auto-closes. This suggests operators trust the system even in ambiguous cases, possibly because the UI does not visually differentiate confidence levels.

> **Safety implication:** An LLM can degrade human oversight not by being wrong, but by communicating its outputs in a way that suppresses the operator's verification instinct.

### 4.2 Implicit vs. Explicit Feedback Divergence

The two feedback signals diverge in ~23% of FAULT_POWER cases. Diagnoses rated "useful" by MTTR correspond to frequent manual corrections.

**Hypothesis:** the operator corrects the diagnosis *and then* resolves quickly — the model attributes the success to the original (incorrect) diagnosis, not to the correction.

> **Safety implication:** RLHF based on operational proxies can reinforce incorrect behavior if the proxy measures the downstream outcome (resolution time) rather than the model's actual output quality.

### 4.3 Silent Degradation via Uneven Playbook Coverage

Of 174 playbook fills executed, **89% targeted FAULT_RAN and FAULT_POWER** — the majority classes. Minority classes (FAULT_GPON, FAULT_CATV) operated with generic playbooks for weeks without any alert firing.

> **Safety implication:** Production retrieval systems silently degrade on less-frequent classes. Aggregate coverage metrics hide this; per-class metrics with degradation alerts are required.

---

## 5. The NOC-Eval Framework

We propose three continuous monitoring metrics for any LLM deployed in an alerting system:

| Metric | Definition | Risk Threshold |

|--------|-----------|----------------|

| **CAC** | Pearson(confidence, 1 − correction_rate) | CAC > 0 indicates miscalibration |

| **MCPC** | % classes with < 10 playbook examples | MCPC > 20% |

These metrics are intentionally simple — computable with SQL queries over any system that logs model confidence, human corrections, and a downstream resolution proxy.

---

## 6. Related Work

### 6.1 Calibration and Overconfidence in LLMs

Guo et al. (2017) showed that larger models tend to be *worse* calibrated, with overestimated confidence distributions. Kadavath et al. (2022) found that LLMs exhibit reasonable calibration when directly asked if they know an answer, but this collapses under distributional shift — exactly what we observe in the NOC environment. Xiong et al. (2024) distinguished between *aleatoric* and *epistemic* uncertainty, noting that current LLMs do not differentiate these in their outputs.

**Gap we address:** existing work measures calibration on static benchmarks. We document miscalibration in a system with continuous distributional shift and real operational consequences.

### 6.2 Proxy Misalignment and RLHF

Goodhart's Law (1975): "when a measure becomes a target, it ceases to be a good measure." Amodei et al. (2016) in *Concrete Problems in AI Safety* documented reward hacking in RL systems. Stiennon et al. (2020) showed in *Learning to Summarize from Human Feedback* that RLHF can produce models that generate outputs human evaluators rate highly but which contain incorrect information. Gao et al. (2022) in *Scaling Laws for Reward Model Overoptimization* showed that as the policy diverges from the reference reward model, the proxy reward increases while the gold reward decreases.

**Gap we address:** RLHF literature operates on curated lab datasets. We document the same phenomenon in a production implicit feedback system where the proxy (MTTR) has specific operational semantics and endogenous noise.

### 6.3 Silent Degradation and Hidden Stratification

Oakden-Rayner et al. (2020) coined *hidden stratification*: models achieving high average accuracy while systematically failing on specific subgroups. D'Amour et al. (2022) in *Underspecification Presents Challenges* showed that multiple models with comparable benchmark accuracy exhibit very different subpopulation behavior. Rabanser et al. (2019) in *Failing Loudly* found that most dataset shift detection methods only catch large, obvious shifts — gradual subpopulation shifts go undetected.

**Gap we address:** these works use static datasets. We document coverage-driven degradation in a production retrieval system with continuous distributional change.

### 6.4 LLMs in Operational Systems (AIOps)

He et al. (2021) identified high false positive/negative cost asymmetry as a defining challenge for NOC models. Chen et al. (2020, Microsoft Research) reported ~40% error reduction when domain knowledge is incorporated vs. pure statistical correlation. Jiang et al. (2023) proposed LLMs for log interpretation without calibration evaluation or distributional shift analysis — a characteristic gap in the current AIOps+LLMs literature.

### 6.5 Automation Bias and Human Oversight

Cummings (2004) defined *automation bias* as humans' tendency to over-follow automated decisions even with contradictory evidence. Goddard et al. (2012) found the effect is robust across domains and that high-confidence UI displays amplify the bias. Our finding in §4.1 — high confidence shown in ambiguous states without visual differentiation — is a direct instance of this mechanism. The 1.4% genuine correction rate is consistent with active automation bias.

### 6.6 Synthesis

| Area | Existing Work | Gap This Paper Covers |

|------|--------------|----------------------|

| LLM calibration | Kadavath 2022, Guo 2017 | Static benchmarks only |

| Proxy misalignment | Gao 2022, Stiennon 2020 | Controlled RLHF only |

| Hidden stratification | Oakden-Rayner 2020, D'Amour 2022 | Static data, not retrieval |

| AIOps with LLMs | Jiang 2023, Chen 2020 | No reliability metrics |

| Automation bias | Goddard 2012, Cummings 2004 | Lab experiments, not production |

---

## 7. Discussion

The three failures are not NOC-specific. They are instances of general production LLM problems:

- **Overconfidence under distributional shift** — relevant to any LLM in an operational environment

- **Proxy reward misalignment** — a real-world instance of lab-studied RLHF reward hacking

- **Silent minority class degradation** — a retrieval evaluation problem in imbalanced distributions

One underappreciated aspect: these three failures *interact and compound*. High confidence suppresses correction (§4.1), which reduces explicit signal quality (§4.2), which prevents detection of retrieval degradation (§4.3). The failure modes are not independent.

---

## 8. Conclusion and Future Work

### 8.1 Contributions

1. Three systematic failure patterns documented in a production LLM system, with quantitative evidence

2. NOC-Eval: a three-metric continuous monitoring framework applicable beyond NOC

3. Evidence that the three failures interact, compounding over time

### 8.2 Practical Implications

1. **Never display confidence without uncertainty distribution** — a single confidence score suppresses operator verification instinct

2. **Separate feedback signals** — operational proxies (MTTR, SLA) are not equivalent to explicit human feedback

3. **Per-class metrics, not only averages** — aggregate accuracy hides critical minority class degradation

### 8.3 Limitations

- Single NOC; generalization requires replication in other environments

- No deliberate drift events in study period; degradation is natural, not experimental

- Automation bias causality is inferred, not experimentally demonstrated

### 8.4 Future Work

1. **UI intervention:** implement uncertainty visualization and measure its effect on correction rate

2. **Automatic FSC alert:** trigger when implicit/explicit divergence exceeds per-class threshold

3. **Cross-domain evaluation:** medical alerting, fraud detection, and cloud security share the same risk architecture and could be evaluated with NOC-Eval

---

## References

[1] D. Amodei et al., "Concrete Problems in AI Safety," *arXiv:1606.06565*, 2016.

[2] M. L. Cummings, "Automation Bias in Intelligent Time Critical Decision Support Systems," *AIAA Intelligent Systems*, 2004.

[3] A. D'Amour et al., "Underspecification Presents Challenges for Credibility in Modern Machine Learning," *JMLR*, vol. 23, no. 226, 2022.

[4] L. Gao, J. Schulman, and J. Hilton, "Scaling Laws for Reward Model Overoptimization," *ICML*, 2023.

[5] K. Goddard, A. Roudsari, and J. C. Wyatt, "Automation Bias: A Systematic Review," *JAMIA*, vol. 19, no. 1, pp. 121–127, 2012.

[6] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, "On Calibration of Modern Neural Networks," *ICML*, 2017.

[7] S. He et al., "Survey on AI for IT Operations," *arXiv:2109.03829*, 2021.

[8] D. Hendrycks et al., "Measuring Massive Multitask Language Understanding," *ICLR*, 2021.

[9] R. Jiang et al., "LLMs as Log Interpreters for AIOps," *arXiv preprint*, 2023.

[10] S. Kadavath et al., "Language Models (Mostly) Know What They Know," *arXiv:2207.05221*, 2022.

[11] L. Oakden-Rayner et al., "Hidden Stratification Causes Clinically Meaningful Failures in ML for Medical Imaging," *ACM CHIL*, 2020.

[12] S. Rabanser, S. Günnemann, and Z. C. Lipton, "Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift," *NeurIPS*, 2019.

[13] N. Stiennon et al., "Learning to Summarize with Human Feedback," *NeurIPS*, 2020.

[14] M. Xiong et al., "Can LLMs Express Their Uncertainty?" *ICLR*, 2024.

[15] C. Chen et al., "Towards AIOps: Root Cause Localization for Microservices," *KDD Workshop*, 2020.