Each case study documents how I approach unfamiliar systems, time-critical failures, and high-consequence environments
Context
In the early 2000s, I worked on large-scale streaming media systems used for internal testing, benchmarking, and marketing demonstrations. At the time, streaming platforms were constrained primarily by disk I/O behavior, controller scheduling, and network throughput rather than application-level logic.
These systems were intentionally driven at or near saturation for extended periods so we could generate repeatable measurements, performance graphs, and documentation for both engineering and marketing audiences.
The Situation
During sustained streaming tests, systems would operate normally for approximately 30 minutes. At that point, errors would begin to appear. If the workload continued uninterrupted, the errors would recur at regular 30-minute intervals.
The issue had several consistent characteristics:
- Appeared only under sustained, high I/O load
- Did not occur during short or burst workloads
- Affected only specific SCSI drive models
- Reproduced reliably across identical systems
- Disappeared when load was reduced
Because most systems at the time relied on 100 Mbps network interfaces and disk-based media sources, even brief disruptions were immediately visible as stream failures.
Investigation
Initial troubleshooting focused on the most obvious constraints:
- Network saturation and NIC behavior
- Disk bandwidth limits
- Controller queue depth exhaustion
- Operating system scheduling
- Application buffering and timing
The regularity of the failures was the critical clue. Errors occurred at predictable 30-minute intervals rather than randomly, suggesting an internal device-level process rather than a software defect.
Root Cause
The issue was traced to background diagnostic behavior in certain SCSI disk drives.
SCSI devices expose operational controls through mode pages, accessed via MODE SENSE and MODE SELECT commands. These pages govern behaviors such as caching, error recovery, and background diagnostics.
In this case:
- The affected drives supported background self-tests similar to what later became widely known as SMART diagnostics
- These diagnostics were deferred while the drive was under heavy I/O load
- After approximately 30 minutes of deferral, the test was forced to run
- Execution of the test temporarily disrupted internal scheduling
Under light workloads, this behavior was effectively invisible. Under sustained streaming workloads, even brief internal interruptions caused measurable failures.
Resolution
By identifying the specific mode page bits controlling background diagnostic behavior, including vendor-specific extensions, we were able to adjust the configuration:
- Forced execution of background self-tests during sustained load was disabled
- Normal foreground error detection and reporting remained intact
After applying the change:
- The 30-minute periodic failures disappeared
- Systems ran indefinitely under full load
- Benchmark results became stable and repeatable
Outcome
With the issue resolved:
- Clean performance graphs were produced
- Results were incorporated directly into internal documentation and marketing material
- Long-running demonstrations became reliable
- Findings were shared with the storage engineering group
The configuration changes were later incorporated at the factory level for affected drive models, preventing recurrence in future deployments.
Takeaway
This case illustrates how system reliability is often shaped by low-level hardware behavior that sits below the operating system and application stack.
By identifying and controlling SCSI mode page behavior, a difficult periodic failure was converted into a permanent, documented fix that improved both engineering confidence and customer-facing results.