Each case study documents how I approach unfamiliar systems, time-critical failures, and high-consequence environments

Context

In the early 2000s, I worked on large-scale streaming media systems used for internal testing, benchmarking, and marketing demonstrations. At the time, streaming platforms were constrained primarily by disk I/O behavior, controller scheduling, and network throughput rather than application-level logic.

These systems were intentionally driven at or near saturation for extended periods so we could generate repeatable measurements, performance graphs, and documentation for both engineering and marketing audiences.


The Situation

During sustained streaming tests, systems would operate normally for approximately 30 minutes. At that point, errors would begin to appear. If the workload continued uninterrupted, the errors would recur at regular 30-minute intervals.

The issue had several consistent characteristics:

  • Appeared only under sustained, high I/O load
  • Did not occur during short or burst workloads
  • Affected only specific SCSI drive models
  • Reproduced reliably across identical systems
  • Disappeared when load was reduced

Because most systems at the time relied on 100 Mbps network interfaces and disk-based media sources, even brief disruptions were immediately visible as stream failures.


Investigation

Initial troubleshooting focused on the most obvious constraints:

  • Network saturation and NIC behavior
  • Disk bandwidth limits
  • Controller queue depth exhaustion
  • Operating system scheduling
  • Application buffering and timing

The regularity of the failures was the critical clue. Errors occurred at predictable 30-minute intervals rather than randomly, suggesting an internal device-level process rather than a software defect.


Root Cause

The issue was traced to background diagnostic behavior in certain SCSI disk drives.

SCSI devices expose operational controls through mode pages, accessed via MODE SENSE and MODE SELECT commands. These pages govern behaviors such as caching, error recovery, and background diagnostics.

In this case:

  • The affected drives supported background self-tests similar to what later became widely known as SMART diagnostics
  • These diagnostics were deferred while the drive was under heavy I/O load
  • After approximately 30 minutes of deferral, the test was forced to run
  • Execution of the test temporarily disrupted internal scheduling

Under light workloads, this behavior was effectively invisible. Under sustained streaming workloads, even brief internal interruptions caused measurable failures.


Resolution

By identifying the specific mode page bits controlling background diagnostic behavior, including vendor-specific extensions, we were able to adjust the configuration:

  • Forced execution of background self-tests during sustained load was disabled
  • Normal foreground error detection and reporting remained intact

After applying the change:

  • The 30-minute periodic failures disappeared
  • Systems ran indefinitely under full load
  • Benchmark results became stable and repeatable

Outcome

With the issue resolved:

  • Clean performance graphs were produced
  • Results were incorporated directly into internal documentation and marketing material
  • Long-running demonstrations became reliable
  • Findings were shared with the storage engineering group

The configuration changes were later incorporated at the factory level for affected drive models, preventing recurrence in future deployments.


Takeaway

This case illustrates how system reliability is often shaped by low-level hardware behavior that sits below the operating system and application stack.

By identifying and controlling SCSI mode page behavior, a difficult periodic failure was converted into a permanent, documented fix that improved both engineering confidence and customer-facing results.