2 min read

DRACAS in practice

DRACAS becomes a technical history. It tells you not just what broke, but what’s been learned, and whether that learning has reached the systems that need it.

DRACAS: Data Recording, Analysis, and Corrective Action System.

Field failures are rarely surprising in retrospect. They usually expose decisions made without enough data, or behaviours that were understood but deprioritised. The question isn’t whether faults will occur. It’s whether the system is prepared to observe, categorise, and respond when they do.

For embedded systems, this means having a structured approach to incident reporting. It's less about ticket volume, and more about quality: ensuring faults are captured meaningfully, analysed properly, and either closed or escalated in a way that feeds back into the broader design and development.

Observation is the first constraint

The first failure of most systems in the field is captured by observation. Hardware restarts unexpectedly. A log is incomplete. A warning state is visible only briefly, then gone. If the system can’t capture the event or the context leading to it, the fault becomes anecdotal.

We structure our devices to emit minimal, durable telemetry snapshots: event codes, battery level, various statuses, and a classification of the fault source (e.g. hardware, network, timeout, watchdog). These are persisted locally and synchronised when convenient. The point isn’t to capture full state. It’s to confirm that something happened, that it had a traceable fingerprint, and that it’s distinguishable from normal variation.

Without structured observation, analysis becomes speculative.

Analysis: root cause, repeatability, and impact

Once captured, faults enter the DRACAS system. They are not treated equally. Each is assigned a cause category, reviewed against historical records, and assessed for repeatability. Most faults fall into one of three classes:

  1. Explained but incomplete: the system failed as designed, but the design didn’t anticipate the context
  2. Unexplained but non-destructive: a defect occurred, but no operational failure resulted
  3. Operationally critical: the fault caused, or would cause, a breach of intended behaviour

For each fault, the outcome is either a documented rationale, a change request, or an open investigation. These reviews are brief and focused. Knowledge is shared not just for audit, but for internal continuity. The aim is not to maintain a perfect log, but to avoid re-investigating the same failure mode multiple times under different names.

Feeding back into the design

A functioning DRACAS is not just a ticket system. Its value is in its feedback loop. Each confirmed fault is evaluated against previous entries. Has this failure mode already been anticipated? If so, was the mitigation adequate? If not, where should it be documented, and how should test approach change?

In practice, this has led us to adjust test logic, modify fallback states, and revise observability. Not because the application was incorrect per se, but because the edge case wasn’t previously captured as a risk.

Over time, the DRACAS becomes a technical history. It tells you not just what broke, but what’s been learned, and whether that learning has reached the systems that need it.