Tuesday, February 21, 2006

Application Triage

It is said that people will have several careers in their lifetime. For me, a past career was in first aid. Part of first aid is triage: a system of delegated and prioritizing injuries. While I did shelve this concept for a while, it has resurfaced in my current profession: IT. Using an analogue of the medical profession, triage, look at some IT problems and sort out what takes priority.

Note the title of this piece: "Application Triage." There is a wholly separate protocol for large scale hardware failures. Let's focus on what happens when applications fail and there are more problems than developers/sysadmins.

The idea is START: Simple Triage And Rapid Treatment. A triage system that can be performed by lightly-trained lay people and IT staff alike. The first rule: triage is only done when there are two or more problems to deal with.

Triage separates crises into four groups. Let's lift these directly from the medical rules for triage. The DECEASED are beyond help, the injured who can be helped by IMMEDIATE attention, the issues that can be DELAYED, and those with MINOR problems—the walking wounded who need help less urgently.

If the application in question won't run: it's DECEASED. In medical triage, an apparently deceased patient is put to the bottom of the triage order. The good things about a DECEASED application: it won't write errors or give users improper imformation. The bad thing: a dead application is still required and so it must be eventually repaired. In application triage: we revisit deceased applications after those that required IMMEDIATE attention.
If the application runs but writes errors (e.g. inserts empty records), it needs IMMEDIATE attention. As long as it runs, it will make additional errors. That's a big deal.
If the application runs but does not write errors (e.g. produces erroneous output that is not stored or used in a critical process), it can be DELAYED.
If the application runs and has errors that are minor (spelling and display errors), it can be considered MINOR. If something isn't right, it never gets below MINOR unless it's been REPAIRED.
REPAIRED: you've dealt with it and all is well with this particular issue.

To further divide these four categories, there are three more considerations: reach, impact and repair.

Reach. Does the problem affect one user or every user? If it affects more users, it's obviously a higher priority.
Impact. How does it relate to the key goals of your organization? For example: the payment processing part of an e-commerce site is critical while the "contact us" section is less so. When the impact is great, push the issue to the top of the list for that category of problems.
Proximity. If you have two related problems: deal with them both at the same time. Why open up a script, fix it. Close it, move on only to return to same part of the same script later on? For example: if you have two DELAYED problems and one MINOR problem; and one DELAYED problem happens in the same script as the MINOR problem: repair the DELAYED and the MINOR problem while the script is open.
Parts. You never think this one is a factor: what if you can't repair something because you are missing something essential to your repair? What if you know something is missing, but you do not know what to replace it with? For example: you are told that a web link is linking to the wrong place. You ask what the right location is but do not get an answer. You don't have the parts required to carry out the repair. If that's the case, leaving it means it always next in the queue for repair. If you put in protem information, you haven't repaired it. If you cannot get the information, downgrade the item (e.g. from IMMEDIATE to DELAYED). If the information is available by the time you get drop the triage list: great. If not, downgrade it again.

Interested in medical triage : http://www.absoluteastronomy.com/reference/triage

tags: triage, application development, debugging

No comments: