The Laptop that Stopped Working
One fine day, a couple of months ago, a laptop that we owned stopped working. We heard 4 beeps coming from the machine at intervals but nothing appeared on the screen.
The service person quickly looked up the symptoms in his knowledge base and informed us that 4 beeps meant a memory error.
I replaced first the two memory modules one by one, but the machine still wouldn’t start. I tried two spare memory modules that I had in the cupboard but the computer wouldn’t start.
I had a brand new computer with me that used the same type and speed of memory as the one we were fixing. I pulled out its memory chips and inserted them into the faulty computer, but still no luck.
At that point, the service person told me that it must be the mother board itself that was not working.
Second Attempt at Triage
So the next day, a mother board and some memory arrived at my office. A little later a field engineer showed up and replaced the mother board. The computer still wouldn’t start up.
When the field engineer heard 4 beeps, the engineer said it MUST BE THE MEMORY.
Third Attempt at Triage
A few days later, a new set of memory modules arrived.
The engineer returned and tried inserting the new memory in. Still no luck. The computer would not start and you could still hear the 4 beeps.
A third set of brand new memory modules and a new mother board were sent over.
Fourth Attempt at Triage
The engineer tried both motherboards and various combinations of memory modules, but still, all you could hear were 4 beeps and the computer would not start.
During one of his attempts to combine memory and motherboards, the engineer noticed that though the computer did not start, it did not beep either.
So, the engineer guessed that it was the screen that was not working. But just to be safe, he’d ask them to send another motherboard and another set of memory modules to go with it.
Fifth Attempt at Triage
The screen, the third motherboard and the fourth set of memory modules arrived in our office and an engineer spent the day trying various combinations of screens, motherboards and memory modules.
But the man on the phone said: “Sir, 4 beeps means there is something wrong with your memory. I will have them replaced.”
I had to take out my new laptop’s memory and pop it into the faulty machine to convince the engineer and support staff that replacing the memory would not fix the problem.
All the parts were now sent over – the memory, motherboard, processor, drive, and screen.
Sixth Attempt at Triage
Finally, the field engineer found that when he had replaced the processor, the computer was able to boot up with no problems.
Better Root Cause Analysis
The manufacturer could have spared themselves all that expense, time and effort had they used an expert system that relied on a probabilistic model of the symptoms and their causes.
Such a model would be able to tell, given the symptoms, which component was the most likely to have failed.
Such a model would be able to direct a field engineer to the component or components whose replacement would be most likely to fix the problem.
If the attempted fix did not work, the model would simply update its understanding of the problem and recommend a different course of action.
I will illustrate the process using what is known in the machine learning community as a directed probabilistic graphical model.
Run-Through of Root Cause Analysis
Let’s say a failure has occurred and there is only one symptom that can be observed: the laptop won’t start and emits 4 beeps.
The first step is to enter this information into the probabilistic graphical model. From a list of symptoms, we select the ones that we observe (all observed symptoms are represented as yellow circles in this document).
So the following diagram has only one circle (observed symptom).
Model 1: The symptom of 4 beeps is modeled in a probabilistic graphical model with a yellow circle as follows:
Now, let’s assume that this symptom can be caused by the failure of memory, the motherboard or the processor.
Model 2: I can add that information to the predictive model, so that the model now looks like this:
The model captures the belief that the causes of the symptom – processor / memory / motherboard failure are (in the absence of any symptoms) independent of each other.
It also captures the belief that given a symptom like 4 beeps, evidence for one cause will explain away (or decrease the probability of) the other causes.
Once such a model is built, it can tell a field engineer the most probable cause of a symptom, the second most probable cause and so on.
So, the engineer will only have to look at the output of the model’s analysis to know whether he needs to replace one component, or two, and which ones.
When the field engineer goes out and replaces the components, his actions can also be fed into the model.
Model 3: Below is an extended model into which attempts to fix the problem by replacing the memory can be incorporated.
If a field engineer were to feed into the system the fact that the memory was replaced with a new module and it didn’t fix the problem, the system would be able to immediately figure out that the memory could not be the cause of the problem, and it would suggest the next most probable cause of failure.
Finally, in case new memory modules being sent to customers for repairs frequently turned out to be defective, that information could also be added to the model as follows:
Now, if the error rate for new memory modules in the supply chain happens to be high for a particular type of memory, then if memory replacement failed to fix a 4-beep problem, the model would understand that faulty memory could still be the cause of the problem.
Applications to Supply Chain Management
The probabilities of all the nodes adjust themselves all the time and this information can actually be used to detect if the error rates in new memory module deliveries suddenly go up.
Benefits to a Customer Service Process
1. Formal capture and storage of triage history
2. Suggestion of cause(s) given the effects (symptoms)
3. Suggestion of other causes given triage steps performed
What the system will seem to be doing (to the layman):
1. Recording symptoms
2. Recommending a course of action
3. Recording the outcome of the course of action
4. Recommending next steps