I am continually amazed at how modeling is made into a task much more complex than it needs to be! Modelers often get enamored by the power of the TEAMS analysis tool, or, obsessed with the intricacies of their equipment, and get lost in the details. You do not need to be Einstein to model, just follow his words — “Make things as simple as possible, but not simpler.”
In this blog, I want to walk you through a simple modeling approach, where you match the depth of the model to the quality and quantity of information available, and the results you want to get out of it.
This model was being developed for the field service organization entrusted with the support and maintenance of a sophisticated Bio-Medical instrument. So, the purpose of this model is to capture enough information to enable the reasoner to effectively utilize all available observables (error codes, status LEDs) to narrow down the possible list of suspects, and then add just enough manual troubleshooting steps to determine the root cause so that service agents can fix the problem right the first time.
We start by identifying the error codes from the service manuals of the system. Why start with these? Because they are continuously monitoring and can automatically detect any malfunctions. Many instruments are capable of uploading these error codes back to field service center, or, often the operator will call in with the error codes displayed on the instrument; so these are usually the first available information from the instruments.
For the reasoner to be able to interpret the error code data, we need to provide it with the list of possible causes (or faults) that would trigger each of these error codes. Many OEMs document these relationships rather well in their service manuals; while others may need to dig deeper into engineering knowledge to identify them. Once the list is available, it is a simple task in TEAMS to lay in the FRUs and their interconnections, identify the Error Codes, and capture their relationships.
Most error codes are broad functional tests, and therefore, each error code can be triggered by multiple faults in multiple FRUs. Traditionally, field service organizations have disregarded such Error codes as they do not provide a specific root cause. However, the reasoner in TEAMS can look across all Error Codes, and use the presence and absence of Error Codes to rule in or rule out possible faults, to significantly reduce the number of possible causes. In fact, we can run an analysis in TEAMS to quantify how often the reasoner will be able to identify the root cause using just the Error codes! For those situations, no further manual troubleshooting will be necessary – the reasoner will be able to pin-point the root cause — and therefore no further manual test procedures need to be modeled.
However, more than likely, there will be a significant percentage of cases where additional information is required to identify the root cause. So now we look for additional observables that are also low effort and readily available. An example would be the status LEDs which may provide additional information not provided by the Error Codes. These status LEDs can also be modeled similar to the Error Codes, and the model updated with these new sources of observation.
Next, we run the TEAMS analysis again, and identify the remaining faults that cannot be isolated uniquely. TEAMS identifies these groups of faults, called the ambiguity groups, where each member of the group cause the same exact combination of Error codes and status LED states, and are therefore indistinguishable from each other. A field service agent would need to troubleshoot further to identify the root cause from among this ambiguity group. The modeling effort, from this point on, only needs to identify the manual procedures that will help differentiate between the faults of an ambiguity group — if a group has N members, one would need to model anywhere from (N – 1) to ln (N) troubleshooting steps to achieve full fault isolation. This focused approach to capturing manual procedures that are only necessary for troubleshooting after the reasoner has utilized all available information to limit the possible set of suspects to minimum, not only reduces the effort in modeling, but also helps the subject matter experts helping in the modeling to only what is important to improve the serviceability of the instrument.
Once the desired level of fault isolation capability is achieved (remember, sometimes it is cheaper to replace a couple of FRUs than to research which might be the real culprit), we can augment the model with better instructions and illustrations on some of the more complex troubleshooting steps. We can link reference material, images, repair instructions, etc. to the manual procedures to help less experienced service agents perform complex tasks. We can identify the skill or certification required for some of the steps, and the reasoner can determine to what degree a problem can be resolved based on the service agent’s skills, and when to transfer the problem to an agent with higher level of certifications. The important thing is to do all of those only for the procedures that are actually necessary for troubleshooting, and not get bogged down in the tedious and long troubleshooting sequences service agents have to undertake because they lack the ability to extract most information from Error codes and status LEDs.
The best model for the job is the simplest model that gets the job done!
If you have any questions or need further clarification, please send me your comments and I will be happy to respond.