Troubleshooting
From Wikipedia, the free encyclopedia
| This article needs additional citations for verification. (June 2010) |
Troubleshooting is a form of problem solving,
often applied to repair failed products or processes. It is a logical,
systematic search for the source of a problem so that it can be solved,
and so the product or process can be made operational again.
Troubleshooting is needed to develop and maintain complex systems where
the symptoms of a problem can have many possible causes. Troubleshooting
is used in many fields such as engineering, system administration, electronics, automotive repair, and diagnostic medicine.
Troubleshooting requires identification of the malfunction(s) or
symptoms within a system. Then, experience is commonly used to generate
possible causes of the symptoms. Determining which cause is most likely
is often a process of elimination
- eliminating potential causes of a problem. Finally, troubleshooting
requires confirmation that the solution restores the product or process
to its working state.
In general, troubleshooting is the identification of, or diagnosis
of "trouble" in the management flow of a corporation or a system caused
by a failure of some kind. The problem is initially described as
symptoms of malfunction, and troubleshooting is the process of
determining and remedying to the causes of these symptoms.
A system can be described in terms of its expected, desired or
intended (usually, for artificial systems, its purpose). Events or
inputs to the system are expected to generate specific results or
outputs. (For example selecting the "print" option from various computer
applications is intended to result in a hardcopy
emerging from some specific device). Any unexpected or undesirable
behavior is a symptom. Troubleshooting is the process of isolating the
specific cause or causes of the symptom. Frequently the symptom is a
failure of the product or process to produce any results. (Nothing was
printed, for example).
The methods of forensic engineering
are especially useful in tracing problems in products or processes, and
a wide range of analytical techniques are available to determine the
cause or causes of specific failures. Corrective action can then be taken to prevent further failures of a similar kind. Preventative action is possible using failure mode and effects analysis (FMEA) and fault tree analysis (FTA) before full scale production, and these methods can also be used for failure analysis.
Contents |
Aspects
Most discussion of troubleshooting, and especially training in formal
troubleshooting procedures, tends to be domain specific, even though
the basic principles are universally applicable.
Usually troubleshooting is applied to something that has suddenly
stopped working, since its previously working state forms the
expectations about its continued behavior. So the initial focus is often
on recent changes to the system or to the environment in which it
exists. (For example a printer that "was working when it was plugged in
over there"). However, there is a well known principle that correlation does not imply causality.
(For example the failure of a device shortly after it's been plugged
into a different outlet doesn't necessarily mean that the events were
related. The failure could have been a matter of coincidence.) Therefore troubleshooting demands critical thinking rather than magical thinking.
It's useful to consider the common experiences we have with light
bulbs. Light bulbs "burn out" more or less at random; eventually the
repeated heating and cooling of its filament,
and fluctuations in the power supplied to it cause the filament to
crack or vaporize. The same principle applies to most other electronic
devices and similar principles apply to mechanical devices. Some
failures are part of the normal wear-and-tear of components in a system.
A basic principle in troubleshooting is to start from the simplest and most probable
possible problems first. This is illustrated by the old saying "When
you see hoof prints, look for horses, not zebras", or to use another maxim, use the KISS principle. This principle results in the common complaint about help desks
or manuals, that they sometimes first ask: "Is it plugged in and does
that receptacle have power?", but this should not be taken as an
affront, rather it should serve as a reminder or conditioning to always check the simple things first before calling for help.
A troubleshooter could check each component in a system
one by one, substituting known good components for each potentially
suspect one. However, this process of "serial substitution" can be
considered degenerate when components are substituted without regards to
a hypothesis concerning how their failure could result in the symptoms
being diagnosed.
Simple and intermediate systems are characterized by lists or trees
of dependencies among their components or subsystems. More complex
systems contain cyclical dependencies or interactions (feedback loops). Such systems are less amenable to "bisection" troubleshooting techniques.
It also helps to start from a known good state, the best example being a computer reboot. A cognitive walkthrough is also a good thing to try. Comprehensive documentation produced by proficient technical writers is very helpful, especially if it provides a theory of operation for the subject device or system.
A common cause of problems is bad design, for example bad human factors design, where a device could be inserted backward or upside down due to the lack of an appropriate forcing function (behavior-shaping constraint), or a lack of error-tolerant design. This is especially bad if accompanied by habituation,
where the user just doesn't notice the incorrect usage, for instance if
two parts have different functions but share a common case so that it
isn't apparent on a casual inspection which part is being used.
Troubleshooting can also take the form of a systematic checklist, troubleshooting procedure, flowchart
or table that is made before a problem occurs. Developing
troubleshooting procedures in advance allows sufficient thought about
the steps to take in troubleshooting and organizing the troubleshooting
into the most efficient troubleshooting process. Troubleshooting tables
can be computerized to make them more efficient for users.
Some computerized troubleshooting services (such as Primefax, later
renamed Maxserve), immediately show the top 10 solutions with the
highest probability of fixing the underlying problem. The technician can
either answer additional questions to advance through the
troubleshooting procedure, each step narrowing the list of solutions, or
immediately implement the solution he feels will fix the problem. These
services give a rebate if the technician takes an additional step after
the problem is solved: report back the solution that actually fixed the
problem. The computer uses these reports to update its estimates of
which solutions have the highest probability of fixing that particular
set of symptoms.[1]
Half-splitting
Efficient methodical troubleshooting starts with a clear
understanding of the expected behavior of the system and the symptoms
being observed. From there the troubleshooter forms hypotheses on
potential causes, and devises (or perhaps references a standardized
checklist of) tests to eliminate these prospective causes. This approach
is often called "Divide and Conquer".
Two common strategies used by troubleshooters are to check for
frequently encountered or easily tested conditions first (for example,
checking to ensure that a printer's light is on and that its cable is
firmly seated at both ends). This is often referred to as "milking the
front panel."[2]
Then, "bisect" the system (for example in a network printing system,
checking to see if the job reached the server to determine whether a
problem exists in the subsystems "towards" the user's end or "towards"
the device).
This latter technique can be particularly efficient in systems with
long chains of serialized dependencies or interactions among its
components. It's simply the application of a binary search across the range of dependencies and is often referred to as "half-splitting".[3]
Reproducing symptoms
One of the core principles of troubleshooting is that reproducible
problems can be reliably isolated and resolved. Often considerable
effort and emphasis in troubleshooting is placed on reproducibility ...
on finding a procedure to reliably induce the symptom to occur.
Once this is done then systematic strategies can be employed to
isolate the cause or causes of a problem; and the resolution generally
involves repairing or replacing those components which are at fault.
Intermittent symptoms
Some of the most difficult troubleshooting issues relate to symptoms that are only intermittent.
In electronics this often is the result of components that are
thermally sensitive (since resistance of a circuit varies with the
temperature of the conductors in it). Compressed air can be used to cool
specific spots on a circuit board and a heat gun can be used to raise
the temperatures; thus troubleshooting of electronics systems frequently
entails applying these tools in order to reproduce a problem.
In computer programming race conditions
often lead to intermittent symptoms which are extremely difficult to
reproduce; various techniques can be used to force the particular
function or module to be called more rapidly than it would be in normal
operation (analogous to "heating up" a component in a hardware circuit)
while other techniques can be used to introduce greater delays in, or
force synchronization among, other modules or interacting processes.
Intermittent issues can be thus defined:
An intermittent is a problem for which there is no known procedure to consistently reproduce its symptom.—Steven Litt, [1]
In particular he asserts that there is a distinction between
frequency of occurrence and a "known procedure to consistently
reproduce" an issue. For example knowing that an intermittent problem
occurs "within" an hour of a particular stimulus or event ... but
that sometimes it happens in five minutes and other times it takes
almost an hour ... does not constitute a "known procedure" even if the
stimulus does increase the frequency of observable exhibitions of the
symptom.
Nevertheless, sometimes troubleshooters must resort to statistical
methods ... and can only find procedures to increase the symptom's
occurrence to a point at which serial substitution or some other
technique is feasible. In such cases, even when the symptom seems to
disappear for significantly longer periods, there is a low confidence
that the root cause has been found and that the problem is truly solved.
Also, tests may be run to stress certain components to determine if those components have failed. [4]
Multiple problems
Isolating single component failures which cause reproducible symptoms is relatively straightforward.
However, many problems only occur as a result of multiple failures or errors. This is particularly true of fault tolerant systems, or those with built-in redundancy. Features which add redundancy, fault detection and failover to a system may also be subject to failure, and enough different component failures in any system will "take it down."
Even in simple systems the troubleshooter must always consider the
possibility that there is more than one fault. (Replacing each
component, using serial substitution, and then swapping each new
component back out for the old one when the symptom is found to persist,
can fail to resolve such cases. More importantly the replacement of any
component with a defective one can actually increase the number of
problems rather than eliminating them).
Note that, while we talk about "replacing components" the resolution
of many problems involves adjustments or tuning rather than
"replacement." For example, intermittent breaks in conductors --- or
"dirty or loose contacts" might simply need to be cleaned and/or
tightened. All discussion of "replacement" should be taken to mean
"replacement or adjustment or other maintenance."
11:50 PM
Putu Meiyanto



















AsyArieL








hahaha..gaptek nich!