What is the most
challenging aspect of a Hardware Engineer's job? Is it the creation
of a functional specification for a board/system from sketchy or
non-existent customer requirements? Is it creating and reviewing the
schematics based on the functional specifications? Is it taking the
schematics through CAD and Signal/Power Integrity simulations? Does
the excitement during board bringup beat all of the above? Or is it
the creative tension (:-)) during software-hardware integration that
is more challenging? Undoubtedly, all these activities are what we
Engineers seek to do over and over again in our careers. But this
piece is about the activity that gives the ultimate adrenalin rush
and delight – the act of debugging, especially under the
constraints of an impending product launch or worse still, a
production-stop situation. Debugging complex issues requires us to
summon all our knowledge, gather data, analyze data, root cause the
issue and implement a fix. It is no exaggeration to liken the debug
process to a detective solving a complex crime or a doctor
diagnosing/curing a disease.
The first phase in the
debug process is denial. Most Engineers are very possessive about
their designs and believe that they can do no wrong. When a bug is
reported by a customer, the initial impression is to attribute it to
wrong software programming, incorrect usage of hardware, exceeding
the recommended operating conditions and what not. Many a time, the
bug could be due to one of these reasons. But this is not always the
case. There ain't no such thing as a perfect design and there could
always be a corner case not taken care of by even the best of
designers. Once the designers get convinced that maybe the bug
report is genuine, the next major hurdle is to replicate the problem
in the lab. More often than not, the biggest challenge is to
replicate the problem well enough so that we have sufficient number
of observations and data to chew on. Experience tells us that a
problem that is repeatable is solvable. After the problem is repeated
enough times and data (waveforms, register dumps) is gathered, it is
time to go through these with a fine toothed comb to identify a
pattern and march towards an eventual solution. Sometimes, the
Engineers involved in the debug exercise get too much caught up in
the maze that they miss the way out. In such situations, bouncing the
problem and observations with a colleague who is not in the debug
team is a very effective idea. The colleague could throw a lead that
could help you to find the solution. The key ingredients for a
successful debug are disciplined experimentation,
recording/documenting observations, data analysis and many times, an
idea flash based on gut feel. The preferred “solutions” are to
identify software workarounds, component value changes and if all
else fails, recommend a hardware revision. Remember never to waste a
crisis – the lessons learnt from the exercise should be formally
documented and incorporated into future projects.
In the course of my
career, I have been very fortunate to be part of many debug exercises
- some exciting, some mundane, most with a happy ending, a few
unsolvable . There are too many to recount, but a few instances came
to mind when I sat down to write this blog. In describing the
instances below, I have deliberately left out company names, product
part numbers and intricate technical details, in the interest of not
disclosing confidential and proprietary information. But, it would be
intellectually dishonest to not disclose the names of the colleagues
who participated in these sorties.
Story #1: Waking Up Blues
Power management is a
mandatory feature in modern day systems. In this project, the
System-On-Chip (SOC) is programmed to enter a power down mode when
there is no activity and to wake up when it receives a packet over
WiFi. The WiFi card is interfaced to the SOC via PCI Express (PCIe).
You can trust a Japanese customer to do the most rigorous testing
that can uncover any corner case bug. The customer had 1000 systems
banging away in the lab just before field deployment. Sixteen of
these systems crashed when they were exiting from the power down mode
on arrival of WiFi packet. You can also trust the Japanese customers
to generate tons of data for analysis. When the bug report came in,
we went through the usual denial and eventually understood the
criticality after going through the data set. Gokul, Sumant and I
tried to replicate the issue, first with WiFi and then with Ethernet
packets as the wakeup events. Despite many attempts, the problem
eluded us. Then we struck upon the idea to use a simpler scenario –
use a timer interrupt as the wakeup event. By structured experiments,
we found that the crash happens when the wakeup event occurs too
close to the power down event. The power down event takes a few clock
cycles to complete and during this critical window if a wakeup event
occurs, then the wakeup process is not graceful and causes system
crash. With the problem repeated, it was time to propose a solution.
The only way to avoid the crash was to block wakeup events during
bthe critical window. The channel for the wakeup event was PCIe. A
crude solution was to reset the PCIe serdes link just before entering
the power down mode. Since the link takes some time to recover, a
WiFi packet would not find its way into the SOC during the critical
window. But, this could lead to loss of packets. The Japanese are
intolerant to crude fixes and insisted on a “proper” solution. At
this time, Sumant and I described the problem to Shibin, who is the
guru of intuitive solutions. He pointed us to the L0s power down mode
of PCIe. This introduces an inherent latency when the link exits from
power down, thereby ensuring no wakeup event during the critical
window. Sumant and Santosh travelled to Japan to demonstrate the
solution. The customer had no issue in accepting this solution since
it used a documented protocol feature, resulted in no packet loss and
avoided the crash. As an added bonus, the L0s mode gave extra power
saving. There is nothing more elegant than a proper Engineering
solution. In the final stages when Sumant and Santosh were testing
out the solution in Japan, I had boarded a flight to Paris. On
reaching the airport, I received the text message that the solution
is working and accepted by the customer and my joy knew no bounds.
Eventually, Shankar and
Swati identified the design bug in the power management controller,
again after painstaking efforts.
Story #2: Brushing up the
laws of Physics
Pressman's book on power
supply
is dedicated to the “Power supply designer, the unsung hero of
system design”. Well, it is no exaggeration. The design, testing
and debugging of power supplies requires us to summon all our
knowledge of Electronics, Electrical Engineering and many times, even
fundamental Physics. More than half a decade back, Portable Media
Players (PMP) were starting to gain popularity, before Smartphones
drove out that entire market segment. We had designed a form-factor
ready PMP reference design that was intended to be demonstrated at
the CES. Just a couple of weeks prior to CES, the battery based SMPS
was not working properly when on full load. I and Ambudhar had
designed the circuit and the guidelines in Linear Technology
Application Note were followed to the dot, or so we thought. Ritesh,
who had recently joined, has deep expertise in power supply design
(among many other things). After several intense debug sessions,
Ritesh identified that the inductor core was saturating due to the
excess current and this was causing the inductor not to behave as
one. The challenge was to get an inductor with a better core. The
usual sources such as Digikey and Mouser did not have the required
inductors in stock. The lead time was too high to get a custom
inductor delivered. Ritesh came up with the idea of hand -winding an
inductor, provided we could get the core. A search showed a Hong Kong
based company that had the required ferrite core in its catalog. I
made a frantic cold call to this company and to our surprise the
gentleman agreed to send us tens of cores as samples. As luck would
have it, the parts arrived on Fedex priority in no time. But the
challenge was that Ritesh had to travel to Mumbai for his wedding.
It was a real delight to watch him wind the copper wires on to the
cores, just in time before he flew out. I and Ambudhar continued the
debug, constantly checking with Ritesh on the phone. By the time
Ritesh reached his home, we had tested out the full load case
successfully, with the inductor finally starting to behave itself.
:-)
Story #3: No level
playing “ground”
This story happened a
decade and half back, but is still very vivid in my memory. From the
very beginning of my career, I had realized that hardware engineers
are from Mars and software engineers from Venus. The thought
processes and approaches to problem solving are quite different, and
rightly so. The fun part is the buck passing when things go wrong.
The software guys think that their code is perfect and the fault lies
with the hardware and the hardware guys reciprocate the same feeling.
The procedure we had was to test a board fully using test scripts
developed by hardware folks and hand over the boards along with the
programming considerations document to the software team. Often the
recommended programming values/sequences or even board handling
instructions are violated resulting in unexpected crashes or even
board damage due to ESD issues. Tensions, escalations, delays ensue
until finally the folks from the different planets converge at the
lab bench to jointly debug and resolve the issue. A tested terminal
controller card was handed over to the software team for code
development. After a couple of days of struggle, the software team
returned the board classifying it as dead-on-arrival as the UART port
was not coming up. The hardware engineer took it to his setup and his
diagnostic tests passed. With the ball back in their court, the
software folks reluctantly reviewed their code, but still the problem
persisted. Enter Rajendra (aka RK), whose debug style is unique. He
needs practically very little data to zoom into the solution. After
receiving a quick dump of the problem, he took a multimeter and
measured the ground difference between the PC's COM port and the
Controller card. A large difference meant that the signals were not
being understood by the PC and the Controller card. RK surmised that
the earth connection of the AC mains socket to which the software
guy's PC was plugged into had a connecivity issue and with the
Electrician confirming and fixing this, a level playing “ground”
was established.
Many other stories come
to my mind – the idle channel noise and echo canceller debug with
Jayaram, PMP hang issue debug with Ambudhar, the many Quicc Engine
debug episodes with Sumant and the IEEE1588 debug with Venkat and
Pawan. More on these in a later blog.
For an excellent reference
on the essentials of debugging, check out this article by Prof
Terence Parr.
really nice to start sharing information on debugs; normally these are only available when you are a participant ;-) having worked on PCIe before, i found the first one really interesting.
ReplyDeleteThanks for your comments, Manish.
DeleteExcellent post on debugging. Completely agree that it is most exciting phase as well as best source of learning. I have always felt that debugging is balance of observation and analysis. Most of the time we get caught in one phase without spending enough effort in other.
ReplyDelete