Saturday, December 22, 2012

Debug Stories

What is the most challenging aspect of a Hardware Engineer's job? Is it the creation of a functional specification for a board/system from sketchy or non-existent customer requirements? Is it creating and reviewing the schematics based on the functional specifications? Is it taking the schematics through CAD and Signal/Power Integrity simulations? Does the excitement during board bringup beat all of the above? Or is it the creative tension (:-)) during software-hardware integration that is more challenging? Undoubtedly, all these activities are what we Engineers seek to do over and over again in our careers. But this piece is about the activity that gives the ultimate adrenalin rush and delight – the act of debugging, especially under the constraints of an impending product launch or worse still, a production-stop situation. Debugging complex issues requires us to summon all our knowledge, gather data, analyze data, root cause the issue and implement a fix. It is no exaggeration to liken the debug process to a detective solving a complex crime or a doctor diagnosing/curing a disease.
The first phase in the debug process is denial. Most Engineers are very possessive about their designs and believe that they can do no wrong. When a bug is reported by a customer, the initial impression is to attribute it to wrong software programming, incorrect usage of hardware, exceeding the recommended operating conditions and what not. Many a time, the bug could be due to one of these reasons. But this is not always the case. There ain't no such thing as a perfect design and there could always be a corner case not taken care of by even the best of designers. Once the designers get convinced that maybe the bug report is genuine, the next major hurdle is to replicate the problem in the lab. More often than not, the biggest challenge is to replicate the problem well enough so that we have sufficient number of observations and data to chew on. Experience tells us that a problem that is repeatable is solvable. After the problem is repeated enough times and data (waveforms, register dumps) is gathered, it is time to go through these with a fine toothed comb to identify a pattern and march towards an eventual solution. Sometimes, the Engineers involved in the debug exercise get too much caught up in the maze that they miss the way out. In such situations, bouncing the problem and observations with a colleague who is not in the debug team is a very effective idea. The colleague could throw a lead that could help you to find the solution. The key ingredients for a successful debug are disciplined experimentation, recording/documenting observations, data analysis and many times, an idea flash based on gut feel. The preferred “solutions” are to identify software workarounds, component value changes and if all else fails, recommend a hardware revision. Remember never to waste a crisis – the lessons learnt from the exercise should be formally documented and incorporated into future projects.
In the course of my career, I have been very fortunate to be part of many debug exercises - some exciting, some mundane, most with a happy ending, a few unsolvable . There are too many to recount, but a few instances came to mind when I sat down to write this blog. In describing the instances below, I have deliberately left out company names, product part numbers and intricate technical details, in the interest of not disclosing confidential and proprietary information. But, it would be intellectually dishonest to not disclose the names of the colleagues who participated in these sorties.
 
Story #1: Waking Up Blues
 
Power management is a mandatory feature in modern day systems. In this project, the System-On-Chip (SOC) is programmed to enter a power down mode when there is no activity and to wake up when it receives a packet over WiFi. The WiFi card is interfaced to the SOC via PCI Express (PCIe). You can trust a Japanese customer to do the most rigorous testing that can uncover any corner case bug. The customer had 1000 systems banging away in the lab just before field deployment. Sixteen of these systems crashed when they were exiting from the power down mode on arrival of WiFi packet. You can also trust the Japanese customers to generate tons of data for analysis. When the bug report came in, we went through the usual denial and eventually understood the criticality after going through the data set. Gokul, Sumant and I tried to replicate the issue, first with WiFi and then with Ethernet packets as the wakeup events. Despite many attempts, the problem eluded us. Then we struck upon the idea to use a simpler scenario – use a timer interrupt as the wakeup event. By structured experiments, we found that the crash happens when the wakeup event occurs too close to the power down event. The power down event takes a few clock cycles to complete and during this critical window if a wakeup event occurs, then the wakeup process is not graceful and causes system crash. With the problem repeated, it was time to propose a solution. The only way to avoid the crash was to block wakeup events during bthe critical window. The channel for the wakeup event was PCIe. A crude solution was to reset the PCIe serdes link just before entering the power down mode. Since the link takes some time to recover, a WiFi packet would not find its way into the SOC during the critical window. But, this could lead to loss of packets. The Japanese are intolerant to crude fixes and insisted on a “proper” solution. At this time, Sumant and I described the problem to Shibin, who is the guru of intuitive solutions. He pointed us to the L0s power down mode of PCIe. This introduces an inherent latency when the link exits from power down, thereby ensuring no wakeup event during the critical window. Sumant and Santosh travelled to Japan to demonstrate the solution. The customer had no issue in accepting this solution since it used a documented protocol feature, resulted in no packet loss and avoided the crash. As an added bonus, the L0s mode gave extra power saving. There is nothing more elegant than a proper Engineering solution. In the final stages when Sumant and Santosh were testing out the solution in Japan, I had boarded a flight to Paris. On reaching the airport, I received the text message that the solution is working and accepted by the customer and my joy knew no bounds.
Eventually, Shankar and Swati identified the design bug in the power management controller, again after painstaking efforts.
 
Story #2: Brushing up the laws of Physics
 
Pressman's book on power supply is dedicated to the “Power supply designer, the unsung hero of system design”. Well, it is no exaggeration. The design, testing and debugging of power supplies requires us to summon all our knowledge of Electronics, Electrical Engineering and many times, even fundamental Physics. More than half a decade back, Portable Media Players (PMP) were starting to gain popularity, before Smartphones drove out that entire market segment. We had designed a form-factor ready PMP reference design that was intended to be demonstrated at the CES. Just a couple of weeks prior to CES, the battery based SMPS was not working properly when on full load. I and Ambudhar had designed the circuit and the guidelines in Linear Technology Application Note were followed to the dot, or so we thought. Ritesh, who had recently joined, has deep expertise in power supply design (among many other things). After several intense debug sessions, Ritesh identified that the inductor core was saturating due to the excess current and this was causing the inductor not to behave as one. The challenge was to get an inductor with a better core. The usual sources such as Digikey and Mouser did not have the required inductors in stock. The lead time was too high to get a custom inductor delivered. Ritesh came up with the idea of hand -winding an inductor, provided we could get the core. A search showed a Hong Kong based company that had the required ferrite core in its catalog. I made a frantic cold call to this company and to our surprise the gentleman agreed to send us tens of cores as samples. As luck would have it, the parts arrived on Fedex priority in no time. But the challenge was that Ritesh had to travel to Mumbai for his wedding. It was a real delight to watch him wind the copper wires on to the cores, just in time before he flew out. I and Ambudhar continued the debug, constantly checking with Ritesh on the phone. By the time Ritesh reached his home, we had tested out the full load case successfully, with the inductor finally starting to behave itself. :-)
 
Story #3: No level playing “ground”
 
This story happened a decade and half back, but is still very vivid in my memory. From the very beginning of my career, I had realized that hardware engineers are from Mars and software engineers from Venus. The thought processes and approaches to problem solving are quite different, and rightly so. The fun part is the buck passing when things go wrong. The software guys think that their code is perfect and the fault lies with the hardware and the hardware guys reciprocate the same feeling. The procedure we had was to test a board fully using test scripts developed by hardware folks and hand over the boards along with the programming considerations document to the software team. Often the recommended programming values/sequences or even board handling instructions are violated resulting in unexpected crashes or even board damage due to ESD issues. Tensions, escalations, delays ensue until finally the folks from the different planets converge at the lab bench to jointly debug and resolve the issue. A tested terminal controller card was handed over to the software team for code development. After a couple of days of struggle, the software team returned the board classifying it as dead-on-arrival as the UART port was not coming up. The hardware engineer took it to his setup and his diagnostic tests passed. With the ball back in their court, the software folks reluctantly reviewed their code, but still the problem persisted. Enter Rajendra (aka RK), whose debug style is unique. He needs practically very little data to zoom into the solution. After receiving a quick dump of the problem, he took a multimeter and measured the ground difference between the PC's COM port and the Controller card. A large difference meant that the signals were not being understood by the PC and the Controller card. RK surmised that the earth connection of the AC mains socket to which the software guy's PC was plugged into had a connecivity issue and with the Electrician confirming and fixing this, a level playing “ground” was established.
Many other stories come to my mind – the idle channel noise and echo canceller debug with Jayaram, PMP hang issue debug with Ambudhar, the many Quicc Engine debug episodes with Sumant and the IEEE1588 debug with Venkat and Pawan. More on these in a later blog.
For an excellent reference on the essentials of debugging, check out this article by Prof Terence Parr.

3 comments:

  1. really nice to start sharing information on debugs; normally these are only available when you are a participant ;-) having worked on PCIe before, i found the first one really interesting.

    ReplyDelete
  2. Excellent post on debugging. Completely agree that it is most exciting phase as well as best source of learning. I have always felt that debugging is balance of observation and analysis. Most of the time we get caught in one phase without spending enough effort in other.

    ReplyDelete