Advertisement -- Learn more about ads on this site.
60,000-79,999 SparkPoints 77,283

Update (Good News!) on my Stress Level - The Helicopter is FLYING!

Thursday, January 31, 2013

Way, way back in the mists of time (97 days ago according to SP) my status update (in several chunks) read:

In order to understand my job right now, imagine that you have a Lego set that can be built up as either a cargo truck or a helicopter. Now imagine that you've built the truck. Now imagine that overnight someone broke into your house and glued half the pieces together. Now imagine that you've been given the task to build a helicopter. It doesn't have to be exactly the same as the original helicopter, it just has to work like it, but you aren't allowed to use any pieces that weren't there in the original set. You don't *have* to use the pieces that were glued together, but so many of them were glued together that it would be difficult to make anything useful without using some of them, glue and all. I'm choosing to look at this as an opportunity for creativity, not as a challenge. It's the only way to keep sane.

Oh, what prophetic words!

To give some context for the what was actually going on, it helps to know what I actually do for a living. My company designs and manufactures microchips that go into networks, storage systems, and large RF systems. The chips we design aren't household names, and most of the companies we sell to aren't household names either, however there is an excellent chance that each and every byte of data that you send over the internet or store "on the cloud" will go through one our chips, most likely many of them. There are even a couple microchips we've built that might be in your home today depending upon where you live and what type of broadband connection you have.

When I say we manufacture microchips, what I really mean is that we contract with large foundries in Asia who actually do the manufacturing, then contract with other large companies (mostly in Asia) to do the assembly and packaging and testing, although we do some of our testing ourselves. We design the chips, develop the software that runs on them, create the test programs, do the debugging, and then sell them. Hopefully through all of this, we make money.

As I said, we have the chips manufactured at a foundry. The steps involved in doing this are many, and I'm not going to go through them all, but the Wikipedia page on "Microfabrication" (
) has a reasonable description. The pictures down the side show a representative set of steps for a relatively simple process with a single metal (we use much more advanced process steps with many more metal layers), but show how everything is built up using layer after layer of photolithographic masks. Think of each mask as a negative for a picture, and think of a completed microchip as a stack of 20-40 layers of different materials, each patterned by a single mask. Depending upon what the mask is used for and how precise they have to be they can be either relatively cheap (less than $10,000) to quite expensive (more than $100,000) to produce, and as a result a complete "mask set" required to produce a particular type of microchip is an expensive piece of equipment. Because of this, creative people (like me) are often asked to make changes and fix errors by editing a smaller set of masks. It's quite common to talk about "all-layer" revisions and "metal-only" revisions, meaning revisions that change all mask layers compared with revisions that only change metal layers. Another class of revision, a "base-layer" revision, isn't all that common because the masks for lowest level of the microelectronic stack (steps 1-8 in the Wikipedia figure) are relatively expensive, and more often than not if you have to edit the base layers it's because the error is too massive and you may as well do an all-layer revision.

Metal-only revisions have another advantage: time. When a company like mine orders a new type of chip the first time, we often start more wafers (the big circular silicon substrates that microchips are built) than we need. After the base layers are processed, we will "hold" some of the wafers, meaning that they're taken off the production line and safely stored away at the foundry until such time as we re-start them. This is an insurance policy of a sort, ensuring that if we find errors in the first run of chips we have the opportunity to re-start the processing on the held wafers with a new set of metal masks to fix the errors. This can be very important because it often takes more than half the processing time to process the base layers, and less than half the time to do the metals. When you have customers waiting for your chips, the time that you can save by using the held wafers can be incredibly valuable!

We've had a chip back in our labs since June, and like most new devices it had what can politely be called "issues". OK, they were bugs. The most serious bugs were in a circuit that we actually sub-contracted out its development to a company in Maryland, and identifying what was wrong took a massive amount of work. We had held wafers, and we had tight customer commit dates, and given that the fixes to the sub-contracted block were metal-only, we decided we would attempt to do a metal-only revision. Of course, we had to then fix the rest of the bugs. We had a "hit list", and there were two blocks that had minor problems, but we had another block that had major problems - a key feature of this chip, one that our customers were depending on, did not work. A co-worker and I were given the task of deciding if we could indeed do a metal-only revision. After looking at it for a couple days, we decided we had a fighting chance, and given that we really didn't have time to do an all-layer revision we said we could do it.

This is when I made my update.

In this context, the Lego pieces are the individual transistors that the block was originally built from. The transistors, being on the base layers, were in fixed locations, and many of them were interconnected with diffusion and polysilicon, also base layers. That's where the "glue" comes from. We were free to re-wire the glued-together Lego (i.e. the interconnected transistors) in the bottom three layers of metal. I think Barb's the only person who really understood what I was talking about.

It was not easy, and involved a lot of cursing and late nights as we continually fought with the original design (it was done badly) and had to find ever more creative ways of using the transistors we had. Along the way we actually found several new bugs in the original design, bugs we did not know about when we started, buts that if we had known about when we started I suspect we would have said "all layers, please". Our stress levels went through the roof, and I burned out. I took a couple unplanned days off (not enough) but returned to work and we eventually succeeded, came up with robust solutions to all the bugs, and sent the revised metal masks off to the foundry. Over that period of time I made various cryptic status updates about helicopters and was probably not the nicest person to be around.

December was a quiet time and my stress levels went back down to a more typical level. The foundry was processing the new metal layers on the old base layers, and then we got the new revision back in early January.

Disaster. Not only did it not look like we had fixed the original bugs, but it looked like we had created new ones. The helicopter wasn't flying - it was crashing and creating collateral damage. For the past three weeks, we have been staring at it, trying to figure out what was up, trying all sorts of different tests and simulating all sorts of crazy scenarios. Nothing made sense. Sometimes, under some conditions, the block would work perfectly. Other times it would fail in ways that made no sense, and we were looking at the very real likelihood that we wouldn't be able to meet our customer commit dates. As you can imagine, my stress levels have been going up. I've tried to compartmentalize and keep the work stress at work, but I've not been 100% successful on that. Thankfully my Achilles injury has healed so I've been able to re-start running at lunch hour so I'm able to burn off some stress and shut off my brain for a while in the middle of the day.

Today we found the root cause of the issues we're having in the lab. In order to clean up a problem on an unrelated block, we had made a modification to the power supply for that block. In doing so, we modified the power supplies for all the blocks, including the one we fixed. As it turns out, the problem isn't with the block, it's with its power supply. Continuing the helicopter analogy, the problem wasn't with the helicopter itself, but we put sugar in the gas tank. Remove the sugar, and the helicopter flies.

There are probably a couple more days to follow up on this front, but when I look ahead in time, I see my stress level dropping rapidly. I so need this.
Share This Post With Others
Member Comments About This Blog Post