Balance of Power: Sleuthing Through Your Code
by DAVE EVANS
The night was well advanced, but the bright glow of fluorescent lamps misrepresented time. As I sat back in my comfortable chair, rubbing tired eyes, I wondered what the venerable but fictional Mr. Sherlock Holmes would offer me as advice. Perhaps because I was so weary from the long hours of debugging, I easily imagined Mr. Holmes sitting near me in a tweed suit smoking his pipe. Certainly he would address me as he once addressed his compatriot Dr. Watson, with a slightly condescending tone, and he would tell me that in my debugging I
was missing the key iota of information.
At that moment, a solitary number seemed brighter on my monitor. Perhaps I have an overactive imagination, but it seemed as if MacsBug were magically illuminating that crucial, overlooked information. My computer was at interrupt level 2, yet it was waiting for a driver request to complete. How could I have missed the interrupt level earlier? It was no wonder that the computer froze. My software had most likely called the driver synchronously at exactly the wrong time. The voice of Mr. Holmes rang again in my ears. This time he quoted from that unfortunate story "A Case of Identity" when he said, "It has long been an axiom of mine that the little things are infinitely the most important."
Sir Arthur's famous detective was unsurpassed as an observer of detail. He believed that keen attention to all things -- even the mundane -- was the key to good detective work. In debugging software, I've found this advice is also true. Although many software bugs can be solved quite easily, the most challenging problems demand more attention. This is especially true of crashes or freezes in your software. To find the detail we need for those, we often have to go below source-level tools and get comfortable with lower-level aids.
In this column I'll take you through some low-level debugging techniques. I'll start with basic strategy and then discuss particular methods and examples. Although many details will be PowerPC-specific, much of the information here is useful on all Macintosh computers.
THE STRATEGY OF A SLEUTH
The experienced engineer starts with a basic strategy when faced with a troublesome software crash or freeze. The strategy is similar to Mr. Holmes's approach to solving difficult crimes. Using the scientific method, he starts by collecting key information and details. When he has finished researching, he begins to analyze the information and eliminates hypothesis after hypothesis. Once close to a solution, he seeks out more detail to narrow his suspects to a single culprit. Similarly, your strategy for debugging software should start with careful observation and research. Then you should hypothesize, test your theories, and collect more detail. This narrowing approach will draw you closer
to the pernicious coding error in your software.
It's tempting when faced with a difficult crash to experiment instead of researching it first. But beware! Don't just reimplement your code with new approaches until it stops crashing. Though some may cynically suggest that that's the Macintosh way to program, don't be lulled into this strategy. I've found that it usually produces unstable code and ultimately takes longer than researching the original problem.
In researching a crash or a freeze, the private bug detective should first ask these few basic questions:
- What kind of crash or freeze is this?
- What code did the computer stop in?
- How did I get to that code?
For these, you'll need a low-level
debugger (such as MacsBug). Let's look at each one in turn.
GET YOUR BEARINGS
The first step is to determine the kind of problem you've got. For crashes there are a number of possible problems, including the all-too-familiar illegal instruction and bus errors. Note that PowerPC exception handlers don't currently distinguish between these or other types. In MacsBug the correct type will be reported, but your debugger may instead describe all crashes as general
spurious interrupts or type 11 errors.
If your crash is from an illegal instruction error, it's possible that the processor jumped to an invalid address or the intended code moved in memory. In this case you'll notice (in a disassembly where execution stopped) that most instructions are invalid or nonsense. This can also occur if the emulator tries to emulate PowerPC code, or if the processor tries to execute 680x0 code as PowerPC code. Try disassembling memory as both PowerPC code (using ipp pc) and 680x0 code (using ip pc).
If your crash is from a bus error, the most likely cause is an invalid address in some register. Disassemble memory where execution stopped and examine the instructions. If there are instructions that dereference registers, inspect those registers for addresses that aren't in a valid range. If you're debugging 680x0 code on a Power Macintosh, you'll need to look at all the instructions near the crash, because the 680x0 emulator won't tell you exactly which instruction caused the error.
Researching a freeze requires a different approach. If the freeze prevents you from using any debugging tools, you must isolate the offending code by watching the computer execute up to the freeze. Setting breakpoints, tracing, and stopping execution at known locations will bring you closer. This approach is slow but will lead you to the code that caused the error or to the state that prompted it. If the computer is frozen but you can still use debugging tools, it's very possible that you're in an infinite loop.
THE LAYOUT OF THE CRIME SCENE
Sherlock Holmes sometimes astonished readers by deducing crimes just from hearing second-hand details. He was also known, however, to walk the back alleys of London and gumshoe the scene of a crime when necessary. Learning the layout of the crime scene was crucial for a number of his deductions. When staring at your newly crashed software, do you recognize the code that your debugger is displaying? Disassemble memory near the location of the crash and snoop around for clues. Check for the following to determine how your computer came to this final resting place:
- If you're using MacsBug, use the wh pc command to check where the code is.
- Display memory and disassemble from the beginning of the code's block of memory.
- Does the code nearby reference strings or Gestalt selectors?
- Look for text symbols and strings in the code.
If you've crashed in
PowerPC code, most low-level debuggers will give great information about where you are. This is because most PowerPC code is registered and linked using the Code Fragment Manager, which these debuggers can access for hints. For example, if you use the wh pc command in MacsBug, after crashing in PowerPC code you'll see something like this:
Address 000BAE34 is in the System heap at 00002800 at NQDColor2Index+00018 The address is in a CFM fragment "NQD" It is 0001AD28 bytes into this heap block: Start Length Tag Mstr Ptr Lock * 000A00F0 0003DB00+04 R 00002AC4 L
Here we see that the computer crashed at a location 24 bytes from the beginning of the NQDColor2Index routine. This routine is in the NQD (or Native QuickDraw) code fragment. Since this address is close to the beginning of the routine, we can disassemble from its start and examine the six instructions that executed before the crash for more clues:
Disassembling PowerPC code from bae00 NQDColor2Index +00000 000BAE00 li r5,0x0000 +00004 000BAE04 lwz r4,TheGDevice(r0) +00008 000BAE08 sth r5,QDErr(r0) +0000C 000BAE0C stw r31,-0x0004(SP) +00010 000BAE10 lwz r5,0x0000(r4) +00014 000BAE14 addi r31,r3,0x0000 +00018 000BAE18 *lwz r3,0x000C(r5)
A bus error at NQDColor2Index+00018 would occur if register R5 contained an invalid address. Look at the register display to validate that hypothesis. Notice in the code that R5 is a dereference of R4, which comes from the low-memory global TheGDevice. Here we crashed because TheGDevice had become
invalid, so now your investigation turns toward that global.
A freeze will typically occur because of a double page fault or exception or because of an infinite loop. Synchronous driver calls will also freeze if called when the interrupt level is above 0. A double fault or exception is common only if you're writing driver software. Your computer can handle only one page fault or exception at a time. A double fault or exception occurs when software that services a fault subsequently causes a second fault. For example, disk drivers are sometimes called by the Virtual Memory Manager to help service page faults; therefore, if you develop a disk driver you must take care not to cause page faults since you may be asked to service one as well.
A good way to detect infinite loops is to trace for a few instructions using your debugger. If you notice the same set of instructions being repetitively executed, you could be in an infinite loop. Look at branch instructions for clues to why the loop isn't completing. A special case of these loops is the vSyncWait routine. It looks like this:
MOVE.W $0010(A0),D0 BGT.S *-6
This tight loop is waiting for the two-byte value located 16 bytes from register A0 to become 0 or negative. This is a standard sequence to wait for a driver request to complete. The driver request is described in an IOParam record pointed to by register A0. When the driver is done servicing the request, it will interrupt the loop and modify the ioResult field 16 bytes into that record. It will then return from the interrupt, and the loop will complete normally. A freeze in this loop means the driver isn't servicing the request. If you typed dm a0 iopb in MacsBug, you might see something like this:
Displaying IOParamBlockRec at 000003A4 000003A4 qLink NIL 000003A8 qType 0002 000003AA ioTrap A003 000003AC ioCmdAddr NIL 000003B0 ioCompletion NIL 000003B4 ioResult 0001 000003B6 ioNamePtr NIL 000003BA ioVRefNum 0008 000003BC ioRefNum FFDF 000003BE ioVersNum #0 000003BF ioPermssn #23 000003C0 ioMisc NIL 000003C4 ioBuffer 01C7E2B0 000003C8 ioReqCount 00010000 000003CC ioActCount 00010000 000003D0 ioPosMode 0001 000003D2 ioPosOffset 1B84AA00
Take note of the ioTrap and ioRefNum fields. In this case, ioTrap is $A003, which is the synchronous Read trap. Using the drvr dcmd in MacsBug, you'll find that the driver with refNum $FFDF is .ASYC00, which is the SCSI driver. This hang, then, occurs during a synchronous Read call to the SCSI driver. Perhaps I should next check the current interrupt level.
HOW DID WE GET THERE?
After a long, ponderous silence, while sharply focused on the current enigma, Holmes might startle you by saying, "Let us reconstruct, Watson." Then he would describe the probable series of events that preceded that particular criminal act. If the reconstruction wasn't adequate to identify a perpetrator, at least it would review the crucial discoveries so far. It would show Holmes's appreciable progress toward a solution. Similarly, while in the midst of a difficult debugging task, you should reconstruct the turn of events to gain
extremely helpful information.
Figuring out what happened, once the computer is stopped cold in a crash or a freeze, isn't easy. In effect, you're looking for footsteps in the sand that are often obscured or covered with other false marks. For this task, the technique we most often use is the stack crawl.
Procedural programming on the Macintosh uses a stack. For each procedure call, the stack is added to, and vital clues such as return addresses and stack frame pointers are left for us to find. In PowerPC code, the link register adds to our clues and is guaranteed to point back to the penultimate procedure of interest. Your low-level debugger will certainly have a stack crawl tool to use as well.
In MacsBug, the sc and sc7 commands are your basic stack-crawling aids. Start your search with the sc command, which looks for stack frames. Frames are structures found on the stack containing both the return address and a pointer to the previous frame. In PowerPC code the frames also contain a standard area to preserve basic registers. Fortunately, frames are required in PowerPC code and follow a standard format. Most 680x0 compilers will generate stack frames as well, although much of the 680x0 system software was written in assembly language without frames. If during your crash you have a valid stack frame address in register A6 or R1, the sc command will show you a history of which code execution preceded your software's demise. Listing 1 shows a basic sc command's result.
Listing 1. Display from the sc command
Calling chain using A6/R1 links Back chain ISA Caller 01C8A0AC 68K 01C139CA 'CODE 0001 0F6E Main'+03A1A 01C8A0A0 68K 01C132EA 'CODE 0001 0F6E Main'+0333A 01C89F4A 68K 00058748 'scod BFB1 011C'+01A38 01C89E6A 68K 00064090 'scod BFB1 011C'+0D380 01C89E40 68K 408787FC CHECKUPDATESEARCH+0003E 01C89E16 68K 40878426 __GETSUBWINDOWS+000D6
In this example the first two links are in a CODE resource from file number $0F6E. Use the MacsBug file command to determine which file they were loaded from. It's likely that they're from the current application, and the return addresses displayed in the Caller column (01C139CA and 01C132EA) are most likely in the application's binary. The return addresses listed are crucial to your sleuthing. They not only point out where execution would have returned to but, more important, they show which instructions were recently executed: the ones just before the return address. Those addresses are your footprints in the sand. They are clues in your reconstruction, and they hint to the turn of
events that led to the crash or freeze.
Note the third and fourth lines in Listing 1, which show return addresses in an 'scod' resource. Those 'scod' resources implement the Process Manager. It's possible that the application binary, probably at the instruction just before address 1C132EA, made a call to the Process Manager.
The fifth and sixth lines of the display show return addresses in the Macintosh ROM. The symbols are shown because I've installed a ROM map file in my MacsBug Preferences folder. You should use the provided ROM map file for your computer, because it will often give you better stack crawl information. You can also deduce that these return addresses are in the ROM from the addresses themselves. Most Macintosh ROMs begin at memory address $40800000. PCI-based Macintosh ROMs currently begin at $FFC00000, and PowerPC processor-based PowerBook ROMs at $40000000. You can determine the beginning address of your ROM by looking at the ROMBase low-memory global. In MacsBug, for example, type dl ROMBase to display the beginning ROM address.
The sc7 command in MacsBug gives you less precise information. In cases when you don't have stack frames, you can ask your debugger to display all possible return addresses on the stack. Your debugger will intelligently guess which values on the stack are possible return addresses, but most of the information displayed will be extraneous. You must pick through this information for clues -- an arduous task. The stack frame-based crawl is neat and tidy, whereas the same situation would produce the sc7 display shown in Listing 2. I've added an asterisk (*) on each line that's also in the sc command's display.
Listing 2. Display from the sc7 command
Return addresses on the stack Stack Addr Frame Addr ISA Caller 01C8A0B0 68K 01C16D62 'CODE 0001 0F6E Main'+06DB2c 01C8A0A4 01C8A0A0 68K 01C139CA 'CODE 0001 0F6E Main'+03A1A * 01C8A094 68K 40849116 UNLOADSEG+00046 01C8A06A 01C8A066 68K 409CFFFC DISPTABLE+8D0BC 01C8A018 68K 4087EAF0 GETRESOURCE+000B2 01C8A00E 68K 408806F6 01C8A008 PPC 00094BE8 EmToNatEndMoveParams+00014 01C89FF8 68K 0011ACDA 01C89FE0 68K 4087ECFE VRMGRSTDENTRY+000B0 01C89FDC 68K 4087ECFE VRMGRSTDENTRY+000B0 01C89FD8 68K 0011A5B4 01C89F4E 01C89F4A 68K 01C132EA 'CODE 0001 0F6E Main'+0333A * 01C89F4A 68K 01C8A09E 01C89F22 01C89F1E 68K 00058748 'scod BFB1 011C'+01A38 * 01C89F1E 68K 01C89F48 01C89EDE 01C89EDA 68K 00163E30 01C89EDA 68K 01C89F1C 01C89E62 68K 01C8AFBE 01C89E44 01C89E40 68K 00064090 'scod BFB1 011C'+0D380 * 01C89E1A 01C89E16 68K 408787FC CHECKUPDATESEARCH+0003E * 01C89DF4 01C89DF0 68K 40878426 __GETSUBWINDOWS+000D6 * 01C89DE2 68K 4087876E CALCANCESTORRGNS+0002A 01C89DDE 68K 001191E6
In this example, there were a number of values on the stack that might have been valid return addresses. The six we saw in the sc command's display are there. Many of the other lines will not be relevant return addresses, because many procedures reserve space on the stack but don't always use it or initialize it. There will often be old return addresses in that unused part of the stack. These old return addresses are like very faint footprints in the sand -- from some previous execution -- and they may tell you what occurred much earlier in time. More often, though, they'll just be distracting and irrelevant to your
Be very wary of an sc7 command when tracing through PowerPC code. PowerPC code typically has large stack frames, at least 56 bytes for each procedure, and the code often doesn't use all those bytes. This will cause many old return addresses to stay in the unused parts of the stack frame, and those old addresses will appear in your sc7 command's display.
Sometimes you'll notice that the sc and sc7 commands fail to work. In MacsBug, you may see the error
Bad stack: stack pointer must be even and <= stack base
There's more than one stack that the system uses, but the stack base that MacsBug refers to in this error is the application stack's base or top address. The sc and sc7 commands first check to see if the A6, A7, and R1 registers point to locations below the application stack's base. If they don't, MacsBug returns this error. The executing code may be using a different stack, however. Many parts of the Mac OS system software use separate stacks. To force MacsBug to execute a stack crawl anyway, specify the register to use and the amount of memory to search through. For example, the MacsBug commands sc7 a7 4000 and sc a6 4000 will execute a stack crawl even if the A6 and A7 registers point above
the application stack's base.
System stacks vary in size from about 8000 bytes up to 48000 bytes. There's no easy way to determine the base of a system stack that's in use. If you don't get interesting clues from 16384 bytes ($4000 in hex), vary the number of bytes you specify and compare your results.
ELEMENTARY, OF COURSE
Don't be pacified by source-level debuggers. Lower-level tools give you a much better understanding of the Mac OS and your code. These tools also give you the ability to research the most complicated problems. Strive to be a software
sleuth, and you'll gain some truly useful expertise.