Balance of Power: Sleuthing Through Your Code

by DAVE EVANS
The night was well advanced, but the bright glow of fluorescent lamps misrepresented time. As I sat back in my comfortable chair, rubbing tired eyes, I wondered what the venerable but fictional Mr. Sherlock Holmes would offer me as advice. Perhaps because I was so weary from the long hours of debugging, I easily imagined Mr. Holmes sitting near me in a tweed suit smoking his pipe. Certainly he would address me as he once addressed his compatriot Dr. Watson, with a slightly condescending tone, and he would tell me that in my debugging I

was missing the key iota of information.

At that moment, a solitary number seemed brighter on my monitor. Perhaps I have an overactive imagination, but it seemed as if MacsBug were magically illuminating that crucial, overlooked information. My computer was at interrupt level 2, yet it was waiting for a driver request to complete. How could I have missed the interrupt level earlier? It was no wonder that the computer froze. My software had most likely called the driver synchronously at exactly the wrong time. The voice of Mr. Holmes rang again in my ears. This time he quoted from that unfortunate story "A Case of Identity" when he said, "It has long been an axiom of mine that the little things are infinitely the most important."

Sir Arthur's famous detective was unsurpassed as an observer of detail. He believed that keen attention to all things -- even the mundane -- was the key to good detective work. In debugging software, I've found this advice is also true. Although many software bugs can be solved quite easily, the most challenging problems demand more attention. This is especially true of crashes or freezes in your software. To find the detail we need for those, we often have to go below source-level tools and get comfortable with lower-level aids.

In this column I'll take you through some low-level debugging techniques. I'll start with basic strategy and then discuss particular methods and examples. Although many details will be PowerPC-specific, much of the information here is useful on all Macintosh computers.

THE STRATEGY OF A SLEUTH

The experienced engineer starts with a basic strategy when faced with a troublesome software crash or freeze. The strategy is similar to Mr. Holmes's approach to solving difficult crimes. Using the scientific method, he starts by collecting key information and details. When he has finished researching, he begins to analyze the information and eliminates hypothesis after hypothesis. Once close to a solution, he seeks out more detail to narrow his suspects to a single culprit. Similarly, your strategy for debugging software should start with careful observation and research. Then you should hypothesize, test your theories, and collect more detail. This narrowing approach will draw you closer

to the pernicious coding error in your software.

It's tempting when faced with a difficult crash to experiment instead of researching it first. But beware! Don't just reimplement your code with new approaches until it stops crashing. Though some may cynically suggest that that's the Macintosh way to program, don't be lulled into this strategy. I've found that it usually produces unstable code and ultimately takes longer than researching the original problem.

In researching a crash or a freeze, the private bug detective should first ask these few basic questions:

  • What kind of crash or freeze is this?

  • What code did the computer stop in?

  • How did I get to that code?

For these, you'll need a low-level

debugger (such as MacsBug). Let's look at each one in turn.

GET YOUR BEARINGS

The first step is to determine the kind of problem you've got. For crashes there are a number of possible problems, including the all-too-familiar illegal instruction and bus errors. Note that PowerPC exception handlers don't currently distinguish between these or other types. In MacsBug the correct type will be reported, but your debugger may instead describe all crashes as general

spurious interrupts or type 11 errors.

If your crash is from an illegal instruction error, it's possible that the processor jumped to an invalid address or the intended code moved in memory. In this case you'll notice (in a disassembly where execution stopped) that most instructions are invalid or nonsense. This can also occur if the emulator tries to emulate PowerPC code, or if the processor tries to execute 680x0 code as PowerPC code. Try disassembling memory as both PowerPC code (using ipp pc) and 680x0 code (using ip pc).

If your crash is from a bus error, the most likely cause is an invalid address in some register. Disassemble memory where execution stopped and examine the instructions. If there are instructions that dereference registers, inspect those registers for addresses that aren't in a valid range. If you're debugging 680x0 code on a Power Macintosh, you'll need to look at all the instructions near the crash, because the 680x0 emulator won't tell you exactly which instruction caused the error.

Researching a freeze requires a different approach. If the freeze prevents you from using any debugging tools, you must isolate the offending code by watching the computer execute up to the freeze. Setting breakpoints, tracing, and stopping execution at known locations will bring you closer. This approach is slow but will lead you to the code that caused the error or to the state that prompted it. If the computer is frozen but you can still use debugging tools, it's very possible that you're in an infinite loop.

THE LAYOUT OF THE CRIME SCENE

Sherlock Holmes sometimes astonished readers by deducing crimes just from hearing second-hand details. He was also known, however, to walk the back alleys of London and gumshoe the scene of a crime when necessary. Learning the layout of the crime scene was crucial for a number of his deductions. When staring at your newly crashed software, do you recognize the code that your debugger is displaying? Disassemble memory near the location of the crash and snoop around for clues. Check for the following to determine how your computer came to this final resting place:

  • If you're using MacsBug, use the wh pc command to check where the code is.

  • Display memory and disassemble from the beginning of the code's block of memory.

  • Does the code nearby reference strings or Gestalt selectors?

  • Look for text symbols and strings in the code.

If you've crashed in

PowerPC code, most low-level debuggers will give great information about where you are. This is because most PowerPC code is registered and linked using the Code Fragment Manager, which these debuggers can access for hints. For example, if you use the wh pc command in MacsBug, after crashing in PowerPC code you'll see something like this:

 Address 000BAE34 is in the System heap 
    at 00002800 at NQDColor2Index+00018
 The address is in a CFM fragment "NQD"

 It is 0001AD28 bytes into this heap block:
     Start    Length      Tag  Mstr Ptr Lock
  * 000A00F0 0003DB00+04   R   00002AC4   L

Here we see that the computer crashed at a location 24 bytes from the beginning of the NQDColor2Index routine. This routine is in the NQD (or Native QuickDraw) code fragment. Since this address is close to the beginning of the routine, we can disassemble from its start and examine the six instructions that executed before the crash for more clues:

Disassembling PowerPC code from bae00
  NQDColor2Index
    +00000 000BAE00   li      r5,0x0000
    +00004 000BAE04   lwz     r4,TheGDevice(r0)
    +00008 000BAE08   sth     r5,QDErr(r0)
    +0000C 000BAE0C   stw     r31,-0x0004(SP)
    +00010 000BAE10   lwz     r5,0x0000(r4)
    +00014 000BAE14   addi    r31,r3,0x0000
    +00018 000BAE18  *lwz     r3,0x000C(r5)

A bus error at NQDColor2Index+00018 would occur if register R5 contained an invalid address. Look at the register display to validate that hypothesis. Notice in the code that R5 is a dereference of R4, which comes from the low-memory global TheGDevice. Here we crashed because TheGDevice had become

invalid, so now your investigation turns toward that global.

A freeze will typically occur because of a double page fault or exception or because of an infinite loop. Synchronous driver calls will also freeze if called when the interrupt level is above 0. A double fault or exception is common only if you're writing driver software. Your computer can handle only one page fault or exception at a time. A double fault or exception occurs when software that services a fault subsequently causes a second fault. For example, disk drivers are sometimes called by the Virtual Memory Manager to help service page faults; therefore, if you develop a disk driver you must take care not to cause page faults since you may be asked to service one as well.

A good way to detect infinite loops is to trace for a few instructions using your debugger. If you notice the same set of instructions being repetitively executed, you could be in an infinite loop. Look at branch instructions for clues to why the loop isn't completing. A special case of these loops is the vSyncWait routine. It looks like this:

MOVE.W      $0010(A0),D0
BGT.S         *-6

This tight loop is waiting for the two-byte value located 16 bytes from register A0 to become 0 or negative. This is a standard sequence to wait for a driver request to complete. The driver request is described in an IOParam record pointed to by register A0. When the driver is done servicing the request, it will interrupt the loop and modify the ioResult field 16 bytes into that record. It will then return from the interrupt, and the loop will complete normally. A freeze in this loop means the driver isn't servicing the request. If you typed dm a0 iopb in MacsBug, you might see something like this:

 Displaying IOParamBlockRec at 000003A4
  000003A4  qLink              NIL
  000003A8  qType              0002 
  000003AA  ioTrap             A003 
  000003AC  ioCmdAddr          NIL 
  000003B0  ioCompletion       NIL 
  000003B4  ioResult           0001
  000003B6  ioNamePtr          NIL
  000003BA  ioVRefNum          0008 
  000003BC  ioRefNum           FFDF 
  000003BE  ioVersNum          #0 
  000003BF  ioPermssn          #23 
  000003C0  ioMisc             NIL 
  000003C4  ioBuffer           01C7E2B0
  000003C8  ioReqCount         00010000 
  000003CC  ioActCount         00010000 
  000003D0  ioPosMode          0001 
  000003D2  ioPosOffset        1B84AA00

Take note of the ioTrap and ioRefNum fields. In this case, ioTrap is $A003, which is the synchronous Read trap. Using the drvr dcmd in MacsBug, you'll find that the driver with refNum $FFDF is .ASYC00, which is the SCSI driver. This hang, then, occurs during a synchronous Read call to the SCSI driver. Perhaps I should next check the current interrupt level.

HOW DID WE GET THERE?

After a long, ponderous silence, while sharply focused on the current enigma, Holmes might startle you by saying, "Let us reconstruct, Watson." Then he would describe the probable series of events that preceded that particular criminal act. If the reconstruction wasn't adequate to identify a perpetrator, at least it would review the crucial discoveries so far. It would show Holmes's appreciable progress toward a solution. Similarly, while in the midst of a difficult debugging task, you should reconstruct the turn of events to gain

extremely helpful information.

Figuring out what happened, once the computer is stopped cold in a crash or a freeze, isn't easy. In effect, you're looking for footsteps in the sand that are often obscured or covered with other false marks. For this task, the technique we most often use is the stack crawl.

Procedural programming on the Macintosh uses a stack. For each procedure call, the stack is added to, and vital clues such as return addresses and stack frame pointers are left for us to find. In PowerPC code, the link register adds to our clues and is guaranteed to point back to the penultimate procedure of interest. Your low-level debugger will certainly have a stack crawl tool to use as well.

In MacsBug, the sc and sc7 commands are your basic stack-crawling aids. Start your search with the sc command, which looks for stack frames. Frames are structures found on the stack containing both the return address and a pointer to the previous frame. In PowerPC code the frames also contain a standard area to preserve basic registers. Fortunately, frames are required in PowerPC code and follow a standard format. Most 680x0 compilers will generate stack frames as well, although much of the 680x0 system software was written in assembly language without frames. If during your crash you have a valid stack frame address in register A6 or R1, the sc command will show you a history of which code execution preceded your software's demise. Listing 1 shows a basic sc command's result.

Listing 1. Display from the sc command

 Calling chain using A6/R1 links
  Back chain  ISA  Caller
  01C8A0AC    68K  01C139CA  'CODE 0001 0F6E Main'+03A1A
  01C8A0A0    68K  01C132EA  'CODE 0001 0F6E Main'+0333A
  01C89F4A    68K  00058748  'scod BFB1 011C'+01A38
  01C89E6A    68K  00064090  'scod BFB1 011C'+0D380
  01C89E40    68K  408787FC  CHECKUPDATESEARCH+0003E
  01C89E16    68K  40878426  __GETSUBWINDOWS+000D6

In this example the first two links are in a CODE resource from file number $0F6E. Use the MacsBug file command to determine which file they were loaded from. It's likely that they're from the current application, and the return addresses displayed in the Caller column (01C139CA and 01C132EA) are most likely in the application's binary. The return addresses listed are crucial to your sleuthing. They not only point out where execution would have returned to but, more important, they show which instructions were recently executed: the ones just before the return address. Those addresses are your footprints in the sand. They are clues in your reconstruction, and they hint to the turn of

events that led to the crash or freeze.

Note the third and fourth lines in Listing 1, which show return addresses in an 'scod' resource. Those 'scod' resources implement the Process Manager. It's possible that the application binary, probably at the instruction just before address 1C132EA, made a call to the Process Manager.

The fifth and sixth lines of the display show return addresses in the Macintosh ROM. The symbols are shown because I've installed a ROM map file in my MacsBug Preferences folder. You should use the provided ROM map file for your computer, because it will often give you better stack crawl information. You can also deduce that these return addresses are in the ROM from the addresses themselves. Most Macintosh ROMs begin at memory address $40800000. PCI-based Macintosh ROMs currently begin at $FFC00000, and PowerPC processor-based PowerBook ROMs at $40000000. You can determine the beginning address of your ROM by looking at the ROMBase low-memory global. In MacsBug, for example, type dl ROMBase to display the beginning ROM address.

The sc7 command in MacsBug gives you less precise information. In cases when you don't have stack frames, you can ask your debugger to display all possible return addresses on the stack. Your debugger will intelligently guess which values on the stack are possible return addresses, but most of the information displayed will be extraneous. You must pick through this information for clues -- an arduous task. The stack frame-based crawl is neat and tidy, whereas the same situation would produce the sc7 display shown in Listing 2. I've added an asterisk (*) on each line that's also in the sc command's display.

Listing 2. Display from the sc7 command

Return addresses on the stack
  Stack Addr  Frame Addr   ISA   Caller
   01C8A0B0                68K   01C16D62 'CODE 0001 0F6E Main'+06DB2c
   01C8A0A4    01C8A0A0    68K   01C139CA 'CODE 0001 0F6E Main'+03A1A      *
   01C8A094                68K   40849116 UNLOADSEG+00046
   01C8A06A    01C8A066    68K   409CFFFC DISPTABLE+8D0BC
   01C8A018                68K   4087EAF0 GETRESOURCE+000B2
   01C8A00E                68K   408806F6 
   01C8A008                PPC   00094BE8 EmToNatEndMoveParams+00014
   01C89FF8                68K   0011ACDA
   01C89FE0                68K   4087ECFE VRMGRSTDENTRY+000B0
   01C89FDC                68K   4087ECFE VRMGRSTDENTRY+000B0
   01C89FD8                68K   0011A5B4
   01C89F4E    01C89F4A    68K   01C132EA 'CODE 0001 0F6E Main'+0333A      *
   01C89F4A                68K   01C8A09E
   01C89F22    01C89F1E    68K   00058748 'scod BFB1 011C'+01A38      *
   01C89F1E                68K   01C89F48
   01C89EDE    01C89EDA    68K   00163E30
   01C89EDA                68K   01C89F1C
   01C89E62                68K   01C8AFBE
   01C89E44    01C89E40    68K   00064090 'scod BFB1 011C'+0D380      *
   01C89E1A    01C89E16    68K   408787FC CHECKUPDATESEARCH+0003E      *
   01C89DF4    01C89DF0    68K   40878426 __GETSUBWINDOWS+000D6      *
   01C89DE2                68K   4087876E CALCANCESTORRGNS+0002A
   01C89DDE                68K   001191E6

In this example, there were a number of values on the stack that might have been valid return addresses. The six we saw in the sc command's display are there. Many of the other lines will not be relevant return addresses, because many procedures reserve space on the stack but don't always use it or initialize it. There will often be old return addresses in that unused part of the stack. These old return addresses are like very faint footprints in the sand -- from some previous execution -- and they may tell you what occurred much earlier in time. More often, though, they'll just be distracting and irrelevant to your

search.

Be very wary of an sc7 command when tracing through PowerPC code. PowerPC code typically has large stack frames, at least 56 bytes for each procedure, and the code often doesn't use all those bytes. This will cause many old return addresses to stay in the unused parts of the stack frame, and those old addresses will appear in your sc7 command's display.

Sometimes you'll notice that the sc and sc7 commands fail to work. In MacsBug, you may see the error

Bad stack: stack pointer must be even and
   <= stack base

There's more than one stack that the system uses, but the stack base that MacsBug refers to in this error is the application stack's base or top address. The sc and sc7 commands first check to see if the A6, A7, and R1 registers point to locations below the application stack's base. If they don't, MacsBug returns this error. The executing code may be using a different stack, however. Many parts of the Mac OS system software use separate stacks. To force MacsBug to execute a stack crawl anyway, specify the register to use and the amount of memory to search through. For example, the MacsBug commands sc7 a7 4000 and sc a6 4000 will execute a stack crawl even if the A6 and A7 registers point above

the application stack's base.

System stacks vary in size from about 8000 bytes up to 48000 bytes. There's no easy way to determine the base of a system stack that's in use. If you don't get interesting clues from 16384 bytes ($4000 in hex), vary the number of bytes you specify and compare your results.

ELEMENTARY, OF COURSE

Don't be pacified by source-level debuggers. Lower-level tools give you a much better understanding of the Mac OS and your code. These tools also give you the ability to research the most complicated problems. Strive to be a software

sleuth, and you'll gain some truly useful expertise.