Please consider a donation to the Higher Intellect project. See https://preterhuman.net/donate.php or the Donate to Higher Intellect page for more info.

PowerPC 601

From Higher Intellect Vintage Wiki

First model of the PowerPC line as part of the AIM alliance between Apple, IBM and Motorola.

The first Macintosh models to use the PowerPC include the Power Macintosh 6100, Power Macintosh 7100 and Power Macintosh 8100.

General Information[edit]

The principal features of the PowerPC 601 microprocessor include

  • full RISC processing architecture
  • parallel processing units: one integer unit and one floating-point unit
  • a branch manager that can usually implement branches by reloading the incoming instruction queue without using any processing time
  • an internal memory management unit (MMU)
  • a single built-in 32 KB cache for data and instructions

It is 132 square millimeters, or 40 percent larger than the Intel 486 chip, but it contains approximately 2.8 million transistors on four layers of metal—nearly twice the number of transistors as the 486. Yet, the 601 also contains a 32K cache, whereas the Intel 486 chip only has room for an 8K on- chip cache. In order to manufacture such a tightly packed chip, IBM used an advanced technology that was able to etch one-half micron (0.5-millionth of a meter) electrical-current pathway (called a trace width) into the silicon wafer. With such small trace widths, IBM was able to pack many more transistors on to a single chip. In addition, because the chips could be small, IBM could cut many more 601 chips without flaws from its 8-inch wafers, thus lowering the price of the chip.

The 601 processor contains a 32K on-chip cache. Instructions and data are stored on the same cache, called a unified cache scheme. Later RISC processors use a split-cache scheme that increases the CPU performance by splitting the data from the instructions into two separate 16K cache.

IBM released the 601 chips at five clock speeds: 60, 66, 80, 100, and 110 MHz. These speeds are achieved with a very low power consumption of 7 watts, further lowering the price of the chip. The 601 chip is currently used in Power Macintosh models 6100, 7100, 8100, and 7200.

The 601 Instruction Set[edit]

The 601 chip supports 69 new instructions defined in the PowerPC architecture but that aren’t in the POWER architecture. Most of these instructions are included to improve cache management, single-precision, floating-point operations, and bit-shifting operations. In turn, ten 32-bit instructions defined in the PowerPC architecture are not in the PowerPC 601 chip. (Remember the PowerPC architecture is a design standard; it’s not essential all chips in the PowerPC series to use all instructions designed in the architecture.)

Of course, all 64-bit instructions defined in the PowerPC architecture are illegal on the 601, which has 32-bit integer registers only and thus can’t support 64-bit instructions. Illegal instructions are ones that are available in the generic architecture, but can’t be trapped and emulated on a particular CPU implementation — in this case, the 601.

The 601 Chip Design[edit]

The 601 chip contains three main execution units to support superscalar dispatch of instructions; The integer unit (also called the fixed-point unit), the floating-point unit, and the branch-processing unit. The branch-processing unit is actually a subsystem of the main instruction unit, which is responsible for fetching and prefetching instructions. The instruction unit can store up to eight instructions at a time in its instruction prefetch queue. Since the instruction unit can prefetch instructions concurrently with the execution of instructions in the integer, floating-point, and branch-processing units (IU, FPU, and BPU respectively), instruction prefetch provides a way to speed up instruction throughput from memory or the cache into the CPU.

The PowerPC 601 chip includes fully interlocking hardware between its execution units and all pipelines. Interlocks allow different pipelines to communicate with one another to ensure that an instruction doesn’t complete if it requires data currently being manipulated by an instruction in a different pipeline or even in a different stage of the same pipe. Also, the 601 chip provides a 64-bit data bus (although it only supports 32-bit instructions).

The Instruction Unit and BPU[edit]

Since the PowerPC 601’s instruction unit and BPU work in close coordination, it’s easy to combine the discussion of both units.

The instruction unit contains issue logic that determines whether the current instruction should be sent to the IU or the FPU. This step is important because it allows integer and floating-point instructions to execute independently of branch prediction and branch instruction processing. The BPU contains its own set of dedicated registers, which allows branch instructions to execute directly within the BPU.

The BPU can actually fetch branch instructions from the queue on its own, separately from the logic used to evaluate integer and floating-point instructions. Specifically, the BPU searches the bottom half of the instruction queue for the presence of branch instructions. When the BPU finds a conditional branch instruction (one that may or may not branch to a different location in the program code based on the results of a conditional test), it removes the instruction from the queue and evaluates it within the BPU.

At this point, the BPU scans the IU and FPU pipelines to determine whether a currently executing instruction will effect the condition bit register that the BPU uses to evaluate a conditional branch instruction. If not, the BPU can resolve the instruction immediately by using the current value in the condition bit. (In other words, previously executed instructions have already produced the operand that determines whether the branch instruction will be taken.)

If the BPU needs to wait for a currently executing instruction to complete before the branch instruction can be evaluated, the BPU performs a speculative branch. Here’s what happens; While the BPU is waiting to evaluate the condition bit, it determines whether to take the branch based on the direction of the branch itself. This approach is called static branch prediction.

If a conditional branch instruction specifies that the program will branch backward if the condition test is true, then the BPU fetches and issues the target branch instructions. If the branch instruction specifies that the program will branch forward for a true condition, the BPU will not fetch the target branch instructions. Different RISC designers, operating independently, have conducted extensive research in branch prediction logic, with more or less uniform results. The static branch prediction method I’ve just described produces correct predictions far more often than it produces incorrect ones, resulting in increased superscalar execution and pipeline throughput.

If the BPU finds a conditional branch instruction, pulls it from the queue, and then is forced to speculatively execute the instruction, the BPU then determines the direction of the branch. The operating system itself determines whether the CPU will take the branch forward or will take it backward — by setting a branch prediction bit. For this discussion, assume that the BPU takes the branch only if the target address is backward, rather than forward, within the program.

If a backward branch is identified, the BPU takes the branch. It requests the prefetcher to replace the instructions that were behind the branch instruction in the queue with instructions starting at the branch target address. This effectively flushes the instruction queue. These instructions are allowed to be decoded and executed within the IU and FPU. However, the results of the instructions cannot be written back to registers until the BPU determines whether the branch was predicted correctly.

If a forward branch is identified, the branch itself is not taken. The BPU marks all instructions that were behind the branch instruction as conditional, allowing them to be decoded and executed within the IU and FPU. However, the results of these instructions cannot be written back to registers until the BPU determines that the branch was predicted correctly.

If the BPU predicts a branch incorrectly, regardless of whether the branch is taken or not taken, the penalty is a complete flush of the instruction queue and all pipelines. The BPU then tells the prefetch unit the starting address for the instructions to be reloaded into the instruction queue. If the prefetcher can find all of the instructions in the on-chip cache (all are cache hits), then the performance penalty is minimal; the bus between the instruction queue and the cache can transfer eight instructions at a time — in a single burst. So, an empty instruction queue can be reloaded within one clock cycle.

If the prefetcher cannot find the desired instructions within the on-chip cache (one or more cache misses), it must fetch the instructions either from the off-chip (L2) cache, if one exists, or from RAM. In either case, performance penalties are greater because both L2 cache and RAM chips are much slower than the on-chip cache. This fact demonstrates how valuable the 601 chip’s large on-chip cache can be in keeping pipelines busy, even when a conditional branch is predicted incorrectly.

Feed-Forwarding[edit]

The instruction unit also controls all feed-forward operations. Feed-forward operations allow one instruction in a pipe to forward the results of data manipulation to a second instruction, farther back in the pipe, that will need the results of the first instruction’s data. Feed-forwarding improves cycle time because the processing results of one instruction can be forwarded directly to the next instruction in the pipe, without having to store the results in a separate register for use by the second instruction.

The Integer Unit[edit]

The IU (fixed point, or integer, unit) is responsible for executing all integer and load/store (memory access) operations, including load/store operations on floating-point registers. The IU is composed of a four-stage pipeline: implementing fetch, dispatch and decode, execute, and write back.

The PowerPC 601’s IU is considered to be fairly typical of RISC implementations, although its feed-forward capabilities are not available on many RISC and RISC-like CPUs, including the Pentium.

The Floating-Point Unit[edit]

The FPU in the 601 chip supports floating-point operations on both single-precision and double-precision values, using the IEEE-754 floating-point standard. (IEEE stands for Institute of Electrical and Electronics Engineers; IEEE-754 specifies the format for single-precision — 32-bit — floating-point values and double-precision — 64-bit — floating-point values.)

The FPU includes 32 registers for performing floating-point operations. As you might expect, each register is 64 bits in length to support double-precision values. The FPU contains six pipeline stages: fetch, dispatch, decode, execute1, execute2, and write back. The execute1 and execute2 stages allow two floating-point instructions to be executed back-to-back.

An added bonus: The 601’s FPU can search the bottom half of the instruction queue (just like the BPU) and then execute floating-point instructions that do not depend on the results of other instructions in the queue. This technique, of course, helps increase pipeline throughput.

And one drawback: The FPU doesn’t support feed-forwarding. If a floating-point instruction needs the result stored by another floating-point instruction further along in the pipe, the instruction must stall until the previous instruction has stored its result in an FPU register. The omission of a feed-forwarding mechanism in the FPU pipe is more understandable than it would have been with the IU, because floating-point routines often interleave floating-point and integer instructions, so pipeline stalls in the FPU are often not critical. Nevertheless, the addition of feed-forwarding to the FPU pipe in later PowerPC chips would be very welcome.

The Cache Unit[edit]

The 601 chip contains a 32K on-chip cache — larger than any cache available on CISC CPUs, and larger than the on-chip caches on most RISC CPUs. The ability of IBM to stuff a 32K cache onto an affordable and relatively small RISC processor says volumes about the carefully crafted design of the 601 and about the sophistication of IBM’s fabrication technology.

One interesting quirk is that the cache is unified, which means that instructions and data are stored in the same cache. IBM pioneered the split-cache concept, which separates data and instructions into separate caches. Split caches are a staple of the POWER RS/6000 technology. But it’s important to keep in mind that the main goal of the 601 design was to provide maximum performance at the lowest possible price. A unified cache is clearly less expensive to implement than a split cache, and does provide a few benefits. But overall, a split cache provides better CPU throughput than a unified cache. In fact, the PowerPC 604 chip contains a split cache — a 16K data cache and a separate 16K instruction cache.

Cache control logic can be somewhat complex and involves such concepts as set associativity, cache data and cache tag memories, translation lookaside buffers, and read/write protocols. These concepts require more technical depth than I think most readers would be willing to sift through, so I’m going to stay away from those topics.

Evaluating the 601’s Performance[edit]

IBM has currently made available PowerPC 601 chips that run at 60, 66, 80, 100, and 110 MHz. Since the 66 MHz provides an intermediate standard, and since the most widely sold Pentium chip runs at 66 MHz, it serves our purposes to use the 66 MHz PowerPC as the benchmark for comparison with the Pentium. Just keep in mind that there are faster PowerPC 601s and faster Pentiums. The differences between the two chips, then, really have little if anything to do with speed.

For the PowerPC 66 MHz 601 chip, IBM reports a value of 62 for the SPECint92 rating (in other words, 62 is the geometric mean for the full battery of integer tests) and a value of 72 for the SPECfp92 rating (the geometric mean for all floating-point tests). Apple reports a SPECint92 rating of 60 and a SPEGfp92 rating of 80. As a point of comparison, consider these Pentium ratings: Intel’s published benchmark results for the 66 MHz chip are 65 SPECint92 and 57 SPECfp92.

These numbers show that the Pentium’s integer instructions perform at a slightly higher rate than the PowerPC 601, but that floating-point ratings for the PowerPC 601 are substantially higher than the floating-point rating for the Pentium. When you consider these numbers, it seems clear why Intel is downplaying the importance of floating-point computation in new software designs.

Another important factor in evaluating chip performance is power consumption. On average, the PowerPC 601 66 MHz chip dissipates about 7 watts. By comparison, the Pentium averages out at 13 watts. This has been Intel’s single biggest problem in getting the Pentium chips into notebook computers. Power dissipation for desktop PCs isn’t as critical as it is for notebook computers, which don’t have the space to support heavy and bulky cooling systems, but it is important. Lower power consumption makes it easier for system designers to design inexpensive cooling systems. So, in the end, low power consumption — even for desktop systems — translates into lower system prices.

See Also[edit]