MIPS architecture

Revision as of 00:22, 17 September 2018 by Netfreak (talk | contribs) (→‎CPU family)

MIPS, for Microprocessor without Interlocked Pipeline Stages, is a RISC microprocessor architecture originally developed at Stanford University and later commercialized by MIPS Technologies. By the late 1990s it was estimated that one in three RISC chips produced were MIPS-based designs.

MIPS designs are used in Silicon Graphics computer product line; many embedded systems; Windows CE devices; Cisco routers; and video game consoles like the Nintendo 64 and Sony PlayStation, PlayStation 2, and PlayStation Portable handheld system.

The early MIPS architectures were 32-bit implementations (generally 32-bit wide registers and data paths), while later versions were 64-bit implementations. Five backward-compatible revisions of the MIPS instruction set exist, named MIPS I, MIPS II, MIPS III, MIPS IV, and MIPS 32/64. The latest of these, MIPS 32/64 Release 2, defines a control register set as well as the instruction set. Several "add-on" extensions are also available, including MIPS-3D which is a simple set of floating-point SIMD instructions dedicated to common 3D tasks, MDMX(MaDMaX) which is a more extensive integer SIMD instruction set using the 64-bit floating-point registers, MIPS16 which adds compression to the instruction stream to make programs take up less room (allegedly a response to the ARM architecture encoding in the ARM architecture), and the recent addition of MIPS MT, new multithreading additions to the system similar to HyperThreading in the Intel's Pentium 4 processors.

Because the designers created such a clean instruction set, computer architecture courses in universities and technical schools often study the MIPS architecture. The design of the MIPS CPU family greatly influenced later RISC architectures such as DEC Alpha.

History

In 1981, a team led by John L. Hennessy at Stanford University started work on what would become the first MIPS processor. The basic concept was to dramatically increase performance through the use of deep instruction pipelines, a technique that was well known, but difficult to implement. Generally a pipeline spreads out the task of running an instruction into several steps, starting work on "step one" of an instruction before the preceding instruction is complete. In contrast, traditional designs of the era waited to complete an entire instruction before moving on, thereby leaving large areas of the CPU idle as the process continued. Moreover, the clock frequency of the entire CPU was dictated by the latency of the entire cycle, rather than by the critical path (i.e. the latency of the pipeline stage taking the longest time to complete).

One major barrier to pipelining was that it required interlocks to be set up to ensure that instructions that took multiple clock cycles to complete would stop the pipeline from loading more data — basically to pause while it completed. These interlocks can take a long time to set up, and were thought to be a major barrier to future speed improvements. A major design aspect of the MIPS design was to demand that all instructions take only one cycle to complete, thereby removing any needs for interlocking.

Although this design eliminated a number of useful instructions, notably things like multiply and divide which would take multiple steps, it was felt that the overall performance of the system would be dramatically improved because the chips could run at much higher clock rates. This ramping of the speed would be difficult with interlocking involved, as the time needed to set up locks is as much a function of die size as clock rate: adding the hardware needed might actually slow down the overall speed.

The elimination of these instructions became a contentious point. Many observers claimed the design (and RISC in general) would never live up to its hype. If one simply replaces the complex multiply instruction with many simpler additions, where is the speed increase? This overly-simple analysis ignored the fact that the speed of the design was in the pipelines, not the instructions.

In 1984 Hennessy was convinced of the future commercial potential of the design, and left Stanford to form MIPS Computer Systems. They released their first design, the R2000, in 1985, improving the design as the R3000 in 1988. These 32-bit CPUs formed the basis of their company through the 1980s, used primarily in Silicon Graphics series of workstations. These commercial designs deviated from the Stanford academic research by implementing most of the interlocks in hardware, supplying full multiply and divide instructions (among others).

In 1991 MIPS released the first 64-bit microprocessor, the R4000. However, MIPS had financial difficulties while bringing it to market. The design was so important to SGI, at the time one of MIPS' few major customers, that SGI bought the company outright in 1992 in order to guarantee the design would not be lost. As a subsidiary of SGI, the company became known as MIPS Technologies.

In the early 1990s MIPS started licensing their designs to third-party vendors. This proved fairly successful due to the simplicity of the core, which allowed it to be used in a number of applications that would have formerly used much less capable CISC designs of similar gate count and price -- the two are strongly related; the price of a CPU is generally related to the number of gates and the number of external pins. Sun Microsystems attempted to follow their success by licensing their SPARC core, but it has never been anywhere near as successful. By the late 1990s MIPS was a powerhouse in the embedded processor field, and in 1997 the 48-millionth MIPS-based CPU shipped, making it the first RISC CPU to outship the famous 68k family. MIPS was so successful that SGI spun-off MIPS Technologies in 1998. Fully half of MIPS' income today comes from licensing their designs, while much of the rest comes from contract design work on cores that will then be produced by third parties.

In 1999 MIPS formalized their licensing system around two basic designs, the 32-bit MIPS32 and 64-bit MIPS64. NEC, Toshiba and SiByte (later acquired by Broadcom) each obtained licenses for the MIPS64 as soon as it was announced. Philips, LSI Logic and IDT have since joined them. Success followed success, and today the MIPS cores are one of the most-used "heavyweight" cores in the marketplace for computer-like devices (hand-held computers, set-top boxes, etc.), with other designers fighting it out for other niches. Some indication of their success is the fact that Freescale (spun-off by Motorola) uses MIPS cores in their set-top box designs, instead of their own PowerPC-based cores.

Since the MIPS architecture is licensable, it has attracted several processor start-up companies over the years. One of the first start-ups to design MIPS processors was Quantum Effect Devices (see next section). The MIPS design team that designed the R4300 started the company SandCraft, which designed the R5432 for NEC and later produced the SR7100, one of the first out-of-order execution processors for the embedded market. The original DEC StrongARM team eventually split into two MIPS-based start-ups: SiByte which produced the SB-1250, one of the first high-performance MIPS-based systems-on-a-chip (SOC); while Alchemy Semiconductor (later acquired by AMD) produced the Au-1000 SoC for low-power applications. Lexra used a MIPS-like architecture and added DSP extensions for the audio chip market and multithreading support for the networking market. Due to Lexra not licensing the architecture, two lawsuits were started between the two companies. The first was quickly resolved when Lexra promised not to advertise their processors as MIPS-compatible. The second was protracted, hurt both companies' business, and culminated in MIPS Technologies giving Lexra a free license and a large cash payment.

CPU family

The first commercial MIPS CPU model, the R2000, was announced in 1985. It added multiple-cycle multiply and divide instructions in a somewhat independent on-chip unit. New instructions were added to retrieve the results from this unit back to the execution core; these result-retrieving instructions were interlocked.

The R2000 could be booted either big-endian or little-endian. It had thirty-two 32-bit general purpose registers, but no Condition Code Register|condition code register, considering it a potential bottleneck, a feature it shares with the AMD 29000 and the DEC Alpha. Unlike other registers the program counter is not directly accessible.

The R2000 also had support for up to four co-processors, one of which was built into the main CPU and handled exceptions and traps, while the other three were left for other uses. One of these could be filled by the optional R2010 floating point unit|FPU, which had thirty-two 32-bit registers that could be used as sixteen 64-bit registers for double-precision.

The R3000 succeeded the R2000 in 1988, adding 32 kB (soon increased to 64 KB) caches for instructions and data, along with cache coherency support for multi-processor use. While there were flaws in the R3000's multiprocessor support, it still managed to be a part of several successful multiprocessor designs. The R3000 also included a built-in Memory management unit|MMU, a common feature on CPUs of the era. The R3000 was the first successful MIPS design in the marketplace, and eventually over 1 million were made. The R3000A, used in the extremely successful Sony PlayStation, was a speed bumped version running at 40 MHz that delivered a performance of 32 Instructions per second|VUPs. Like the R2000, the R3000 was paired with the R3010 FPU. Pacemips produced an R3400 and Integrated Device Technology|IDT produced R3500, both of them were R3000s with R3010 fpu on a single chip. Toshiba's R3900 was a virtually first System-on-a-chip|SoC for the early Handheld PCs based on the Windows CE.

The R4000 series, released in 1991, extended the MIPS instruction set to a full 64-bit architecture, moved the FPU onto the main die to create a single-chip system, and operated at a radically high internal clock speed (it was introduced at 100 MHz). However, in order to achieve the clock speed the caches were reduced to 8 KB each and took three cycles to access. The high operating frequencies were achieved through the technique of deep pipelining (called super-pipelining at the time). With the introduction of the R4000 a number of improved versions soon followed, including the R4400 of 1993 which included 16 KB caches, largely bug-free 64-bit operation, and a controller for another 1 MB external (level 2) cache.

MIPS, now a division of SGI called MTI, designed the lower-cost R4200, and later the even lower cost R4300, which was the R4200 with a 32-bit external bus. The Nintendo 64 used a NEC Corporation|NEC VR4300 CPU that was based upon the low-cost MIPS R4300i.

Quantum Effect Devices (QED), a separate company started by refugees from MIPS, designed the R4600 "Orion", the R4700 "Orion", the R4650 and the R5000. Where the R4000 had pushed clock frequency and sacrificed cache capacity, the QED designs emphasized large caches which could be accessed in just two cycles and efficient use of silicon area. The R4600 and R4700 were used in low-cost versions of the SGI Indy workstation as well as the first MIPS based Cisco routers, such as the 36x0 and 7x00-series routers. The R4650 was used in the original WebTV set-top boxes (now Microsoft TV). The R5000 FPU had more flexible single precision floating-point scheduling than the R4000, and as a result, R5000-based SGI Indys had much better graphics performance than similarly clocked R4400 Indys with the same graphics hardware. SGI gave the old graphics board a new name when it was combined with R5000 in order to emphasize the improvement. QED later designed the RM7000 and RM9000 family of devices for embedded markets like networking and laser printers. QED was acquired by the semiconductor manufacturer PMC-Sierra in August 2000, the latter company continuing to invest in the MIPS architecture.

The R8000 (1994) was the first superscalar MIPS design, able to execute two ALU and two memory operations per cycle. The design was spread over six chips: an integer unit (with 16 KB instruction and 16 KB L1 data caches), a floating-point unit, three full-custom secondary cache tag RAMs (two for secondary cache accesses, one for bus snooping), and a cache controller ASIC. The design had two fully pipelined double precision multiply-add units, which could stream data from the 4 MB off-chip secondary cache. The R8000 powered SGI's Power Challenge computer servers in the mid 1990s and later became available in the Power Indigo2 workstation. Its limited integer performance and high cost dampened appeal for most users, although its FPU performance fit scientific users quite well, and the R8000 was in the marketplace for only a year and remains fairly rare.

In 1995, the R10000 was released. This processor was a single-chip design, ran at a faster clock speed than the R8000, and had larger 32 KB primary instruction and data caches. It was also superscalar, but its major innovation was out-of-order execution. Even with a single memory pipeline and simpler FPU, the vastly improved integer performance, lower price, and higher density made the R10000 preferable for most customers.

Recent designs have all been based upon R10000 core. The R12000 used improved manufacturing to shrink the chip and operate at higher clock rates. The revised R14000 allowed higher clock rates with additional support for DDR SDRAM|DDR Static random access memory|SRAM in the off-chip CPU cache|cache, and a faster front side bus clocked to 200 MHz for better throughput. Later iterations are named the R16000 and the R16000A and feature increased clock speed, additional L1 cache, and smaller die manufacturing compared with before.

MIPS microprocessor specifications
Model Frequency [MHz] Year Process [µm] Transistors [millions] Die size [mm²] IO Pins Power [W] Voltage Dcache [k] Icache [k] Scache [k]
R2000 8-16.7 1985 2.0 0.11 -- -- -- -- 32 64 none
R3000 12-40 1988 1.2 0.11 66.12 145 4 -- 64 64 none
R4000 100 1991 0.8 1.35 213 179 15 5 8 8 1024
R4400 100-250 1992 0.6 2.3 186 179 15 5 16 16 1024
R4600 100-133 1994 0.64 2.2 77 179 4.6 5 16 16 512
R5000 150-200 1996 0.35 3.7 84 223 10 3.3 32 32 1024
R8000 75-90 1994 0.5 2.6 299 591 30 3.3 16 16 1024
R10000 150-250 1995 0.35 6.8 299 599 30 3.3 32 32 16384
R12000 270-400 1998 0.18–0.25 6.9 204 600 20 2.3 32 32 16384
R14000 500-600 2001 0.13 7.2 204 527 17 1.5 32 32 16384
R16000 700-1000 2002 0.11 -- -- -- 20 1.5 32 32 16384

Note: These specifications are only common processor configurations. Variations exist, especially in Level 2 cache.

Summary by MIPS version


Applications

Among the manufacturers which made computer workstation systems using MIPS processors are Silicon Graphics|SGI, MIPS Computer Systems, Inc., Olivetti, Siemens Nixdorf Informationssysteme|Siemens-Nixdorf, Acer (company)|Acer, Digital Equipment Corporation, NEC Corporation|NEC, and DeskStation. Various operating systems have been ported to the architecture, such as SGI's IRIX, Microsoft's Windows NT (although support for MIPS ended with the release of Windows NT 4.0) and Windows CE, Linux, BSD, Unix|UNIX System V, SINIX, MIPS Computer Systems' own RISC/os, and others.

However, use of MIPS as the main processor of computer workstations has declined, and SGI has announced its plans to cease developing high-performance iterations of the MIPS architecture in favor of using Intel IA64-based processors (see "Other models and future plans" section below).

On the other hand, use of MIPS microprocessors in embedded roles is likely to remain common, because of the low power-consumption and heat characteristics of embedded MIPS implementations, the wide availability of embedded development tools for MIPS, as well as experts knowledgeable about the architecture.

Other models and future plans

Other members of the MIPS family include the R6000, an ECL implementation of the MIPS architecture which was produced by Bipolar Integrated Technology. The R6000 microprocessor introduced the MIPS II instruction set. Its Translation Lookaside Buffer|TLB and cache architecture are different from all other members of the MIPS family. The R6000 did not deliver the promised performance benefits, and although it saw some use in Control Data machines, it quickly disappeared from the mainstream market. The PMC-Sierra RM7000 was a version of the R5000 with a built-in 256 kB level 2 cache and a controller for optional level three cache. It was primarily targeted at embedded system|embedded designs, including SGI's graphics processors and various networking solutions, primarily by Cisco Systems|Cisco. The R9000 name was never used.

At one time SGI had intended to move off the MIPS platform to the Intel Itanium, and development was to have ended with the R10000. The ever-longer delays in introducing the Itanium meant that the installed base of MIPS-based machines continued to increase. By 1999 it was clear that development had ended too soon, and the R14000 and R16000 were released as a result. SGI had hinted at a more complex R8000 style FPU for later R-series, and also a dual core processor, but SGI's financial troubles and the officially supported use of QuickTransit emulation to run IRIX binaries on Altix have essentially eliminated IRIX/MIPS hardware development.

Cores

In recent years most of the technology used in the various MIPS generations has been offered as Semiconductor intellectual property core|IP-cores (building-blocks) for embedded processor designs. Both 32-bit and 64-bit basic cores are offered, known as the 4K and 5K respectively, and the design itself can be licensed as MIPS32 and MIPS64. These cores can be mixed with add-in units such as FPUs, SIMD systems, various input/output devices, etc.

MIPS cores have been commercially successful, now being used in many consumer and industrial applications. MIPS cores can be found in newer Cisco and Linksys routers, cable modems and Asymmetric Digital Subscriber Line|ADSL modems, smartcards, laser printer engines, set-top box|set-top boxes, robots, handheld computers, Sony PlayStation 2 and Sony PlayStation Portable. In cellphone/PDA applications, the MIPS core has been unable to displace the incumbent, competing ARM_architecture|ARM core.

Examples of MIPS-powered devices: Broadcom BCM5352E - WiFi router processor with 54g WLAN, fast Ethernet, 200 MHz, 16KiB ins. 8KiB data cache, 256B prefetch cache, MMU, 16-bit 100 MHz SDRAM controller, serial/parallel flash, 5-port 100 Mbit/s Ethernet (switch), 16 GPIO, JTAG, 2xUART, 336-ball BGA. BCM 11xx, 12xx, 14xx - 64bit "SiByte" MIPS line.

IDT RC32438, ATI Xilleon, Alchemy Au1000, 1100, 1200, Broadcom Sentry5, Cavium Octeon CN34xx and CN38xx, Infineon Technologies EasyPort, Amazon, ADM5120, WildPass, INCA-IP, INCA-IP2, NEC EMMA and EMMA2, NEC VR4181A, VR4121, VR4122, VR4181A, VR5432, VR5500, Oak Technologies Generation, PMC-Sierra RM11200, QuickLogic QuickMIPS ESP, Toshiba "Donau", Toshiba TMPR492x, TX4925, TX9956, TX7901

Programming and emulation

There is a freely available "MIPS R2000/R3000 Simulator" called SPIM for several operating systems (specifically Unix or GNU/Linux; Mac OS X; MS Windows 95, 98, NT, 2000, XP; and DOS) which is good for learning MIPS assembly language programming and the general concepts of RISC-assembly language programming: http://www.cs.wisc.edu/~larus/spim.html

A more feature-rich free MIPS emulator is available from the GXemul project (formerly known as the mips64emul project), which emulates not only the various MIPS III and higher microprocessors (from the R4000 through the R10000), but also emulates entire computer systems which use the microprocessors. For example, GXemul can emulate both a DECstation with a MIPS R4400 CPU (and boot to Ultrix), and an SGI O2 with a MIPS R10000 CPU (although the ability to boot Irix is limited), among others, as well as the various framebuffers, SCSI controllers, and the like which comprise those systems.

Examples of system calls (used by SPIM)
service Trap code Input Output Notes
print_int $v0=1 $a0= integer to print prints a0 to standard output
print_string $v0=4 $a0= address of first character prints a character string to standard output
sbrk $v0=9 $a0= number of bytes required $v0= address of allocated memory Allocates memory from the heap

Summary of R3000 instruction set

Instructions are divided into three types: R, I and J. Every instruction starts with a 6-bit opcode. In addition to the opcode, R-type instructions specify three registers, a shift amount field, and a function field; I-type instructions specify two registers and a 16-bit immediate value; J-type instructions follow the opcode with a 26-bit jump target. MIPS R3000 Instruction Set Summary MIPS Instruction Reference

Register name, number, use, and call conventions:

Registers
Name Number Use Callee must preserve?
$zero $0 constant 0 N/A
$at $1 assembler temporary no
$v0–$v1 $2–$3 Values for function returns and expression evaluation no
$a0–$a3 $4–$7 function arguments no
$t0–$t7 $8–$15 temporaries no
$s0–$s7 $16–$23 saved temporaries yes
$t8–$t9 $24–$25 temporaries no
$k0–$k1 $26–$27 reserved for OS kernel no
$gp $28 global pointer yes
$sp $29 stack pointer yes
$fp $30 frame pointer yes
$ra $31 return address N/A

Registers that are preserved across a call are registers that (by convention) will not be changed by a system call or procedure (function) call. For example, $s-registers must be saved to the stack by a procedure that needs to use them, and $sp and $fp are always incremented by constants, and decremented back after the procedure is done with them (and the memory they point to). By contrast, $ra is changed automatically by any normal function call (ones that use jal), and $t-registers must be saved by the program before any procedure call (if the program needs the values inside them after the call).

Real instructions

These are instructions that have direct hardware implementation, as opposed to pseudoinstructions which are translated into multiple real instructions before being assembled.

The following are the three formats used for the core instruction set:

Type -31-                                 format (bits)                                 -0-
R opcode (6) rs (5) rt (5) rd (5) shamt (5) funct (6)
I opcode (6) rs (5) rt (5) immediate (16)
J opcode (6) address (26)
  • CONST denotes a constant ("immediate").
  • In the following, the register numbers are only examples, and any other registers can be used in their places.
  • All the following instructions are native instructions.
  • Opcodes and funct codes are in hexadecimal.
Category Name Instruction syntax Meaning Format/opcode/funct Notes
Arithmetic Add add $1,$2,$3 $1 = $2 + $3 (signed) R 0 <math>20_{16}</math> adds two registers
Add unsigned addu $1,$2,$3 $1 = $2 + $3 (unsigned) R 0 <math>21_{16}</math>
Subtract sub $1,$2,$3 $1 = $2 - $3 (signed) R 0 <math>22_{16}</math> subtracts two registers
Add immediate addi $1,$2,CONST $1 = $2 + CONST (signed) I <math>8_{16}</math> Used to add constants (and also to copy one register to another "addi $1, $2, 0")
Add immediate unsigned addiu $1,$2,CONST $1 = $2 + CONST (unsigned) I <math>9_{16}</math>
Multiply mult $1,$2 LO = (($1 * $2) << 32) >> 32;
HI = ($1 * $2) >> 32;
R 0 <math>18_{16}</math> Multiplies two registers and puts the 64-bit result in two special memory spots - LOW and HI. Alternatively, one could say the result of this operation is: (int HI,int LO) = (64-bit) $1 * $2 .
Divide div $1, $2 LO = $1 / $2     HI = $1 % $2 R Divides two registers and puts the 32-bit integer result in LO and the remainder in HI.
Data Transfer Load address la $1, Label $1 = Memory Address I Loads the address of a label.
Load word lw $1,CONST($2) $1 = Memory[$s2 + CONST] I <math>23_{16}</math> loads the word stored from: ($s2+CONST) and the following 3 bytes.
Load halfword lh $1,CONST($2) $1 = Memory[$s2 + CONST] I <math>25_{16}</math> loads the halfword stored from: ($s2+CONST) and the following byte.
Load byte lb $1,CONST($2) $1 = Memory[$s2 + CONST] I loads the byte stored from: ($s2+CONST).
Store word sw $1,CONST($2) Memory[$s2 + CONST] = $1 I stores a word into: ($s2+CONST) and the following 3 bytes. The order of the operands is a large source of confusion.
Store half sh $1,CONST($2) Memory[$s2 + CONST] = $1 I stores the first half of a register (a halfword) into: ($s2+CONST) and the following byte.
Store byte sb $1,CONST($2) Memory[$s2 + CONST] = $1 I stores the first fourth of a register (a byte) into: ($s2+CONST).
Load upper immediate lui $1,CONST $1 = CONST << 16 I loads a 16-bit immediate operand into the upper 16-bits of the register specified. Maximum value of constant is 216-1
Move from high mfhi $1 $1 = HI R Moves a value from HI to a register. Do not use a multiply or a divide instruction within two instructions of mfhi (that action is undefined because of the MIPS pipeline).
Move from low mflo $1 $1 = LO R 0 <math>12_{16}</math> Moves a value from LO to a register. Do not use a multiply or a divide instruction within two instructions of mflo (that action is undefined because of the MIPS pipeline).
Logical And and $1,$2,$3 $1 = $2 & $3 R Bitwise and
And immediate andi $1,$2,CONST $1 = $2 & CONST I
Or or $1,$2,$3 $1 = $2 | $3 R Bitwise or
Or immediate ori $1,$2,CONST $1 = $2 | CONST I
Exclusive or xor $1,$2,$3 $1 = $2 ^ $3 R
Nor nor $1,$2,$3 $1 = ~($2 | $3) R Bitwise nor
Set on less than slt $1,$2,$3 $1 = ($2 < $3) R Tests if one register is less than another.
Set on less than immediate slti $1,$2,CONST $1 = ($2 < CONST) I Tests if one register is less than a constant.
Bitwise Shift Shift left logical sll $1,$2,CONST $1 = $2 << CONST R shifts CONST number of bits to the left (multiplies by <math>2^{CONST} </math>)
Shift right logical srl $1,$2,CONST $1 = $2 >> CONST R shifts CONST number of bits to the right - zeros are shifted in (divides by <math>2^{CONST} </math>). Note that this instruction only works as division of a two's complement number if the value is positive.
Shift right arithmetic sra $1,$2,CONST <math> $1 = $2 >> CONST + \ </math>
<math> \bigg(\bigg(\sum_{n=1}^{CONST}2^{31-n}\bigg)\cdot $2 >> 31 \bigg) </math>
R shifts CONST number of bits - the sign bit is shifted in (divides 2's complement number by <math>2^{CONST} </math>)
Conditional branch Branch on equal beq $1,$2,CONST if ($1 == $2) go to PC+4+CONST I Goes to the instruction at the specified address if two registers are equal.
Branch on not equal bne $1,$2,CONST if ($1 != $2) go to PC+4+CONST I Goes to the instruction at the specified address if two registers are not equal.
Unconditional jump Jump j CONST goto address CONST J Unconditionally jumps to the instruction at the specified address.
Jump register jr $1 goto address $1 R Jumps to the address contained in the specified register
Jump and link jal CONST $31 = PC + 4; goto CONST J For procedure call - used to call a subroutine, $31 holds the return address; returning from a subroutine is done by: jr $31

NOTE: in the branching and jump instructions, the offset can be replaced by a label present somewhere in the code.

Pseudoinstructions are translated into multiple real instructions (see above) before being assembled.

Name instruction syntax Real instruction translation meaning
Branch greater than bgt if(R[rs]>R[rt]) PC=Label
Branch less than blt if(R[rs]<R[rt]) PC=Label
Branch greater than or equal bge if(R[rs]>=R[rt]) PC=Label
branch less than or equal ble if(R[rs]<=R[rt]) PC=Label
branch greater than unsigned bgtu if(R[rs]<=R[rt]) PC=Label
branch greater than zero bgtz if(R[rs]<=R[rt]) PC=Label

Other instructions

These instructions are to be placed in either the "real instructions" or "pseudoinstructions" sections above.

Common logical instructions (bitwise)

addiu $1,$2,100 $1 = $2 + 100 (unsigned immediate)
addu $1,$2,$3 $1 = $2 + $3 (unsigned)
div $1,$2 HI = $1 % $2 ; LO = $1 / $2
subu $1,$2,$3 $1 = $2 - $3 (unsigned)

Memory to register transfer instructions

lbu $1,100($2) loads an unsigned byte
lhu $1,100($2) loads an unsigned halfword
lwcz $1,100($2) loads a word to the "z" coprocessor ("z" is the number of the coprocessor)

Note that there is no corresponding "load lower immediate" instruction; this can be done by using addi (add immediate, see below) or ori (or immediate) with the register $0 (whose value is always zero). For example, both addi $1, $0, 100 and ori $1, $0, 100 load the decimal value 100 into register $1.

Register to memory transfer instructions

swcz $1,100($2) stores a word from the "z" coprocessor ("z" is the number of the coprocessor).

Register to register (move) instructions

mfcz $1,$c1 moves a value in the coprocessor register $1 to the main processor register $1 ("z" is the number of the coprocessor)
mtcz $1,$c1 moves a value from the main processor register $1 to the coprocessor register $1
mov.d $fp1,$fp3 FPU register $3 to the f.p. register $1
mov.s $fp1,$fp3 moves a value with single precision from the FPU register $3 to the f.p. register $1

(values with double precision use two adjacent FPU registers)

An operation with signed immediates differs from one with unsigned ones in that it does not throw an exception. Subtracting an immediate can be done with adding the negation of that value as the immediate.

Some other important instructions

  • nop (no operation) (machine code 0x00000000, interpreted by CPU as sll $0,$0,0)
  • break (breaks the program, used by debuggers)
  • syscall (uses for system calls to the operating system)
  • a set of FPU-related instructions
  • a vast set of Macro|virtual instructions, decomposed by the assembler in native instructions


The R10000, code-named "T5", is a microprocessor implementation of the MIPS IV instruction set architecture (ISA) developed by MIPS Technologies, Inc. (MTI), then a division of Silicon Graphics, Inc. (SGI). The chief designers were Chris Rowen and Kenneth C. Yeager. The R10000 microarchitecture was known as ANDES, an abbreviation for Architecture with Non-sequential Dynamic Execution Scheduling.[1] The R10000 largely replaced the R8000 in the high-end and the R4400 elsewhere. MTI was a fabless semiconductor company, the R10000 was fabricated by NEC and Toshiba]. Previous fabricators of MIPS microprocessors such as Integrated Device Technology (IDT) and three others did not fabricate the R10000 as it was more expensive to do so than the R4000 and R4400.

History

The R10000 was introduced in January 1996 at clock frequencies ranging from 150 MHz to 200 MHz, but was not available in large volumes until later in the year due to fabrication problems at MIPS's foundries. The 200 MHz version was in short supply throughout 1996, and was priced at US$3,000 as a result.[2]

On 25 September 1996, SGI announced that R10000s fabricated by NEC between March and the end of July that year were faulty, drawing too much current and causing systems to shut down during operation. SGI recalled 10,000 R10000s that had shipped in systems as a result, which impacted the company's earnings.[3][4]

In 1997, a version of R10000 fabricated in a 0.25 µm process enabled the microprocessor to reach 250 MHz.

Users

The R10000 users were:

  • SGI, in their R10000 in their workstations, servers and supercomputers
  • NEC, in their Cenju-4 supercomputer
  • Siemens Nixdorf, in their servers
  • Tandem Computers, in their Himilaya fault-tolerant servers

Description

The R10000 is a four-way superscalar design that implements register renaming and executes instructions out-of-order. Its design was a departure from previous MTI microprocessors such as the R4000, which was a much simpler Scalar processor in-order design that relied largely on high clock rates for performance.

The R10000 fetches four instructions every cycle from its instruction cache. These instructions are decoded and then placed into the integer, floating-point or load/store instruction queues depending on the type of the instruction. The decode unit is assisted by the pre-decoded instructions from the instruction cache, which append five bits to every instruction to enable the unit to quickly identify which execution unit the instruction is executed in, and rearrange the format of the instruction to optimize the decode process.

Each of the instruction queues can accept up to four instructions from the decoder, avoiding any bottlenecks. The instruction queues issue their instructions to their execution units dynamically depending on the availability of operands and resources. Each of the queues except for the load/store queue can issue up to two instructions every cycle to its execution units. The load/store queue can only issue one instruction. The R10000 can thus issue up to five instructions every cycle.

Integer unit

The integer unit consists of the integer register file and three pipelines, two integer, one load store. The integer register file was 64 bits wide and contained 64 entries, of which 32 were architectural registers and 32 were rename registers used to implement register renaming. The register file had seven read ports and three write ports. Both integer pipelines have an adder and a logic unit. However, only the first pipeline has a barrel shifter and hardware for confirming the prediction of conditional branches. The second pipeline is used to access the multiplier and divider. Multiplies are pipelined, and have a six-cycle latency for 32-bit integers and ten for 64-bit integers. Division is not pipelined. The divider uses a non-restoring algorithm that produces one bit per cycle. Latencies for 32-bit and 64-bit divides are 35 and 67 cycles, respectively.

Floating-point unit

The floating-point unit (FPU) consisted of four functional units, an adder, a multiplier, divide unit and square root unit. The adder and multiplier are pipelined, but the divide and square root units are not. Adds and multiplies have a latency of three cycles and the adder and multiplier can accept a new instruction every cycle. The divide unit has a 12- or 19-cycle latency, depending on whether the divide is single precision or double precision, respectively.

The square root unit executes square root and reciprocal square root instructions. Square root instructions have a 18- or 33-cycle latency for single precision or double precision, respectively. A new square root instruction can be issued to the divide unit every 20 or 35 cycles for single precision and double precision respectively. Reciprocal square roots have longer latencies, 30 to 52 cycles for single precision (32-bit) and double precision (64-bit) respectively.

The floating-point register file contains sixty-four 64-bit registers, of which thirty-two are architectural and the remaining are rename registers. The adder has its own dedicated read and write ports, whereas the multiplier shares its with the divider and square root unit.

The divide and square root units use the SRT division algorithm. The MIPS IV ISA has a multiply-add instruction. This instruction is implemented by the R10000 with a bypass — the result of the multiply can bypass the register file and be delivered to the add pipeline as an operand, thus it is not a fused multiply-add, and has a four-cycle latency.

Caches

The R10000 has a 32 KB instruction cache and a 32 KB data cache, which was large for the time (1996). The instruction cache was two-way set-associative and has a 64-byte line size. Instructions are partially decoded by appending four bits to each instruction (which has a length of 32 bits) before they are placed in the cache. The 32 KB data cache is two-way interleaved, with the cache consisting of two 16 KB banks that are two-way set-associative. It is virtually indexed and physically tagged to enable the cache to be indexed in the same clock cycle and to maintain cache coherency with the secondary cache.

The secondary cache supported capacities between 512 KB and 16 MB, and is implemented externally with synchronous static random access memory (SSRAM). It is accessed via a dedicated 128-bit bus with 9-bits for error correcting code (ECC). The cache and bus operated at the same clock rate as the R10000, whose maximum was 200 MHz. At 200 MHz, the bus yielded a peak bandwidth of 3.2 GB/s.

Addressing

MIPS IV is a 64-bit architecture, but the R10000 did not implement the entire physical or virtual address to reduce cost. Instead, it has a 40-bit physical address and a 44-bit virtual address, thus it is capable of addressing 1 TB of physical memory and 16 TB of virtual memory.

Avalanche system bus

The R10000 used the Avalanche bus, a 64-bit bus that operated at frequencies up to 100 MHz. Avalanche is a multiplexed address and data bus, so at 100 MHz it yielded a maximum theoretical bandwidth of 800 MB/s, but its peak bandwidth was 640 MB/s as it required some cycles to transmit addresses.

The system interface controller supported glue-less symmetrical multiprocessing (SMP) of up to four microprocessors. Systems using the R10000 with external logic could scale to hundreds of processors. An example of such a system is the Origin 2000.

Fabrication

The R10000 consisted of approximately 6.8 million transistors, of which approximately 4.4 million are contained in the primary caches.[5] The die measured 16.640 by 17.934 mm, for a die area of 298.422 mm2. It was fabricated in a 0.35 µm process and packaged in 599-pad ceramic land grid array (LGA). Before the R10000 was introduced, the Microprocessor Report, covering the 1994 Microprocessor Forum, reported that it was packaged in a 527-pin ceramic pin grid array (CPGA); and that vendors also investigated the possibility of using a 339-pin multi-chip module (MCM) containing the microprocessor die and 1 MB of cache.[6]

Derivatives

The R10000 was extended by multiple successive derivatives. All derivatives after the R12000 have their clock frequency kept as low as possible to maintain power dissipation in the 15 to 20 W range so they could be densely packaged in SGI's high performance computing (HPC) systems.

R12000

The R12000 was a derivative of the R10000 started by MIPS and completed by SGI. It was fabricated by NEC and Toshiba. The version fabricated by NEC was called the VR12000. The microprocessor was introduced in November 1998. It was available at 270, 300 and 360 MHz. The R12000 was developed as a stop-gap solution following the cancellation of the "Beast" project, which intended to deliver a successor to the R10000. R12000 users included NEC, Siemens-Nixdorf, SGI and Tandem Computers (and later Compaq, after their acquisition of Tandem).

The R12000 improved upon the R10000 microarchitecture by: inserting an extra pipeline stage to improve clock frequency by resolving a critical path; increasing the number of entries in the branch history table, improving prediction; modifying the the instruction queues so they take into account the age of a queued instruction, enabling older instructions were executed before newer ones if possible.

The R12000 was fabricated by NEC and Toshiba in a 0.25 µm CMOS process with four levels of aluminum interconnect. The new use of a new process did not mean that the R12000 was a simple die shrink with a tweaked microarchitecture, the layout of the die was optimized to take advantage of the 0.25 µm process.[7][8] The NEC fabricated VR12000 contained 7.15 million transistors and measured 15.7 by 14.6 mm (229.22 mm2).

R12000A

The R12000A was a derivative of the R12000 developed by SGI. Introduced in July 2000, it operated at 400 MHz and was fabricated by NEC a 0.18 µm process with aluminum interconnects.

R14000

The R14000 was a further development of the R12000 announced in July 2001. The R14000 operated at 500 MHz, enabled by the 0.13 µm CMOS process with five levels of copper interconnect it was fabricated with. It featured improvements to the microarchitecture of the R12000 by supporting double data rate (DDR) SSRAMs for the secondary cache and a 200 MHz system bus.[9]

The R14000 is a typical representative of the modern RISC processors that are capable of out-of-order and speculative instruction execution. Like in the Compaq Alpha processor there are two independent floating-point units for addition and multiplication and, additionally, two units that perform floating division and square root operations (not shown in Figure 13). The latter, however, are not pipelined and with latencies of about 20--30 cycles are relatively slow. In all there are 5 pipelined functional units to be fed: an address calculation unit which is responsible for address calculations and loading/storing of data and instructions, two ALU units for general integer computation and the floating-point add and multiply pipes already mentioned. The level 1 instruction and data caches have a moderate size of 32 KB and are 2-way set-associative. In contrast, the secondary cache can be very large: up to 16 MB. Both the integer and the floating-point registers have a physical size of 64 entries, however, 32 of them are accessible by software while the other half is under direct CPU control for register re-mapping. The clock frequency of the MIPS R1x000 processors have always been on the low side. The first R10000 appeared at a frequency of the 180 MHz while in the new R14000A the clock cycle is 600 MHz and will slightly rise during its lifetime. With the initial 600 MHz frequency the theoretical peak performance is 1.2 Gflop/s. Because of the independent floating-point units without fused multiply-add capabilities often a fair fraction of that speed can be realised. There also have been made some improvements with respect to the earlier chips: the bus speed has been doubled from 100 MB/s to 200 MB/s and the L1 cache that ran at a 2/3 speed in the predecessor R12000 has been sped up to full speed in the R14000A.

The R14000A is built in advanced 0.13 µm technology and it has at the present 600 MHz clock frequency an extremely low power consumption: only 17 Watt, several factors lower than that of the other processors discussed here. SGI keeps the clock frequency intentionally as low as possible to enable to build "dense" systems that can accommodate a large amount of processors in a small volume.

R14000A

The R14000A was a further development of the R14000 announced in February 2002. It operated at 600 MHz, dissipated approximately 17 W, and was fabricated by NEC Corporation in a 0.13 µm CMOS process with seven levels of copper interconnect.[9]

R16000

The R16000, code-named "N0", was the last derivative of the R10000. It was developed by SGI and fabricated by NEC in their 0.11 µm process with eight levels of copper interconnect. The microprocessor was introduced on 9 January 2003, debuting at 700 MHz for the Fuel.[10] In April 2003, a 600 MHz version was introduced for the Origin 350. Improvements were 64 KB instruction and data caches.

R16000A

The R16000 refers to R16000 microprocessors with clock rates higher than 700 MHz. The first R16000A was a 800 MHz version, introduced on 4 February 2004. Later, a 900 MHz version was introduced, and this version was for some time, the fastest publicly known R16000A—SGI later revealed there were 1.0 GHz R16000s shipped to selected customers. R16000 users included HP and SGI. SGI used the microprocessor in their Fuel and Tezro workstations; and the Origin 3000 servers and supercomputers. HP used the R16000A in their NonStop Himalaya S-Series fault-tolerant servers inherited from Compaq via Tandem.

R18000

The R18000 was a canceled further development of the R10000 microarchitecture that featured major improvements by Silicon Graphics, Inc. described at the Hot Chips symposium in 2001. The R18000 was designed specifically for SGI's ccNUMA servers and supercomputers. Each node would have two R18000s connected via a multiplexed bus to a system controller, which interfaced the microprocessors to their local memory and the rest of the system via a hypercube network.

The R18000 improved the floating-point instruction queues and revised the floating-point unit to feature two multiply-add units, quadrupling the peak FLOPS count. Division and square-root were performed in separate non-pipelined units in parallel to the multiply-add units. The system interface and memory hierarchy was also significantly reworked. It would have a 52-bit virtual address and a 48-bit physical address. The bidirectional multiplexed address and data system bus of the R18000 would be replaced by two unidirectional DDR links, a 64-bit multiplexed address and write path and a 128-bit read path. Although they are unidirectional, each path could be shared by another R18000, although the two would be shared through multiplexing. The bus could also be configured in the SysAD or Avalanche configuration for backwards compatibility with R10000 systems.

The R18000 would have a 1 MB four-way set-associative secondary cache would be included on-die; supplemented by an optional tertiary cache built from single data rate (SDR) or double data rate (DDR) SSRAM or DDR SDRAM with capacities of 2 to 64 MB. The L3 cache had its cache tags, equivalent to 400 KB, located on-die to reduce latency. The L3 cache is accessed via a 144-bit bus, of which 128 bits are for data and 8 bit for ECC. The L3 cache's clock rate was to have been programmable.

The R18000 was to be fabricated in NEC's UX5 process, a 0.13 µm CMOS process with nine levels of copper interconnect. It would have used 1.2 V power supply and dissipated less heat than contemporary server microprocessors in order to be densely packed into systems.

References

Further reading

  • Patterson and Hennessy: Computer Organization and Design. The Hardware/Software Interface. Morgan Kaufmann Publishers. ISBN 1-55860-604-1
  • Dominic Sweetman: See MIPS Run. Morgan Kaufmann Publishers. ISBN 1-55860-410-3

See also

  • Godson, a MIPS-like processor architecture developed at Chinese Academy of Sciences

External links

  1. "MIPS Claims Floating-Point Record With R10000, The Hottest Chip At The Microprocessor Forum".
  2. Gwennap, "Alpha Sails, PowerPC Flails", p. 8."
  3. "Defects Revealed In SGI R10000 MIPS Systems, Revenues Hit".
  4. "SGI To Recall 10,000 R10000s".
  5. Yeager, "The R100000 Superscalar Microprocessor", p. 28.
  6. "MIPS R10000 Uses Decoupled Architecture", p. 4.
  7. Gwennap, "MIPS R12000 to Hit 300 MHz".
  8. Halfhill, "RISC Fights Bach with the Mips R12000".
  9. 9.0 9.1 "SGI to develop MIPS chips for Origin, Onyx"
  10. Silicon Graphics, Inc., SGI Boosts Price/Performance on Silicon Graphics Fuel Visual Workstation Family up to 25%.