Please consider a donation to the Higher Intellect project. See https://preterhuman.net/donate.php or the Donate to Higher Intellect page for more info. |
Onyx2/Origin2000 Node boards
An Onyx2 node fits on a single 16" by 11" printed circuit board that contains one or two processors, the main memory, the directory memory and the Hub ASIC. The node board plugs into the backplane through a 300-pad CPOP (Compression Pad-on-Pad) connector. The connector actually combines two connections, one to the NUMAlink router network and another to the XIO I/O subsystem.
Processor
Each processor and their secondary cache is contained on a HIMM (Horizontal Inline Memory Module) daughter card that plugs into the node board. At the time of introduction, the Onyx2 used the IP27 board, featuring one or two R10000 processors clocked at 180 MHz with 1 MB secondary cache(s). A high-end model with two 195 MHz R10000 processors with 4 MB secondary caches was also available. In February 1998, the IP31 board was introduced with two 250 MHz R10000 processors with 4 MB secondary caches. Later, the IP31 board was upgraded to support two 300, 350 or 400 MHz R12000 processors. The 300 and 400 MHz models had 8 MB L2 caches, while the 350 MHz model had 4 MB L2 caches. Near the end of its life, a variant of the IP31 board that could utilize the 500 MHz R14000 with 8 MB L2 caches was made available.
Known node board CPU speeds
IP27: CPUs are mounted directly to the node board individually.
180 MHz R10000 (Can not be mixed with others speed node boards)
195 MHz R10000
IP31: CPUs are mounted in pairs (along with their respective caches) to a PIMM, a pluggable module which then mounts to the node board.
250 MHz R10000
300 MHz R12000
350 MHz R12000 (Can not be used in configurations greater than 8 CPUs)
400 MHz R12000
500 MHz R14000
Main memory and directory memory
Each node board can support a maximum of 4 GB of memory through 16 DIMM slots by using proprietary ECC SDRAM DIMMs with capacities of 16, 32, 64 and 256 MB. Because the memory bus is 144 bits wide (128 bits for data and 16 bits for ECC), memory modules are inserted in pairs. Directory memory, which contains information on the contents of remote caches for maintaining cache coherency, must be used in configurations with more than 32 processors as the Onyx2 uses a distributed shared memory model. The directory memory is contained on proprietary DIMMs that are inserted into eight DIMM slots set aside for its use. In configurations where there are fewer than 32 processors, the directory memory is contained within the main memory.
Hub ASIC
The Hub ASIC interfaces the processors, memory and XIO to the NUMAlink 2 system interconnect. The ASIC contains five major sections: the crossbar (referred to as the "XB"), the I/O interface (referred to as the "II"), the network interface (referred to as the "NI"), the processor interface (referred to as the "PI") and the memory and directory interface (referred to as the "DM"), which also serves as the memory controller. The interfaces communicate with each other via FIFO buffers that are connected to the crossbar. When two processors are connected to the Hub ASIC, the node does not behave in a SMP fashion. Instead, the two processors operate separately and their buses are multiplexed over the single processor interface. This was done to save pins on the Hub ASIC. The Hub ASIC is clocked at 100 MHz and contains 900,000 gates fabricated in a five-layer metal process.
Origin2000/Onyx2 Node Board LEDs
The are a series of LEDs on the bulkhead of Origin2000/Onyx2 processor node boards. As the system is powered up, the LEDs display information as the system executes boot-time PROM code. If the system hangs or cannot complete execution of the PROM code the LEDs display can be used as a diagnostic tool.
Each Origin2000 or Onyx2 node board has two columns of LEDs. Each column has eight LEDs; the LEDs in the left column are assigned to CPU A, the right column to CPU B.
The following diagram represents the position and function of the node board LEDs. The descriptions "Binary Group 1 and 2", "Processor Activity" and "Processor Heartbeat" apply to the LEDs in either column:
CPU A CPU B ---X X-- Binary _ | X X | Group 1 | X X | ---X X | |--Processor Activity ---X X | Binary _ | X X | Group 2 | X X-- ---X X-----Processor Heartbeat (after IRIX loads)
The first seven pairs of LEDs indicate processor activity during PROM code execution and after the operating system (IRIX) loads. The eighth pair of LEDs indicates the processor heartbeat, and are only active after IRIX loads.
Decoding the Node Board LED Display
If the system does not successfully execute PROM code, the coded pattern of LEDs illuminated on the node board can used as a diagnostic tool to determine the failure point.
The LEDs should be read one column (CPU) at a time. Begin at the top of the column of LEDs, and read the first four LEDs as a group (marked as "Binary Group 1 in the diagram). Assign a binary numeric value to each LED, where an illuminated LED receives a value of "0", and an inactive LED receives a value of "1".
Note that he assignment of binary values is the reverse of typical "on"/"off" value assignments.
So if the first four LEDs were off, this pattern would be assigned the value 1111. Conversely, if all four LEDs were on, the value would be 0000. If the LED illumination pattern was on-off-off-on, the value would be 0110.
Move to the second set of four LEDs in the same processor column (marked as "Binary Group 2" in the diagram) and apply the same binary values used with the first group of four.
Assign each group of binary numbers a hexadecimal value using the following list of equivalents:
Binary Hexadecimal Value Value 0000 0 0001 1 0010 2 0011 3 0100 4 0101 5 0110 6 0111 7 1000 8 1001 9 1010 a 1011 b 1100 c 1101 d 1110 e 1111 f
As an example, if the first group of LEDs displayed as on-on-on-on, the binary value would be 0000 and the first hexadecimal value would be 0. If the second group of LEDs displayed off-off-on-off, the binary value would be 1101 and the second hexadecimal value would be d. The first and second hexadecimal values are in sequence to determine the PROM code execution point. In the previous example, the LED code would be read as 0d.
Knowing the location of certain system components is helpful during diagnosis. For instance, the CPU, Scache and Hub chip are all located on the node board. The Bridge chip is on the Base I/O board (IO6 or IO6G) or other I/O board. The Crossbow (XBOW) chip is on the midplane.
Diagnosing PROM Code Boot Progress
If a system, node board or CPU does not successfully complete PROM code initialization, the static value display on the respective LEDs can be used to determine the point at which the node board or CPU failed.
During the initialization process there are also times that the node boards execute certain aspects of the PROM code sequentially. As each node board executes the code in turn, the remaining node boards wait for that board to signal completion. If the node board hangs while executing that aspect of the PROM code, the remaining boards continue to wait for the completion signal. Because the rest of the node boards continue wait for a completion signal, this might appear as a complete system hang. The LEDs on the node boards can be read to determine which have actually failed and which are merely waiting.
Hexadecimal values between 00 and 7f are used to indicate the progress of PROM code execution (if hexadecimal values within the specified range are not listed they are unused).
If a suspected point of failure is not listed, the cause cannot not be isolated to a specific component without additional proprietary diagnostics.
LED Code Boot Phase Suspected Point of Failure 00 System Reset CPU 01 Init CPU CPU 02 Test CPU CPU 03 Run TLB CPU 04 Test Pri Instruction Cache CPU 05 Test Pri Data Cache CPU 06 Test Secondary Cache CPU 07 Flush all Caches CPU 0a Invalidate Pri Inst Cache CPU 0b Invalidate Pri Data Cache CPU 0c Invalidate Secondary Cache Scache 0d Succeed - Jump to Main 0e About to increase PROM Access Speed PROM 0f Increased PROM Access Speed PROM 10 Init Pri data Cache CPU 11 Init Pri Instruction Cache CPU 12 Init CPU COP0 Registers CPU 13 Flush TLB CPU 1a Probe for MSC MSC, nodeboard, midplane 1b Probe for Junk UART MSC, nodeboard, midplane 1c Done with MSC Probe MSC, UART 1d About to Init UART MSC 1e Done with UART Init UART 20 Start Power on Diagnostics (POD) 21 About to enter POD mode C portion 22 About to enter POD prompt loop 23 About to enter POD mode(assembler) 24 Local CPU (A/B) Arbitration CPU 25 Init Secondary Cache Scache 28 About to perform 1st Local Barrier CPU, nodeboard hub 2a Config DEX mode - Stack and Data CPU, Scache 2b Reached Main CPU 38 1st Local Barrier Succeeded 3c About to Jump to UALIAS Space RAM 3d Jumped to UALIAS Space RAM 3e About to Jump to Cached Space Scache 3f Jumped to Cached Space Scache 40 About to Test Stack Area RAM 41 Done Testing Stack Area Scache, RAM 45 About to enter Slave Launch Loop Master CPU 46 Received Launch Interrupt 47 Calling Launched Function RAM, Scache 48 Launched Function Returned 4a About to Init Hub MD & SIMM Controls Hub 4b About to Probe & Config Memory Size RAM 4c About to Init PCF8512C Chip MSC, Midplane 4d Done Init - PCF8512C Chip MSC 4f About to Discover Hub I/O Hub 50 About to Write Hub Config info 51 About to Write Router Config info 52 About to Init Hub I/O Hub 53 About to Probe I/O for Console Base I/O, Bride, XBow 54 Probe I/O for Console Success Base I/O 56 Hub I/O Init Done Midplane, I/O card, Base I/O 57 Saved Errors Stored from Reset Hub 58 Cleared all Error Registers Hub 59 Enabled Error Checking Hub 5a Done Discovering Hub I/O Hub 5b About to Init NMI Handler Area RAM, Scache 5c About to Test Hub Interrupts Hub
Fatal Node Board Error Codes
If a node board suffers a fatal error while during PROM diagnostics and is disabled, the LEDs of each CPU on that node board record the failure code. The failure codes displayed on the node board LEDs can be read using the information in the #Decoding the Node Board LED Display section of this article.
LED Code Reason for Failure Suspected Point of Failure 81 CoProcessor Failed Register Test CPU 83 Pri Instruction Cache Failed Test CPU 84 Pri Data Cache Failed Test CPU 85 Secondary Cache Failed Test Scache 86 CPU Disabled by Another CPU CPU 87 Real-Time Counter Broken CPU 8c General Exception 91 Hub Local Failed CPU, Hub 93 Some Node Not Premium (>32 CPUs) Directory Memory 98 Node has no Local memory No RAM or Disabled RAM 9a CPU is Disabled CPU Disabled 9b Memory Download Failed RAM, PROM 9e Failed Writing Hub Config Info RAM 9f Failed Writing Router Config Info RAM a0 Hub I/O Init Failed Hub a1 Node failed Init RAM a4 Hub Chip Failed Hub a5 Router Chip Failed Router a6 Waiting for Reset To Go Hub a7 LLP Failed After Reset Hub a8 LLP Never Up After Reset Hub a9 Node Board - No Good Local Memory No RAM or Disabled RAM ab Network Discovery Failed ac NASID Calculation Failed CrayLink Cabling Error ad Route Calculation Failed CrayLink Cabling Error ae Route Distribution Failed Check Router LEDs for Error af NASID Distribution Failed CrayLink Cabling Error b0 Master Not Assigned NASID Check Router LEDs for Error b1 Module ID Arbitration Failed MSC b2 Origin2000 Craylinked to Origin200 Illegal Configuration b3 Partition Config Error User Error
Node Board Early Exception LED Codes
If an exception occurs in the PROM code execution before the exception can be displayed by normal means, the CPU LEDs will begin a blinking error code. As the LEDs flash, they will alternate between flashing all eight LEDs and flashing only the LEDs that indicate the error code.
LED Code Exception Suspected Point of Failure f2 General Exception CPU f3 ECC Exception CPU f4 TLB Exception CPU f5 XTLB Exception CPU f6 Unimplimented Exception CPU f7 Cache Error Exception CPU
Post Initialization LED Displays
After the CPUs have completed initialization they display a different set of LED patterns.
Prior to IRIX booting, the master CPU will alternate the display of 55 and 00 (see #Decoding the Node Board LED Display for additional information).
After IRIX loads, the bottom (eighth) LED is used to indicate the CPU heartbeat. LEDs 1 through 7 will progressively illuminate from bottom to top to indicate CPU activity.
Full list of codes
LED Name Description Value 0x01 INITCPU Initialize the general-purpose registers (GPRs), floating-point registers (FPR), and COP0 registers 0x02 TESTCP1 Test the COP1 registers 0x03 RUNTLB Switch to mapped mode 0x04 TESTICACHE Test the primary instruction cache 0x05 TESTDCACHE Test the primary data cache 0x06 TESTSCACHE Test secondary cache 0x07 FLUSHCACHES Flush all caches 0x0a INVICACHE Invalidate the primary instruction cache 0x0b INVDCACHE Invalidate the primary data cache 0x0c INVSCACHE Invalidate secondary cache 0x0d INMAIN Successfully jumped to the main() function 0x0e SPEEDUP Prepare to increase PROM access speed 0x0f SPEEDUPOK Successfully increased PROM access speed 0x1a MSCPROBE Prepare to probe for presence of the MSC 0x1c DONEPROBE Completed the probe for the presence of the MSC 0x1d UARTINIT Prepare to initialize selected UART 0x1e UARTINITDONE Completed the initialization of the selected UART 0x21 PODLOOP Prepare to enter POD mode (C code portion) 0x22 PODPROMPT Prepare to enter the POD prompt loop 0x23 PODMODE Prepare to enter POD mode (assembly code portion) 0x24 LOCALARB Perform local arbitration (between CPU A and CPU B) 0x28 BARRIER Prepare to perform first local barrier 0x2a MAKESTACK Prepare to configure Dex mode stack and date 0x2b MAIN Code execution has reached the main() function 0x31 NMI Received external nonmaskable interrupt 0x35 RTCINIT Prepare to initialize the HUB real-time counter 0x36 RTCINITDONE Completed the initialization of the HUB real-time counter 0x38 BARRIEROK Successfully completed the first local barrier operation 0x3c JUMPRAMU Prepare to jump to UALIAS space 0x3d JUMPRAMUOK Successfully jumped to UALIAS space 0x3e JUMPRAMC Prepare to jump to cached space 0x3f JUMPRAMCOK Successfully jumped to cached space 0x40 STACKRAM Prepare to test the stack area of memory 0x41 STACKRAMOK Successfully tested the stack area of memory 0x45 LAUNCHLOOP Prepare to enter the slave launch loop 0x46 LAUNCHINTR Received a launch interrupt 0x47 LAUNCHCALL Call the launched() function 0x48 LAUNCHDONE Returned from the launched() function 0x4a MDIRINIT Prepare to initialize the HUB MD and SIMM controls 0x4b MDIRCONFIG Prepare to determine and configure the memory size 0x4c I2CINIT Prepare to initialize the PCF8584 I2C chip 0x4d I2CDONE Completed the initialization of the PCF8584 I2C chip 0x4f IODISCOVER Prepare to discover Hub I/O 0x50 HUB_CONFIG Prepare to write Hub configuration information into the KLCONFIG structure 0x51 ROUTER_CONFIG Prepare to write the router configuration information into the KLCONFIG structure 0x52 INITIO Prepare to initialize the I/O section of the Hub 0x53 CONSOLE_GET Prepare tp probe the I/O section of the Hub 0x54 CONSOLE_GET_OK Successfully completed the probe of the I/O section for the console 0x56 INITIODONE Completed the initialization of the I/O section of the Hub 0x57 STASH2 Reset error state saved 0x58 STASH3 Clear Hub error registers 0x59 STASH4 Enable Hub error checking 0x5a IODISCOVER_DONE Completed the discovery of the Hub I/O 0x5b NMI_INIT Prepare to initialize the NMI handler area 0x5c TEST_INTS Prepare to test Hub interrupts 0x5d IORESET Prepare to perform early reset of Hub I/O section Failure codes LED Name Description Value 0x81 CP1 Register test failed 0x82 RESTART Restart master was unable to load the BaseIO PROM 0x83 ICACHE Primary instruction cache test failed (The failing FRU is the node board) 0x84 DCACHE Primary data cache test failed (The failing FRU is the node board) 0x85 SCACHE Secondary cache test failed (The failing FRU is the node board) 0x86 KILLED CPU disabled by another node 0x87 RTC Real-time counter not counting 0x91 HUBLOCAL Hub local arbitration failed; ignore this CPU 0x93 PREM_DIR_REQ All nodes must have premium DIMMs for this configuration 0x97 MAINRET Returned from main() function 0x98 NOMEM Node board does not have local memory 0x9a DISABLED CPU disabled by an environment variable 0x9b DOWNLOAD Failure occured while downloading the PROM code into RAM 0x9c COREDEBUG Boot process cannot set the CORE debug registers 0x9d IODISCOVER Failure occured during the HUB I/O discovery process 0x9e HUB_CONFIG Failure occured while writing the HUB information into the KLCONFIG structure 0x9f ROUTER_CONFIG Failure occured while writing the router information into the KLCONFIG structure 0xa0 HUBIO_INIT Failure occured while trying to initialize the HUB I/O interface 0xa1 CONFIG_INIT Failure occured while trying to initialize the KLCONFIG structure 0xa2 RTRCHIP Failure occured while testing the Router chip 0xa3 LINKDEAD Failure occured while testing the LLP link 0xa4 HUBBIST Failure occured while the HUB chip executed the built-in self test (BIST) 0xa5 RTRBIST Failure occured while the router chip executed the built-in self test (BUILT) 0xa6 RESETWAIT Waiting for a reset to occur 0xa7 LLP_FAIL LLP failed after the reset 0xa8 LLP_NORESET LLP never came up after the reset 0xa9 BADMEM Local memory is corrupted 0xab NET_DISCOVER Failure occured for the Hub network discovery 0xac NASID_CALC Failure occured for the NASID calculation 0xad ROUTE_CALC Failure occured for the route calculation 0xae ROUTE_DIST Failure occured for the route distribution 0xaf NASID_DIST Failure occured for the NASID distribution 0xb0 NO_NASID Master did not assign a NASID 0xb1 NO_MODULEID Failure occured for the module ID arbitration 0xb2 MIXED_SN00 Origin 200 system is configured with an Origin 2000 system 0xb3 ERRPART Failure occured in the partition configuration 0xb4 MODEBIT Failure occured while copying the processor mode bits 0xb5 BACK_CALC Failure occured while calculating the midplane frequency