CPU Node board failure on Origin 2000

From Higher Intellect Wiki
Jump to: navigation, search

Hi there. Here at SANBI we have an Origin 2000 made up of two machines in one cabinet. Historically they have appeared a single 'virtual machine', with the 'bottom' one playing the role of 'master'. A few weeks ago, the machine rebooted itself, and the 'top' machine took over as 'master'. And now, if you try and boot up the bottom machine - or both machines - you get an error message like this:

 .......
 DONE
   Checking
 partitioning information .........              DONE
 Loading BASEIO prom .......................              DONE

 BASEIO PROM Monitor SGI Version 6.111  built 09:43:30 AM May 24,
   2002
 (BE64) 13 CPUs on 7 nodes found.
 ****************************************************************
 *    PANIC: Boards in same module show different moduleids.    *
 *      PANIC: Failed to automatically assign moduleid(s)       *
 *    Please assign globally unique module id(s) at the MSC.   *
 ****************************************************************

 Switching into Power-On Diagnostics mode...


 1A 000: *** Software entry into POD mode from IO6 POD mode on node 0
 1A 000: POD IOC3 Dex>

As you can see, the result is being dropped into POD. Previous to this happening there was a problem on one of the CPU node boards, but when I tried to use the 'disable' command from the serial console, it didn't seem to do anything besides freeze the machine up. BTW. the MMSC for this cabinet is not functional at present so the only access I have to hardware is by plugging a serial cable into one of the console ports of the two machines.

This 'accident' happened just as I was coming on the job on this site, so I'm not very familiar with the machine setup - does anyone have suggestions for further diagnostics?

Response

Hmm.... if the machine is still conneted with 2 (max 4.) craylink cables you'll have a single machine. You can check this when open the right baffle and looking for thicker cables. I think with 'virtual machine' you mean that your systems use the partition feature. With that you can create smaller sub units from a larger installation. The smallest subunit in an origin 2000 is one module and each runs its own irix kernel (installation).

But... i never use the partition feature so i cant spend some more specific hints how to deal when having a problem there. I cant believe that you can run a system which uses partition without the MMSC because the MMSC is needed to shutdown and restart a specific partition.

If an origin went down the system create a FRU analyse under /var/adm/crash. You will also find some infos in the SYSLOG file. When having no need for the FRU files you can delete them to get the disk space back.

This looks like and older IRIX installation because latest PROM is 6.156 from Nov 18, 2003.

When the origin modules are craylinked together each one needs an unique id. In an standard installation the lower module of rack1 is numbered with '1' and the upper is '2'. The lower module from a 2nd. rack get '3' and so on.

If a module lost its configuration, or when clearing up the logs from POD, or after changing the MSC it can be happend that two or more modules use the same id. So you have to asign the ids manually by enter the command line modus from the maintenance menu. Type in 'help' to get a list. Im sure there is an command like 'moduleid' or just 'module'. With this command you can get the current id and also assign a new one.

So shutdown both installations and restart only the lower module. When possible let the system boot into the IRIX OS. The MSC than shows you the current ID. The LEDs shows something like 'P0M 1 C' which means "Partion 0,Module 1, Console". Shutdown the system and try the same step with the upper module. If both uses the same ID you have to re-assign one of them. Restart the module and enter then maintenance menu. Press '5' for the commandline menu. Type in 'moduleid 1' for example followed by 'update' to save the new configuration. After this restart the systems.

Something similar can be happend when moving/replacing nodeboards from one module into another. I my case (2 rack system with 32cpus) i try clearallogs and initlogs from the POD. After this the system starts to re-number all nodes and modules. But dont try this until you have check howto setup your partitions!



Share your opinion