Please consider a donation to the Higher Intellect project. See https://preterhuman.net/donate.php or the Donate to Higher Intellect page for more info. |
Difference between revisions of "CPU Node board failure on Origin 2000"
Line 1: | Line 1: | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
Hi there. Here at SANBI we have an Origin 2000 made up of two machines | Hi there. Here at SANBI we have an Origin 2000 made up of two machines | ||
Line 54: | Line 41: | ||
==Response== | ==Response== | ||
− | Hmm.... if the machine is still | + | Hmm.... if the machine is still connected with 2 (max 4.) craylink cables you'll have a single machine. You can check this when open the right baffle and looking for thicker cables. I think with 'virtual machine' you mean that your systems use the partition feature. With that you can create smaller sub units from a larger installation. The smallest subunit in an origin 2000 is one module and each runs its own IRIX kernel (installation). |
But... i never use the partition feature so i cant spend some more specific hints how to deal when having a problem there. I cant believe | But... i never use the partition feature so i cant spend some more specific hints how to deal when having a problem there. I cant believe | ||
Line 62: | Line 49: | ||
having no need for the FRU files you can delete them to get the disk space back. | having no need for the FRU files you can delete them to get the disk space back. | ||
− | This looks like and older IRIX installation because latest PROM is 6.156 | + | This looks like and older [[IRIX]] installation because latest PROM is 6.156 |
from Nov 18, 2003. | from Nov 18, 2003. | ||
Line 70: | Line 57: | ||
and so on. | and so on. | ||
− | If a module lost its configuration, or when clearing up the logs from POD, or after changing the MSC it can be | + | If a module lost its configuration, or when clearing up the logs from POD, or after changing the MSC it can be happened that two or more modules use the same id. So you have to assign the ids manually by enter the command line modus from the maintenance menu. Type in 'help' to get a list. Im sure there is an command like 'moduleid' or just 'module'. With this command you can get the current id and also assign a new one. |
So shutdown both installations and restart only the lower module. When possible let the system boot into the IRIX OS. The MSC than shows you the current ID. The LEDs shows something like 'P0M 1 C' which means "Partion 0,Module 1, Console". Shutdown the system and try the same step with the upper module. If both uses the same ID you have to re-assign one of them. Restart the module and enter then maintenance menu. Press '5' for the commandline menu. Type in 'moduleid 1' for example followed by 'update' to save the new configuration. After this restart the systems. | So shutdown both installations and restart only the lower module. When possible let the system boot into the IRIX OS. The MSC than shows you the current ID. The LEDs shows something like 'P0M 1 C' which means "Partion 0,Module 1, Console". Shutdown the system and try the same step with the upper module. If both uses the same ID you have to re-assign one of them. Restart the module and enter then maintenance menu. Press '5' for the commandline menu. Type in 'moduleid 1' for example followed by 'update' to save the new configuration. After this restart the systems. | ||
− | Something similar can be | + | Something similar can be happened when moving/replacing nodeboards from |
one module into another. I my case (2 rack system with 32cpus) i try | one module into another. I my case (2 rack system with 32cpus) i try | ||
clearallogs and initlogs from the POD. After this the system starts to | clearallogs and initlogs from the POD. After this the system starts to | ||
− | re-number all nodes and modules. But | + | re-number all nodes and modules. But don't try this until you have check |
howto setup your partitions! | howto setup your partitions! | ||
[[Category:SGI]] | [[Category:SGI]] |
Latest revision as of 15:26, 20 July 2019
Hi there. Here at SANBI we have an Origin 2000 made up of two machines in one cabinet. Historically they have appeared a single 'virtual machine', with the 'bottom' one playing the role of 'master'. A few weeks ago, the machine rebooted itself, and the 'top' machine took over as 'master'. And now, if you try and boot up the bottom machine - or both machines - you get an error message like this:
....... DONE Checking partitioning information ......... DONE Loading BASEIO prom ....................... DONE BASEIO PROM Monitor SGI Version 6.111 built 09:43:30 AM May 24, 2002 (BE64) 13 CPUs on 7 nodes found. **************************************************************** * PANIC: Boards in same module show different moduleids. * * PANIC: Failed to automatically assign moduleid(s) * * Please assign globally unique module id(s) at the MSC. * **************************************************************** Switching into Power-On Diagnostics mode... 1A 000: *** Software entry into POD mode from IO6 POD mode on node 0 1A 000: POD IOC3 Dex>
As you can see, the result is being dropped into POD. Previous to this happening there was a problem on one of the CPU node boards, but when I tried to use the 'disable' command from the serial console, it didn't seem to do anything besides freeze the machine up. BTW. the MMSC for this cabinet is not functional at present so the only access I have to hardware is by plugging a serial cable into one of the console ports of the two machines.
This 'accident' happened just as I was coming on the job on this site, so I'm not very familiar with the machine setup - does anyone have suggestions for further diagnostics?
Response
Hmm.... if the machine is still connected with 2 (max 4.) craylink cables you'll have a single machine. You can check this when open the right baffle and looking for thicker cables. I think with 'virtual machine' you mean that your systems use the partition feature. With that you can create smaller sub units from a larger installation. The smallest subunit in an origin 2000 is one module and each runs its own IRIX kernel (installation).
But... i never use the partition feature so i cant spend some more specific hints how to deal when having a problem there. I cant believe that you can run a system which uses partition without the MMSC because the MMSC is needed to shutdown and restart a specific partition.
If an origin went down the system create a FRU analyse under /var/adm/crash. You will also find some infos in the SYSLOG file. When having no need for the FRU files you can delete them to get the disk space back.
This looks like and older IRIX installation because latest PROM is 6.156 from Nov 18, 2003.
When the origin modules are craylinked together each one needs an unique id. In an standard installation the lower module of rack1 is numbered with '1' and the upper is '2'. The lower module from a 2nd. rack get '3' and so on.
If a module lost its configuration, or when clearing up the logs from POD, or after changing the MSC it can be happened that two or more modules use the same id. So you have to assign the ids manually by enter the command line modus from the maintenance menu. Type in 'help' to get a list. Im sure there is an command like 'moduleid' or just 'module'. With this command you can get the current id and also assign a new one.
So shutdown both installations and restart only the lower module. When possible let the system boot into the IRIX OS. The MSC than shows you the current ID. The LEDs shows something like 'P0M 1 C' which means "Partion 0,Module 1, Console". Shutdown the system and try the same step with the upper module. If both uses the same ID you have to re-assign one of them. Restart the module and enter then maintenance menu. Press '5' for the commandline menu. Type in 'moduleid 1' for example followed by 'update' to save the new configuration. After this restart the systems.
Something similar can be happened when moving/replacing nodeboards from one module into another. I my case (2 rack system with 32cpus) i try clearallogs and initlogs from the POD. After this the system starts to re-number all nodes and modules. But don't try this until you have check howto setup your partitions!