Please consider a donation to the Higher Intellect project. See https://preterhuman.net/donate.php or the Donate to Higher Intellect page for more info. |
SGI ECC Issues
Every few months or so, my Indy (specs below) spits out a kernel panic such as the following. Note that none of the SIMMs have moved, and they're the same SIMMs that had previously given me years of solid performance in my Indigo2, and I used to get the same errors with a previous set of SIMMs in the same system:
example 1: <6>Recoverable memory parity error corrected by CPU at 0xaa4c240 <0x302> code:30 <6>Memory Parity Error in SIMM S2 <6>CPU Error/Addr 0x202<PAR >: 0xaa4c240 <0>PANIC: IRIX Killed due to internal Error at PC:0x7fc045a0 ep:0x88cac750 example 2: <6>Recoverable memory parity error corrected by CPU at 0x12da3788 <0x302> code:28 <6>Memory Parity Error in SIMM S8 <6>CPU Error/Addr 0x202<PAR >: 0x12da3788 <0>PANIC: IRIX Killed due to internal Error at PC:0x88016b5c ep:0xffffce88
First off, since when are parity errors recoverable? Or is the Indy doing ECC with the parity bits?
Second, why is the kernel panicing on a (so claimed) recoverable and corrected error? systune panic_on_sbe is set to 0.
I am not stating that SGI ECC works exactly as described below; but, conceptually, it is descriptive of the manner of the SGI ECC's working.
The illustrations are presented in big-endian manner. The most significant bits and words are presented to the left and to the top. Parity memory has, by tradition, an extra chip for each 8 bits of data; with the value stored in that chip that makes, again by tradition, the count of 1 bits in the parity bit and the 8 data bits to its right (parity) odd. This is sufficient to detect that a single bit among those bits has been inverted. MSW -66665555-55555544-44444444-33333333 p32109876p54321098p76543210p98765432 LSW -33222222-22221111-11111100-00000000 p10987654p32109876p54321098p76543210 If the word size is at least 64 bits with parity at the usual ratio, then there are 72 bits available. If used in a different manner, these are sufficient to detect all single and double bit inversions and correct all single bit inversions For example, here is a practical model ECC system. It might not match your particular system (e.g. it might not match any SGI computer's system). MSB 776666666666555555555544444444443333 109876543210987654321098765432109876 6666555p5555555444444444433333333332 3210987h6543210987654321098765432109 pa ------------------------------------ pb -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. pc --..--..--..--..--..--..--..--..--.. pd ----....----....----....----....---- pe ........--------........--------.... pf ........----------------............ pg ........---------------------------- ph -------* LSW 333333222222222211111111110000000000 543210987654321098765432109876543210 222p222222111111111p1......p...p.ppp 876g543210987654321f0987654e321d0cba pa -----------------------------------* pb -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.* pc --..--..--..--..--..--..--..--..-* pd ....----....----....----....---* pe ....--------........-------* pf ....---------------* pg ---* The first set of bit numberings is the unstructured view of the bits in a memory protection unit. The second set illustrate a possible assignment of those bits to data and protection. The numbered bits are the data bits. The number gives the bit position in the power of 2 sense. The lettered bits are the 8 parity bits. The parity bits give the parity of the bits matching the positions with hyphens for that parity bit. By tradition, odd parity is used. The asterisked positions are where the parity bits are stored. Except to provide a linear mapping of the error signature to a bit, there is no significance to the bit numbering. Frequently, the bits are rearranged so that the parity bits are in the `ninth bit of the byte' position or to match some other needs. If no bit inversion has occured, then all of the parity bits will be correct. If a single bit inversion occurs, then the pa bit will be wrong and if ph-pg-pf-pe-pd-pc-pb is considered to be a 7- bit binary number, then the positions which are wrong considered as 1 bits indicate which bit has been inverted! The error can be corrected by inverting that bit. If a double bit inversion occurs, then pa will be correct and the the ph-pg-pf-pe-pd-pc-pb will be non-zero. This is a detected, but uncorrectable memory error. If a triple or more bit inversion has occured, then it is probable that that either an apparent single bit case with a bit position greater than position 71 (1000111) or an apparent double bit will be indicated. Both of these cases are considered a unrecoverable memory error. Not all triple or more bit inversions can be detected. In order to sweep out soft single bit errors, which may be caused, for example, by cosmic or embedded radiation, before they become multiple bit errors, the operating system may have a very low priority task (either the wait task or just above it in priority) that periodically sweeps through real memory and triggers ECC correction. Normally, the ECC data is created and checked by each major subsystem of the computer, e.g. CPU, memory bank, I/O channel. In the case of a partial write, the subsystem may check the supplied data, obtain the complete word to be modified, merge the data and recompute the ECC signature. Frequently, in interest of system performance, the ECC signature is check in parallel with using the data and the results discarded if the data were bad. These behaviors can have significant impact on system performance.
it is also useful to explicitly say that the checkbit portion of the information word is also protected for SBE and DBE too.
the fun comes when considering address errors. when a single bit address error occures that correlates with every bit in the read word the result would be a 'corrupt read' (or whatever industry standard nomenclature you subscribe too). the ECC would be correct but from the wrong address. mild attempts to detect this involve XOR'ing the address into the checkbits, so the syndrome also contains bad address indications. there are more exotic methods though...
another delightful result of all this was the advent of 4 bit chips.faults can occur on address and control lines effecting 4 bits in a row (or some arbitrary combination) your ECC must now detect (and maybe correct) these types of errors too. heaven help you if you have 8 or 16 bit chips :-) unless you have a PEECEE then, it doesn't matter worth a F00F.
Single bit errors, which are recoverable with ECC memory, can be invisible to all but the machine error handling and logging components of the kernel.
Multiple and therefore unrecoverable errors need not be fatal if the memory location can be determined to not be in assigned use at the time, which can be recovered from other storage or by recomputation or which be determined to be ignorable.
Pages in the available to be assigned pool; unchanged pages which are backed in swap or by files; the remainder of pages partially allocated to processes; already written pages in the file cache in files having been written to output files; pages in the file cache from input files; with signal handling, unrecoverable errors at the kernel level which a process determines that does not need to be recovered or the program knows how to recover; etc. are examples of such not necessarily fatal multiple bit memory errors. Which of these cases SGI Irix permits is something I do not know.
We estimate that the combined cosmic ray, background and intrinsic (radioactivity in the chip and its enclosure) radiation and quantum tunneling caused memory errors occur at an order of magnitude rate of about 1 per gigabyte-week. We have seen a few cases most easily explained by a cosmic ray shower causing multiple errors in multiple machine concurrently, i.e., several machines failing at about the same exact moment for memory related reasons.