Please consider a donation to the Higher Intellect project. See or the Donate to Higher Intellect page for more info.

SGI ECC Issues

From Higher Intellect Wiki
Jump to navigation Jump to search

Every few months or so, my Indy (specs below) spits out a kernel panic such as the following. Note that none of the SIMMs have moved, and they're the same SIMMs that had previously given me years of solid performance in my Indigo2, and I used to get the same errors with a previous set of SIMMs in the same system:

example 1:
    <6>Recoverable memory parity error corrected by CPU at 0xaa4c240 <0x302> 
    <6>Memory Parity Error in SIMM  S2
    <6>CPU Error/Addr 0x202<PAR >: 0xaa4c240

    <0>PANIC: IRIX Killed due to internal Error
        at PC:0x7fc045a0 ep:0x88cac750

example 2:
    <6>Recoverable memory parity error corrected by CPU at 0x12da3788 <0x302> 
    <6>Memory Parity Error in SIMM  S8
    <6>CPU Error/Addr 0x202<PAR >: 0x12da3788

    <0>PANIC: IRIX Killed due to internal Error
        at PC:0x88016b5c ep:0xffffce88

First off, since when are parity errors recoverable? Or is the Indy doing ECC with the parity bits?

Second, why is the kernel panicing on a (so claimed) recoverable and corrected error? systune panic_on_sbe is set to 0.

I am not stating that SGI ECC works exactly as described below; but, conceptually, it is descriptive of the manner of the SGI ECC's working.

          The illustrations are presented in big-endian manner.  The
          most significant bits and words are presented to the left
          and to the top.

          Parity memory has, by tradition, an extra chip for each 8
          bits of data; with the value stored in that chip that makes,
          again by tradition, the count of 1 bits in the parity bit
          and the 8 data bits to its right (parity) odd.  This is
          sufficient to detect that a single bit among those bits has
          been inverted.

               MSW -66665555-55555544-44444444-33333333
               LSW -33222222-22221111-11111100-00000000

          If the word size is at least 64 bits with parity at the
          usual ratio, then there are 72 bits available.  If used in a
          different manner, these are sufficient to detect all single
          and double bit inversions and correct all single bit

          For example, here is a practical model ECC system.  It might
          not match your particular system (e.g. it might not match
          any SGI computer's system).

               MSB 776666666666555555555544444444443333

                pa ------------------------------------
                pb -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
                pc --..--..--..--..--..--..--..--..--..
                pd ----....----....----....----....----
                pe ........--------........--------....
                pf ........----------------............
                pg ........----------------------------
                ph -------*

               LSW 333333222222222211111111110000000000

                pa -----------------------------------*
                pb -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.*
                pc --..--..--..--..--..--..--..--..-*
                pd ....----....----....----....---*
                pe ....--------........-------*
                pf ....---------------*
                pg ---*

          The first set of bit numberings is the unstructured view of
          the bits in a memory protection unit.  The second set
          illustrate a possible assignment of those bits to data and
          protection.  The numbered bits are the data bits.  The
          number gives the bit position in the power of 2 sense.  The
          lettered bits are the 8 parity bits.  The parity bits give
          the parity of the bits matching the positions with hyphens
          for that parity bit.  By tradition, odd parity is used.  The
          asterisked positions are where the parity bits are stored.

          Except to provide a linear mapping of the error signature to
          a bit, there is no significance to the bit numbering.
          Frequently, the bits are rearranged so that the parity bits
          are in the `ninth bit of the byte' position or to match some
          other needs.

          If no bit inversion has occured, then all of the parity bits
          will be correct.

          If a single bit inversion occurs, then the pa bit will be
          wrong and if ph-pg-pf-pe-pd-pc-pb is considered to be a 7-
          bit binary number, then the positions which are wrong
          considered as 1 bits indicate which bit has been inverted!
          The error can be corrected by inverting that bit.

          If a double bit inversion occurs, then pa will be correct
          and the the ph-pg-pf-pe-pd-pc-pb will be non-zero.  This is
          a detected, but uncorrectable memory error.

          If a triple or more bit inversion has occured, then it is
          probable that that either an apparent single bit case with a
          bit position greater than position 71 (1000111) or an
          apparent double bit will be indicated.  Both of these cases
          are considered a unrecoverable memory error.  Not all triple
          or more bit inversions can be detected.

          In order to sweep out soft single bit errors, which may be
          caused, for example, by cosmic or embedded radiation, before
          they become multiple bit errors, the operating system may
          have a very low priority task (either the wait task or just
          above it in priority) that periodically sweeps through real
          memory and triggers ECC correction.

          Normally, the ECC data is created and checked by each major
          subsystem of the computer, e.g. CPU, memory bank, I/O
          channel.  In the case of a partial write, the subsystem may
          check the supplied data, obtain the complete word to be
          modified, merge the data and recompute the ECC signature.
          Frequently, in interest of system performance, the ECC
          signature is check in parallel with using the data and the
          results discarded if the data were bad.  These behaviors can
          have significant impact on system performance.

it is also useful to explicitly say that the checkbit portion of the information word is also protected for SBE and DBE too.

the fun comes when considering address errors. when a single bit address error occures that correlates with every bit in the read word the result would be a 'corrupt read' (or whatever industry standard nomenclature you subscribe too). the ECC would be correct but from the wrong address. mild attempts to detect this involve XOR'ing the address into the checkbits, so the syndrome also contains bad address indications. there are more exotic methods though...

another delightful result of all this was the advent of 4 bit chips.faults can occur on address and control lines effecting 4 bits in a row (or some arbitrary combination) your ECC must now detect (and maybe correct) these types of errors too. heaven help you if you have 8 or 16 bit chips :-) unless you have a PEECEE then, it doesn't matter worth a F00F.

Single bit errors, which are recoverable with ECC memory, can be invisible to all but the machine error handling and logging components of the kernel.

Multiple and therefore unrecoverable errors need not be fatal if the memory location can be determined to not be in assigned use at the time, which can be recovered from other storage or by recomputation or which be determined to be ignorable.

Pages in the available to be assigned pool; unchanged pages which are backed in swap or by files; the remainder of pages partially allocated to processes; already written pages in the file cache in files having been written to output files; pages in the file cache from input files; with signal handling, unrecoverable errors at the kernel level which a process determines that does not need to be recovered or the program knows how to recover; etc. are examples of such not necessarily fatal multiple bit memory errors. Which of these cases SGI Irix permits is something I do not know.

We estimate that the combined cosmic ray, background and intrinsic (radioactivity in the chip and its enclosure) radiation and quantum tunneling caused memory errors occur at an order of magnitude rate of about 1 per gigabyte-week. We have seen a few cases most easily explained by a cosmic ray shower causing multiple errors in multiple machine concurrently, i.e., several machines failing at about the same exact moment for memory related reasons.