CPU Error Checking

Error detection in the 21164 is concentrated in the Bus Interface Unit (BIU). The Alpha CPU uses ECC to ensure data integrity. The following errors are detected in the CPU:

Icache Tag or Data Parity Error. The Icache is parity-protected. A machine check occurs before the instruction causing the parity error is executed.
ICPERR_STAT. DPE (data parity error) or TPE (tag parity error) is set.
EXC_ADDR contains either the PC of the instruction causing the parity error or that of an earlier trapping instruction. The Icache is not flushed by hardware in this event.
Scache Data Parity Error--Istream. A machine check occurs before the instruction, which causes the parity error to be executed. Bad data may be written to the Icache or Icache refill buffer and validated. The operation can be retired if there are not multiple errors.
SC_STAT: SC_DPERR<7:0> is set; <SC_SCND_ERR> is set if there are multiple errors.
SC_STAT: CBOX_CMD is IREAD.
SC_ADDR: Contains the address of the 32-byte block containing the error. (Bit 4 indicates which octaword was accessed first, but the error may be in either octaword.)
Scache Tag Parity Error--Istream. A machine check occurs before the instruction, which causes the parity error to be executed. Bad data may be written to the Icache or Icache refill buffer and validated. The operation cannot be retired.
SC_STAT: SC_TPERR<2:0> is set; <SC_SCND_ERR> is set if there are multiple errors.
SC_STAT: CBOX_CMD is IREAD. SC_ADDR. Contains the address of the 32-byte block containing the error. (Bit 4 indicates which octaword was accessed first, but the error may be in either octaword.)
Scache Data Parity Error--Dstream. A machine check occurs. The machine state may have changed. You cannot retry this, but deleting the process may be sufficient if the data was confined to a single process and no second error occurred.
SC_STAT. SC_DPERR<7:0> is set; <SC_SCND_ERR> is set if there are multiple errors.
SC_STAT. CBOX_CMD is DREAD, DWRITE, or READ_DIRTY.
SC_ADDR. Contains the address of the 32-byte block containing the error. (Bit 4 indicates which octaword was accessed first, but the error may be in either octaword.)
Scache Tag Parity Error--Dstream. A machine check occurs. The machine state may have changed. You cannot retry this and probably will not be able to recover by deleting a single process because the exact address is unknown.
SC_STAT. SC_TPERR<2:0> is set; < SC_STAT. CBOX_CMD is DREAD, DWRITE, READ_DIRTY, SET_SHARED, or INVAL.
SC_ADDR. Records physical address bits <39:04> of the location with the error.
Dcache Data Parity Error. The Dcache data is parity-protected. A machine check occurs. The machine state may have changed. You cannot retry this, but you may only need to delete the process if data is confined to a single process and no second error occurred.
DCPERR_STAT. <DP0> or <DP1> (data parity error in bank 0 or 1) is set. <LOCK> is set. <SEO> is set if there are multiple errors.
VA. Contains the virtual address of the quadword with the error.
MM_STAT. Locked. Contains information about instruction causing the error.
Dcache Tag Parity Error. The Dcache Tag is parity-protected. A machine check occurs. The machine state may have changed. DCPERR_STAT: <TP0> or <TP1> (tag parity error in bank 0 or 1) is set. <LOCK> is set. <SEO> is set if there are multiple errors.
VA. Contains the virtual address of the Dcache block (hexword) with the error.
MM_STAT. Locked. Contents contain information about instruction causing the error. The <WR> bit is set if the error occurred on a store instruction.
Istream Uncorrectable ECC Error. A machine check occurs before the instruction responsible for the parity error is executed. Bad data may be written to the Icache or Icache refill buffer and validated. You can retry the operation if there are not multiple errors. The Icache must be flushed to remove bad data. The Icache refill buffer may be flushed by executing enough instructions to fill the refill buffer with new data (32 instructions). Then you reflush the Icache.
EI_STAT. <UNC_ECC_ERR > is set; <SEO_HRD_ERR > is set if there are multiple errors.
EI_STAT. <EI_ES > is set if source of fill data is memory/system, clear if Bcache.
EI_STAT. <FIL_IRD > is set.
EI_ADDR. Contains the physical address bits <39:4 > of the octaword associated with the error.
FILL_SYN. Contains the syndrome bits associated with the failing octaword.
BC_TAG_ADDR. Holds the result of external cache tag probe if external cache was enabled for this transaction.
Dstream Uncorrectable ECC Error. A machine check occurs. The machine state may have changed. You cannot retry the operation, but you may only need to delete the process if the data is confined to a single process and no second error occurred.
EI_STAT. <UNC_ECC_ERR> is set; <SEO_HRD_ERR> is set if there are multiple errors.
EI_STAT. <EI_ES> is set if source of fill data is memory/system, clear if Bcache.
EI_STAT. <FIL_IRD> is clear.
EI_ADDR. Contains the physical address bits <39:4> of the octaword associated with the error.
FILL_SYN. Contains the syndrome bits associated with the failing octaword.
BC_TAG_ADDR. Holds the result of external cache tag probe if external cache was enabled for this transaction.
Bcache Tag Parity Error--Istream. A machine check occurs before the instruction, which causes the parity error to be executed. Bad data may be written to the Icache or Icache refill buffer and validated. You can retry the operation if there are not multiple errors. The Icache must be flushed to remove bad data. The Icache refill buffer may be flushed by executing enough instructions to fill the refill buffer with new data (32 instructions). Then you reflush the Icache.
EI_STAT. <BC_TPERR> or <BC_TC_PERR>is set; <SEO_HRD_ERR> is set if there are multiple errors.
EI_STAT. <EI_ES> is clear.
EI_STAT. <FIL_IRD> is set.
EI_ADDR. Contains the physical address bits <39:4> of the octaword associated with the error.
BC_TAG_ADDR. Holds the result of external cache tag probe.
Bcache Tag Parity Error--Dstream. A machine check occurs. The machine state may have changed. You cannot retry the operation, but you may only need to delete the process if the data is confined to a single process and no second error occurred. EI_STAT. <BC_TPERR> or <BC_TC_PERR> is set; <SEO_HRD_ERR> is set if there are multiple errors.
EI_STAT. <EI_ES> is clear.
EI_STAT. <FIL_IRD> is clear.
EI_ADDR. Contains the physical address bits <39:4> of the octaword associated with the error.
BC_TAG_ADDR. Holds the result of external cache tag probe.
System Command/Address Parity Error. A machine check occurs and the machine state may have changed.
EI_STAT. <EI_PAR_ERR> is set; <SEO_HRD_ERR > is set if there are multiple errors.
EI_STAT. <EI_ES > is set.
EI_ADDR. Contains the physical address bits <39:4> of the octaword associated with the error.
BC_TAG_ADDR. Holds results of external cache tag probe if external cache was enabled for this transaction. When the 21164 detects a command or address parity error, the command is unconditionally NOACKed.
Istream or Dstream Correctable ECC Errors. The 21164 hardware corrects the data before filling the Scache and Icache. The Dcache is completely invalidated. The data in the Bcache contains the ECC error but is scrubbed by PALcode in the correctable interrupt routine. A separately maskable correctable error interrupt occurs at IPL 31 (same as machine check) (masked by clearing ICSR<CRDE>).
ISR: <CRD> is set.
EI_STAT. <COR_ECC_ERR> is set.
EI_STAT. <FIL_IRD> is set if Istream or clear if Dstream.
EI_STAT. <EI_ES> is clear if source of the error is Bcache, and set otherwise.
EI_ADDR. Contains the physical address bits <39:4> of the octaword associated with the error.
FILL_SYN. Contains the syndrome bits associated with the octaword containing the ECC error.
BC_TAG_ADDR. Unpredictable (not loaded on correctable errors).

Bcache Error Detection

The Bcache does not detect errors, but the data is protected by ECC and the tag is protected by parity. ECC is generated by the CPU for each group of eight (8) bytes written into the Bcache. Fill data from the Bcache to the system is not checked for errors. If a correctable error is detected (single- bit error) during a fill from the Bcache, the CPU traps and the fill is replayed with corrected data.


Serial ROM (SROM)

The SROM and the SROM interface to the 21164 do not have error-checking capability.


Memory Error Detection

The memory SIMMs do not detect errors, but they furnish information though ECC bits to error-detection networks in the CPU and core logic. During CPU-initiated transactions, ECC is typically generated by the CPU (in the case of a write Bcache victim to memory, the ECC is from the Bcache). For DMA write transactions, ECC is generated in the PCI portion of the core logic. If there is bad parity on data being written from the PCI to memory during a DMA write, the PCI agent that instigated the DMA write is allowed to complete normally, but the write data is discarded (PYXIS_ERR<PCI_PERR> will be set).


PCI Bus Error Detection

Not all PCI devices are required to detect and report parity errors, but all are required to generate parity on all of their transactions. Some do this more successfully than others. During the address phase, the PAR bit provides even parity for AD [31:0] and C/BE[3:0], regardless of whether the lines carry meaningful information.

Master devices drive PAR for the address and data phases on write transactions. Target devices drive PAR during the data phase of read transactions.

The PCI contains PERR and SERR to signal errors. PERR reports data parity errors for all transactions except special cycle commands. PERR can only be driven by one device at a time. Targets use PERR to signal data parity errors back to the master.

SERR reports address parity errors and data parity errors on special cycles. It is a wire-OR'd signal that can be driven by multiple devices at any one time. SERR will be sent to the CPU as an NMI.


ISA Bus Error Detection

The ISA bus uses I/O Channel Check (IOCHK) to signal that some ISA device detected a parity error on the ISA bus. The assertion of IOCHK causes an NMI to be sent to the main interrupt controller, which, in turn, sends a machine check interrupt request to the CPU.