The SGI SPIDER Chip
Scalable Pipelined Interconnect for Distributed Endpoint Routing
The SGI SPIDER Chip
Mike Galles ([email protected])
Silicon Graphics Computer Systems
2011 N. Shoreline Blvd.
Mountain View, CA 94039-7311
The SGI SPIDER chip provides a high speed, reliable switching network with a flexible interface and topology suitable for a variety of high end applications. Six full duplex ports and a non-blocking internal crossbar can sustain a data transfer rate of 4.8 GBytes/sec, either between chips in a single chassis or between remote chassis over cables up to 5 meters in length. Messages of arbitrary length travel over 4 independent virtual channels with 256 levels of priority, and are protected by CCITT-CRC with hardware retry. These features make the SGI SPIDER well suited to serve as an interprocessor communication fabric, a distributed graphics switch fabric, or a central switch for high end networking applications.
Design of the SGI SPIDER chip was guided by the principles of computer communications architecture. Isolation between the physical, data link, and message layers led to a well structured design which is transportable and more easily verified. Because all layers are implemented in hardware, latency was kept very low, so that the benefits of layering could be realized without sacrificing performance. The organization of this paper follows each communications layer, from physical to message, and concludes with physical design details and performance evaluation.
The physical transmission layer for each port is based on a pair of Source Synchronous Drivers and Receivers (SSD and SSR), which transmit and receive 20 data bits and a data framing signal at 400 MBaud. The data link level guarantees reliable transmission using a CCITT-CRC code with a go-back-n sliding window protocol  retry mechanism, and is referred to as the Link Level Protocol (LLP). The message layer defines 4 virtual channels and a credit based flow control scheme to support arbitrary message lengths, as well as a header format to specify message destination, priority, and congestion control options. Two distinct message types are supported, each with its own routing mechanism. Network administrative messages use source vector routing, while mainstream messages use distributed routing tables which exist in each chip. The routing tables define a static route between every pair of destinations in the network, and are programmed by soft ware whenever the network needs to be reconfigured. This allows hardware to take advantage of the static route and optimize routing latency, while software can step in at any time to define new routes which avoid detected faults or change existing routes based on new algorithms. The receive buffers of a port maintain a separate linked list of messages for each of the 5 possible output ports for each virtual channel to avoid the `block at head of queue' bottleneck. Each arbitration cycle, the arbiter chooses up to 6 winners from as many as 120 arbitration candidates to maximize crossbar utilization. Messages accumulate a net work age as they are routed, increasing their priority to
avoid starvation and promote network fairness.
Physical transmission layer
Physical data transmission for each of the 6 full duplex ports is provided by an SSD/SSR pair. A single link consists of 20 data bits, a frame bit, and a differential clock per direction. The 200 MHz differential clock is sampled on both edges, making the effective rate of data transmission 400 MBaud. The raw data bandwidth of each link is 1 GByte/sec per direction.
The chip core operates at 100 MHz, requiring a 4 to 1 serialization to occur between the core and the 400 MHz SSD data pins. At each 100 MHz clock edge, the core provides 80 bits of data, which the SSD serializes into a 20 bit stream over four 400 MHz clocks. The SSD and core use the same clock source with different divisors for synchronous transmission.
The SSR uses the received differential clock to sample incoming data bits. The differential clock has been delayed on the board by 1.25 nsec to ensure that the clock edge is centered on the desired data sample point across process variation. Once data is clocked into the SSR, it is de-serialized and placed into one of two 80 bit registers which serve as the core's receive buffer. The data framing bit is synchronized to the local core clock frequency, and data is read from the receive buffer into the core. To prevent receive buffer overruns, a dead cycle is inserted to break up continuous data bursts which exceed a programmable maximum length. This maximum burst length is a function of crystal frequency tolerances, but generally can be set to a burst several thousand clocks in length.
Data link layer
The data link layer guarantees reliable data transfer between chips and provides a clean interface to the chip core, hiding the control details of the physical transmission layer. The interface to the LLP receives and sends data in micropacket quantities, which consist of 128 bits of data plus 8 bits of sideband. The 8 bits of sideband are intended for flow control and out-of-band communication by a higher level protocol, and are not used by the LLP. This relatively small micropacket size was chosen because a large class of network transactions are short, fitting into 1 or 2 micropackets. Also, data received from a SPIDER chip is known to be error free in 16 byte quantities, allowing end points to use partial data from larger transfers immediately, without waiting for a full message CRC check. This is especially important for interprocessor networks, where early restart on a partial cache line fill can be implemented. Data protection on a single link basis instead of end to end tolerates message header modification as it flows through the network, allowing such features as messaging and virtual channel adaption.
Once the LLP receives data and sideband from the chip core, it assigns a sequence number to the micropacket and stores it in a retransmit buffer. The LLP also appends an acknowledge sequence number to the micropacket, and calculates a 16 bit check code using CCITT-CRC. The entire micropacket is then transmitted over the link. If no errors occur, the remote peer will acknowledge the sequence number of the transmitted micropacket(s), which are then removed from the retransmit buffer. If an error does occur, a go-back-n sliding window protocol is used for recovery. Any error condition causes the packet to not be acknowledged, and the LLP will retransmit all micropackets in the retransmit buffer until a positive acknowledge is received. The combination of the CCITT-CRC and sliding window protocol protects the link from all single, double, and odd number of bit errors, all burst errors up to 16 bits in length, dropped and duplicate micropackets, and clock and data
The LLP uses transparent link negotiation during reset to support multiple port widths. The LLP can negotiate with chips having 10 bit, 20 bit, or wider interfaces. At reset time the peer LLPs negotiate using a narrow port width to arrive at the widest common port width. The interface to the chip core is identical in all bit modes, but the data rate will reflect the port width. This is useful to interface to devices with varying data rate needs. In addition, if a SPIDER to SPIDER link is damaged in any of the upper 10 bits it can negotiate to use the lower 10 bits, and operate at half bandwidth.
The message layer defines 4 virtual channels  and provides independent flow control, header identification, and error notification on a micropacket basis. Flow control is credit based. After reset, all links credit their peers based on the size of their virtual channel receive buffers. This size can vary between interface chips, but SPIDER implements 256 byte buffers for each virtual channel on each port. This is sufficiently large to maintain full band width over 5 meter cables, plus some extra buffering to absorb momentary congestion without degrading bandwidth. Virtual channel tagging and credit information is transmitted in the micropacket sideband field, leaving the remaining 128 bits for data transfer. The total overhead cost in bandwidth of providing reliable links, assigning virtual channels, and managing flow control is 32 bits out of 160, or 20%. This leaves the effective link bandwidth available for data transfer at 800 MBytes/sec per direction per link.
Each message contains a header micropacket followed by zero or more data micropackets. Twenty three bits of each header micropacket are reserved to specify destination and routing information; the remaining bits of a header packet are treated as data, and may be used in any way by a higher level protocol. Message boundaries are delineated by a tail bit which is set on the last micropacket of a message. Message headers always follow micropackets with the tail bit set.
A 9 bit destination identifier specifies one of 512 possible network destinations. These identifiers map into the routing tables, which are described in the next section. The header direction field is 4 bits, and specifies an exit port on the next SPIDER chip. This direction format supports up to 15 ports, as direction 0 is always reserved for communicating with the local administration control block on a SPIDER. There are 256 levels of message age, which are used to prioritize arbitration decisions. Finally, 2 bits of congestion control (CC) are provided. The LSB of the CC field specifies that an individual message may adapt be tween two virtual channels, always choosing the least used virtual channel at each port. This increases performance in heavily loaded networks, but allows messages to arrive out
The SGI SPIDER chip uses program, distributed tables to route messages. Once the tables are programmed, a static route between every two endpoints in the network is established. This allows hardware to make routing decisions in minimal time, while software is free to reconfigure the network to avoid faults or establish a new route map for performance or resource reasons. This scheme relieves the endpoints from global network knowledge and cumbersome source routing hardware in the endpoint interface chips.
Each message header specifies a unique destination ID. When a message enters a port, the destination ID is used to look up routing instructions in the tables. The table returns a direction, or exit port, which is used by the next SPIDER chip for crossbar arbitration. Table lookup is pipelined in
this fashion to reduce latency.
The routing tables are organized into 2 levels of hierarchy. This reduces the number of entries required in each table from 512 down to 48, but also places some restrictions on routing possibilities. The 4 LSB's of the destination ID specify the local address, while the upper 5 bits specify the meta address. Each SPIDER chip has a meta ID register which is compared to the meta ID of the message. If meta IDs match, the message is in the correct meta domain, and the local table is used to route to the precise destination. If meta IDs do not match, the meta table is used to specify the route to the correct meta domain.
It is possible to send messages without using the tables via a source routing protocol called vector routing. This mechanism is used for network configuration and administration. Vector routed packets contain a source relative, step by step series of direction fields which define the exact route of a message. Vector routed messages can be sent while standard table routed messages are in flight.
The programmable tables support a variety of topologies. Different topologies are chosen to trade off cost, network bandwidth, and route redundancy. Tables must be programmed in a cycle-free manner to avoid deadlock. Two possible topologies are briefly be described below.
To achieve scalable bisection bandwidth with minimal average distance be endpoints, a hierarchical fat hypercube can be used. This topology grows as a standard hypercube for 2 to 32 endpoints, then expands to a series of second hypercubes for larger systems. The second, or meta hypercubes are `fat hypercubes' because there is actually a full hypercube connecting each vertex of every local hypercube. The two levels of hypercube need not be of the same dimension.
The fat hierarchial hypercube topology maintains a constant bisection bandwidth of 800 MBytes/sec per endpoint for all configurations up to 512 endpoints. Networks can be sparsely populated, and there is no requirement for powers of 2 endpoints or SPIDER chips. This topology can sustain multiple faults in both the meta hypercubes and the local hypercubes without losing connectivity. To reduce cost, it is also possible to place two or more end points on each SPIDER chip, which is called bristling. In creasing the bristling factor of a network will reduce the SPIDER chip count, but will also reduce network bisection
Another possible network topology is a non-blocking N by N switch. The topology uses O((N*N)/4) SPIDER chips to build a full crossbar between N ports for large N. This topology has a guaranteed bandwidth of 1.6 GBytes/ sec between every endpoint pair simultaneously, but is expensive
in its use of SPIDER chips for large configurations.
The central crossbar arbitration algorithm is crucial to performance. The goal is to maximize bandwidth through the crossbar while guaranteeing starvation avoidance and favoring higher priority messages. To minimize latency, the arbiter falls into a bypass mode which used fixed priority evaluation for minimal arbitration time when there is no output port contention.
To maximize bandwidth through the crossbar without using unreasonable buffering, each virtual channel buffer is organized as a set of linked lists. There is one linked list for each possible output port for each virtual channel. This solution avoids the block at head of queue problem, since a blocked message targeting one output port will not prevent a subsequent message in that same buffer from exiting on a different output port.
Using the linked list based virtual channel buffers, the central arbiter attempts to maximize the number of as signed ports for each arbitration. The arbiter organization is similar to the Wavefront Arbiter described by Tamir et. al.  This type of arbiter improves as more arbitration candidates are present. To maximize crossbar efficiency, each virtual channel from each port can request arbitration for every possible destination, providing up to 120 arbitration candidates each cycle.
In order to avoid starvation and encourage network fairness, the arbiter is rotated each arbitration cycle to favor the highest priority requestor. Priority is based on the age field of a message header. Once a message enters a SPIDER chip, it will age at a programmable rate until it exits, taking its new age in the message header as it enters the next SPIDER. The concept on network age promotes fairness as a function of network flight time rather than just across a single chip. This tends to reduce starvation problems brought on by network hotspots, as endpoints distant from the hotspot will have a higher priority when they arrive at a the hotspot.
Messages which are considered high priority can be injected into the network with an artificially high age. The top 16 age levels are reserved for the highest priority messages, as standard messages are not allowed to age beyond a value of 239.
As messages travel through the network, gaps may be introduced in between data packets of the same message. This occurs when a higher priority message on a different virtual channel cuts through a lower priority message in progress, or when port contention causes gaps due to buffer underflow. The SPIDER chip can be programmed to discourage gaps in messages. Register settings allow a message above a threshold age to cut through other message in progress in order to minimize high priority message latency. If the high priority threshold for message cut through is increased or disabled, fewer gaps will appear in
messages as they cross the network.
Network administration & error handling
The SGI SPIDER chip core contains an administrative module which provides controls functions and tracks error conditions and performance data. Registers in this module control functions such as aging rates, message cut-through thresholds, table access, error state, performance metrics, port protection, and port shutdown, as well as scratch registers with value exchange access. Protection registers control access to the administrative module, and reset fences can be set to prevent reset from propagating through network partitions.
While the LLP layer ensures reliable transmission, it is important to track transient errors, as they impact performance and may indicate a failing part. Checkbit errors and sequence numbers errors are counted and stored on a per port basis. If a single packet is retried a large number of times without success, the link is shut down and no new data is accepted from the link until software examines error state and resets the link.
Timers can also be set to watch for a tail timeout condition, which occurs when a message remains in progress over a certain virtual channel with no data flowing for a timeout period. When a tail timeout occurs, the SPIDER chip fabricates a tail micropacket with an error indicator set and sends it along the open channel. Another timer can be set to watch for deadlock conditions. If a given virtual channel remains full with no progress for a timeout period, a deadlock timeout bit is set and the port is reset. This condition should only occur due to misprogrammed tables or endpoint loops.
A real time clock network is also provided for applications which require tight synchronization between end points. Two additional physical wires per port distribute a clock signal throughout the network. Control registers select the source clock port, and all other endpoints receive the clock with minimal skew.
The SGI SPIDER chip is built using CMOS 5L, a.5 micron process from IBM. The 850,000 gates are made up of a combination of standard cells, memory arrays, hand laid datapath macros, and custom I/O cells to handle high speed off chip signaling. The 160 mm square die with 5 layer metal connects to a custom 624 pin, 18 layer ceramic column grid array (CCGA) package using flip chip C4 style die attach. Core voltage operates at 3.3 V and dissipates up to 29 watts when all ports are in operation. Three of the six ports provide a proprietary single end ed, open drain, low voltage interface. These ports can communicate with other chips in the same chassis, and tolerate multiple connector crossings. The other three ports drive complementary PECL output pairs. The differential signaling allows for communication between chassis over cables up to 5 meters in length, providing the ground shift between chassis is less than 500 mV. The differential ports also provide driver/receiver shutdown features which can be used in concert with remote power sensing to provide hot plugging of cable connections. An external chip is also available to provide translation between single ended and differential signaling levels.
An important application of the SGI SPIDER chip is interprocessor communication, which is very latency sensitive. To address this, a great deal of design effort was spent minimizing latency. Operations are done in parallel when ever possible, arbitration is speculated before error status is known, and custom cell layout is used to speed chip transit.
After data is received by the SSR and synchronized, it enters the chip core and begins several operations in parallel. Table lookup and crossbar arbitration is normally serialized, as the exit port must be known before arbitration begins. To parallelize these operations, table lookup is pipelined across SPIDER chips. The direction field in the message header refers to the exit port targeted for the next hop, so crossbar arbitration can begin immediately. While arbitration progresses. the table lookup is performed for the next SPIDER chip, which depends on the destination ID and the direction field. This does increase table size, as a full table is required for each neighboring SPIDER chip, but it reduces latency by a full clock. Pipelined tables also add flexibility to possible routes, as different exit ports can be given depending on where a messages came from as well as where it is going.
During arbitration and table lookup, the LLP checks the CRC code and sequence numbers, and will signal bad data two cycles later with a squash signal. When a squash occurs, all state affected by the squashed micropacket re winds. In the event that the bad packet has already been routed, a stomp code is appended to the CRC on the outgoing
If a message wins arbitration, it will flow through the central crossbar and be joined by pre-computed flow control information at the sender block. Finally, CRC is computed and the micropacket is transmitted by the SSD. The crossbar consists of hand placed multiplexor cells and the CRC generation is optimized using wide parity cells. In the absence of congestion, the pin to pin latency of the SPIDER chip is 40 nsec. After adding inter-chip propagation delays such as circuit board trace and cables, uncongested delay is approximately 50 nsec per SPIDER chip. The table below shows the average latency for uniform accesses between endpoints for the non-bristled hierarchical fat hypercube topology discussed earlier:# Endpoints Avg. Latency Bisection BW 8 118 nsec 6.4 GB/sec 16 156 nsec 12.8 GB/sec 64 274 nsec 51.2 GB/sec 256 344 nsec 205 GB/sec 512 371 nsec 410 GB/sec
The SGI SPIDER chip implementation was made possible by a small but inspired team of SGI engineers. All design and performance goals were met at speed and on time. Special thanks to the hardware team, Yuval Koren, Bob Newhall, and David Parry, with Ron Nikel on high speed signaling. Thanks also for architecture feedback from Dan Lenoski and Jim Laudon, plus performance simulation and feedback from Dick Hessel and Yen-Wen Lu. References  Stallings, William, "Data and Computer Communications". Macmillan Publishing Co, 1988, pp. 137-144.  Dally, William J., "Virtual Channel Flow Control", IEEE Proc. 17th Int. Symp. on Computer Architecture, May 1990, pp. 60-68.  Tamir, Yuval, and Chi, Hsin-Chou, "Symmetric Cross bar Arbiters for VLSI Communication Switches", IEEE Transactions on Parallel and Distributed Systems, Vol. 4, No. 1, 1993, pp. 13-27.