Unix Review November 1990 4D 340
Tested Mettle - THE SILICON GRAPHICS 4D/340 VGX SUPERWORKSTATION
BY DAVID WILSON
The Silicon Graphics Inc. 4D/340 VGX is an exceptional system for a number of reasons. The system is powered by four 33 MHz R3000 processor chips, and at this time has the highest performing graphics available on any SGI system. This combination, albeit expensive, provides an extremely potent combination of compute server and display processor. The graphics subsystem provides drawing rates of more than one million operations per second for lines and some other shapes, while the four processors can operate cooperatively on a single user task.
SGI has two different families of multiprocessor systems. The 4D/200 series is based on 25 MHz R3000 CPU chips, and the 4D/300 series is based on 33 MHz R3000 chips. Another (middle) digit in the model number indicates the number of processors in the system. The 4D/200 series is available with one to eight processors (4D/210 to 4D/280). Because the 4D/300 is based on a processor board that contains two complete processors, the 4D/300 is only available in configurations with two, four, or eight processors (4D/320, 4D/340, 4D/380).
Also offered with the 4D/200 and 4D/300 series machines are two varieties of graphics accelerators, the VGX and GTX. These graphics subsystems differ substantially in both cost and performance. If purchased without a graphics subsystem, a 4D/200 or 4D/300 system is able to act as a multiprocessor compute and data server for a network. Adding a GTXB graphics subsystem adds $40,000 to the server price, while adding a VGXB graphics accelerator adds $60,000 to the total cost. For that difference, the VGX subsystem (also known as PowerVision) is specified as two or more times faster than the GTX graphics performance.
This review will concentrate on only one of the SGI multiprocessor models, a 4D/340 VGX. With four 33 MHz R3000 processors and the higher-performing graphics accelerator, the reviewed system has few compromises. Models 4D/320 and 4D/340 are packaged in the same deskside chassis, with dimensions of 26 x 26 x 27 inches, with a total weight of 185 pounds. The standard SGI monitor is a 19- inch color RGB display with a weight of 84 pounds. These dimensions are the same for the 4D/200 series machines up to the model 240. A different chassis—a full size rack—is used in systems that have more than four processors.
The 4D/340 can have up to 128 MB of ECC memory. The base published price of all SGI systems includes 8 MB of RAM. That memory size might be usable in SGI's Personal Iris workstation, but must be supplemented in a multiprocessor system. A configuration with 32 MB or more of memory seems reasonable in a four-processor configuration.
The deskside chassis is a "twin tower" style. Processor and memory electronics are all contained in the larger chassis, while a smaller tower is used for peripheral expansion. A 150 MB tape drive and an SCSI disk drive are commonly found there. SGI offers both 380 MB and a 780 MB SCSI disk drives. Additional SCSI device modules can easily be added to the system. In particular, an easily attached SCSI tape drive can be purchased, and this single drive can be moved quickly between systems. This allows the advantage of having a local tape drive on each system, but limits the cost to that of a single unit, regardless of the number of SGI systems.
SGI also offers both higher performing SMD and IPI disk drives for its systems. This makes sense where the system will often be used as a server. These disks are physically larger, requiring a separate chassis to hold the drive, as well as being faster and having more capacity than the SCSI drives. Both disks offer capacities of 1 GB more per drive, and cost $20,000 or more for a controller, drive, and separate chassis.
Another item that is very useful for systems with large disk capacities is an Exabyte 8 mm cartridge tape drive. With a price of $8950, and a capacity of 2.3 GB per drive, this SCSI peripheral has sufficient capacity to make the system backup procedure a much more reasonable task. Because an SGI 4D/340 is nearly always going to be part of a network, the Exabyte will be a resource available to the network rather than just to the system on which it is located.
Since SGI is best known for its graphics performance and subsystems, it is not surprising that its high-end VGX models have several options. The VGXB differs from the VGX option by having fewer bit planes. The difference in cost is $5000 at the time of order, but a field upgrade is $20,000 for this option. There is also a second raster buffer available for the VGX, which costs $25,000. This can double performance for some graphics applications but its usefulness is clearly dependent upon particular customer applications.
The 4D/340 has four processors, which in normal operation execute tasks as they become ready to run. SGI also offers PowerFORTRAN and PowerC compilers, which allow (if possible) single programs to execute on multiple processors. These compilers (actually preprocessors) examine the programs and insert directives for the standard compiler, which generates parallelized code. While the standard FORTRAN compiler is capable of generating parallel code if given the directives manually, the PowerFORTRAN product automatically examines the user program to find and exploit any possible parallelism in it.
The system tested for this review was the base 4D/340 VGX model, with a base price of $174,900. (Gasp!) In addition, we added 56 MB of RAM (for a total of 64 MB) at $26,000, a 380 MB disk drive for $4000, and a built-in tape drive for $1500. The compliment of software includes SGI's basic UNIX operating system (bundled), FORTRAN ($1195), NFS ($595), X11 ($950), and the software development package (bundled). PowerFORTRAN and PowerC are both $5000.
"Tested Mettle" reviewed a SGI 4D/85 in Vol. 8 No. 6, and most of the comments in that review on purchasing, installation, support, documentation, and operation also apply to the 4D/340 system.
Most SGI models of the 4D/340's size are usually purchased directly from local SGI sales offices, although they can also be purchased through some VARs.
Our experiences with SGI sales people have been very positive; they seem extremely knowledgeable and friendly. Purchasing a system such as a 4D/340 will require frequent consultation with the sales people, as configurations are fine-tuned.
In particular, with a system configuration as (potentially) large and expensive as a 4D/340's, the particular configuration chosen can be a delicate issue. There is a difficult tradeoff between purchasing options now or in the future-some are less expensive if purchased with the system, but others are the same price regardless of when they are purchased. RAM memory, at about $500 per megabyte, is priced reasonably. Although a 56 MB add-on board costs an appreciable $26,000, depending upon system load, adding memory is one of the easiest ways to make marked improvements in performance on UNIX systems.
Government and VAR discounts from SGI are generally good, and even end users may be able to negotiate on the price of a 4D/340 system. Delivery on the 4D/340 models seems good, but because of the various configurations a customer should not expect "from-stock" delivery. However, SGI does a good job of providing accurate delivery estimates.
All SGI 4D/340 systems are installed by a field engineer. The system unit arrives in a very large box, with the display in a somewhat smaller one. SGI takes the responsibility for unpacking and doing hardware setup on the unit.
The system is supposed to ship with the operating system installed. Should there be any problems, the field engineer can re-install the operating system. Although this policy may not be firmly established, network hookup and checkout is also done by the SGI engineers. The time from when the boxes were opened until the system was connected to the network and ready for use was a respectable 90 minutes, including substantial system-diagnostic checkout time.
This hand-holding policy makes sense for a system as large as the 4D/340. Other than its weight, the installation of this system was no more difficult than that of a Personal Iris but for a system of this size, it is comforting to have an SGI engineer present.
Support and Documentation
We had no reason to use the SGI hardware or software support lines for the 4D/340. Previous experience indicates that SGI's support personnel provide good support, with accurate and helpful answers when calls are returned.
Local support through the SGI sales offices is likely to vary depending upon the particular location, but our experience is very positive. A friendly engineer at the local office is worth a great deal.
SGI documentation is generally quite good, especially for graphics programming issues. Manuals are all provided in a 6-x-9-inch page format in loose-leaf binders. The UNIX manual pages are also available on line. The overall organization is traditional for UNIX documentation, with additional SGI-unique manuals done in the company's own style.
Only for the monitor ROM do we find a lack of documentation. For the 4D/340 we never had to make use of the ROM other than to boot the system at power on, but in other contexts on other SGI systems we have needed to use this facility.
Operation and Ease of Use
SGI loads its 4Sight windowing package when the system boots. This software also allows X11 to run transparently. With the dual windowing available, and an optional icon-based interface for the system, the user interface has all of the features normally found in current graphical user interfaces. While we do not prefer or normally use an icon interface, its presence did not make us change how we do our work.
X11 worked as expected, and all of tests ran without problems on the current (3.3) version of the SGI operating system. Some older versions of the operating system had problems with some of our tests, and version 3.3 seems to have cleaned up all the problems we encountered with the earlier releases. The system ran as expected, which is our best measure of how easy it is to use. X11 and NFS worked as expected, and the FORTRAN and C compilers seem to be variants of compilers developed by Mips Computer Corp. The parallelizing FORTRAN and C compilers were enabled with a simple switch on the compiler command line and were easy to use. Output from the compiler was understandable, so it was possible to see what parts of the code could run in parallel.
For the administrator, SGI has a graphical administration shell. The shell is very easy to use and makes administration chores seem quite trivial for most standard functions. There is also a graphical performance monitor that separately shows the status of each of the processors and the disk controller.
One new product for SGI is its network visualizer. This software tool allows an SGI workstation to monitor network traffic, and display useful information about network activity. The interface to this package is easy to use, and there are many options available. The package provides a visual display of network nodes (by IP number) , and draws lines to the other systems with which they are communicating. The package can also, among other things, grab and decode network packets.
With the network visualizer it is possible, for example, to monitor NFS packets being sent between two specific systems, or to monitor all packets sent from a particular system, and so on. While these facilities have long been available on network "sniffer" products, this is the first time we have seen a package running on a workstation that is able to do nearly so many tasks.
The 4D/340's performance is interesting both in comparison to other systems and for what it indicates about the performance of parallel systems. The 4D/340 uses multiple high-performance processors, but their usefulness on a given test varies greatly.
Figure 1 shows the Dhrystone, Whetstone, Linpack, and SPECmark performance results for the 4D/340 and several other competitive systems. The Dhrystone tests, a measure of CPU performance, varies by a factor of three on the systems being compared. The SGI 4D/340 rates second among systems compared (to the IBM RS/6000 Model 540). In comparison to 25 MHz R3000 systems (DECstation 5000), the Dhrystone values are about where they should be, and all of the R3000 systems have higher Dhrystone values than the other two systems.
The IBM system leads in Whetstones, a measure of floating-point performance, followed by the 4D/340 VGX, followed by systems in about the same order as for the Dhrystone test. The IBM has about twice the performance of the 4D/340 on double-precision Whetstones. This is a wider gap than that for Dhrystones, because of the exceptionally fast floating-point hardware in the system.
For Linpack, another measure of floating-point performance, the same pattern appears again. The Sun-4/330 system has a slightly higher relative performance on this test than on Whetstone tests, and is faster than the 4D25 system.
Also shown in Figure 1 is the SPECmark value. The SPEC group is a consortium of hardware vendors that have created a benchmark suite for the purpose of standardizing hardware-performance measurement. The SPECmark is a weighted average of 10 separate benchmark tests, covering both CPU- and FPU-intensive areas of performance. A SPECmark rating of 1 is the value achieved by a DEC VAX 11/780 running the same suite of tests. These benchmarks have been ported to systems by their hardware manufacturers and the results shown are those measured by each manufacturer and reported through the SPEC group. The two missing numbers have not been published by SPEC. The value for the 4D/340 is also footnoted, because it was actually run on a two-processor 4D/320 instead.
The ranking apparent in the previous tests appears again here, with the IBM leading the 4D/340, which in turn is faster than the other systems. The value for the DECstation 5000 is unexpectedly high on this test relative to the 4D/340.
For the Dhrystone, Whetstone, and Linpack tests the presence of four processors made no difference to the results. We ran the tests both in the normal one-processor mode and with code compiled for multiprocessors. Running these tests with all four processors enabled made no significant difference to the benchmark results. The SPECmark value was run on a 4D/320 dual processor system, but there is no single-processor system for comparison.
Figure 2 shows the data-transfer rates, in kilobytes per second, achieved when sequentially reading 2 MB of data from an 8 MB file. The block size for this test varies from 512 to 8192 bytes. We obtained excellent rates from the SCSI disk on the 4D/340, at 850 KB per second. This is slightly lower than the 4D/210 measured with am 15 MHz ESDI disk, and is slightly lower than the sequential rate on a DECstation 5000. The IBM RS/6000 Model 540 achieves the best transfer rate on this test, at an amazing 1.86 MB per second or higher.
Figure 3 shows data-transfer rates, in kilobytes per second, achieved when randomly reading 2 MB of data from an 8 MB file. The block size for this test varies from 512 bytes to 8192 bytes, affecting the number of logical reads performed by the system. Again, the SGI machine does an excellent job with an SCSI disk, achieving rates second only to the IBM RS/6000 Model 540. The IBM, Sun, and SGI systems achieve outstanding performance on this test, and the results for all three are generally quite close.
Figure 4 shows the graphics performance we measure on a variety of SGI systems, using both X11 and SGI's GL graphics libraries. The test we use, Ghraphstone, is a set of 122 drawing primitives. For each test, we draw the individual graphics figure for one minute, and count how many times each figure can be drawn. Dividing the count by the time (approximately 60 seconds) provides a figures per second drawing rate for each of the 122 tests. These rates are then normalized into 11 different figure types (such as filled circles, lines, and unfilled rectangles), which in turn are combined into the final Ghraphstone number. This value represents the relative performance of the system when doing 2D drawing-intensive graphics.
The Ghraphstone values shown represent a range of nearly a factor of 10 in performance across X11 and GL, and a factor of three across GL libraries. While SGI's figures would suggest that the 4D/340 VGX has at least twice the performance of the GTX, we do not see that big of a difference. Also of interest is that X11 on the fastest system, the 4D/340 VGX, does not draw as rapidly as GL on the slowest of these systems, the 4D/25T. All of the Ghraphstone values for GL are excellent, and no other system has yet achieved a 47,000 or higher value. Those for X11 are acceptable to good.
The reason for the difference between the Ghraphstone-derived and SGI-cited performance values is interesting. SGI has a very large set of graphics performance numbers, detailing the test conditions under which each number is measured. For the VGX graphics those numbers are two or more times faster than the GTX numbers. We measured SGI graphics performance under their conditions, which turn out exactly as specified. But, graphics is complicated, and changing a few of the many environmental conditions greatly changes performance. We can see, for example, a factor of five difference between best case and worst case performance in drawing 10-pixel lines, depending upon shading and other drawing parameters.
Our results were obtained by taking Ghraphstones for the GL library and running it without modification on the VGX system. We have no doubt that with proper modification, to take fullest advantage of the VGX hardware, we could have made the VGX graphics performance better. This becomes a philosophical issue—whether a benchmark program should be modified to take full advantage of new hardware is not an easy issue to resolve. Regardless, it is clear that the VGX hardware is substantially faster than the GTX graphics, and that further performance increases could be obtained by tuning the program for the particular graphics hardware at hand.
Figure 5 is shows a test of the parallel nature of the system. When compiling a program, the compiler normally assumes only a single processor is available and generates code specific to this case. Optionally, a compiler switch can be used to enable parallelization of the code. At run-time an environmental variable can be set to enable use of up to a specified number of processors. Thus, we have five sets of performance data, two of which run on only one processor.
The first two columns in Figure 5 represent matrix arithmetic, doing matrix operations on 1000 x 1000 or 100 x 100 matrices. For the 1000 x 1000 test case, adding processors improves performance by 58, 80, or 90 percent, depending upon the number of processors added. For the 100 x 100 case, additional processors give performance gains of 205, 312, or 449 percent as two, three, and four processors are used.
The 100 x 100 case shows greater performance increases than expected (using four processors should at best provide a four times performance increase) because of cache--four processors can get a lot more of the arrays into cache than can a single processor, and each additional processor provides full performance plus the effect of additional cache. The 1000 x 1000 matrices are too large to fit into cache, and thus provide an indication of the improvement possible for very large jobs.
The mixed-code example in Figure 5 is more like normal user code. The test does a variety of things, and was never intended to be a benchmark for use on a vector parallel machine. Thus, the performance improvements are like those that might be expected in typical situations. Gains of 20, 29, or 37 percent are obtained for two, three, or four processors. Note that on two of the tests the single processor parallelized code runs slower than the normal code, and this appears to be due to a minimal overhead involved in code that supports multiple processors. But we did not see more than a 2 to 3 percent penalty running on a single processor after compiling code that can run on multiple processors.
It appears that about 50 percent of our tests show little if any improvement in performance on a parallel system, while other tests show an improvement of 10 percent or more (up to the 449-percent shown in Figure 5). The decision about how many processors to purchase is not an easy one, and depends entirely upon the nature the user's code. Also, it needs to pointed out that there are alternatives. If the goal is to get a single program to run at maximum speed then another kind of system may be the solution. However, multiple processors can always be used in a server environment where there are many pending tasks to execute. Picking a multiprocessor system by the combined MIPS rating of the individual processors is not accurate for running a single job—few, if any, individual tasks take advantage of all the processing power available.
Figure 6 shows Workstation Laboratories' Khornerstone benchmark ratings for the systems compared. The Khornerstone test yields a normalized numerical rating of performance based on a mix of floating-point-, processor-, and disk-intensive tasks (21 in all), as well as another value representing the total time it takes to execute all of the tests. The Khornerstone test is designed to assess machines operating under a single-user workload.
The SGI 4D/340 achieves an outstanding 87,495 Khornerstones, second (by a narrow margin) only to the IBM RS/6000 Model 540. This is the second highest value we have ever measured on any system. While the 4D/340 has excellent CPU and FPU performance, it also maps its disk files into memory (as does the IBM machine) to get excellent disk performance. This advantage is an operating system/memory size issue, and makes the performance difference between small and large memory sizes stand out clearly.
An equivalently sized 4D/240 system would achieve a much higher value than shown since it has the same operating system. The performance value of extra memory is quite high on SGI machines.
Our price/performance measure is obtained by dividing the list price of a system by its Khornerstone rating. Figure 7 shows the comparative results of this calculation.
At $2.40 per Khornerstone, the 4D/340 has a good price/performance ratio. The reason that the ratio is even this high is the review configuration--SGI charges more for the VGX graphics subsystem than what several of the other complete systems cost. Further, the configuration has 64 MB of RAM, which is more than required to garner the best performance on our tests. Even so, in a very large configuration with very high performance graphics, the 4D/340 has a good price/performance ratio. While some other systems shown have better price/performance ratios than the SGI machine, we have not seen a better ratio on as large and expandable a system that is able to support very high performance graphics.
The report card for the SGI 4D/340 VGX is quite favorable, reflecting how close this system comes to a dream machine for a programmer. Installation is "outstanding" because of the pre-installed nature of the software, and the friendly SGI engineer that comes with the system. Documentation rates "good"; it is solid and accurate but not pressing the limits of documentation quality in any area.
The expandability of the 4D/VGX is "excellent." With four processors, 128 MB of RAM, extremely high-performance graphics (with other options), and a SCSI bus for expansion, the system has a lot of room for growth. There are also disk alternatives, giving up to 20 GB of disk storage capacity if desired.
The operation of the system is "good," and there is an optional network monitor program that can be an aid to an entire network of systems. From a user's perspective, the operating system follows the conventions associated with workstations, and offers an outstanding environment in which to develop programs.
The performance of the machine is "outstanding," and is among the first systems to use a 33 MHz R3000 CPU chip. This, combined with an operating system that takes advantage of all available memory for disk operations, and the fastest graphics available today, provides an extremely responsive system. However, multiple processors can be used effectively only with certain user codes (and then only when specially compiled), or in a server or heavy multitasking environment.
Price/performance rates "good." The 4D/340 is an expensive system, and our price/performance ratio is as high as it is because of the hardware that is being purchased, resulting in an overall "good" rating.
Overall, we consider the 4D/340 to be an " outstanding" system. The machine can double as a server and as a graphics box, although effective graphics performance probably requires much of the processing power for the system. The software environment, the parallel compilers, the multiple processors, and the powerful graphics all combine to make this system truly remarkable.
To us, this is a system everyone would want if it were a little smaller and a lot less expensive. It has the features enumerated above, that make it a dream machine for programming and software development. It works as expected, works rapidly, and never is busy—unless unusual user code is running, there is always a processor available for direct use. Overall throughput is outstanding—no background task affects performance.
David Wilson is president of Workstation Laboratories, which performs software development and performance analysis on a variety of computer systems (including personal computers, technical workstations, and multiuser microcomputer systems). Written suggestions or inquiries are welcome. (PO Box 368, Humboldt, AZ 86329).