1 The following paper was originally published in the Proceedings of the USENIX 1996 Annual Technical Conference San Diego, California, January 1996 lmbench: Portable Tools for Performance Analysis Larry McVoy, Silicon Graphics Carl Staelin, Hewlett-Packard Laboratories For more information about USENIX Association contact: 510 528-8649 1. Phone: 2. FAX: 510 548-5738 3. Email: [email protected] 4. WWW URL: http://www.usenix.org
2 lmbench: Portable tools for performance analysis Larry McVoy Silicon Graphics, Inc. Carl Staelin Hewlett-Packard Laboratories of this benchmark suite obsolete or irrelevant. Abstract is already in widespread use at many lmbench lmbench is a micro-benchmark suite designed to sites by both end users and system designers. In some focus attention on the basic building blocks of many lmbench cases, has provided the data necessary to common system applications, such as databases, simu- discover and correct critical performance problems lations, software development, and networking. In uncovered lmbench that might have gone unnoticed. almost all cases, the individual tests are the result of a problem in Sun’s memory management software that analysis and isolation of a customer’s actual perfor- made all pages map to the same location in the cache, mance problem. These tools can be, and currently are, effectively turning a 512 kilobyte (K) cache into a 4K used to compare different system implementations cache. from different vendors. In several cases, the bench- marks have uncovered previously unknown bugs and measures only a system’s ability to lmbench design flaws. The results have shown a strong correla- transfer data between processor, cache, memory, net- tion between memory system performance and overall work, and disk. It does not measure other parts of the performance. includes an extensible lmbench system, such as the graphics subsystem, nor is it a database of results from systems current as of late MIPS, MFLOPS, throughput, saturation, stress, graph- 1995. ics, or multiprocessor test suite. It is frequently run on multiprocessor (MP) systems to compare their perfor- 1. Introduction mance against uniprocessor systems, but it does not take advantage of any multiprocessor features. lmbench provides a suite of benchmarks that attempt to measure the most commonly found perfor- The benchmarks are written using standard, mance bottlenecks in a wide range of system applica- portable system interfaces and facilities commonly tions. These bottlenecks have been identified, iso- is portable and used by applications, so lmbench lated, and reproduced in a set of small micro- comparable over a wide set of Unix systems. benchmarks, which measure system latency and band- has been run on AIX, BSDI, HP-UX, IRIX, lmbench width of data movement among the processor and Linux, FreeBSD, NetBSD, OSF/1, Solaris, and memory, network, file system, and disk. The intent is SunOS. Part of the suite has been run on Win- to produce numbers that real applications can repro- dows/NT as well. duce, rather than the frequently quoted and somewhat is freely distributed under the Free lmbench less reproducible marketing performance numbers. Software Foundation’s General Public License [ Stall- The benchmarks focus on latency and bandwidth man89 ], with the additional restriction that results may because performance issues are usually caused by be reported only if the benchmarks are unmodified. latency problems, bandwidth problems, or some com- bination of the two. Each benchmark exists because it 2. Prior work captures some unique performance problem present in Benchmarking and performance analysis is not a one or more important applications. For example, the new endeavor. There are too many other benchmark TCP latency benchmark is an accurate predictor of the lmbench suites to list all of them here. We compare Oracle distributed lock manager’s performance, the to a set of similar benchmarks. memory latency benchmark gives a strong indication of Verilog simulation performance, and the file system ] wants to Park90 : IOstone [ • I/O (disk) benchmarks latency benchmark models a critical path in software be an I/O benchmark, but actually measures the mem- development. ory subsystem; all of the tests fit easily in the cache. ] is a systematic file system and IObench [ Wolman89 was dev eloped to identify and evaluate lmbench disk benchmark, but it is complicated and unwieldy. system performance bottlenecks present in many In [ ] we reviewed many I/O benchmarks and McVoy91 machines in 1993-1995. It is entirely possible that found them all lacking because they took too long to computer architectures will have changed and run and were too complex a solution to a fairly simple advanced enough in the next few years to render parts
3 problem. We wrote a small, simple I/O benchmark, and almost all are far more complex. Less filling, that measures sequential and random I/O far lmdd tastes great. faster than either IOstone or IObench. As part of were checked 3. Benchmarking notes lmdd ] the results from McVoy91 [ against IObench (as well as some other Sun internal 3.1. Sizing the benchmarks I/O benchmarks). lmdd proved to be more accurate than any of the other benchmarks. At least one disk The proper sizing of various benchmark parame- lmdd vendor routinely uses to do performance testing ters is crucial to ensure that the benchmark is measur- of its disk drives. ing the right component of system performance. For example, memory-to-memory copy speeds are dramat- ] measure I/O per- Chen and Patterson [ Chen93, Chen94 ically affected by the location of the data: if the size formance under a variety of workloads that are auto- parameter is too small so the data is in a cache, then matically varied to test the range of the system’s per- the performance may be as much as ten times faster formance. Our efforts differ in that we are more inter- than if the data is in memory. On the other hand, if the ested in the CPU overhead of a single request, rather memory size parameter is too big so the data is paged than the capacity of the system as a whole. to disk, then performance may be slowed to such an Berkeley Software Distribution’s microbench • extent that the benchmark seems to ‘never finish.’ : The BSD effort generated an extensive set of suite takes the following approach to the lmbench test benchmarks to do regression testing (both quality cache and memory size issues: and performance) of the BSD releases. We did not use this as a basis for our work (although we used ideas) • All of the benchmarks that could be affected by for the following reasons: (a) missing tests — such as cache size are run in a loop, with increasing sizes (typ- memory latency, (b) too many tests, the results tended ically powers of two) until some maximum size is to be obscured under a mountain of numbers, and (c) reached. The results may then be plotted to see where wrong copyright — we wanted the Free Software the benchmark no longer fits in the cache. Foundation’s General Public License. • The benchmark verifies that there is sufficient mem- : Ousterhout’s Operating System benchmark • ory to run all of the benchmarks in main memory. A ] proposes several system benchmarks to Ousterhout90 [ small test program allocates as much memory as it measure system call latency, context switch time, and can, clears the memory, and then strides through that file system performance. We used the same ideas as a memory a page at a time, timing each reference. If basis for our work, while trying to go farther. We any reference takes more than a few microseconds, the measured a more complete set of primitives, including page is no longer in memory. The test program starts some hardware measurements; went into greater depth small and works forward until either enough memory on some of the tests, such as context switching; and is seen as present or the memory limit is reached. went to great lengths to make the benchmark portable and extensible. 3.2. Compile time issues The GNU C compiler, , is the compiler we gcc measures net- Networking benchmarks • Netperf : chose because it gav e the most reproducible results working bandwidth and latency and was written by was not present, we gcc across platforms. When Rick Jones of Hewlett-Packard. includes a lmbench used the vendor-supplied . All of the benchmarks cc smaller, less complex benchmark that produces similar except the -O were compiled with optimization results. benchmarks that calculate clock speed and the context ttcp is a widely used benchmark in the Internet com- switch times, which must be compiled without opti- munity. Our version of the same benchmark routinely mization in order to produce correct results. No other delivers bandwidth numbers that are within 2% of the optimization flags were enabled because we wanted . ttcp numbers quoted by results that would be commonly seen by application writers. McCalpin’s stream benchmark :[ • ] has McCalpin95 memory bandwidth measurements and results for a All of the benchmarks were linked using the large number of high-end systems. We did not use default manner of the target system. For most if not these because we discovered them only after we had all systems, the binaries were linked using shared results using our versions. We will probably include libraries. lmbench McCalpin’s benchmarks in in the future. 3.3. Multiprocessor issues In summary, we rolled our own because we wanted simple, portable benchmarks that accurately All of the multiprocessor systems ran the bench- measured a wide variety of operations that we con- marks in the same way as the uniprocessor systems. sider crucial to performance on today’s systems. Some systems allow users to pin processes to a partic- While portions of other benchmark suites include sim- ular CPU, which sometimes results in better cache ilar work, none includes all of it, few are as portable, reuse. We do not pin processes because it defeats the
4 MP scheduler. In certain cases, this decision yields run the benchmarks as the only user of a machine, interesting results discussed later. without other resource intensive or unpredictable pro- cesses or daemons. 3.4. Timing issues lmbench database 3.5. Using the : The benchmarks measure the Clock resolution • includes a database of results that is lmbench elapsed time by reading the system clock via the gettimeofday useful for comparison purposes. It is quite easy to interface. On some systems this build the source, run the benchmark, and produce a interface has a resolution of 10 milliseconds, a long table of results that includes the run. All of the tables time relative to many of the benchmarks which have in this paper were produced from the database results measured in tens to hundreds of microseconds. . This paper is also included lmbench included in To compensate for the coarse clock resolution, the benchmarks are hand-tuned to measure many opera- and may be reproduced incorporating with lmbench tions within a single time interval lasting for many new results. For more information, consult the file lmbench-HOWTO in the lmbench clock ticks. Typically, this is done by executing the distribution. operation in a small loop, sometimes unrolled if the 4. Systems tested operation is exceedingly fast, and then dividing the loop time by the loop count. has been run on a wide variety of plat- lmbench forms. This paper includes results from a representa- • Caching : If the benchmark expects the data to be in tive subset of machines and operating systems. Com- the cache, the benchmark is typically run several parisons between similar hardware running different times; only the last result is recorded. operating systems can be very illuminating, and we If the benchmark does not want to measure cache per- have included a few examples in our results. formance it sets the size parameter larger than the The systems are briefly characterized in Table 1. bcopy cache. For example, the benchmark by default Please note that the list prices are very approximate as copies 8 megabytes to 8 megabytes, which largely is the year of introduction. The SPECInt92 numbers defeats any second-level cache in use today. (Note are a little suspect since some vendors have been that the benchmarks are not trying to defeat the file or ‘‘optimizing’’ for certain parts of SPEC. We try and process page cache, only the hardware caches.) quote the original SPECInt92 numbers where we can. • Variability : The results of some benchmarks, most notably the context switch benchmark, had a tendency 4.1. Reading the result tables to vary quite a bit, up to 30%. We suspect that the Throughout the rest of this paper, we present operating system is not using the same set of physical tables of results for many of the benchmarks. All of pages each time a process is created and we are seeing the tables are sorted, from best to worst. Some tables the effects of collisions in the external caches. We have multiple columns of results and those tables are compensate by running the benchmark in a loop and sorted on only one of the columns. The sorted col- taking the minimum result. Users interested in the . umn’s heading will be in bold most accurate data are advised to verify the results on their own platforms. 5. Bandwidth benchmarks Many of the results included in the database were By bandwidth, we mean the rate at which a partic- donated by users and were not created by the authors. ove data. We attempt to measure ular facility can m Good benchmarking practice suggests that one should the data movement ability of a number of different Operating SPEC List Name Vender Multi price used & model or Uni System CPU Mhz Year Int92 176 IBM PowerPC IBM 43P Uni AIX 3.? MPC604 133 ’95 15k 110k 126 ’93 71 Power2 AIX 4.? Uni IBM 990 IBM Power2 133 FreeBSD/i586 ASUS P55TP4XE Uni FreeBSD 2.1 Pentium ’95 190 3k HP-UX B.10.01 MP 35k 167 ’95 HP 9000/859 120 PA 7200 HP K210 SGI Challenge ’94 140 R4400 200 α IRIX 6.2- MP SGI Challenge 80k R4400 200 ’94 SGI Indigo2 IRIX 5.3 135 15k Uni SGI Indigo2 9k 189 ’94 275 Alpha 21064A Linux 1.3.38 Uni DEC Cabriolet Linux/Alpha Linux/i586 Triton/EDO RAM Uni Linux 1.3.28 Pentium 120 ’95 155 5k Uni Linux 1.3.37 Pentium Pro 200 ’95 ̃ 320 7k Intel Alder Linux/i686 ’93 OSF1 3.0 Alpha 21064 150 DEC 3000/500 84 35k DEC [email protected] Uni DEC [email protected] ? 250k 341 ’95 300 Alpha 21164 OSF1 3.2 MP DEC 8400 5/300 250 ’95 167 SunOS 5.5 21k Uni Sun Ultra1 Sun Ultra1 UltraSPARC Sun SC1000 Sun SC1000 MP SunOS 5.5- β SuperSPARC 50 ’92 65 35k ’95 133 Pentium Pro Uni Intel Alder Solaris/i686 5k ̃ 215 SunOS 5.5.1 Unixware/i686 7k ̃ 320 ’95 Intel Aurora Pentium Pro Unixware 5.4.2 Uni 200 Table 1. System descriptions.
5 facilities: library in a load and an add for each word of memory. The bcopy , bcopy , hand-unrolled add is an integer add that completes in one cycle on all direct-memory read and write (no copying), pipes, inter- of the processors. Given that today’s processor typi- interface, and the read mmap TCP sockets, the cally cycles at 10 or fewer nanoseconds (ns) and that face. memory is typically 200-1,000 ns per cache line, the results reported here should be dominated by the 5.1. Memory bandwidth memory subsystem, not the processor add unit. Data movement is fundamental to any operating The memory contents are added up because almost system. In the past, performance was frequently mea- all C compilers would optimize out the whole loop sured in MFLOPS because floating point units were when optimization was turned on, and would generate slow enough that microprocessor systems were rarely far too many instructions without optimization. The limited by memory bandwidth. Today, floating point solution is to add up the data and pass the result as an units are usually much faster than memory bandwidth, unused argument to the ‘‘finish timing’’ function. so many current MFLOP ratings can not be main- tained using memory-resident data; they are ‘‘cache Memory reads represent about one-third to one- only’’ ratings. half of the bcopy work, and we expect that pure reads should run at roughly twice the speed of . bcopy We measure the ability to copy, read, and write Exceptions to this rule should be studied, for excep- data over a varying set of sizes. There are too many tions indicate a bug in the benchmarks, a problem in results to report all of them here, so we concentrate on , or some unusual hardware. bcopy large memory transfers. We measure copy bandwidth two ways. The first Bcopy Memory is the user-level library bcopy interface. The second unrolled libc read write System is a hand-unrolled loop that loads and stores aligned 242 IBM Power2 171 205 364 8-byte words. In both cases, we took care to ensure 152 Sun Ultra1 167 129 85 that the source and destination locations would not 120 80 85 DEC [email protected] 123 map to the same lines if the any of the caches were HP K210 117 126 78 57 direct-mapped. In order to test memory bandwidth 58 235 88 Unixware/i686 65 rather than cache bandwidth, both benchmarks copy Solaris/i686 52 48 159 71 1 area to another 8M area. (As secondary an 8M 46 79 91 DEC [email protected] 45 caches reach 16M, these benchmarks will have to be Linux/i686 42 56 208 56 83 42 73 FreeBSD/i586 39 resized to reduce caching effects.) Linux/Alpha 39 39 73 71 The copy results actually represent one-half to 75 Linux/i586 38 42 74 one-third of the memory bandwidth used to obtain 36 67 65 35 SGI Challenge those results since we are reading and writing mem- SGI Indigo2 31 32 69 66 ory. If the cache line size is larger than the word 26 63 21 21 IBM PowerPC stored, then the written cache line will typically be 31 Sun SC1000 17 15 38 read before it is written. The actual amount of mem- ory bandwidth used varies because some architectures Table 2. Memory bandwidth (MB/s) have special instructions specifically designed for the Memory writing is measured by an unrolled loop function. Those architectures will m ove twice bcopy that stores a value into an integer (typically a 4 byte as much memory as reported by this benchmark; less integer) and then increments the pointer. The proces- ove three times as much advanced architectures m sor cost of each memory operation is approximately memory: the memory read, the memory read because the same as the cost in the read case. it is about to be overwritten, and the memory written. The numbers reported in Table 2 are not the raw results reported in Table 2 may be bcopy The 2 is capa- hardware speed in some cases. The Power2 stream correlated with John McCalpin’s McCalpin95 ble of up to 800M/sec read rates [ ] and HP McCalpin95 ] benchmark results in the following man- [ PA RISC (and other prefetching) systems also do bet- benchmark reports all of the mem- ner: the stream ter if higher levels of code optimization used and/or benchmark reports the ory moved whereas the bcopy the code is hand tuned. bytes copied. So our numbers should be approxi- mately one-half to one-third of his numbers. The Sun libc bcopy in Table 2 is better because they use a hardware specific bcopy routine that uses Memory reading is measured by an unrolled loop instructions new in SPARC V9 that were added specif- that sums up a series of integers. On most (perhaps ically for memory movement. all) systems measured the integer size is 4 bytes. The loop is unrolled such that most compilers generate The Pentium Pro read rate in Table 2 is much code that uses a constant offset with the load, resulting higher than the write rate because, according to Intel, 1 2 Some of the PCs had less than 16M of available Someone described this machine as a $1,000 pro- memory; those machines copied 4M. cessor on a $99,000 memory subsystem.
6 the write transaction turns into a read followed by a bcopy hardware unavailable to the C library. write to maintain cache consistency for MP systems. It is interesting to compare pipes with TCP because the TCP benchmark is identical to the pipe 5.2. IPC bandwidth benchmark except for the transport mechanism. Ide- ally, the TCP bandwidth would be as good as the pipe Interprocess communication bandwidth is fre- bandwidth. It is not widely known that the majority of quently a performance issue. Many Unix applications bcopy , the checksum, and the the TCP cost is in the are composed of several processes communicating network interface driver. The checksum and the driver through pipes or TCP sockets. Examples include the may be safely eliminated in the loopback case and if documentation system that prepared this groff the costs have been eliminated, then TCP should be , remote file access, X Window System paper, the just as fast as pipes. From the pipe and TCP results in and World Wide Web servers. Table 3, it is easy to see that Solaris and HP-UX have Unix pipes are an interprocess communication done this optimization. mechanism implemented as a one-way byte stream. Bcopy rates in Table 3 can be lower than pipe rates Each end of the stream has an associated file descrip- because the pipe transfers are done in 64K buffers, a tor; one is the write descriptor and the other the read size that frequently fits in caches, while the bcopy is descriptor. TCP sockets are similar to pipes except typically an 8M-to-8M copy, which does not fit in the they are bidirectional and can cross machine bound- cache. aries. In Table 3, the SGI Indigo2, a uniprocessor, does Pipe bandwidth is measured by creating two pro- better than the SGI MP on pipe bandwidth because of cesses, a writer and a reader, which transfer 50M of caching effects - in the UP case, both processes share data in 64K transfers. The transfer size was chosen so the cache; on the MP, each process is communicating that the overhead of system calls and context switch- with a different cache. ing would not dominate the benchmark time. The reader prints the timing results, which guarantees that All of the TCP results in Table 3 are in loopback all data has been moved before the timing is finished. mode — that is both ends of the socket are on the same machine. It was impossible to get remote net- TCP bandwidth is measured similarly, except the working results for all the machines included in this data is transferred in 1M page aligned transfers instead paper. We are interested in receiving more results for of 64K transfers. If the TCP implementation supports identical machines with a dedicated network connect- it, the send and receive socket buffers are enlarged to ing them. The results we have for over the wire TCP 1M, instead of the default 4-60K. We hav e found that bandwidth are shown below. setting the transfer size equal to the socket buffer size produces the greatest throughput over the most imple- TCP bandwidth System Network mentations. hippi SGI PowerChallenge 79.3 TCP pipe System Libc bcopy Sun Ultra1 9.5 100baseT 8.8 fddi HP 9000/735 34 93 57 HP K210 FreeBSD/i586 100baseT 7.9 89 18 Linux/i686 56 SGI Indigo2 10baseT .9 171 IBM Power2 84 10 10baseT HP 9000/735 .9 Linux/Alpha 39 73 9 .7 Linux/[email protected] 10baseT Unixware/i686 58 68 -1 61 Sun Ultra1 167 51 Table 4. Remote TCP bandwidth (MB/s) DEC [email protected] 80 46 11 Solaris/i686 48 38 20 The SGI using 100MB/s Hippi is by far the fastest DEC [email protected] 45 35 9 22 34 32 SGI Indigo2 in Table 4. The SGI Hippi interface has hardware sup- Linux/i586 42 34 7 port for TCP checksums and the IRIX operating sys- IBM PowerPC 30 17 21 tem uses virtual memory tricks to avoid copying data 23 13 FreeBSD/i586 42 as much as possible. For larger transfers, SGI Hippi 31 17 SGI Challenge 36 has reached 92MB/s over TCP. 9 Sun SC1000 11 15 100baseT is looking quite competitive when com- pared to FDDI in Table 4, even though FDDI has Table 3. Pipe and local TCP bandwidth (MB/s) packets that are almost three times larger. We wonder is important to this test because the pipe bcopy how long it will be before we see gigabit ethernet write/read is typically implemented as a bcopy into interfaces. the kernel from the writer and then a bcopy from the kernel to the reader. Ideally, these results would be 5.3. Cached I/O bandwidth approximately one-half of the bcopy results. It is Experience has shown us that reusing data in the to be faster than the C possible for the kernel bcopy file system page cache can be a performance issue. since the kernel may have access to library bcopy This section measures that operation through two
7 interfaces, and . The benchmark here is read is virtually the same as the library bcopy case. How- mmap ev er, file reread can be faster because the kernel may not an I/O benchmark in that no disk activity is assist hardware not available to involved. We wanted to measure the overhead of have access to bcopy reusing data, an overhead that is CPU intensive, rather the C library. Ideally, File mmap performance should than disk intensive. is approach performance, but mmap Memory read often dramatically worse. Judging by the results, this The read interface copies data from the kernel’s looks to be a potential area for operating system file system page cache into the process’s buffer, using improvements. 64K buffers. The transfer size was chosen to mini- mize the kernel entry overhead while remaining realis- In Table 5 the Power2 does better on file reread tically sized. than bcopy because it takes full advantage of the memory subsystem from inside the kernel. The mmap and the The difference between the read bcopy reread is probably slower because of the lower clock benchmarks is the cost of the file and virtual memory rate; the page faults start to show up as a significant system overhead. In most systems, the bcopy speed cost. speed. The exceptions should be faster than the read It is surprising that the Sun Ultra1 was able to usually have hardware specifically designed for the bcopy at the high rates shown in Table 2 but did not bcopy function and that hardware may be available only to the operating system. show those rates for file reread in Table 5. HP has the opposite problem, they get file reread faster than The read benchmark is implemented by reread- bcopy, perhaps because the kernel bcopy has access ing a file (typically 8M) in 64K buffers. Each buffer is to hardware support. summed as a series of integers in the user process. The summing is done for two reasons: for an apples- The Unixware system has outstanding mmap to-apples comparison the memory-mapped benchmark reread rates, better than systems of substantially needs to touch all the data, and the file system can higher cost. Linux needs to do some work on the sometimes transfer data into memory faster than the code. mmap processor can read the data. For example, ’s XFS SGI 6. Latency measurements ove can m data into memory at rates in excess of 500M ove data into the cache at only per second, but it can m Latency is an often-overlooked area of perfor- 68M per second. The intent is to measure perfor- mance problems, possibly because resolving latency mance delivered to the application, not DMA perfor- issues is frequently much harder than resolving band- mance to memory. width issues. For example, memory bandwidth may be increased by making wider cache lines and increas- File Memory File Libc ing memory ‘‘width’’ and interleave, but memory read mmap read System bcopy latency can be improved only by shortening paths or IBM Power2 171 106 187 205 increasing (successful) prefetching. The first step HP K210 57 88 117 52 toward improving latency is understanding the current 129 167 85 Sun Ultra1 101 latencies in a system. 67 80 DEC [email protected] 78 120 62 235 200 Unixware/i686 58 The latency measurements included in this suite Solaris/i686 48 52 159 94 are memory latency, basic operating system entry cost, 40 DEC [email protected] 50 79 45 signal handling cost, process creation times, context 36 40 208 Linux/i686 56 switching, interprocess communication, file system 63 40 21 IBM PowerPC 51 latency, and disk latency. SGI Challenge 36 65 56 36 69 44 SGI Indigo2 32 32 6.1. Memory read latency background FreeBSD/i586 42 30 73 53 Linux/Alpha 39 24 73 18 In this section, we expend considerable effort to 9 23 74 Linux/i586 42 define the different memory latencies and to explain Sun SC1000 28 38 20 15 and justify our benchmark. The background is a bit tedious but important, since we believe the memory Table 5. File vs. memory bandwidth (MB/s) latency measurements to be one of the most thought- . lmbench provoking and useful measurements in The mmap interface provides a way to access the The most basic latency measurement is memory kernel’s file cache without copying the data. The latency since most of the other latency measurements benchmark is implemented by mapping the mmap can be expressed in terms of memory latency. For entire file (typically 8M) into the process’s address example, context switches require saving the current space. The file is then summed to force the data into process state and loading the state of the next process. the cache. However, memory latency is rarely accurately mea- as File read In Table 5, a good system will have sured and frequently misunderstood. because as the fast as (or even faster than) Libc bcopy file system overhead goes to zero, the file reread case
8 Memory read latency has many definitions; the of at least the load-in-a-vacuum latency. most common, in increasing time order, are memory • Back-to-back-load latency : Back-to-back-load chip cycle time, processor-pins-to-memory-and-back latency is the time that each load takes, assuming that time, load-in-a-vacuum time, and back-to-back-load the instructions before and after are also cache- time. missing loads. Back-to-back loads may take longer : Memory chips are Memory chip cycle latency • than loads in a vacuum for the following reason: many rated in nanoseconds; typical speeds are around 60ns. critical word systems implement something known as A general overview on DRAM architecture may be , which means that the subblock of the cache line first Hennessy96 ]. The specific information we found in [ that contains the word being loaded is delivered to the ] and pertains to the Toshiba94 describe here is from [ processor before the entire cache line has been module and TC514400AJS DRAM THM361020AS-60 brought into the cache. If another load occurs quickly workstations. The 60ns time is the time used in SGI enough after the processor gets restarted from the cur- assertion to the when the data will be avail- from RAS rent load, the second load may stall because the cache pins (assuming access time able on the CAS DRAM is still busy filling the cache line for the previous load. requirements were met). While it is possible to get On some systems, such as the current implementation in 60ns, that is not all of the time DRAM data out of a of UltraSPARC, the difference between back to back involved. There is a precharge time that must occur and load in a vacuum is about 35%. Toshiba94 ] quotes 110ns as the after every access. [ measures back-to-back-load latency lmbench random read or write cycle time and this time is more because it is the only measurement that may be easily representative of the cycle time. measured from software and because we feel that it is what most software developers consider to be memory Pin-to-pin latency : This number represents the time • latency. Consider the following C code fragment: needed for the memory request to travel from the pro- cessor’s pins to the memory subsystem and back p = head; again. Many vendors have used the pin-to-pin defini- while (p->p_next) tion of memory latency in their reports. For example, p = p->p_next; ] while describing the DEC 8400 quotes [ Fenwick95 Alpha, the loop part turns into three instruc- DEC On a memory latencies of 265ns; a careful reading of that tions, including the load. A 300 Mhz processor has a paper shows that these are pin-to-pin numbers. In 3.33ns cycle time, so the loop could execute in slightly spite of the historical precedent in vendor reports, this less than 10ns. However, the load itself takes 400ns definition of memory latency is misleading since it DEC on a 300 Mhz 8400. In other words, the instruc- ignores actual delays seen when a load instruction is tions cost 10ns but the load stalls for 400. Another immediately followed by a use of the data being way to look at it is that 400/3.3, or 121, nondependent, loaded. The number of additional cycles inside the nonloading instructions following the load would be processor can be significant and grows more signifi- needed to hide the load latency. Because superscalar cant with today’s highly pipelined architectures. processors typically execute multiple operations per It is worth noting that the pin-to-pin numbers clock cycle, they need even more useful operations include the amount of time it takes to charge the lines between cache misses to keep the processor from going to the SIMM s, a time that increases with the stalling. s in a system. More SIMM (potential) number of This benchmark illuminates the tradeoffs in pro- SIMM s mean more capacitance which requires in cessor cache design. Architects like large cache lines, longer charge times. This is one reason why personal up to 64 bytes or so, because the prefetch effect of computers frequently have better memory latencies gathering a whole line increases hit rate given reason- than workstations: the PCs typically have less memory able spatial locality. Small stride sizes have high spa- capacity. tial locality and should have higher performance, but : A load in a vacuum is • Load-in-a-vacuum latency large stride sizes have poor spatial locality causing the the time that the processor will wait for one load that system to prefetch useless data. So the benchmark must be fetched from main memory (i.e., a cache provides the following insight into negative effects of miss). The ‘‘vacuum’’ means that there is no other large line prefetch: activity on the system bus, including no other loads. • Multi-cycle fill operations are typically atomic While this number is frequently used as the memory ev ents at the caches, and sometimes block other cache latency, it is not very useful. It is basically a ‘‘not to accesses until they complete. exceed’’ number important only for marketing rea- sons. Some architects point out that since most pro- • Caches are typically single-ported. Having a large cessors implement nonblocking loads (the load does line prefetch of unused data causes extra bandwidth not cause a stall until the data is used), the perceived demands at the cache, and can cause increased access load latency may be much less that the real latency. latency for normal cache accesses. When pressed, however, most will admit that cache misses occur in bursts, resulting in perceived latencies
9 In summary, we believe that processors are so fast that the average load latency for cache misses will be DEC [email protected] memory latencies closer to the back-to-back-load number than to the load-in-a-vacuum number. We are hopeful that the 500 industry will standardize on this definition of memory ∆ n 450 latency. L a 400 a n 6.2. Memory read latency t ∗ 350 ∆ o × + • e The entire memory hierarchy can be measured, 300 s n including on-board data cache latency and size, exter- e 250 Main c c nal data cache latency and size, and main memory mem 200 y o .5MB latency. Instruction caches are not measured. TLB 150 cache n ], Saavedra92 miss latency can also be measured, as in [ 8KB i 100 d cache but we stopped at main memory. Measuring TLB n s 50 miss time is problematic because different systems + × ∗ • ∆ ∆ 0 map different amounts of memory with their TLB hardware. 8 24 22 20 18 16 14 12 10 The benchmark varies two parameters, array size and array stride. For each size, a list of pointers is cre- log2(Array size) ated for all of the different strides. Then the list is walked thus: Figure 1. Memory latency mov r4,(r4) # C code: p = *p; missing a cache, while others add another cache to the The time to do about 1,000,000 loads (the list wraps) hierarchy. For example, the Alpha 8400 has two on- is measured and reported. The time reported is pure board caches, one 8K and the other 96K. latency time and may be zero even though the load The cache line size can be derived by comparing instruction does not execute in zero time. Zero is curves and noticing which strides are faster than main defined as one clock cycle; in other words, the time memory times. The smallest stride that is the same as memory latency time, as it does not only reported is main memory speed is likely to be the cache line size include the instruction execution time. It is assumed because the strides that are faster than memory are that all processors can do a load instruction in one pro- getting more than one hit per cache line. cessor cycle (not counting stalls). In other words, if the processor cache load time is 60ns on a 20ns pro- Figure 1 shows memory latencies on a nicely cessor, the load latency reported would be 40ns, the DEC Alpha. We use this machine as made machine, a 3 additional 20ns is for the load instruction itself. Pro- the example because it shows the latencies and sizes cessors that can manage to get the load address out to of the on-chip level 1 and motherboard level 2 caches, the address pins before the end of the load cycle get and because it has good all-around numbers, espe- some free time in this benchmark (we don’t know of cially considering it can support a 4M level 2 cache. 13 any processors that do that). bytes or 8K, while the The on-board cache is 2 19 This benchmark has been validated by logic ana- bytes or 512K. external cache is 2 Indy by Ron Minnich SGI lyzer measurements on an Table 6 shows the cache size, cache latency, and while he was at the Maryland Supercomputer main memory latency as extracted from the memory Research Center. latency graphs. The graphs and the tools for extract- Results from the memory latency benchmark are . It is worth- lmbench ing the data are included with plotted as a series of data sets as shown in Figure 1. while to plot all of the graphs and examine them since Each data set represents a stride size, with the array DEC the table is missing some details, such as the size varying from 512 bytes up to 8M or more. The Alpha 8400 processor’s second 96K on-chip cache. curves contain a series of horizontal plateaus, where We sorted Table 6 on level 2 cache latency because each plateau represents a level in the memory hierar- we think that many applications will fit in the level 2 chy. The point where each plateau ends and the line cache. The HP and IBM systems have only one level rises marks the end of that portion of the memory hier- of cache so we count that as both level 1 and level 2. archy (e.g., external cache). Most machines have sim- Those two systems have remarkable cache perfor- ilar memory hierarchies: on-board cache, external mance for caches of that size. In both cases, the cache cache, main memory, and main memory plus TLB delivers data in one clock cycle after the load instruc- miss costs. There are variations: some processors are tion. 3 HP systems usually focus on large caches as close In retrospect, this was a bad idea because we calcu- as possible to the processor. A older HP multiproces- late the clock rate to get the instruction execution time. If the clock rate is off, so is the load time. sor system, the 9000/890, has a 4M, split I&D, 2 way
10 call table to write , verify the user area as readable, Level 1 Level 2 look up the file descriptor to get the vnode, call the Memory cache cache vnode’s write function, and then return. System Clk. lat. size lat. size latency System system call HP K210 -- 8 256K 349 -- 8 260 -- -- 14 13 256K IBM Power2 Linux/Alpha 2 256K 175 Unixware/i686 5 5 8K 25 Linux/i586 2 179 30 256K Linux/i686 5 10 8K Linux/i686 3 270 16K 42 512K 6 6 Sun Ultra1 Unixware/i686 4 Linux/Alpha 3 6 96K 357 46 8K 5 Sun Ultra1 Solaris/i686 7 14 8K ? 48 256K 281 FreeBSD/i586 6 64 SGI Indigo2 2M 8 5 16K 1170 Solaris/i686 7 1189 5 8 16K 64 4M SGI Challenge 9 DEC [email protected] 3 DEC [email protected] 8K 66 4M 400 3 Sun SC1000 9 DEC [email protected] 291 6 12 67 8K 512K HP K210 10 512K 182 8K 7 FreeBSD/i586 7 95 SGI Indigo2 11 150 Linux/i586 8 8 8K 107 256K DEC [email protected] 11 20 20 1236 8K 140 Sun SC1000 1M IBM PowerPC 12 7 IBM PowerPC 6 394 16K 164 ? 512K IBM Power2 16 SGI Challenge 24 Table 6. Cache and memory latency (ns) Table 7. Simple system call time (microseconds) set associative cache, accessible in one clock (16ns). That system is primarily a database server. Linux is the clear winner in the system call time. The IBM focus is on low latency, high bandwidth The reasons are twofold: Linux is a uniprocessor oper- memory. The IBM memory subsystem is good ating system, without any MP overhead, and Linux is because all of memory is close to the processor, but a small operating system, without all of the ‘‘features’’ has the weakness that it is extremely difficult to evolve accumulated by the commercial offers. the design to a multiprocessor system. Unixware and Solaris are doing quite well, given The 586 and PowerPC motherboards have quite that they are both fairly large, commercially oriented poor second level caches, the caches are not substan- operating systems with a large accumulation of ‘‘fea- tially better than main memory. tures.’’ The Pentium Pro and Sun Ultra second level 6.4. Signal handling cost caches are of medium speed at 5-6 clocks latency each. 5-6 clocks seems fast until it is compared Signals in Unix are a way to tell another process to against the HP and IBM one cycle latency caches of handle an event. They are to processes as interrupts similar size. Given the tight integration of the Pen- are to the CPU. tium Pro level 2 cache, it is surprising that it has such Signal handling is often critical to layered systems. high latencies. Some applications, such as databases, software devel- The 300Mhz DEC Alpha has a rather high 22 opment environments, and threading libraries provide clock latency to the second level cache which is prob- an operating system-like layer on top of the operating ably one of the reasons that they needed a 96K level system, making signal handling a critical path in many 1.5 cache. SGI and DEC have used large second level of these applications. caches to hide their long latency from main memory. measure both signal installation and lmbench signal dispatching in two separate loops, within the 6.3. Operating system entry context of one process. It measures signal handling by installing a signal handler and then repeatedly sending Entry into the operating system is required for itself the signal. many system facilities. When calculating the cost of a facility, it is useful to know how expensive it is to per- Table 8 shows the signal handling costs. Note that form a nontrivial entry into the operating system. there are no context switches in this benchmark; the signal goes to the same process that generated the sig- We measure nontrivial entry into the system by nal. In real applications, the signals usually go to repeatedly writing one word to /dev/null ,a another process, which implies that the true cost of pseudo device driver that does nothing but discard the sending that signal is the signal overhead plus the con- data. This particular entry point was chosen because text switch overhead. We wanted to measure signal it has never been optimized in any system that we and context switch overheads separately since context getpid have measured. Other entry points, typically switch times vary widely among operating systems. , are heavily used, heavily opti- and gettimeofday mized, and sometimes implemented as user-level SGI does very well on signal processing, espe- library routines rather than system calls. A write to cially since their hardware is of an older generation /dev/null the driver will go through the system
11 process into a new application, which. forms the basis System sigaction sig handler of every Unix command line interface, or shell. measures this facility by forking a new lmbench 4 7 SGI Indigo2 4 SGI Challenge 9 child and having that child execute a new program — 4 13 HP K210 in this case, a tiny program that prints ‘‘hello world’’ FreeBSD/i586 4 21 and exits. Linux/i686 4 22 The startup cost is especially noticeable on (some) 25 Unixware/i686 6 systems that have shared libraries. Shared libraries IBM Power2 10 27 can introduce a substantial (tens of milliseconds) Solaris/i686 9 45 startup cost. 52 IBM PowerPC 10 52 Linux/i586 7 fork, exec fork, exec fork 59 6 DEC [email protected] System & exit & exit sh -c & exit 138 Linux/Alpha 13 12 Linux/Alpha 0.7 3 Table 8. Signal times (microseconds) Linux/i686 0.4 14 5 16 Linux/i586 0.9 5 than many of the others. Unixware/i686 0.9 5 10 The Linux/Alpha signal handling numbers are so 2.0 16 DEC [email protected] 6 poor that we suspect that this is a bug, especially given 2.9 IBM PowerPC 50 8 3.1 8 19 SGI Indigo2 that the Linux/x86 numbers are quite reasonable. 1.2 IBM Power2 8 16 FreeBSD/i586 2.0 11 19 6.5. Process creation costs HP K210 3.1 20 11 Process benchmarks are used to measure the basic DEC [email protected] 39 13 4.6 process primitives, such as creating a new process, 14 4.0 SGI Challenge 24 running a different program, and context switching. 3.7 Sun Ultra1 20 37 Solaris/i686 4.5 22 46 Process creation benchmarks are of particular interest Sun SC1000 69 281 14.0 in distributed systems since many remote operations include the creation of a remote process to shepherd Table 9. Process creation time (milliseconds) the remote operation to completion. Context switch- ing is important for the same reasons. Complicated new process creation . When pro- • . The Unix process cre- • Simple process creation grams start other programs, they frequently use one of ation primitive is , which creates a (virtually) fork , and/or three standard interfaces: popen , system exact copy of the calling process. Unlike VMS and execlp . The first two interfaces start a new process some other operating systems, Unix starts any new by invoking the standard command interpreter, . Consequently, and/or fork fork process with a , to start the process. Starting programs this /bin/sh execve should be fast and ‘‘light,’’ facts that many way guarantees that the shell will look for the have been ignoring for some time. requested application in all of the places that the user lmbench measures simple process creation by would look — in other words, the shell uses the user’s creating a process and immediately exiting the child $PATH variable as a list of places to find the applica- process. The parent process waits for the child pro- is a C library routine which also looks tion. execlp cess to exit. The benchmark is intended to measure for the program using the user’s $PATH variable. the overhead for creating a new thread of control, so it Since this is a common way of starting applica- includes the fork exit time. and the tions, we felt it was useful to show the costs of the The benchmark also includes a system call wait generality. in the parent and context switches from the parent to to start the We measure this by starting /bin/sh the child and back again. Given that context switches same tiny program we ran in the last case. In Table 9 of this sort are on the order of 20 microseconds and a the cost of asking the shell to go look for the program system call is on the order of 5 microseconds, and that is quite large, frequently ten times as expensive as just the entire benchmark time is on the order of a mil- creating a new process, and four times as expensive as lisecond or more, the extra overhead is insignificant. explicitly naming the location of the new program. Note that even this relatively simple task is very The results that stand out in Table 9 are the poor expensive and is measured in milliseconds while most Sun Ultra 1 results. Given that the processor is one of of the other operations we consider are measured in the fastest, the problem is likely to be software. There microseconds. is room for substantial improvement in the Solaris . The preceding benchmark New process creation • process creation code. did not create a new application; it created a copy of This benchmark measures the the old application. cost of creating a new process and changing that
12 6.6. Context switching However, because the overhead is measured in a single process, the cost is typically the cost with ‘‘hot’’ Context switch time is defined here as the time caches. In the Figure 2, each size is plotted as a line, needed to save the state of one process and restore the with context switch times on the Y axis, number of state of another process. processes on the X axis, and the process size as the Context switches are frequently in the critical per- data set. The process size and the hot cache overhead formance path of distributed applications. For exam- costs for the pipe read/writes and any data access is ple, the multiprocessor versions of the IRIX operating what is labeled as size=0KB overhead=10 . The data through the net- ove system use processes to m size is in kilobytes and the overhead is in microsec- working stack. This means that the processing time onds. for each new packet arriving at an idle system includes The context switch time does not include anything the time needed to switch in the networking process. other than the context switch, provided that all the Typical context switch benchmarks measure just benchmark processes fit in the cache. If the total size the minimal context switch time — the time to switch of all of the benchmark processes is larger than the between two processes that are doing nothing but con- cache size, the cost of each context switch will text switching. We feel that this is misleading because include cache misses. We are trying to show realistic there are frequently more than two active processes, context switch times as a function of both size and and they usually have a larger working set (cache foot- number of processes. print) than the benchmark processes. Other benchmarks frequently include the cost of Context switches for the system calls needed to force the context switches. For example, Ousterhout’s context switch benchmark Linux [email protected] read measures context switch time plus a and a write on a pipe. In many of the systems measured 450 lmbench by , the pipe overhead varies between 30% 400 1M .5M and 300% of the context switch time, so we were care- m • • • ful to factor out the pipe overhead. 350 i c T • The context switch bench- Number of processes. 300 r i mark is implemented as a ring of two to twenty pro- o m 250 s cesses that are connected with Unix pipes. A token is e .5M e passed from process to process, forcing context 200 × × c i switches. The benchmark measures the time needed .25M o 150 n • to pass the token two thousand times from process to .25M n × d process. Each transfer of the token has two costs: the 100 ∆ s .25M context switch, and the overhead of passing the token. .12M ∆ 50 .12M • In order to calculate just the context switching time, × × ∆ ∆ ∆ the benchmark first measures the cost of passing the 0 token through a ring of pipes in a single process. This 8 10 12 14 16 18 20 22 0 2 4 6 overhead time is defined as the cost of passing the token and is not included in the reported context size=0KB overhead=10 switch time. Processes size=4KB overhead=19 ∆ size=16KB overhead=66 size=32KB overhead=129 × Size of processes. • In order to measure more realis- size=64KB overhead=255 • tic context switch times, we add an artificial variable size ‘‘cache footprint’’ to the switching processes. The cost of the context switch then includes the cost Figure 2. Context switch times of restoring user-level state (cache footprint). The cache footprint is implemented by having the process Results for an Intel Pentium Pro system running 4 allocate an array of data and sum the array as a series Linux at 167 MHz are shown in Figure 2. The data of integers after receiving the token but before passing points on the figure are labeled with the working set the token to the next process. Since most systems will due to the sum of data in all of the processes. The cache data across context switches, the working set for actual working set is larger, as it includes the process the benchmark is slightly larger than the number of and kernel overhead as well. One would expect the processes times the array size. context switch times to stay constant until the working set is approximately the size of the second level cache. It is worthwhile to point out that the overhead The Intel system has a 256K second level cache, and also includes the cost of accessing ove mentioned ab the context switch times stay almost constant until the data, in the same way as the actual benchmark. about 256K (marked as .25M in the graph). 4 All arrays are at the same virtual address in all pro- cesses.
13 • The context switch benchmark is a Cache issues The Sun Ultra1 context switches quite well in part because of enhancements to the register window han- deliberate measurement of the effectiveness of the dling in SPARC V9. caches across process context switches. If the cache does not include the process identifier (PID, also sometimes called an address space identifier) as part 6.7. Interprocess communication latencies of the address, then the cache must be flushed on Interprocess communication latency is important ev ery context switch. If the cache does not map the because many operations are control messages to same virtual addresses from different processes to dif- another process (frequently on another system). The ferent cache lines, then the cache will appear to be time to tell the remote process to do something is pure flushed on every context switch. overhead and is frequently in the critical path of If the caches do not cache across context switches important functions such as distributed applications there would be no grouping at the lower left corner of (e.g., databases, network servers). Figure 2, instead, the graph would appear as a series The interprocess communication latency bench- of straight, horizontal, parallel lines. The number of marks typically have the following form: pass a small processes will not matter, the two process case will be message (a byte or so) back and forth between two just as bad as the twenty process case since the cache processes. The reported results are always the would not be useful across context switches. microseconds needed to do one round trip. For one way timing, about half the round trip is right. How- 2 processes 8 processes ev er, the CPU cycles tend to be somewhat asymmetric 0KB 32KB 0KB 32KB System for one trip: receiving is typically more expensive than 101 Linux/i686 6 18 7 sending. 163 13 Linux/i586 10 215 78 Linux/Alpha 11 70 13 . Unix pipes are an interprocess com- • Pipe latency 43 13 IBM Power2 16 18 munication mechanism implemented as a one-way 102 14 Sun Ultra1 31 20 byte stream. Each end of the stream has an associated DEC [email protected] 14 17 22 41 file descriptor; one is the write descriptor and the other 87 IBM PowerPC 144 26 16 the read descriptor. HP K210 17 17 99 18 Pipes are frequently used as a local IPC mecha- Unixware/i686 17 17 18 72 nism. Because of the simplicity of pipes, they are fre- 34 33 FreeBSD/i586 27 102 Solaris/i686 36 54 43 118 quently the fastest portable communication mecha- 104 SGI Indigo2 40 47 38 nism. DEC [email protected] 53 68 59 134 Pipe latency is measured by creating a pair of SGI Challenge 63 93 69 80 pipes, forking a child process, and passing a word 107 104 197 Sun SC1000 142 back and forth. This benchmark is identical to the two-process, zero-sized context switch benchmark, Table 10. Context switch time (microseconds) except that it includes both the context switching time and the pipe overhead in the results. Table 11 shows We picked four points on the graph and extracted the round trip latency from process A to process B and those values for Table 10. The complete set of values, back to process A. as well as tools to graph them, are included with lmbench . System Pipe latency Note that multiprocessor context switch times are Linux/i686 26 frequently more expensive than uniprocessor context Linux/i586 33 switch times. This is because multiprocessor operat- Linux/Alpha 34 ing systems tend to have very complicated scheduling 62 Sun Ultra1 code. We believe that multiprocessor context switch 65 IBM PowerPC times can be, and should be, within 10% of the unipro- Unixware/i686 70 cessor times. DEC [email protected] 71 78 HP K210 Linux does quite well on context switching, espe- IBM Power2 91 cially on the more recent architectures. By comparing Solaris/i686 101 the Linux 2 0K processes to the Linux 2 32K pro- FreeBSD/i586 104 cesses, it is apparent that there is something wrong 131 SGI Indigo2 with the Linux/i586 case. If we look back to Table 6, DEC [email protected] 179 we can find at least part of the cause. The second SGI Challenge 251 level cache latency for the i586 is substantially worse 278 Sun SC1000 than either the i686 or the Alpha. Table 11. Pipe latency (microseconds) Given the poor second level cache behavior of the PowerPC, it is surprising that it does so well on con- The time can be broken down to two context text switches, especially the larger sized cases. switches plus four system calls plus the pipe overhead.
14 The context switch component is two of the small pro- RPC/UDP System UDP cesses in Table 10. This benchmark is identical to the context switch benchmark in [ Ousterhout90 ]. 180 Linux/i686 93 197 Sun Ultra1 267 . TCP sockets may be TCP and RPC/TCP latency • Linux/Alpha 180 317 viewed as an interprocess communication mechanism 259 358 DEC [email protected] similar to pipes with the added feature that TCP sock- 366 Linux/i586 187 ets work across machine boundaries. FreeBSD/i586 212 375 Solaris/i686 348 454 TCP and RPC/TCP connections are frequently 531 IBM Power2 254 used in low-bandwidth, latency-sensitive applications. 206 536 IBM PowerPC The default Oracle distributed lock manager uses TCP 152 HP K210 543 sockets, and the locks per second available from this 671 313 SGI Indigo2 service are accurately modeled by the TCP latency 834 489 DEC [email protected] test. 678 893 SGI Challenge 739 Sun SC1000 1101 System TCP RPC/TCP Table 13. UDP latency (microseconds) Linux/i686 216 346 Sun Ultra1 162 346 few advantages, however. They preserve message DEC [email protected] 371 267 boundaries, whereas TCP does not; and a single UDP FreeBSD/i586 256 440 socket may send messages to any number of other Solaris/i686 305 528 sockets, whereas TCP sends data to only one place. Linux/Alpha 429 602 146 606 HP K210 UDP and RPC/UDP messages are commonly used 278 SGI Indigo2 641 in many client/server applications. NFS is probably IBM Power2 332 649 the most widely used RPC/UDP application in the 299 IBM PowerPC 698 world. Linux/i586 467 713 DEC [email protected] 485 788 Like TCP latency, UDP latency is measured by 546 SGI Challenge 900 having a server process that waits for connections and Sun SC1000 855 1386 a client process that connects to the server. The two processes then exchange a word between them in a Table 12. TCP latency (microseconds) loop. The latency reported is round-trip time. The measurements in Table 13 are local or loopback mea- Sun’s RPC is layered either over TCP or over surements, since our intent is to show the overhead of UDP. The RPC layer is responsible for managing con- the software. Again, note that the RPC library can add nections (the port mapper), managing different byte hundreds of microseconds of extra latency. orders and word sizes (XDR), and implementing a remote procedure call abstraction. Table 12 shows the TCP UDP same benchmark with and without the RPC layer to System Network latency latency show the cost of the RPC implementation. Sun Ultra1 100baseT 280 308 TCP latency is measured by having a server pro- FreeBSD/i586 100baseT 365 304 HP 9000/735 425 fddi 441 cess that waits for connections and a client process 543 10baseT 602 SGI Indigo2 that connects to the server. The two processes then HP 9000/735 603 592 10baseT exchange a word between them in a loop. The latency 1099 SGI PowerChallenge 1068 hippi reported is one round-trip time. The measurements in 2954 1912 Linux/[email protected] 10baseT Table 12 are local or loopback measurements, since our intent is to show the overhead of the software. Table 14. Remote latencies (microseconds) The same benchmark may be, and frequently is, used to measure host-to-host latency. Network latency . We hav e a few results for over • Note that the RPC layer frequently adds hundreds the wire latency included in Table 14. As might be of microseconds of additional latency. The problem is expected, the most heavily used network interfaces not the external data representation (XDR) layer — (i.e., ethernet) have the lowest latencies. The times the data being passed back and forth is a byte, so there shown include the time on the wire, which is about is no XDR to be done. There is no justification for the 130 microseconds for 10Mbit ethernet, 13 microsec- extra cost; it is simply an expensive implementation. onds for 100Mbit ethernet and FDDI, and less than 10 DCE RPC is worse. microseconds for Hippi. • UDP and RPC/UDP latency . UDP sockets are an • TCP connection latency . TCP is a connection- alternative to TCP sockets. They differ in that UDP based, reliable, byte-stream-oriented protocol. As part sockets are unreliable messages that leave the retrans- of this reliability, a connection must be established mission issues to the application. UDP sockets have a before any data can be transferred. The connection is
15 accomplished by a ‘‘three-way handshake,’’ an 6.8. File system latency exchange of packets when the client attempts to con- File system latency is defined as the time required nect to the server. to create or delete a zero length file. We define it this Unlike UDP, where no connection is established, way because in many file systems, such as the BSD TCP sends packets at startup time. If an application fast file system, the directory operations are done syn- creates a TCP connection to send one message, then chronously in order to maintain on-disk integrity. the startup time can be a substantial fraction of the Since the file data is typically cached and sent to disk total connection and transfer costs. The benchmark at some later date, the file creation and deletion shows that the connection cost is approximately half become the bottleneck seen by an application. This of the cost. bottleneck is substantial: to do a synchronous update to a disk is a matter of tens of milliseconds. In many Connection cost is measured by having a server, cases, this bottleneck is much more of a perceived per- registered using the port mapper, waiting for connec- formance issue than processor speed. tions. The client figures out where the server is regis- system tered and then repeatedly times a connect The benchmark creates 1,000 zero-sized files and call to the server. The socket is closed after each con- then deletes them. All the files are created in one nect. Twenty connects are completed and the fastest directory and their names are short, such as "a", "b", of them is used as the result. The time measured will "c", ... "aa", "ab", ... include two of the three packets that make up the three System FS Create Delete way TCP handshake, so the cost is actually greater than the times listed. Linux/i686 EXT2FS 751 45 67 HP K210 HFS 579 System TCP connection Linux/i586 EXT2FS 1,114 95 Linux/Alpha EXT2FS 834 115 HP K210 238 450 369 Unixware/i686 UFS Linux/i686 263 XFS SGI Challenge 3,508 4,016 IBM Power2 339 DEC [email protected] ADVFS 4,184 4,255 FreeBSD/i586 418 Solaris/i686 UFS 23,809 7,246 Linux/i586 606 8,333 18,181 UFS Sun Ultra1 667 SGI Indigo2 25,000 Sun SC1000 UFS 11,111 716 SGI Challenge 28,571 11,235 FreeBSD/i586 UFS Sun Ultra1 852 11,494 SGI Indigo2 EFS 11,904 Solaris/i686 1230 12,345 38,461 ? DEC [email protected] Sun SC1000 3047 IBM PowerPC 12,658 12,658 JFS 13,333 IBM Power2 12,820 JFS Table 15. TCP connect latency (microseconds) Table 16. File system latency (microseconds) Table 15 shows that if the need is to send a quick message to another process, given that most packets The create and delete latencies are shown in Table and a send get through, a UDP message will cost a 16. Notice that Linux does extremely well here, 2 to 3 reply (if positive acknowledgments are needed, orders of magnitude faster than the slowest systems. which they are in order to have an apples-to-apples However, Linux does not guarantee anything about the comparison with TCP). If the transmission medium is disk integrity; the directory operations are done in 10Mbit Ethernet, the time on the wire will be approxi- memory. Other fast systems, such as SGI’s XFS, use a mately 65 microseconds each way, or 130 microsec- log to guarantee the file system integrity. The slower onds total. To do the same thing with a short-lived systems, all those with ̃10 millisecond file latencies, TCP connection would cost 896 microseconds of wire are using synchronous writes to guarantee the file sys- time alone. tem integrity. Unless Unixware has modified UFS The comparison is not meant to disparage TCP; substantially, they must be running in an unsafe mode TCP is a useful protocol. Nor is the point to suggest since the FreeBSD UFS is much slower and both file that all messages should be UDP. In many cases, the systems are basically the 4BSD fast file system. difference between 130 microseconds and 900 microseconds is insignificant compared with other 6.9. Disk latency aspects of application performance. However, if the Included with lmbench is a small benchmarking application is very latency sensitive and the transmis- program useful for measuring disk and file I/O. sion medium is slow (such as serial link or a message lmdd , which is patterned after the Unix utility dd , through many routers), then a UDP message may measures both sequential and random I/O, optionally cheaper. prove generates patterns on output and checks them on input, supports flushing the data from the buffer cache on systems that support msync , and has a very flexi- ble user interface. Many I/O benchmarks can be
16 trivially replaced with a interrupted, reconnect, and transfer the data. perl script wrapped around . lmdd This technique can be used to discover how many While we could have generated both sequential drives a system can support before the system becomes CPU-limited because it can produce the and random I/O results as part of this paper, we did not because those benchmarks are heavily influenced overhead load of a fully configured system with just a few disks. by the performance of the disk drives used in the test. We intentionally measure only the system overhead of a SCSI command since that overhead may become a 7. Future work bottleneck in large database configurations. There are several known improvements and exten- Some important applications, such as transaction . sions that could be made to lmbench processing, are limited by random disk IO latency. Memory latency • . The current benchmark measures Administrators can increase the number of disk opera- clean-read latency. By clean, we mean that the cache tions per second by buying more disks, until the pro- lines being replaced are highly likely to be unmodi- lmdd cessor overhead becomes the bottleneck. The fied, so there is no associated write-back cost. We benchmark measures the processor overhead associ- would like to extend the benchmark to measure dirty- ated with each disk operation, and it can provide an read latency, as well as write latency. Other changes upper bound on the number of disk operations the pro- include making the benchmark impervious to sequen- cessor can support. It is designed for SCSI disks, and tial prefetching and measuring TLB miss cost. it assumes that most disks have 32-128K read-ahead buffers and that they can read ahead faster than the MP benchmarks . None of the benchmarks in • 5 processor can request the chunks of data. is designed to measure any multiprocessor lmbench features directly. At a minimum, we could measure The benchmark simulates a large number of disks cache-to-cache latency as well as cache-to-cache by reading 512byte transfers sequentially from the raw bandwidth. disk device (raw disks are unbuffered and are not read ahead by Unix). Since the disk can read ahead faster • Static vs. dynamic processes . In the process cre- than the system can request data, the benchmark is ation section, we allude to the cost of starting up pro- doing small transfers of data from the disk’s track cesses that use shared libraries. When we figure out buffer. Another way to look at this is that the bench- how to create statically linked processes on all or most mark is doing memory-to-memory transfers across a systems, we could quantify these costs exactly. SCSI channel. It is possible to generate loads of more • McCalpin’s stream benchmark . We will probably than 1,000 SCSI operations/second on a single SCSI incorporate part or all of this benchmark into disk. For comparison, disks under database load typi- . lmbench cally run at 20-80 operations per second. Automatic sizing . We hav e enough technology that • System Disk latency we could determine the size of the external cache and SGI Challenge 920 autosize the memory used such that the external cache SGI Indigo2 984 had no effect. HP K210 1103 More detailed papers • . There are several areas that DEC [email protected] 1436 Sun SC1000 1466 could yield some interesting papers. The memory Sun Ultra1 2242 latency section could use an in-depth treatment, and the context switching section could turn into an inter- Table 17. SCSI I/O overhead (microseconds) esting discussion of caching technology. lower The resulting overhead number represents a 8. Conclusion bound on the overhead of a disk I/O. The real over- is a useful, portable micro-benchmark lmbench head numbers will be higher on SCSI systems because suite designed to measure important aspects of system most SCSI controllers will not disconnect if the performance. We hav e found that a good memory request can be satisfied immediately. During the subsystem is at least as important as the processor benchmark, the processor simply sends the request speed. As processors get faster and faster, more and and transfers the data, while during normal operation, ove to more of the system design effort will need to m the processor will send the request, disconnect, get the cache and memory subsystems. 5 This may not always be true: a processor could be 9. Acknowledgments fast enough to make the requests faster than the rotating disk. If we take 6M/second to be disk speed, and divide Many people have provided invaluable help and that by 512 (the minimum transfer size), that is 12,288 insight into both the benchmarks themselves and the IOs/second, or 81 microseconds/IO. We don’t know of paper. The USENIX reviewers were especially helpful. any processor/OS/IO controller combinations that can do We thank all of them and especially thank: Ken Okin an IO in 81 microseconds. (SUN) , Kevin Normoyle (SUN) , Satya Nishtala (SUN) ,
17 Greg Chesson (SGI) , John Mashey (SGI) 33-43, Proceedings USENIX Winter Conference, , Neal Nuck- , Ron January 1991. , John McCalpin (SGI) olls (Univ. of Delaware) Minnich , Tom Rokicki (Sarnoff) , Chris Ruemmler (HP) [Ousterhout90] John K. Ousterhout, ‘‘Why aren’t . (Digidesign) , and John Weitz (HP) operating systems getting faster as fast as hard- We would also like to thank all of the people that ware?,’’ pp. 247-256, Proceedings USENIX Sum- mer Conference, June 1990. have run the benchmark and contributed their results; none of this would have been possible without their [Park90] Arvin Park and J. C. Becker, ‘‘IOStone: a assistance. synthetic file system benchmark,’’ Computer Archi- tecture News, 18 (2), pp. 45-52, June 1990. Our thanks to all of the free software community lmbench for tools that were used during this project. [Saavedra95] R.H. Saavedra and A.J. Smith, ‘‘Measur- is currently developed on Linux, a copylefted Unix ing cache and TLB performance and their effect on written by Linus Torvalds and his band of happy hack- IEEE Transactions on Com- benchmark runtimes,’’ ers. This paper and all of the lmbench documenta- puters, 44 (10), pp. 1223-1235, October 1995. suite of tools tion was produced using the groff [Stallman89] Free Software Foundation, Richard written by James Clark. Finally, all of the data pro- Stallman, ‘‘General Public License,’’ 1989. perl cessing of the results is done with written by Included with lmbench . Larry Wall. [Toshiba94] Toshiba, ‘‘DRAM Components and Mod- Sun Microsystems, and in particular Paul Borrill, ules,’’ pp. A59-A77,C37-C42, Toshiba America supported the initial development of this project. Sili- Electronic Components, Inc., 1994. con Graphics has supported ongoing development that [Wolman89] Barry L. Wolman and Thomas M. Olson, turned into far more time then we ever imagined. We ‘‘IOBENCH: a system independent IO bench- are grateful to both of these companies for their finan- mark,’’ 17 (5), pp. Computer Architecture News, cial support. 55-70, September 1989. 10. Obtaining the benchmarks Biographical information The benchmarks are available at Larry McVoy currently works Silicon Graphics in http://reality.sgi.com/employees/lm_engr/lmbench.tgz the Networking Systems Division on high perfor- as well as via a mail server. You may request the lat- mance networked file systems and networking archi- est version of lmbench by sending email to tecture. His computer interests include hardware lmbench-current* [email protected] with architecture, software implementation and architec- as the subject. ture, performance issues, and free (GPLed) software issues. Previously at Sun, he was the architect for the References SPARC Cluster product line, redesigned and wrote an [Chen94] P. M. Chen and D. A. Patterson, ‘‘A new entire source management system (now productized as approach to I/O performance evaluation − self- TeamWare), implemented UFS clustering, and imple- scaling I/O benchmarks, predicted I/O perfor- mented all of the Posix 1003.1 support in SunOS 4.1. Tr ansactions on Computer Systems, 12 (4), mance,’’ Concurrent with Sun work, he lectured at Stanford pp. 308-339, November 1994. University on Operating Systems. Before Sun, he [Chen93] Peter M. Chen and David Patterson, ‘‘Stor- worked on the ETA systems supercomputer Unix port. age performance − metrics and benchmarks,’’ Pro- He may be reached by electronically via lm[email protected] ceedings of the IEEE, 81 (8), pp. 1151-1165, or by phone at (415) 933-1804. August 1993. Carl Staelin works for Hewlett-Packard Laborato- [Fenwick95] David M. Fenwick, Denis J. Foley, ries in the External Research Program. His research William B. Gist, Stephen R. VanDoren, and Danial interests include network information infrastructures Wissell, ‘‘The AlphaServer 8000 series: high-end and high performance storage systems. He worked for Digital Technical server platform development,’’ HP at U.C. Berkeley on the 4.4BSD LFS port, the 7 (1), pp. 43-65, August 1995. Journal, HighLight hierarchical storage file system, the Mari- posa distributed database, and the NOW project. He [Hennessy96] John L. Hennessy and David A. Patter- received his PhD in Computer Science from Princeton son, ‘‘Computer Architecture A Quantitative Approach, 2nd Edition,’’ Morgan Kaufman, 1996. University in 1991 in high performance file system design. He may be reached electronically via [McCalpin95] John D. McCalpin, ‘‘Memory band- . [email protected] width and machine balance in current high perfor- mance computers,’’ IEEE Technical Committee on Computer Architecture newsletter, to appear, December 1995. [McVoy91] L. W. McVoy and S. R. Kleiman, ‘‘Extent- like Performance from a Unix File System,’’ pp.