21st-Symmetric Multiprocessing

Symmetric Multiprocessing

When to Double Down, and When to Stand Pat

Whoops! Here go the year-end bonuses: "It's possible that some of our sales force overstated the benefits of dual-capable CPU systems. Unfortunately, by being overly critical of single-CPU-capable systems, if they have to eat some crow, so be it, but we do believe this (single CPU) system has value." These rather embarrassing words came from a-who-can-blame-him-I-don't-want-to-be-named PC maker executive. Intel's abrupt about face positioning on the new single chip Pentium 4 workstations were the cause of this exec's black feather comments. Intel will not be able to produce a multi-chip P4 architecture until possibly the second half of 2001, or maybe even later, given the way the company's P4 rollout luck has been going.

With this big delay in getting multi-CPU capable products out the door, Intel's only viable option is to say that its new and expensive P4 is so capable that a single chip workstation is all you will ever need. However, up until the introduction of the new P4 in November 2000, Intel and its PC maker customers have consistently promoted and differentiated workstations from mere mortal PCs by asserting that such heavy lifting duties demanded dual-chip SMP (symmetric multiprocessing) solutions, running factory-installed NT/Windows 2000, naturally. This dual-chip market tactic applies most especially to Dell, the single biggest supplier of workstations in the U.S. The Texas Company has never sold single-CPU workstations.

But when Intel does finally ship its new multiprocessor solution, oy! Expect a P4 SMP PR spin that will make Florida presidential politics look positively amateur. Ironically, all this may ultimately prove to be just an Intel marketecture ploy, not an SMP technology holdup. Industry sources close to Intel have said that "Foster", the SMP version of the Pentium 4, will likely support industry standard double data rate (DDR) memory, with the way more pricey Rambus memory (an Intel pushed standard) remaining an option for high end workstations. Foster apparently uses the same CPU core as the Pentium 4, but will utilize some new internal features to enable SMP. What's really intriguing is that Intel may simply be burying deep inside in the P4 the go-fast switch to turn on SMP processing, as well as some other necessary features. And so, the regular uniprocessor P4 may eventually turn out to be exactly the same CPU as Foster, just with the SMP powerplant switched off. If this turns out to be the case, expect hackers around the world to jump on this like blood-starved fleas on a fat Saint Bernard.

Meanwhile, all this P4 posturing will keep thousands of low-rent Intel users busily chuckling as they merrily continue making screaming home-brew SMP systems using dual Celeron CPU motherboards from companies like Abit. And don't expect to find the costly and hugely complex Windows 2000 running on these built in the backyard SMP systems. Instead, you will probably find Linux or BeOS, both of which can, or will soon, in the case of Linux 2.4, do a torrid SMP samba. Also not to be left out of this Where's Waldo P4 salsa is the new Mac OS X running on Apple's dual G4 machines. Underpinning MacOS X is "Darwin", an Open Source brew of the Mach and FreeBSD UNIX kernels.

SMP system speed up primarily comes about because all the machine's processors share a single memory and bus interface within a single computer. Commodity PC SMP systems have been around for some time. Linux-on-Intel, for example, has offered SMP support using custom-built kernels going all the way back to Linux 2.0. (Linux SMP support became available for UltraSparc, SparcServer, Alpha and PowerPC machines in 2.2.x.) For Intel machines, all that was needed was the right Linux kernel configuration and a multiprocessor 486 or Pentium or higher machine that met Intel's Multi-Processor Specification (MPS) 1.1/1.4 requirements.

MPS enables Intel architecture systems to run with two or more processors. Note, though, that MPS does not specify how the hardware implements shared memory, only how that implementation must function from a software point of view. Your SMP mileage may therefore vary and will ultimately depend on how each vendor implements access to physically shared memory. Your SMP Linux kernel and particular applications could therefore find some hardware environments more hospitable than other codes might, especially if you develop for a particular make/model SMP machine. One consequence of this is that blanket statement SMP benchmarks should always be viewed with some skepticism.

Getting all this go-fast goodness depends on the application developer retooling their code to be SMP-aware, which also includes choosing what shared memory model to go with. In SMP machines, there are two fundamentally different models commonly used for shared memory programming: "shared everything" and "shared something." Both models use shared memory, except that shared everything places all data structures in shared memory, while shared something requires the user to explicitly indicate which data structures are potentially shared and which are private to a single processor.

Like a wool sweater, a program is typically made up from a complex fabric of threads that run down the CPU's chest in predictable, sequential order. But if you break up that rigid sequence and run those threads in parallel across multiple CPUs, then you can knit that application sweater much faster. In shared everything SMP, you only get to use one basket of thread ? the system memory. So if you dive down to get some new thread to darn and the basket is tied up with something else ("locked"), no SMP joy. When multiple CPUs in an SMP box are executing the same code out of the same shared memory, these memory locks (sometimes referred to as interlocks) are necessary to guarantee exclusive access to critical sections of the code. A critical section is deemed to be any sequence of instructions that must be executed atomically, i.e., as an uninterrupted unit. The SMP lock/unlock mechanism is actually an implementation of semaphores, with appropriate queuing of requests when a resource is busy.

You can thus put locks around things to prevent processor conflict when using shared everything, but that can be highly inefficient, if not disastrous, for some types of applications. Moreover, many libraries use data structures that are not sharable. And most fundamentally, shared everything only works if all processors are executing the exact same memory image; you cannot use shared everything across multiple different code images. Thus, you can easily be forced back to shared something and eyeballing your code for better systems tuning and debugging.

The most common type of shared everything programming support is a threads library. Such threads are often called "lightweight" processes. A thread can be viewed as a sequence of instructions that represents a specific task within a program, such as a specific subroutine. To greatly simplify, a thread is owned and mostly managed by the program, while the overall program is owned and managed by the operating system (unless you are running Mach, where the kernel handles all the threads, and then they are called heavyweight threads).

Linux application developers typically use a POSIX Pthreads package to enable its SMP shared memory capabilities. But unfortunately, the POSIX API doesn't require that threads truly run in parallel on your nifty multi-CPU Linux hardware. For example, versions like PCThreads (http://members.aa.net/~mtp/PCthreads.html), although based on the POSIX 1003.1c standard, do not necessarily implement true parallel thread execution -- all the threads of a program are kept within a single Linux process. On the other hand the LinuxThreads package (pauillac.inria.fr/~xleroy/Linuxthreads), is apparently a solid implementation of "shared everything" approach.

(For a very good tutorial overview on SMP programming in Linux, see "Linux Parallel Processing HOWTO" by Hank Dietz, www.Linuxdoc.org/HOWTO/Parallel-Processing-HOWTO.html) and also see the "Linux SMP HOWTO" by David Mentr?, which can be found at www.phy.duke.edu/brahma/smp-faq/smp-howto.html.)

In Linux 2.2, building an SMP kernel was a configuration option on the part of the user. However, you can now reasonably expect that some or most Linux 2.4 distributions will now ship with SMP support already configured into the kernel. With the release of 2.4, Linux SMP support improves even more with various refinements offering improved scalability, the ability to dynamically detect and handle bugs in problematic hardware, and has also removed the need for some 2.2 SMP hacks. A very good overview was written by Joe Pranevich, the "Wonderful World of Linux 2.4" (www.Linuxdevices.com/files/misc/WWOL2.4.html), that summarizes the features found in this newest iteration of Linux. The scheduler has been somewhat revised to be more efficient on systems with a larger number of concurrent processes, like multithreaded applications. Linux 2.4 can also handle many more simultaneous processes than the previous system by being more scalable on multiprocessor systems and also by providing a configurable process limit. In Linux 2.4 the scalable thread/process limit is only limited by the amount of memory in the system (increased to 64GB on Intel gear running 2.4).

It has also been reported that on high-end servers with as little as half a gigabyte of RAM installed, it was easily possible to support as many as 16 thousand processes at once under Linux 2.4. Other 2.4 users have reported being able to run many more than that on their specific systems. The 2-gigabyte file size restriction has also been lifted with 2.4. The way Linux handles shared memory has also been changed in Linux 2.4 to be more standards compliant. One side effect of this set of changes is that Linux 2.4 will require a special "shared memory" filesystem to be mounted in order for shared memory segments to work. This should be handled by the distributions when they become ready for Linux 2.4.

It should also be noted that on the BSD front, BSDi has been working on significantly improving SMP performance these last couple of years for its next generation BSD/OS 5.0 code. High performance FreeBSD SMP systems may also be on the horizon, mostly thanks to BSDi recently making available its next gen code to the FreeBSD development community.

One multi-threaded SMP wunderkind is Apple's new MacOS X, built on the Open Source kernel marriage ("Darwin") of FreeBSD and Carnegie Mellon's Mach. In Mac OS X, FreeBSD is the outer operating system shell (a library providing low-level services and POSIX/BSD support), with Mach serving as the kernel. This BSD shell uses existing Mach system calls and doesn't add much extra layering. This shell is necessary as Mach only provides very primitive, low-level services. To make Mach truly usable you need a library on top of it, in this instance, using FreeBSD. (Note: NetBSD is utilized for the user space in MacOS X)

Like Linux, MacOS X still requires programmer intervention to fully wring out its SMP capabilities. But the really good news, though, is that its Mach kernel can handle up to thousands of processors. Moreover, Mach was explicitly designed to support diverse architectures, including multiprocessors with varying degrees of shared memory access: Uniform Memory Access (UMA), Non-Uniform Memory Access (NUMA), and No Remote Memory Access (NORMA). Mach also features integrated memory management and interprocess communication to provide both efficient communication of large numbers of data, and communication-based memory management.

Mach can also easily layer the emulation of several concurrently running operating systems, and hence, we see the great ease with which MacOS X hosts the older MacOS 9. In fact, there is no reason why a virtual Windows machine shouldn't run almost as well as MacOS 9 on Mach. By the way, PowerPC MKLinux is simply Red Hat Linux running on top of Mach.

Mach is very different from UNIX/Linux in the way it handles threads. Unlike a UNIX/LINUX process, which contains both an executing program and a bundle of resources such as the file descriptor table and address space, a Mach task contains only a bundle of resources; threads handle all execution activities. Each thread belongs to exactly one task, and a task cannot do anything unless it has one or more threads. Thus, there is no notion of a Unix/Linux-type "process" in Mach. Rather, a traditional UNIX process would be implemented as a task with a single thread of control.

Unlike an application running under Linux, the Mach kernel itself manages the threads, and for that reason, they are sometimes called heavyweight threads. Thread creation and destruction are done by the kernel, and involve updating kernel data structures. They provide the basic mechanisms for handling multiple activities within a single address space. Via Cthreads or Pthreads on Mach, you can write shared memory applications that will concurrently run on both uniprocessor and multiprocessor machines transparently, taking advantage of additional processors when found. Devotees would argue that because of the special way it incorporates and manages threads plus its numerous other features, the Mach kernel provides superior SMP performance than the Linux kernel. For all the same reasons, Mach is also asserted to be a much better system for building highly scalable network clusters.

In sum, Open Source Darwin, the heart of Power PC MacOS X, is a killer operating system and will also likely be finding its way onto many Intel machines. For its part, Apple will be shipping this powerful new SMP system to millions of users, probably making Darwin the number one Open Source distribution on the market. Apple has truly bet the corporate farm on the power of Open Source.

Standing in marked SMP contrast to Linux 2.4, MacOS X, BSD, Windows 2000 and almost every other operating system on the market is BeOS from Jean Louis Gassee's Be. BeOS has always had native SMP built right into its object-oriented C++ bones. Simply put, BeOS has the most modern SMP implementation found in a commodity PC operating system. In BeOS, it is the operating system that is natively SMP aware, not the application. In BeOS, there is no need to worry about reprogramming code with POSIX threads, or wondering whether or not the application is taking full advantage of all your box's processors; with BeOS, just snap in a second (or more CPU) and go.

All the BeOS C++ object foundation classes are multithreaded, and at execution, every thread is assigned to the next available processor. In BeOS a process is called "team." All the threads in the same team program share the same memory space. Your application automatically becomes multithreaded simply by writing to the BeOS API. And as BeOS is preemptive multitasking, all the threads get a piece of the CPU pie. You still could explicitly write your code for maximum efficiency, like breaking up a photo image into four parts for doing a gaussian blur across four CPUs, if available. But there is still no need to manually assign threads to the processors. The underlying breakout of all the OS functions is handled automatically. Note, though, that command line applications ported to BeOS from UNIX/LINUX are typically not SMP aware. Note also that Be has made the Tracker, its BeOS user interface/file navigation system, Open Source. Likewise, the BeOS "Deskbar," which lists the various application menus and system services, has been made Open Source.

So, if BeOS is so cool, why isn't it being used in lots of Intel servers? The answer lies in the fact that the current BeOS network stack is inefficient for server use. However, BONE (BeOS Network Environment) will fix these network inefficiencies by bringing BSD socket compliance to BeOS. Now in beta testing, BONE will finally make BeOS a truly formidable SMP server platform.

With all this SMP activity underway, you can expect a lot of vendor hype and confusion, As a PR fallout, "scalability" and "speed" may end up being equivalent in some user's minds. But these two terms are emphatically not equivalent. Scalability actually means that the effort to handle a problem grows reasonably with the hardness of the problem. Some problems will scale well, others simply will not. Another way of stating this is that some problems are tractable and others are intractable. A whole discipline has sprung up devoted to understanding which problems are tractable ? solvable in a reasonable amount of time on a computer ? and those which will never be solved no matter how much computer resources you throw at them. Clearly, you want to avoid problems that become exponentially harder; i.e., they don't scale.

In the 1960's a classification method was devised to address this tractable/intractable issue. A problem of size N is said to be tractable if its solution takes a length of time that depends on a polynomial function of N ?- that is, an algebraic power of N such as N squared, N cubed, and so on. The computational resources required for a tractable problem generally scale with the numbers in a moderate way. This, therefore, is the true meaning of "scalability," If, on the other hand, the time taken to solve a problem blows up exponentially with the size of the input ? for example, on the order 2N or greater -- the problem is deemed to be intractable. It just won't scale. Although this is a rich area of study, getting definitive answers to what is tractable/intractable has proven to be enormously difficult.

People will also sometimes confuse this matter by noting that a slow as molasses algorithm runs somewhat better on more than one CPU, and thus it "scales well". Wrong. You could add as many CPUs as there are stars in the universe and it still won't run properly if it's intractable; i.e., it just won't scale. On the other hand, if you have a problem that scales well, you can expect significant improvements in execution time as you add more processors. In this sense, one can accurately say that a tractable problem may yield a high performance algorithm that will scale well and possibly run quite fast on any given computer.

All of which is to say that scalability has absolutely nothing to do with the size of the machine, its operating system capabilities, or their combined overall performance. In truth, the real determining factor for SMP computer performance is how well any particular software/hardware system performs under a given load.

Fortunately, there are still a number of problems that are scalable and therefore lend themselves to SMP machines. However, these solvable problems tend not to be found in your typical business data center or MIS department. Rather, say hello, NASA or Fermi Lab. For your usual business environment, running a web server on an SMP box is often mentioned as a good example solution. However, one could also legitimately argue that you are better off buying many cheap uniprocessor boxes instead. Indeed, Yahoo, arguably one of the single most successful and busiest web sites, does just that. They use lots of cheap Intel-FreeBSD uniprocessor boxes to reliably serve their many millions of users.

There are also a considerable number of problems that will scale/work extremely well in loosely connected, cheap uniprocessor systems. The much ballyhooed success of Internet CPU cycle stealing client programs, like SETI@Home, underscores the power of having large numbers of machines at your disposal, shared memory elegance be damned. It's a rare event when twenty thousand cycle-sliced machines can't beat the pants off a 16-way SMP box.

You can also use PVM or MPI for enabling network parallel processing. Distributed applications that use message-passing schemes like PVM and MPI obviously do not share the same memory space. As a consequence, and also because of the overhead of opening and tearing down TCP/IP sockets and communications latency, there can be a noticeable performance loss in comparison to an SMP system. The operative words, though, are "can be," as there are many problems that scale and run extremely well on PVM/MPI systems. These PVM/MPI systems can also involve a very large number of networked resources spread out over a WAN.

Significantly, especially from a PVM/MPI developer's perspective, Linux 2.4's TCP/IP stack is vastly improved. It has undergone a complete rewrite. Linux stack performance has now gone way up and the stack has also been made far more reentrant. Prior to 2.4, Linux serialized many TCP/IP activities, hurting performance and scalability (especially on multiprocessors).

If you are still not persuaded that network message passing is viable, then "Lind", a shared memory scheme for networked parallel systems, is also available. Linda is the software creation of David Gelernter of Yale University. (Tragically, Gelernter was badly maimed by a bomb sent to him by the infamous Unabomber.) Linda uses a network shared memory scheme called "tuple space." But unlike PVM or MPI, twenty-five or so networked machines are the upper practical limit with Linda.

All the foregoing is just another way of saying that the single biggest market need for SMP is assuredly not in large expensive servers, but rather, in desktop machines. As desktop systems increasingly become overstressed multimedia machines; sometimes simultaneously working on MP3 files, playing DVD/MPEG movies, using video cams, web surfing, running other rich multimedia, as well as playing games; there is now an overwhelming reason to make the move to SMP. In fact, BeOS was born to serve this very market need.

So how come, as of yet, SMP hasn't stormed onto the desktop? Two very good reasons: 1. Windows 9.x/Me, the ruling desktop standard, is completely brain dead when it comes to SMP (and for doing most everything else for that matter). And 2, Intel, to inflate its profits, has kept SMP-capable CPU prices artificially high. Indeed, it looks like we are about to see "Xeon II" played out with the new P4 "Foster" SMP chip.

Some people have wised up to Intel's game and were overjoyed to discover that the lowly and cheap Celeron chip was an incredibly great CPU for doing BeOS and Linux SMP chores. To make the CPU inexpensive, the original Celeron used the Pentium ("Deschutes") core but did not have any on-die level 2 cache. However, as cache recompense, users have found that these first Celeron's were fiends for being overclocked. The Celerons also had reasonably good 3D and FPU performance. Then, in 1998, Intel introduced the new "A" series of Celerons with on-die Level 2 cache running at CPU clock frequency. In contrast, a Pentium II's L2 cache ran at only half the CPU clock frequency. The upshot was the new Celerons were now wolves in cheap sheep's clothing. But this performance fact was largely overlooked by the general and computer press (Intel's weak Celeron PR didn't help matters, either). However, savvy Celeron users caught on very quickly.

But it was with the release by Intel of socket-370 Celerons that motherboard manufacturers finally wised up and began making Socket-370 boards. Socket-to-slot adapters, called slockets, were also made to make the new Celerons compatible with existing motherboards. For example, Abit's BP6 motherboard combines Celeron SMP processing, built in UltraDMA/66 controller, and numerous expansion slots, all at a very reasonable price. Adding new fuel to the Celeron SMP fire is that Intel will be shipping an 800MHz Celeron chip, paired with a 100MHz system bus, in Q1, 2001. An 850MHz Celeron is also planned for the second quarter of next year.

Sad to say, though, it's a fact of business life that no major PC vendor is going to risk Intel's wrath by churning out millions of hacked, low cost Celeron SMP systems. More to the point, these PC makers also like the fat profit margins that come with selling Xeon boxes. The same profit-taking thing will also happen when the SMP P4 finally ships.

If inexpensive SMP is ever truly going to take off on the desktop, there are only a couple of viable hardware vendor options remaining. One is Apple and its Power PC. Apple has already begun shipping dual G4 machines, systems that will truly come alive when multi-threaded MacOS X applications start to appear later in 2001. But that won't help died in the wool PC users -- And neither, apparently, will Intel. That leaves just one low cost PC SMP desktop possibility: AMD.

AMD's answer to the Celeron is its Duron CPU, which has 128K of L1 cache, and 64k of "exclusive" L2 cache, and currently clock up to 800 MHz. Unlike Intel's Celeron and Celeron II processors, data in the L1 cache is not duplicated in the L2 cache. On an Intel Celeron processor, you will normally find the contents of the 32K of L1 cache duplicated in 32K (of 128K) of the L2 cache. AMD's says that the Duron performs up to 25% faster than an equivalent speed Celeron processor, and various published benchmarks seem to more than support their claim. A comparative 3DMark done by Ars Technica (http://arstechnica.com/reviews/2q00/duron/duron-3.html) showed a 700MHz Duron almost trouncing a 700 MHz Pentium III as well as "Classic" Athlon 750. Remarkably, the 700 MHz Duron whipped a Celeron 850 despite the Intel part's 150 MHz speed advantage.

And now for the really good news: The Duron, which is built on the Athlon architecture, supports SMP according to AMD. All that's needed, therefore, is for someone to make a cheap SMP board with DDR SDRAM support. And lo, AMD is now releasing its AMD 760MP chipset that has ATA100, DDR SDRAM support, and SMP support. Strap on a pair -- or more -- of inexpensive Duron's (some versions cost only about $110) and get ready to fly.

In November 2000, AMD finally unveiled its multiprocessor plans and with any luck, these new SMP systems will also cause some religious desktop conversions down in Santa Clara. AMD's SMP DDR system architecture uses two-way or four-way building blocks using the 266Mhz Athlon bus, which is based on the new AMD-760 MP chipset. AMD Vice President Rich Heye, a veteran engineer of both the early DEC Alpha and the PowerPC said that "Some people will look at bus speeds and say, 'Aw, 266Mhz? That's as slow as sin'. But you can't just be looking at wire speeds or at clock rates - it's how you use those wires."

AMD has been on a market tear with its Athlon and Duron low cost/high performance CPUs. We can now hopefully look forward to Athlon Thunderbirds and Celery-eating Durons storming onto the desktop SMP scene. There are potentially millions of Linux 2.4, BeOS R5, or PC-Darwin users just waiting for this to happen.

SMP has well and truly arrived and with a little bit of luck, ordinary users around the world will soon be rejoicing in multithreaded chorus.

21st Pub Date: March 2001

21st, The VXM Network, https://vxm.com