Why Digital Signal Processors might Put Cray & Intel on The Trailer

Franco Vitaliano

In this article sidebar, see how a spectacular new
ICE Machine may freeze Cray sales; and why an upcoming
TV Set Top box might claw its way to the top of the PC industry
Nielsen Ratings.

Drowning by Cycles

(Note: This article is from Jan. 1996) Bandwidth is to communications what faster clock cycles are to CPU power. The more you got, the more power you have. And the present contrived scarcity of communications bandwidth is the telecommunications industry Watergate. There is absolutely no reason why you should not have multi-megabit delivery to your desktop today.

Some large companies, in fact, like EDS, know this, and have already made their moves to open up the nation's dormant fiber pipes for their private, internal use. These are the so called 'dark fiber' networks; millions of miles of unused sections of fiber laid down by the phone companies for future use.

A single strand of fiber, running free and unfettered by switching electronics, has an intrinsic bandwidth of 25 thousand gigahertz per each of the three groups of frequencies (three passbands) it can support. Just one such fiber thread can carry twice the peak hour telephone traffic in the U.S.

Within the next five years, if fiber was used at its true data carrying capacity, 500 megahertz two way communications can be brought into the home (your house currently shuffles along at 4Khz over copper). Via compression or more advanced technologies, speeds would soar into scores of gigabits. The ATM communications protocol may well emerge as the service provider's first choice, as its fixed packet, 53 byte cells facilitate fast networks.

So what's stopping all this bandwidth progress? For fiber to work at these awesome high speeds, it needs to be all optical. Put an intelligent, electrical switch in its path, and it slows way down. Thus, fiber nets work fastest when they are 'dumb'. But the telephone companies make their money by building intelligent switches that monitor your calls so they can then bill you. Thus, to make money, they have to artificially hamstring fiber's go fast capabilities. The phone companies are therefore in no hurry to use the full bandwidth potential of what they already own.

But, as all will now ask in confused unison, if there are no intelligent switches, how do you route the dumb fiber call? The answer is surprisingly simple: With massive bandwidth, you can create any type of logical switch you desire. The best analogy is AM or FM radio. The 'air' bandwidth is sufficient that you simply turn to the desired station frequency on your radio dial.

Likewise, if fiber bandwidth is abundant enough, you just 'tune' your PC to the fiber frequency assigned you. That means millions of frequencies over fiber, of course. This incredible feat has already been accomplished; with IBM having the first fully functional, all optically switched fiber system, called, appropriately enough, 'Rainbow'. Its opto-electronics interface fits on a single PS/2 Microchannel card. To the Bell operating companies and long distance carriers, Rainbow must rank right up there with flood, pestilence, and the FCC.

From the computer side, the folks who control the CPUs (i.e., Intel) also see a market threat from such a bandwidth firehose. You get enough bandwidth into your PC, and it can flood even a fast Pentium. At 622 megabits/second ATM rates, the processing power of a Pentium class machine has under 100 instruction cycles to read store, display, and analyze a packet, plus do its other housekeeping chores, like run spreadsheets, etc. But as networks approach gigabit+ speeds, the demands on conventional CPUs, no matter how fast, will simply be overwhelming.

And if unlimited bandwidth becomes 'free', does your PC now become a multimedia/telephony/videconferencing machine first, and a spreadsheet/WP/DB system second? In that possible event, what does it mean for the Intel/Microsoft PC cabal? What do you use for a PC processor/OS to deal with this bandwidth torrent?

If such a watershed event should come to pass, Gates, along with TV cable baron John Malone, have already hedged their desktop/cable box CPU bets by each ponying up $15M to invest in a Silicon Valley startup called Microunity. This remarkable company's hopes are pinned on making communications processor chips that can process this coming bandwidth deluge a 100 times faster than a P6; but do so at clock speeds that are only ten times faster than the new Intel CPU. The low clock speed is the key to Microunity's success. For otherwise, how do you develop memory subsystems that can keep up at this blinding pace?

But if Microunity does not succeed, (Note This now seems in doubt, as of 5/97--Ed.) other contenders are ready to take its -- and Intel's -- place. The most notable contender being digital signal processing (DSP) chips.


DSP (digital signal processing) chips are truly cool. One software driven DSP card can replace your PC's fax board, modem, sound card, MIDI card, and telephone answering machine. DSPs also do MPEG -1 and -2 video compression extremely well.

Fundamentally, a DSP chip is a device optimized for processing real time signals. Those signals can be audio, speech, music synthesizers, video, industrial/medical sensors, etc. DSP chips typically sport their own unique operating system. The DSP then processes the signals based on some type of algorithm. DSPs come in two basic flavors, fixed function (like MPEG processing), and programmable. They also come in two variants, those with floating point, and those without.

Complete, general purpose computer systems based on DSPs are also starting to appear. E.g., M.I.T. Lincoln Labs recently spun off Integrated Computing Engines, Inc. (ICE) 617 768 2300. Its premiere product is a parallel processing, multi-purpose, supercomputer desktop system that uses 64 of Analog Device Inc.'s ADSP21060 SHARC DSP chips. (Lincoln Labs was one of the SHARC design partners.)

This $100K ICE machine may freeze the sales of some multimillion dollar supercomputer systems. Significantly, the ICE machine offers high level (as in easy-to-use) DSP software development tools. With 64 SHARCs snapping away on just one 7.5" x 13.5" card, this desktop monster produces 7.7 Gflops (billion floating point ops/sec). The compact ICE system is designed to function as a back-end processor, using an NT PC or UNIX workstation as a front-end.

Just recently, an M.I.T. student, Andrew Twyman, broke Netscape's export encryption code in about week using an ICE/SHARC desktop machine. (NOTE: This article was written in Jan. 1996). Previously, it had been estimated that it would cost thousands of dollars in computer time to break this encryption scheme. But ICE figures it cost Twyman about $584. This chump change, code-busting feat has tremendous implications for those people tasked with developing secure 'Net encryption schemes. Their job is going to get a lot harder.

The SHARC (Super Harvard Architecture) chip has some unique properties for a DSP. Perhaps one of the SHARC's most important features is that has four megabits of on-board memory. As long as your application requirements stay within those 4 Mbits, then you get speed, indeed.

Up until the advent of the SHARC, the Texas Instruments' 320C40 chip, with its peak of 80 MFlops (million floating point operations per second) was king of the DSP hill. But at 120 MFlops (peak), the SHARC zooms right past the TI part.

SHARC DSPs are in tremendous demand, and Analog Devices is reportedly way behind in shipments. Part of the backlog problem was caused by some faulty manufacturing runs. The military and aerospace markets have also shown a big interest in the SHARC, further compounding these delays. DSP market analysts say the demand for the chip is 'explosive.'

Interestingly, Integrated Computing Engines is not focused on either military, aerospace, or cipher-busting applications. Instead, reports are that ICE is building the ultimate gaming machine for arcade use. So for 25 cents, adolescents of all ages will soon be enjoying the supercomputer thrill of a lifetime.

So how come we don't have a DSP in every PC pot?

DSP algorithms are not your Visual Basic programming plug and chug. But the developer's payoff can be one DSP-based card that performs many different, concurrent functions. Further, because they are programmable, it is possible to easily upgrade DSP-based systems with new software/algorithms. Thus, your investment won't be obsolete next week.

[A Technical Aside -- The origins of DSP chips are quite outside the traditional model of computing. Its divergent roots also help explain the dearth of general purpose DSP/PC software. DSPs were conceived as devices to solve real-time problems that have infinite-input data streams to process. The first sources for such signal-processing applications came from industrial and medical sensors, and focused on digital filtering and spectrum analysis.

As a result, DSPs were frequently used in sampling and analyzing discrete points of continuous analog signals. Typically, such calculations involve highly repetitive sets of arithmetic operations such as those found in the calculation of running sums. So early on, designers of DSPs turned to pipelined and superscalar chip architectures in an attempt to capitalize on the arithmetical parallelism inherent in signal-processing applications.

In addition, the real-time nature of the highly specialized problems that DSP chips were employed to solve compounded the difficulty of optimizing the devices. Typically, DSP vendors handled those problems by developing a unique operating system for each proprietary DSP chip.]

But while applications for DSPs are increasing, knowledgeable software developers have remained scarce. Fortunately, this DSP-developer drought is about to radically change, as high level DSP software development tools and computer languages, like C, C++, and even Ada, are now appearing on the scene.

These recent arrivals will help expedite innovation in DSP algorithms, as well initiate a proliferation of new DSP-based systems. This software development progress is also being mirrored by significant advances in new DSP chip hardware, like the SHARC.

Of all the PC vendors, IBM has probably the most advanced PC-based DSP program. IBM's MDSP2780 DSP chip includes a complete real time operating system, as well as a canned suite of ready to wear algorithms that include Soundblaster card emulation, a 14.4 modem/fax, a 32 voice wave table synthesizer (as well as FM synthesis), a digital answering machine, and a full duplex answerphone. All packaged up, IBM calls its DSP/OS/algorithm product Mwave.

IBM is also bringing out a new DSP architecture, called Mfast. IBM claims its new design can do 10 billion-plus 16 bit operations/sec, and five billion-plus single-precision floating point operations/sec. 3-D graphics, multi-standard video, digital TV, and video conferencing are also being added by IBM to these new Mfast DSPs.

Indicative of a growing industry trend, the new Mfast architecture also uses VLIW (Very Long Instruction Word). VLIW is a special type of processing scheme. It allows multiple actions to execute in parallel on board the chip via a single complex instruction. In of itself, this multiplicity of operations per each computer cycle is not big news. Most new PC CPUs, and all UNIX workstation CPUs, have this simultaneous instruction execution capability.
But this type of multi-instruction parallel execution, called superscalar processing, is done via special hardware on board the chip. This hardware makes the CPU chip both more expensive and complex. But VLIW does its simultaneous instruction dance via software. This makes a VLIW chip simpler and hence, less costly to make. (Also note that the new Intel/Hewlett Packard CPU venture, the 'P7', will utilize VLIW processing -- But the ongoing and quite intense HP/Intel disagreements over the P7 design may produce something far removed from what was originally described. There is now an HP group design, and an Intel group design. Which camp wins the day will soon be decided in an upcoming P7 shoot-out within Intel.)

VLIW makes use of a much longer, fixed length, single complex instruction, with many regular and independent sub-parts. Translation: much more work can simultaneously happen in VLIW-based chips than with other CPU/DSP technologies.

To visualize a VLIW at work, think of a long length of sausages all linked together, each one separate, but each one also eaten individually by several people at once.

VLIW performs its magic thanks to a piece of pre-processing software called a compiler. In any type of computer, be it a VLIW, Pentium PC, UNIX workstation, or whatever, a compiler converts the original source language statements into binary code. A compiler takes in the original source code for the program, examines it, then decides how to parse the instructions for machine processing.

Good compilers can make or break a CPU's performance. Sometimes, you will see published new performance benchmarks for a CPU. Quite often, this speed increase was not the result of hardware improvements, but the result of using a better compiler. (This ever-clever compiler game can also lead to spurious SPEC'manship. E.g., Intel just had to do another well-publicized mea culpa. An alleged 'bug' in a compiler had caused Intel to overstate the speed of its microprocessors by about 10% in the SPECint92 test results. Intel was forced to make this admission after arch-rival Motorola pointed out the 'error.') In a VLIW, the cleverness of the compiler takes on even more significance.

Unlike a superscalar CPU which relies on the hardware to dynamically assign the incoming instructions, the VLIW compiler has to be intelligent enough to correctly pre-schedule each instruction subunit prior to processing. Thus, a regular CPU does its instruction allocation dynamically, while a VLIW chip does it statically. If this instruction scheduling is not done properly by the VLIW compiler chip performance can quickly degrade.

[A Technical Aside -- We would say that the differences between these computer architectures are in the formats and semantics of the instructions. In a number of ways, the challenges facing VLIW design are very analogous to superscalar RISC design.

The idea behind VLIW is to "weld" a fixed number of basic instructions into fixed-length "composite" instructions whose components would be executed in parallel on multiple logic units within the DSP chip.

The real challenge to this idea comes in choosing the right composite instructions to avoid both data dependencies and branch-instruction dependencies within the very long instruction word. To solve this problem, VLIW compilers employ a two-step technique called trace scheduling.

In the first step, called trace selection, the compiler predicts the most likely sequence of operations that will be executed. Then, in a process called trace compaction, the compiler attempts to minimize the number of wide instructions required to execute the program. Like their RISC brethren, VLIW systems need lots and lots of memory to be effective. VLIW is also not easy to pull off.

In the mid-1980's, a start-up company called Multiflow put out one the very first commercial computer systems that used VLIW techniques, the Multiflow Trace. Essentially, it was a SPARC machine, and run under SunOs. Multiflow went belly up -- more from poor marketing than technology, the pundits said. But the legacy of Multiflow lives on.]

A consumer electronics giant takes on the PC big boys

The big news, though, is that Philips Semiconductor, a division of the Dutch conglomerate Philips Electronics, is also bringing out its own VLIW multimedia processor, which goes under the name of TriMedia. The TriMedia can process several multimedia types concurrently, at the rate of five operations per individual instruction. Philips has been working on the TriMedia since 1988.

Unlike IBM, which has not yet announced release dates for its new Mfast series, the TriMedia chips are now available in sample quantities. Full production commences in the second half of 1996. Philips' new chip is a hybrid device, combining elements of both a DSP and a regular computer CPU. The TriMedia has been designed to operate as either a multimedia co-processor in a PC; or standalone, as an integrated DSP/CPU in a consumer electronics device. In a nutshell, the TriMedia combines the multimedia power of a next generation DSP with the high level programmability of a regular CPU.

The TriMedia directly interfaces into the PCI local bus, as well into a digital camera, video encoder, stereo A-D/D-A converters, and a V.34 analog modem or ISDN digital interface for telephony. Its 'glueless' interface means no extra circuitry or support chips are necessary. This feature greatly cuts down on system cost and complexity.

At its heart is a 400 MBps internal bus that links together autonomous modules. A single TriMedia VLIW can do several things at once, like opening up data paths between main memory and to any of its modules. These multimedia modules provide such functions as digital video, audio, MPEG, and image processing.

One VLIW instruction contains five distinct operations for each machine cycle, and each operation can be comprised of several media operations. So, for example, audio, video, and 3D image processing can all occur simultaneously. The TriMedia software development package comes with a suite of application libraries, such as MPEG-1 and -2, V.34 modem, H.320, H.324 (video-conferencing), and 3-D graphics.

It is the combination of VLIW, smart compilers, on-chip parallelism, and unique multimedia logic design which gives the TriMedia its performance edge. A 120-MHz Intel Pentium cruises along at 200 million operations per second (MOPS); while the 100MHz TriMedia screams at up to 4 billion operations per second (BOPS).

Comparing DSP processing power with regular CPUs is always a case of apple and oranges, but generally speaking, one DSP operation is equivalent to three CPU operations. In other words, a CPU needs to run 3x faster to compete with a DSP device at a given performance rating.

In the case of the TriMedia, these comparisons are even more difficult, as the part is optimized for media processing, and not regular computer functions. Nonetheless, for a state of the art Pentium to compete with the TriMedia DSP functions, the Intel chip would have to run a 100 or so times faster!

Significantly, these TriMedia processing speeds are being quoted by Philips for C code. In fact, all performance optimization is done in C. Unlike prior generations of DSPs, no assembly or machine language is required with the TriMedia.

Remarkably, the projected price for the TriMedia chip is less than $50 in quantity. In contrast, a Pentium costs hundreds of dollars. But perhaps most stunning from a PC users point of view is that Philips says the capabilities of the TriMedia chip can be transparently accessed by applications written to Microsoft's Windows 95 'Direct' X multimedia APIs.

As a consequence, if the PC has a TriMedia chip on board, the user will suddenly get a tremendous speedup in multimedia processing -- No software upgrade required. If all this works out as planned, the TriMedia might be a rare case of the proverbial, almost free lunch for PC users.

By the way, Philips is planning to use the TriMedia as a co-processor in a new type of CD-i/multimedia/TV set top box.. This is the 'Magic Carpet' system, which also utilizes the Silicon Graphics Inc./MIPS R4X00 CPU. Interestingly, the Magic Carpet will also feature the new Web/3D graphics API being developed by the SGI/Sun/Netscape troika. This awesome new Philips box will supposedly sell for around $500.

If an A/V consumer electronics vendor like Philips (who is also a key industry player in the new DVD system, along with Sony, and Toshiba/Warner) was to follow the PC industry example, we might have:
1) A standardized system box with Internet/high speed modem TV cable connections; 2) sporting the new 32 bit, 33 MHz, 132 megabyte/sec PCI (peripheral component interconnect) local bus now being adopted by PC manufacturers, including Apple; into which PCI bus 3) could be added standardized, plug and play system cards offering specialized or enhanced functions; 4) administered by a blazingly fast DSP-style CPU; executing 5), a standardized real time operating system. 6) all selling for several thousand dollars less than current WinTel PCs. -- And doesn't such a low cost unit fit the description of the 'Net Appliance we have been hearing so much about as of late?

Such a new PC-style systems approach to A/V electronics would likely mean an end to the contrived functional obsolescence treadmill that consumers are now forced to run on.

So stay tuned. The WCW spectacles will soon face stiff viewer competition as the PC and consumer industries wrestle each other down to the TV mat.

Back to main Impact article

Copyright 1996, All Rights Reserved, Franco Vitaliano

21st, The VXM Network,