Cell Broadband Engine Architecture and its first implementation

A performance view

Explore the first implementation of the Cell Broadband Engine™ (Cell BE) Architecture, developed jointly by Sony, Toshiba, and IBM, and get an up-close look at its performance figures and characteristics.

Thomas Chen , Systems Performance, IBM

Close [x]

Dr. Thomas Chen is a Senior Technical Staff Member working in the Systems Performance area of the Systems and Technology Group. He has been working on performance and architecture design of Cell Processor. His other responsibilities include future workload analysis & characterization, storage & I/O workload analysis, as well as the performance of both processor and system architecture & designs. Dr. Chen received his Master and Ph.D. degrees, both in Computer and Electrical Engineering, from the State University of New York at Buffalo.

Ram Raghavan ([email protected]), Systems Performance, IBM

Close [x]

Dr. Ram Raghavan is a senior performance analyst at the STI Design Center, where he has been mainly focused on the performance of the Cell processor beyond the cores. Before joining the STI project, he was with the IBM Systems and Technology Group working on the performance aspects of IBM Power servers and before that in the development of system verification tools. He holds a Ph.D. degree in the area of Computer Science and Engineering from the University of Michigan at Ann Arbor.

Jason Dale ([email protected]), Systems Performance, IBM

Close [x]

Jason Dale is an Engineer with IBM Server and Technology Group, and has been focused on the performance of microprocessor core performance inside the CELL Broadband Engine. He is currently working on performance analysis for a next-generation processor core. Jason holds a B.S. degree from Clarkson University and a M.S. degree from U.T. Austin, both in computer engineering.

Eiji Iwata ([email protected]), Microprocessor Development, Sony Computer Entertainment Inc.

Close [x]

Eiji Iwata received the M.E. degree in information systems engineering from Kyushu University in Fukuoka, Japan and joined Sony Corporation, Tokyo in 1991. In 1996, he was a Visiting Researcher at Hydra project in CSL, Stanford University. Since 1997, he developed a CMP for media applications in Sony and joined STI Design Center in 2001. Currently he works on Cell related project in Sony Computer Entertainment, Inc.

29 November 2005

The Cell Broadband Engine (Cell BE) processor is the first implementation of the Cell Broadband Engine Architecture (CBEA), developed jointly by Sony, Toshiba, and IBM. In addition to Cell BE's use in the upcoming Sony PlayStation® 3 console, there is a great deal of interest in using it in Cell BE-based workstations, media-rich electronics devices, video and image processing systems, as well as several other emerging applications.

The Cell BE includes one POWER™ Processing Element (PPE) and eight Synergistic Processing Elements (SPEs). The Cell BE architecture is designed to be well-suited for a wide variety of programming models, and allows for partitioning of work between the PPE and the eight SPEs. This paper shows that Cell BE can outperform other state-of-the-art processors by approximately an order of magnitude, or more in some cases.

Introduction

Until recently, improvements in the performance of general-purpose processor systems were primarily derived from higher processor clock frequencies, wider issue super-scalar or deeper super-pipeline designs. However, without a commensurate increase in the memory speed, these approaches only led to increased memory latencies and even more complex logic to hide them. Further, not being able to have a large number of concurrent accesses to memory, complex cores end up under-utilizing the execution pipelines and memory bandwidth. Invariably the end results can be poor utilization of the chip area and a disproportionate increase in power dissipation relative to overall performance.

The approach taken by the Cell BE design was to focus on improving performance/area and performance/power ratios (see Introduction to the Cell Multiprocessor, listed in Resources. These goals are largely achieved by using powerful, yet simple cores that use area more efficiently with less power dissipation. Supported by an interconnect with high data bandwidth, these cores could work both independently and cooperatively. By supporting a large number of simultaneous memory accesses from the cores, the memory bandwidth can be used more efficiently as well. The design philosophy is somewhat similar to the recent trend of having multiple general-purpose cores in the same chip; in the Cell BE, all cores are just much simpler, yet still powerful.

The following section, Cell BE Overview and its Performance Characteristics, introduces the performance characteristics of the Cell BE, focusing on the PPE, the SPEs, the Element Interconnect Bus (EIB), the XDR memory, and the IO interfaces (IOIFs). Finally, Application Examples and their Performance Characteristics, characterizes performance of several applications that exploit the Cell BE processor features and compares the results to other general-purpose processors.

Cell BE overview and its performance characteristics

Figure 1. Cell BE Processor Block Diagram

Figure 1 shows a high-level view of the first implementation of Cell BE. It includes a general-purpose 64-bit POWER Processing Element (PPE). In addition, the Cell BE incorporates eight Synergistic Processing Elements (SPEs) interconnected by a high-speed, memory-coherent Element Interconnect Bus (EIB). This initial implementation of Cell BE is targeted to run at 3.2GHz.

The SIMD units on the eight SPEs provide the majority of the computational power of the Cell BE. When using single-precision floating-point fused multiply-add instructions, the eight SPEs in the first-generation Cell BE chip can perform a total of 64 floating-point operations per cycle.

The integrated memory controller (MIC) provides a peak bandwidth of 25.6GB/s to an external XDR memory, while the integrated I/O controller provides peak bandwidths of 25GB/s (for inbound) and 35GB/s (for outbound). The EIB supports a peak bandwidth of 204.8GB/s for intra-chip data transfers among the PPE, the SPEs, and the memory and the I/O interface controllers.

The POWER Processing Element (PPE)

The PPE consists of a POWER Processing Unit (PPU) connected to a 512KB L2 cache. The PPE is the main processor of the Cell BE, and is responsible for running the operating system and coordinating the SPEs. The key design goals of the PPE are to maximize the performance/power ratio as well as the performance/area ratio. The PPU is a dual-issue, in-order processor with dual-thread support. A pipeline timing diagram, as detailed in Introduction to the Cell Multiprocessor (see Resources for a link), is shown in Figure 2.

Figure 2. The PPU pipeline

The PPE core can fetch four instructions at a time, and issue two. In order to improve performance from its in-order pipeline, the PPE utilizes delayed-execution pipelines and allows limited out-of-order execution of load instructions. This allows the PPE to get some of the advantages of out-of-order execution without any significant increase in complexity. We do not focus on the PPE in this paper since most of the algorithms presented here do not utilize the PPE.

SPE performance

The SPE is a modular design consisting of a Synergistic Processing Unit (SPU) and a Memory Flow Controller (MFC). An SPU is a compute engine with SIMD support and 256KB of dedicated local storage. The MFC contains a DMA controller with an associated MMU, as well as an Atomic Unit to handle synchronization operations with other SPUs and the PPU.

An SPU is a dual-issue, in-order machine with a large 128-entry, 128-bit register file used for both floating-point and integer operations. The SPU operates directly on instructions and data from its dedicated local store, and relies on a channel interface to access the main memory and other local stores. The channel interface, which is in the MFC, runs independently of the SPU and is capable of translating addresses and doing DMA transfers while the SPU continues with the program execution.

Figure 3. SPE block diagram

The SPUs SIMD support can perform operations on sixteen 8-bit integers, eight 16-bit integers, four 32-bit integers, or four single-precision floating-point numbers per cycle. At 3.2GHz, each SPU is capable of performing up to 51.2 billion 8-bit integer operations or 25.6GFLOPs in single precision. Figure 3 shows the main functional units in an SPU: (1) an SPU floating point unit for single-precision, double-precision, and integer multiplies; (2) an even SPU fixed-point unit for arithmetic, logical operations, and word shifts; (3) an odd SPU fixed-point unit for permutes, shuffles, and quadword rotates; (4) an SPU control unit for instruction sequencing and branch execution; (5) an SPU local store unit for load and stores; also to supply instructions to the control unit; (6) an SPU channel/DMA transport which is responsible for controlling input and output through the MFC.

As Figure 3 shows, each functional unit is assigned to one of the two execution pipelines. The floating point and fixed point units are on the even pipeline while the rest of the functional units are on the odd pipeline. The SPU can issue and complete up to two instructions per cycle, one on each of the execution pipelines. A dual issue occurs when a group of fetched instructions has two issueable instructions, one of which is executed by a unit on the even pipeline and the other executed by a unit on the odd pipeline.

There are three types of instruction fetches: flush-initiated fetches, inline prefetches, and hint fetches. To fetch instructions, the instruction fetch logic reads 32 instructions at a time into its instruction line buffer (ILB), from which two instructions at a time are sent to the issue logic. When the operands are ready, the issue logic sends the instructions to the functional units for execution. Functional unit pipelines vary from two to seven cycles. Hint instructions can preload instructions into the ILB.

Features such as a deterministic LS access time, simple issue rules, software-inserted branch hints, a large register file, and so on, are exposed to the compiler and applications for performance tuning. With some tuning efforts, we have seen a wide variety of applications approach the theoretical IPC of 2 in the SPU. The SPEs have DMA support for excellent data streaming bandwidth that is much higher than many modern processors.

The Element Interconnect Bus

The Element Interconnect Bus (EIB) in the Cell BE allows for communication among the PPE, the SPEs, the off-chip memory, and the external I/O (see Figure 4). The EIB consists of one address bus and four 16B-wide data rings, two of which run clockwise and the other two counter-clockwise. Each ring can potentially allow up to three concurrent data transfers as long as their paths do not overlap. The EIB operates at half the speed of the processor.

Each requestor on the EIB starts with a small number of initial command credits to send out requests on the bus. The number of credits is the size of the command buffer inside the EIB for that particular requestor. One command credit is used for each request on the bus. When a slot becomes open in the command buffer as a previous request makes progress in the EIB request pipeline, the EIB returns the credit back to the requestor.

Figure 4. The EIB and the connected units

When a requestor needs a data ring, it makes a single request to the EIB data bus arbiter. The arbiter arbitrates among multiple requestors and decides which data ring is granted to which requestor and when. The memory controller is given the highest priority to prevent stalling of the requestor on the read data, while all others are treated equally with a round-robin priority. Any bus requestor can use any of the four rings to send or receive data. The data arbiter does not grant a data ring to a requestor if the transfer would cross more than halfway around the ring on its way to destination or would interfere with another data transfer already in progress at the time.

Each unit on the EIB can simultaneously send and receive 16B of data every bus cycle. The maximum data bandwidth of the entire EIB is limited by the maximum rate at which addresses are snooped across all units in the system, which is one per bus cycle. Since each snooped address request can potentially transfer up to 128B, the theoretical peak data bandwidth on the EIB at 3.2GHz is 128Bx1.6GHz = 204.8GB/s.

The sustained data bandwidth on the EIB will however be lower than the peak bandwidth due to several factors: the locations of the destination and the source relative to each other, the potential for the new transfer interfering with those already in progress, the number of Cell BE chips in the system, whether the data transfers are to/from memory or between local stores in the SPEs, and the efficiency of the data arbiter.

Reduced bus bandwidths can result such as when: (1) all requestors access the same destination at the same time, such as memory on the same local store; (2) all transfers are in the same direction and cause idling on two of the four data rings; (3) there are a large number of partial cache line transfers lowering the bus efficiency; and (4) each data transfer is a 6-hop, inhibiting the units on the way from using the same ring.

We ran a series of experiments on the hardware in our lab, using a core frequency of 3.2GHz. In our experiments, all four pairs of SPEs did streaming reads or writes to each other's local stores. For instance: SPE0 reads from SPE2, SPE4 from SPE6, SPE1 from SPE3, and SPE5 from SPE7, and vice versa in each case. Table 1 below shows the results from a few of the runs. We use the notation SPEx <-> SPEy to indicate that SPEx reads from SPEy's local store, and vice versa.

Table 1. Sustained EIB bandwidth achieved for some SPE-to-SPE DMA transfers

Test configuration	Aggregate EIB BW 3.2 GHz
SPE1 <-> SPE3, SPE5 <-> SPE7, SPE0 <-> SPE2, SPE4 <-> SPE6	186 GB/s
SPE0 <-> SPE4, SPE1 <-> SPE5, SPE2 <-> SPE6, SPE3 <-> SPE7	197 GB/s
SPE0 <-> SPE1, SPE2 <-> SPE3, SPE4 <-> SPE5, SPE6 <-> SPE7	197 GB/s
SPE0 <-> SPE3, SPE1 <-> SPE2, SPE4 <-> SPE7, SPE5 <-> SPE6	197 GB/s
SPE0 <-> SPE7, SPE1 <-> SPE6, SPE2 <-> SPE5, SPE3 <-> SPE4	78 GB/s
SPE0 <-> SPE5, SPE1 <-> SPE4, SPE2 <-> SPE7, SPE3 <-> SPE6	95 GB/s
SPE0 <-> SPE6, SPE1 <-> SPE7, SPE2 <-> SPE4, SPE3 <-> SPE5	197 GB/s

The sustained effective EIB data bandwidth in our experiments varied from 78GB/s (38% of peak) to 197GB/s (96% of peak). In the case where the bandwidth is only 78GB/s, the communicating SPEs are farthest apart (see Figure 4), and only one transfer happens on each of the four rings. In the case where the bandwidth is 95GB/s, the communicating SPEs are five hops away from each other, still preventing other transfers to take place because of path overlap. In the remaining cases the achieved bandwidth is close to the peak of 204.8GB/s.

The memory subsystem

The memory interface controller (MIC) in the Cell BE chip is connected to the external RAMBUS XDR memory through two XIO channels that operate at a maximum effective frequency of 3.2GHz (400 MHz, Octal Data Rate). Both RAMBUS channels can have eight concurrently operating banks and a maximum size of 256MB, for a total of 512MB.

The MIC has separate read and write request queues for each XIO channel operating independently. For each channel, the MIC arbiter alternates the dispatch between read and write queues after a minimum of every eight dispatches from each queue or until the queue becomes empty, whichever is shorter. High-priority read requests are given priority over reads and writes.

Writes of sizes 16B or more, but less than 128B, can be written directly to memory using a masked-write operation, but writes smaller than 16B need a read-modify-write operation. Due to a small number of buffers for read-modify-write operations, the read part of the read-modify-write operation is given a higher priority over normal reads, while the write part is given a higher priority over normal writes.

Other performance-enhancing features in the MIC include a fast-path mode where an incoming request with the request queues empty is dispatched immediately to save a few cycles of latency for reads, and a speculative read mode where reads can be dispatched even before the combined response is received from the EIB.

With both XIO channels operating at 3.2GHz, the peak raw memory bandwidth is 25.6GB/s. However, normal memory operations such as refresh, scrubbing, and so on, typically reduce the bandwidth by about 1GB/s. The peak bandwidth assumes that all the banks are kept active all the time by the incoming request streams, and requests are all of the same type (read or write), and each is exactly 128B in size. If streaming reads and writes are intermingled, the effective bandwidth can be reduced to about 21GB/s; the bandwidth loss in this case arises from the overhead of turning around the MIC-to-XIO bidirectional bus.

Finally, the Cell BE also implements resource allocation to provide controlled access to the critical shared resources such as the memory or the I/O links. This controlled access allows time-critical applications to meet timing targets by preventing contention for the shared resources.

Flexible I/O Interface

There are seven transmit and five receive RAMBUS RRAC FlexIO links that are 1B wide each. These links can be configured as two logical interfaces. With the FlexIO links operating at 5GHz, the I/O interface provides a raw bandwidth of 35GB/s outbound and 25GB/s inbound. A typical configuration may have one I/O interface configured with raw bandwidths of 30GB/s outbound and 20GB/s inbound; and the other I/O interface with raw bandwidths of 5GB/s outbound and 5GB/s inbound.

Data and commands on the I/O interface are transmitted as packets. In addition to the command, response, and data, each packet may carry information such as the data tag, data size, command id, and flow control information, as well as other information. Because of these overheads, and non-optimal arrival times of data and commands, the effective bandwidth on the two interfaces is typically lower, and ranges from 50% to 80% of the raw bandwidth. Of course, other factors such as the prevailing data traffic on the EIB, resource allocation, speed of the I/O devices, ordering characteristics of the I/O data traffic, interrupts, and so on, can potentially reduce the I/O bandwidth further.

Application examples and their performance characteristics

In this section, we present the results for a number of applications that showcase the performance of the Cell BE chip. The applications cover a wide range: Matrix Multiplication, Linpack, MPEG-2 video decoding, Triangle Transform and Lighting, and cryptography algorithms such as AES, TDES, MD5, and so on.

Some of the performance results in this section were obtained from our cycle-accurate SPEsim performance model, which is an internal tool, as it provides a great deal of insight into pipeline behavior. In other cases, the results were measured on the hardware with no operating system support and overheads. The results from the hardware indicate that the simulator is typically more than 98% accurate for compute intensive code, while in some cases such as Linpack, which has significant memory activity, the modeling accuracy can drop to about 88%, due to simplifying assumptions in the EIB and memory system model.

Optimization of Matrix Multiplication

Our matrix multiplication program calculates C = A x B, where A, B, and C are square matrices of order N. Each element c_ij of C is computed as follows:

A well-known optimization to reduce the required memory bandwidth is to partition the matrices into smaller square matrices of order M < N:

where A_ij, B_ij and C_ij are square matrices of order M. Each matrix block C_ij is then solved by:

The first optimization for the SPE architecture was to take advantage of the four way-SIMD using 32-bit multiply-and-add operations to perform up to eight single-precision floating point operations per cycle.

The next optimization was to take advantage of DMA engines in SPE by adopting a double-buffer approach to overlap computations on the previously fetched data blocks with transfers of subsequent data blocks to and from memory. This effectively removes memory stalls.

Additional optimizations, such as overlapping the execution of blocks, loop exchange, and software pipelining were applied to balance the usage of the two SPU pipelines and maximize the dual-issue rate. With these tuning efforts, the matrix multiplication program on a single SPU improved its performance by almost 60x over the original scalar code running on an SPU.

Table 2. Performance of Matrix Multiplication on one SPU

	# of Cycles	# of Insts.	CPI	Dual Issue	Channel Stalls	Other Stalls	# of Used Registers	GFLOPS
Original (scalar)	258.9M	247.1M	1.05	26.1%	11.4%	26.3%	47	0.42
SIMD optimized	9.78M	13.8M	0.711	40.3%	3.0%	9.8%	60	10.96
SIMD + dbl buf	9.68M	13.6M	0.711	41.4%	2.6%	10.2%	65	11.12
Optimized code	4.27M	8.42M	0.508	80.1%	0.2%	0.4%	69	25.12

The data presented in Table 2 above was obtained from a cycle-accurate SPEsim simulator. The accuracy of the results from our SPEsim simulator for matrix multiplication (with a matrix size of 256x256), as shown in the table below, is higher than 99%.

Number of SPUs	1
SPEsim (GFLOPS)	25.12
Hardware (GFLOPS)	25.01
Accuracy (%)	99.6%

Since operations in each data block are independent from those in other blocks, the matrix multiplication algorithm is easily parallelized to all eight SPUs. Figure 5 shows that the matrix multiplication performance increases almost linearly with the number of SPUs, especially with large matrix sizes. Using eight SPUs, the parallel version of matrix multiplication achieves 201GFLOPS, very close to the theoretical maximum of 204.8GFLOPS.

Figure 5. Performance of parallelized matrix multiplication for different sized matrices

Assuming that matrix multiplication can achieve its peak single-precision floating point capability, a Pentium4 with SSE3 at 3.2GHz can achieve 25.6GFLOPS, while the Cell BE could achieve 201GFLOPS, better by almost a factor of eight.

Optimization of Linpack

Linpack solves a dense system of linear equations Ax = b, where A is a matrix of order N, and x and b are single dimension arrays of size N. Matrix multiplication is a key part of Linpack. The Linpack implementation on the Cell BE was based upon the Block Partitioned Algorithm for LU Factorization (see Software Libraries for Linear Algebra Computations on High Performance Computers listed in Resources).

Linpack single precision floating point performance

The initial implementation of single-precision Linpack was a scalar version. When run on the cycle-accurate SPEsim simulator with a matrix size of 1024x1024 using partitioned blocks of 64x64, it achieved only 0.26GFLOPS with a 3.2GHz SPE.

With optimizations added to hide the memory latency by overlapping data transfers with computation using double buffers, as well as working to maximize dual issue with optimal instruction scheduling, the final optimized version of the program achieved 16.5GFLOPS on a 1Kx1K matrix, 22.0GFLOPS on a 4Kx4K matrix, and 23.5GFLOPS on a 8Kx8K matrix (Table 3).

Table 3. Performance of optimized Linpack on a single SPU

Code	Matrix size	# of Cycles	# of Insts.	CPI	Dual Issue	Channel Stalls	Other Stalls	# of Regs	GFLOPS	Efficiency
Orig.	1024x1024	9.11G	6.57G	1.39	16.0%	1.9%	50%	32	0.26	1.02%
Opt.	1024x1024	140M	205M	0.68	56.7%	18.1%	3.9%	72	16.5	64.5%
Opt	4096x4096	6.66G	11.9G	0.56	71.7%	6.4%	1.7%	68	22.0	85.9%
Opt.	8192x8192	50.0G	94G	0.53	75.8%	3.8%	1.0%	68	23.5	91.8%

* Note that there is an approximately 10% error in this SPEsim performance results.

Table 4 shows the Linpack performance when optimized to run on all eight SPEs. Note that the computational efficiency improved significantly with the larger matrix size. Eight SPUs running Linpack at 3.2GHz achieves 155.5GFLOPS on a 4Kx4K matrix.

Table 4. Performance of parallelized Linpack on eight SPUs

Matrix size	Cycles	# of Insts.	CPI	Single Issue	Dual Issue	Channel Stalls	Other Stalls	# of Used Regs	SPEsim	Mea- sured	Model accuracy	Effi- ciency
1024x1024	27.6M	2.92M	0.95	27.9%	32.6%	26.9%	12.6%	126	83.12	73.04	87.87%	35.7%
4096x4096	918.0M	1.51G	0.61	29.0%	56.7%	10.8%	3.4%	126	160	155.5	97.2%	75.9%

Table 5 compares the performance results of parallel Linpack from the SPEsim and the hardware.

Table 5. Performance of parallelized Linpack on SPEsim and hardware

	1kx1k matrix						4kx4k matrix
Number of SPUs	1	2	3	4	5	8	8
SPEsim (GFLOPS)	16.03	32.79	45.62	56.33	64.94	83.12	160
Hardware (GFLOPS)	14.94	30.46	42.04	50.8	58.21	73.04	155.5
Model Accuracy	93.20%	92.89%	92.15%	90.18%	89.64%	87.87%	97.19%
Efficiency	58.36%	59.49%	54.74%	49.61%	45.48%	35.66%	75.93%

As can be seen from Table 5, the SPEsim results for the Linpack are off by 7% to 12%, except for in the 4Kx4K case, due to not modeling all the complexities of DMA and the memory system.

Linpack double precision floating point performance

Although the SPU double-precision (DP) floating-point is not as high as the single-precision performance, it is still good. Each SPU is capable of executing two DP instructions every seven cycles. With Fused-Multiply-Add, an SPU can achieve a peak 1.83GFLOPS at 3.2GHz. With eight SPUs and fully pipelined DP floating-point support in the PPE's VMX, the Cell BE is capable of a peak 21.03GFLOPS DP floating-point, compared to a peak of 230.4GFLOPS SP floating point.

The initial implementation of double-precision Linpack was a scalar version. When run on the SPEsim with a matrix size of 1024x1024, using partitioned blocks of 64x64, it achieved 0.27GFLOPS on a 3.2GHz SPE. With additional tuning, the final version achieved 1.55GFLOPS on a 4Kx4K matrix, or about 85% of peak.

Table 6. Performance of double-precision FP Linpack on one SPU

Single-SPU Linpack (DP)	matrix size	# of Cycles	# of Insts.	CPI	Dual Issue	Channel Stalls	Other Stalls	Used Regs	3.2 GHz GFLOPs	Efficiency
Orig.	1Kx1K	8.46G	4.91G	1.72	9.5%	3.00%	58.0%	33	0.27	14.88%
Opt.	1Kx1K	1.57G	466M	3.36	2.8%	0.80%	74.1%	74	1.46	80.06%
Opt.	4Kx4K	94.5G	26.0G	3.63	1.9%	0.20%	75.8%	74	1.55	84.88%

The double-precision version of Linpack on the Cell BE was also parallelized to run on all eight SPUs. Table 7 summarizes Linpack performance parallelized to eight SPUs, running on SPEsim.

Table 7. Performance of parallelized double-precision Linpack on eight SPUs

matrix size	# of Cycles	# of Insts.	CPI	Dual Issue	Channel Stalls	Other Stalls	Used Regs	SPEsim GFLOPS	Measured GFLOPS	Model Accuracy	Efficiency
1Kx1K	236.7M	69.1M	3.42	2.9%	6.7%	68.5%	128	9.704	9.46	97.49%	64.66%
2Kx2K	1.64G	44.9M	3.65	2.2%	3.3%	72.5%	128	11.184	11.05	98.80%	75.53%

Table 8. Comparison of SPEsim and hardware performance results

	1kx1k matrix
Number of SPUs	1	2	3	4	5	6	7	8
SPEsim (GFLOPS)	1.46	2.84	4.15	5.39	6.56	7.66	8.67	9.71
Hardware (GFLOPS)	1.45	2.81	4.11	5.32	6.46	7.52	8.51	9.46
Model Accuracy	99.14%	98.82%	99.12%	98.79%	98.52%	98.12%	98.21%	97.45%
Efficiency	79.23%	76.78%	74.86%	72.68%	70.60%	68.49%	66.43%	64.62%

Table 8 shows the results measured on hardware as compared to the SPEsim model. The modeling accuracy for DP Linpack is greater than 97% in all cases. With a 1024x1024 matrix, the computational efficiency of parallel Linpack is greater than 64% when running on all eight SPUs. The best parallel Linpack DP floating-point results measured from hardware is 11.82GFLOPS for a 4096x4096 matrix, with an efficiency of 81%. The Linpack performance on IA-32 and IA-64 machines (Survey of High Performance Machines; see Resources) is summarized in Table 9 along with the SPU results.

Table 9. Comparison of Linpack performance between Cell BE and other processors

Linpack 1kx1k (DP)	Peak GFLOPS	Actual GFLOPS	Efficiency
SPU, 3.2GHz	1.83	1.45	79.23%
8 SPUs, 3.2GHz	14.63	9.46	64.66%
Pentium4, 3.2GHz	6.4	3.1	48.44%
Pentium4 + SSE3, 3.6GHz	14.4	7.2	50.00%
Itanium, 1.6GHz	6.4	5.95	92.97%

The Linpack implementation on the Cell BE has the highest DP floating point performance in the chart, exceeding the most recent IA-32 and IA-64 machines available today.

Optimization of MPEG-2 video decoding

The Cell BE is primarily targeted for game applications (Introduction to the Cell Multiprocessor), some of which demand a high video processing capability. Consequently, significant effort was devoted to optimizing MPEG-2 video decoding. Figure 6 shows the major components of an MPEG-2 video decoder; a Variable Length Decoder (VLD), an Inverse Quantizer (IQ), an Inverse Discrete Cosine Transform (IDCT), the Motion Compensation section (MC), and control logic.

The VLD code in an MPEG2 decoder has a lot of branches and needs significant optimization to run well on the SPU. Our optimizations include algorithmic tuning, lookup-table modification, fast bit manipulation with SPU intrinsics, static branch prediction with the gcc compiler's __builtin_expect pragma, function inlining, and global register assignment. Most of the optimizations also reduced the instruction count and eliminated or minimized branch penalties.

Figure 6. Process flow of MPEG2 decoding

A fast IDCT Algorithm (see Fast computational Algorithm for the Discrete Cosine Transform listed in Resources), was adopted to operate an 8x8 2-D IDCT using 512 mpya (multiply and add) and 128 add/subtract instructions. This IDCT algorithm was regular and it was easy to keep the precision required by the MPEG2 standard for 16bx16b multiplication. Most operations in IDCT were four-way SIMD for the inner-product, except matrix transpose, which was eight-way SIMD.

The Motion Compensation part was implemented primarily with eight-way SIMD, and some 16-way SIMD. Function inlining and nested if else to switch case conversion were adopted to eliminate some branches and to improve code scheduling by enlarging basic block size.

Motion compensation required small pixel block transfers (for example, 16x16 pixels) from system memory to local store in order to construct the predicted blocks. If a video frame were stored in raster scan manner, then motion compensation would require numerous small DMA transfers (for example, 16 transfers of 16 bytes). In the Cell BE, however, the DMA transfers are most efficient when performed on 128B naturally aligned boundaries. In order to increase the efficiency, the data structure was rearranged to put each macroblock in a 384B contiguous area, consisting of a 16x16 pixel luminance block and two 8x8 pixel chrominance blocks (in the case of the 4:2:0 format). With this arrangement, a macroblock could be retrieved from the main memory to local store with fewer 128B DMA transfers.

Table 10 shows our results from the optimized decoder running on cycle-accurate SPU simulator. As is evident, a single 3.2GHz SPU can easily support any kind of real-time MPEG-2 video decoding.

Table 10. Single SPU MPEG2 decode performance at different resolutions

		# of Cycles	# of Insts.	CPI	# of Used Registers	Frames/sec. @3.2GHz
CIF	1Mbps	63.4M	51.9M	1.22	126	1514
SDTV	5Mbps	263M	220M	1.20	126	365
SDTV	8Mbps	324M	290M	1.12	126	296
HDTV	18Mbps	1.25G	1.01G	1.24	126	77

* Note that there is an approximately 10% error in this SPEsim performance results.

On hardware, our optimized MPEG2 decoder was able to decode about 1379 CIF frames/sec, about 10% lower than the simulation result. Adjusted for modeling error, each SPU is capable of decoding about 324 frames/sec with a 5Mbps SDTV video stream. Intel reports that its Pentium4 processor running at 2.8GHz is capable of decoding about 310 frames of a SDTV 720x480x30fps video stream every second when using the highly tuned Intel Performance Primitive (IPP) library (see Resources). With linear scaling to 3.2GHz, this frame rate increases to 354fps. Although differences in the video stream data can change these numbers to some extent, given that the Cell BE has eight SPUs, Cell BE should outperform a general purpose processor with SIMD capability by a fair margin.

Optimization of Triangle Transform and Lighting

Transform-light is a critical stage commonly found in vertex-based graphic pipelines. Because of its structured data, the transform-light implementation on Cell BE can benefit greatly from the SIMD support in the SPU. The optimizations here were primarily focused on loop unrolling, increasing the dual-issue rate, overlapping the DMAs with computation, and avoiding branches.

With many more registers than general-purpose processors, our Transform-Light on the SPU benefited significantly from aggressive loop unrolling.

Table 11. Performance of SPE versus PowerPC® 970 for Transform-Lighting

Loop Unrolling	PPC970 2GHz (Mvtx/sec)	PPC970 2.7GHz (Scaled) (Mvtx/sec)	SPE 3.2GHz (Mvtx/sec)	SPE Advantage
1	82.95	112	139.44	1.25
2	94.8	128	155.92	1.22
4	89.47	121	208.48	1.73
8	58.45	79	217.2	2.75

For comparative analysis, we implemented a best-effort Transform-Light on a single-core PowerPC 970 (PPC970), system running at 2GHz. The results in Table 11 show that Transform-Light on one SPU performs about 70% better than the PPC970/VMX system when scaled to 2.7GHz. Given that the Cell BE has eight SPUs, the Cell BE should outperform the PPC970/VMX by a wide margin

Cryptography performance

Cryptography is a fast-emerging workload, and is increasingly found in many applications. The Cell BE, with its SIMD pipelines and a large register file, can execute cryptography functions efficiently. The code optimization here was focused on aggressive loop unrolling and register-based table lookups to take advantage of the SPU's large register file. Tuning was also done for dual issue, bit permutation, and byte shuffling. Table 12 summarizes the performance results of several key cryptography algorithms and hash functions on a single SPE, compared to a leading brand CPU with SIMD.

Table 12. Performance of a single SPE and another processor for Cryptography

Algorithms	SPE (Gb/sec)	Leading Brand CPU with SIMD (Gb/sec)	SPE Performance Advantage
AES ECB Encrypt - 128 bit key	2.059	1.029	2.002
- 192 bit key	1.710	0.877	1.950
- 256 bit key	1.462	0.762	1.918
AES CBC Encrypt - 128 bit key	0.795	0.968	0.821
- 192 bit key	0.664	0.823	0.807
- 256 bit key	0.570	0.725	0.786
AES ECB Decrypt - 128 bit key	1.499	1.035	1.448
- 192 bit key	1.252	0.870	1.438
- 256 bit key	1.068	0.758	1.410
AES CBC Decrypt - 128 bit key	1.507	0.966	1.560
- 192 bit key	1.249	0.829	1.507
- 256 bit key	1.066	0.724	1.472
DES - ECB encrypt	0.492	0.426	1.156
- CBC encrypt	0.275	0.417	0.660
- ECB decrypt	0.492	0.425	1.158
- CBC decrypt	0.489	0.421	1.162
TDES - ECB encrypt	0.174	0.133	1.313
- CBC encrypt	0.097	0.132	0.733
- ECB decrypt	0.174	0.133	1.313
- CBC decrypt	0.174	0.132	1.321
MD5	2.448	2.862	0.855
SHA-1	2.116	0.902	2.347
SHA-256	0.854	0.518	1.649

Results demonstrate that the cryptography performance on a single SPE is typically better (up to 2.3x at the same frequency) than a leading brand processor with SIMD.

Cell BE performance summary

In Terrain Rendering Engine (TRE): Cell Broadband Optimized Real-Time Ray-Caster , Minor, et al. showed that their implementation of the Terrain Rendering Engine on the Cell BE runs 50x faster than a PPC970 with the VMX running at 2GHz.

With Cell BE's innovative micro-architectural features, we showed that a wide variety of algorithms on the Cell BE achieve performance that is equal to or significantly better than a general-purpose processor. For applications that can take advantage of SPU's SIMD pipelines and PPU's thread-level parallelism, the results will be similar. Table 13 summarizes the performance advantage of the Cell BE (or its one SPU) over other processors.

Table 13. Performance comparison of Cell BE versus other processors for different applications

Type	Algorithm	3.2 GHz GPP	3.2 GHz Cell BE	Perf Advantage
HPC	Matrix Multiplication (S.P.)	25.6 Gflops* (w/SIMD)	200 GFlops (8SPEs)	8x (8SPEs)
	Linpack (S.P.) 4kx4k	25.6 GFlops* (w/SIMD)	156 GFlops (8SPEs)	6x (8SPEs)
	Linpack (D.P.) 1kx1k	7.2 GFlops (3.6GHz IA32/SSE3)	9.67 GFLops (8SPEs)	1.3x (8SPEs)
graphics	TRE	.85 fps (2.7GHz G5/VMX)	30 fps (Cell BE)	35x (Cell BE)
	transform-light	128 MVPS (2.7GHz G5/VMX)	217 MVPS (one SPE)	1.7x (one SPE)
security	AES ECB encryp. 128b key	1.03 Gbps	2.06Gbps (one SPE)	2x (one SPE)
	AES ECB decryp. 128b key	1.04 Gbps	1.5Gbps (one SPE)	1.4x (one SPE)
	TDES ECB encryp.	0.13 Gbps	0.17 Gbps (one SPE)	1.3x (one SPE)
	DES ECB encryp.	0.43 Gbps	0.49 Gbps (one SPE)	1.1x (one SPE)
	SHA-1	0.9 Gbps	2.12 Gbps (one SPE)	2.3x (one SPE)
video processing	mpeg2 decoder (sdtv)	354 fps (w/SIMD)	329 fps (one SPE)	0.9x (one SPE)

* assuming 100% compute efficiency, achieving theoretical peak of 25.6GLOPS, in its single precision MatrixMultiply & Linpack implementation

Summary

The Cell BE is a best-of-breed design that delivers significant computational power, high bandwidth, excellent power and performance, and leading area efficiency within the constraints of a process technology.

With eight decoupled SPU SIMD engines, each with dedicated resources including DMA channels, the Cell BE has eight times more SIMD capability (for up to 16-way data parallelism) than traditional processors with vector architecture extensions, such as the PPC970 in the G5. The PPE is also free to process other applications even while all SPUs are in use. The Cell BE can perform especially well in cases where a general-purpose processor would normally get tied up by a single SIMD application. With its performance density and efficiency, the Cell BE can outperform leading-edge processors on a variety of workloads by approximately an order of magnitude, in some cases more, when software is tuned and optimized for this architecture.

Attributions

Cell Broadband Engine is a trademark of Sony Computer Entertainment Inc.

Resources

Learn

"Introduction to the Cell Multiprocessor," by J. Kahle, et al., IBM Journal of Research and Development, Vol. 49, No. 4/5, July/Sept. 2005, pp. 589-604.
Block Partitioned Algorithm for LU Factorization as discussed in " Software Libraries for Linear Algebra Computations on High Performance Computers," by J.J. Dongarra, D. W. Walker.
"Survey of High Performance Machines," by J.J. Dongarra of the University of Tennessee / Oak Ridge national Laboratory.
"A Fast computational Algorithm for the Discrete Cosine Transform;" by W.A. Chen, C. Harrison, S.C. Fralick; published in IEEE Transactions on Communications; Vol. COM-25, No. 9, Sept. 1977 (page 1004; Subscribe to see entire article).
Intel White paper, "Performance Benchmarks for Intel Integrate Performance Primitives," Intel Corp.
"Terrain Rendering Engine (TRE): Cell Broadband Optimized Real-Time Ray-Caster," by B. Minor, G. Fossum, V. To, from Proceedings of GPSx conference, October 2005, will soon be posted to the IBM Semiconductor solutions Technical library.
The Cell BE Broadband Engine project page at IBM Research offers a wealth of links, diagrams, information and articles.
The IBM Semiconductor Solutions Technical Library Cell Broadband Engine documentation section lists specifications, user manuals, and more.
Find all Cell BE-related articles, discussion forums, downloads, and more at the IBM developerWorks Cell Broadband Engine resource center: your definitive resource for all things CBE.
Keep abreast of all the Cell BE news: subscribe to the Power Architecture Community Newsletter.

Get products and technologies

Get Cell BE: Contact IBM E&TS for custom Cell BE-based or custom-processor based solutions.
Get the alphaWorks Cell BE Broadband Engine downloads.
See all Power Architecture-related downloads on one page.

Discuss

Participate in the discussion forum.
Send a letter to the editor.

Comments

Close [x]

developerWorks: Sign in

Required fields are indicated with an asterisk (*).

The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Close [x]

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

All information submitted is secure.

Dig deeper into developerWorks

developerWorks Premium

Exclusive tools to build your next great app. Learn more.
developerWorks Labs

Technical resources for innovators and early adopters to experiment with.
IBM evaluation software

Evaluate IBM software and solutions, and transform challenges into opportunities.

static.content.url=http://www.ibm.com/developerworks/js/artrating/

SITE_ID=1

Zone=Multicore acceleration

ArticleID=99373

ArticleTitle=Cell Broadband Engine Architecture and its first implementation

publish-date=11292005

Feb	DEC	Oct
	08
2014	2015	2018

Cell Broadband Engine Architecture and its first implementation

Introduction

Cell BE overview and its performance characteristics

Figure 1. Cell BE Processor Block Diagram

The POWER Processing Element (PPE)

Figure 2. The PPU pipeline

SPE performance

Figure 3. SPE block diagram

The Element Interconnect Bus

Figure 4. The EIB and the connected units

Table 1. Sustained EIB bandwidth achieved for some SPE-to-SPE DMA transfers

The memory subsystem

Flexible I/O Interface

Application examples and their performance characteristics

Optimization of Matrix Multiplication

Table 2. Performance of Matrix Multiplication on one SPU

Figure 5. Performance of parallelized matrix multiplication for different sized matrices

Optimization of Linpack

Linpack single precision floating point performance

Table 3. Performance of optimized Linpack on a single SPU

Table 4. Performance of parallelized Linpack on eight SPUs

Table 5. Performance of parallelized Linpack on SPEsim and hardware

Linpack double precision floating point performance

Table 6. Performance of double-precision FP Linpack on one SPU

Table 7. Performance of parallelized double-precision Linpack on eight SPUs

Table 8. Comparison of SPEsim and hardware performance results

Table 9. Comparison of Linpack performance between Cell BE and other processors

Optimization of MPEG-2 video decoding

Figure 6. Process flow of MPEG2 decoding

Table 10. Single SPU MPEG2 decode performance at different resolutions

Optimization of Triangle Transform and Lighting

Table 11. Performance of SPE versus PowerPC® 970 for Transform-Lighting

Cryptography performance

Table 12. Performance of a single SPE and another processor for Cryptography

Cell BE performance summary

Table 13. Performance comparison of Cell BE versus other processors for different applications

Summary

Attributions

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Choose your display name

Dig deeper into developerWorks

developerWorks Premium

developerWorks Labs

IBM evaluation software