Graphics Processing Units Paper PDF
Graphics Processing Units Paper PDF
GPUs
Risn Howard (0850896) 18th April 2012
The use of GPUs to carry out general purpose computing as well as graphic acceleration alongside CPUs in desktop and mobile computing.
based on the Multibus standard. The card accelerated the drawing of lines, arcs, rectangles and character bitmaps. It was based on the 82720 Graphic Display Controller. Direct memory access (DMA), was used to load the framebuffer, which accelerated it. It was intended that this board would be used with Intels line of Multibus industrial single board computer plug-in cards.[1, 2] Texas instruments released the first microprocessor with on-chip graphics capabilities, TMS34010, in 1986. This had a very graphics-oriented instruction set and could also run general-purpose code. The IBM 8514 graphics system was released as one of the first video cards in 1987 to implement fixed-function 2D primitives in electronic hardware for IBM PC compatibles.[1] B. The purpose of a Graphical Processing Unit A GPU, Graphics Processing Unit manipulates and alters memory so as to accelerate the building of images. A GPU is primarily used for the computation of three dimensional (3D) functions. Lighting effects, transformations and 3D motion are some of the computations required. These are mathematicallyintensive tasks which would put a strain on the CPU.[1-3] Embedded systems, mobile phones, personal computers, workstations and game consoles are some devices in which GPUs are used. Computer graphics can be manipulated very efficiently by modern GPUs. Due to their highly parallel structure they are more effective that generalpurpose CPUs for algorithms where the processing of large blocks of data is done in parallel. More CPU time is freed up for other tasks by using GPUs.[1, 3] In 1999 the term GPU was popularized by nVidia who marketed the worlds first GPU, or Graphics Processing Unit, a single-chip processor with integrated transform, lighting, triangle setup/clipping, and rendering engines that is capable of processing a minimum of 10 million polygons per second[4], the GeForce 256. This GPU is capable of billions of calculations per second. It has over 22 million transistors, compared to the 9 million found on the Pentium III. Quadro. It is the workstation version which is designed for CAD applications. The Pentium III. Quadro can process over 200 billion operations a second and deliver up to 17 million triangles per second.[1, 3]
I.
INTRODUCTION
User-programmable Graphics Processing Units (GPUs) for mainstream computing and scientific use is a hot topic in computer architecture. These GPUs are specialized processor systems to accelerate the processing of graphics on both desktops and laptops. OpenCL and CUDA are the main contender languages for GPU programming available. Shading languages such as Cg, HLSL and GLSL are available for programming the GPU programmable rendering pipeline. NVidia is one the main companies behind the GPU and the programming of the GPU. The GeForce nVidia GPU card is compatible with many graphics APIs. OpenGL and Microsofts DirectX are among the compatible APIs. NVidia is also branching into the mobile space with the Tegra chips for smart phones and tablets. Multi-core CPUs are important for multi-tasking and lowering the power consumption. GPGPU is a new concept which is general purpose computing on GPUs. Here the GPU is utilized to exploit data parallelism which is available in some applications and perform no-graphics processing. The GPU takes on some of the mathematically intensive tasks leaving the CPU free to deal with other user tasks. II. GRAPHICS PROCESSING UNITS
A. The histroy of GPUs Intel made the iSBX 275 Video Graphics Controller Multimode Board in 1983. This was for industrial systems
C. The benefits of GPUs GPUs process large blocks of data in parallel because of their highly parallel structure. The processing of large blocks of data could be in the form of fast sort algorithms of large lists, or 2D fast wavelet transformations. This makes them more effective than general-purpose CPUs. They are used alongside CPUs for this purpose, by performing those mathematically intensive tasks it relieves the strain that would have been put on the CPU and it is freed up to perform other tasks.[1, 5] III. COMPUTE UNIFIED DEVICE ARCHITECTURE
Figure 1. CUDA Libraries and Tools[8]
NVidia developed Compute Unified Device Architecture, CUDA, for graphics processing. CUDA is the computing engine in nVidia GPUs. By harnessing the power of the GPU an increases in computing performance is facilitated. CUDA shares a range of computational interfaces with two competitors, the Khronos Group and Microsoft, whose architectures are OpenCL and DirectCompute.[5, 6] Access to the virtual instruction set and memory of the parallel computational elements in CUAD GPUs is given to developers through CUDA. Computations like those on CPUs are accessible using CUDA for the latest nVidia GPUs. However GPUs have parallel throughput architectures unlike CPUs which execute a single thread very quickly; this emphasises executing many threads slowly. Solving these general purpose problems on GPUs is known as GPGPU.[5, 7] A. Strengths of CUDA There are several advantages of CUDA over traditional GPGPUs using graphic APIs. CUDA offers full support for integer and bitwise operations. This includes integer texture lookups. Scattered reads are also implemented meaning that code can read from arbitrary addresses in memory. CUDA has the advantage of faster downloads and readbacks to and from the GPU. A shared memory region is also offered; memory can be shared amongst threads. As a result a usermanaged cache can be availed of, which enables higher bandwidth than possible using texture lookups. CUDA also supports a wide range of libraries and tools which are of use to developers, Figure 1.[5, 8]
B. CUDAs weaknesses However there are also some limitations. CUDA-enabled GPUs are only available from nVidia unlike OpenCL. A performance hit due to system bus bandwidth and latency may be incurred by copying between host and device memory. Asynchronous memory transfers handled by the GPUs DMA engine could partially alleviate this. Due to optimisation techniques the compiler is required to employ to use limited resources valid C/C++ may sometimes be flagged and prevent compilation.[5] C. CUDA programming model The CPU is known as the host and the GPU is the compute device, it is the coprocessor to the CPU in the CUDA programming model. Data will need to be shared between both devices as they each have their own memory. The kernel is an application or a program that runs on the GPU and when it is launched it is executed as an array of parallel threads. This execution is shown in Figure 2. A block can only contain a certain number of threads so threads can be grouped together to form a grid of thread blocks.[9, 10]
IV.
Open Computing Language, OpenCL, is a cross-platform, parallel programming framework. It is the first truly open and royalty-free programming standard for general-purpose computations on heterogeneous systems. OpenCL provides a uniform programming environment for software developers to write efficient, portable code for devices using a diverse mix of multi-core CPUs, GPUs, and other parallel processors such as DSPs. OpenCL includes a language for writing kernels and APIs (Application Programming Interfaces) that are used to identify and then control the platforms. Using task-based and data-based parallelism OpenCL provides parallel computing.[12] OpenCL is maintained by the non-profit technology consortium Khronos Group[13]. It has been adopted by Intel[14], Advanced Micro Device (AMD)[15], nVidia[16] ARM Holdings[17] and IBM[18]. OpenCL gives any application access to the graphics processing unit for nongraphical computing, extending the power of the GPU beyond graphics.[12] Apple Inc. initially developed OpenCL and hold the trademark[19]. OpenCL was refined into an initial proposal in collaboration with technical teams at nVidia, Intel, AMD and IBM. The proposal was submitted to the Khronos Group in 2008. The goal was to have a cross platform environment for general purpose computing on GPUs. Representatives from software companies, CPU, GPU and embedded-processor joined together to form the Khronos Compute Working Group to finish the technical details on the specification for OpenCL1.0. Once the specification was reviewed by Khronos members and approved it was released to the public by the end of 2008. The worlds first conformant GPU implementation of OpenCL for both windows and Linux was shipped in June 2009.[12, 16] A. Strengths of OpenCL OpenCL is an open and royalty free language and the fact that code is portable across devices is a big advantage. It is a C like language for heterogeneous devices. It can be used on parallel CPU architectures and it is not vendor specific. OpenCL provides a common language for writing computational kernels, and a common API for managing execution on target devices. OpenCL implementations already exist for nVidia and AMD GPUs and for x86 CPUs. B. Weaknesses of OpenCL OpenCL has some limitations. It is a low-level API which means that developers are responsible for a lot of plumbing, lots of objects/handles to keep track of. They are also responsible for thread safety; certain types of multi-accelerator codes are much more difficult to write than in CUDA. There is a need for OpenCL middleware and libraries, such as the libraries and tools available for CUDA. OpenCL code must
D. CUDA architecture Figure 3 shows an example of the CUDA architecture. Here it can be seen that OpenCL and DirectCompute are supported applications on the CUDA platform for nVidia hardware, along with CUDA they are the device-level API support offered by nVidia. Language integration is also possible. The CUDA run-time application can be in C, C++, Fortran, Python, or Java, etc. The CUDA Architecture consists of parallel compute engines inside nVidia GPUs (1). It also contains OS kernel-level support for hardware initialization and configuration (2). The user-mode driver, which provides a device-level API for developers (3) and a PTX instruction set architecture (4) for parallel computing kernels and functions, is also shown.[11]
deal with hardware diversity. Many features are optional and are not supported by certain devices. Due to the diversity of hardware on which OpenCL must operate a single kernel is likely to not achieve peak performance on all device types.[20] C. OpenCL architecture The handling of passing data to and from your processing environment and, the compiler of the OpenCL code is the main part of the OpenCL framework. The main stages of execution are shown below in Figure 4. Setting up and coordinating the host environment (with n processors including multiple GPUs) so that it can then distribute the data and compile the code efficiently are the key parts OpenCL handles. OpenCL then has control of each process. This means it can store the main progress of the code until the completion of all the desired operations, when either more operations can be performed or the data from the GPU can then be handed back to main memory on the CPU. OpenCL depends on the driver provided by hardware in order for it to be supported.[21]
V.
Since the inception of OpenCL there have been many comparisons between it and CUDA. Correct implementation of OpenCL for the target architecture performs no worse than CUDA. Portability is the key feature of OpenCL. It is not vendor specific like CUDA, which only runs on nVidia devices. This has both advantages and disadvantages associated with it.[5, 12] CUDA is limited to nVidia hardware thus it is more acutely aware of the platform upon which it will be executing. More mature compiler optimizations and execution techniques are provided as a result. This gives CUDA the upper hand as OpenCL code needs to be prepared to deal with much greater hardware diversity. GPU-specific technologies cannot be directly used by the programmer.[5, 12] CUDA has a much larger userbase and codebase than OpenCL due to the maturity of CUDA. The developer can add in optimizations manually to the kernel code. OpenCL had less mature compilation techniques. As the OpenCL toolkit matures the gap between it and the CUDA toolkit will converge.[5, 12] The Scalable Heterogeneous Computing Benchmark Suite (SHOC) was used to compare CUDA and OpenCL kernels on nVidia GPUs. CUDA performs better on NVIDIA GPUs than OpenCL according to the tests. The test measures the number of floating point operations per second in GFLOPS in reference to the kernels. The graph of results is shown in Figure 6.[25]
D. OpenCL programming model Similarly to CUDA, OpenCL has kernels. One or more kernels make up an application or a program that runs on the GPU and when it is launched it is executed as an array of parallel work items. Work groups contain the array of parallel work items. Kernels run over global dimension index range which is known as an NDRange, shown in Figure 5.[22-24]
VI.
ALTERNATIVE LANGUAGES
A. An overview of DirectCompute Microsoft developed DirectCompute. This is an API that supports GPGPU on Microsoft Windows Vista and Windows 7. DirectCompute is part of the DirectX collection of APIs. Although it was initially release with the DirectX 11 API, it runs on both DirectX 10 and DirectX 11 GPUs.[26-28] According to nVidias DirectCompute programming guide DirectCompute as a new type of shader which exposes the compute functionality of the GPU. This compute shader has much more general purpose processing capabilities than the normal shader.[29] There doesnt have to be a fixed mapping between the data being processed and the threads doing the processing with DirectCompute. This means that one thread can process one or many data elements, and the number of threads being used to perform the computation is controlled by the application directly.[29] DirectCompute has thread group shared memory which allows groups of threads to share data, and can reduce bandwidth requirements significantly. Similarly to other Compute APIs, Compute Shaders do not directly support any fixed-function graphics features with the exception of texturing.[29] B. Advantages of DirectCompute There are several advantages of DirectCompute over other GPU computing solutions. Direct3D is integrated with DirectCompute which means it has efficient interoperability with D3D graphics resources. All texture features are included but LOD must be specified explicitly. The HLSL shading language is used by DirectCompute. A single API is provided across all graphics hardware vendors on Windows platforms as a result there is some level of guarantee of consistent results across the different hardware.[29] C. An overview of OpenGL The Open Graphics Library, OpenGL, is an API for GPUs. The procedures and functions used to specify the objects and operations need to produce 3D images are contained in this interface. Silicon Graphics Incorporated designed OpenGL in 1992. The Khronos Group manage OpenGL.[30-33] OpenGL is designed as window-system and operating-system independent, it also is network-transparent. High performance, visual compelling graphics software applications can be created using OpenGL on PCs workstations or supercomputers. It was used in applications such as CAD and video games. All the features of the latest graphics hardware are exposed by OpenGL. Shown in Figure 8 is the OpenGL
A. Similarities between CUDA and OpenCL The programming model used by CUDA is similar to that used by OpenCL. Figure 7 shows a comparison of terms for the data parallelism models. The CPU is the host for both models and kernels executed form the application which contain parallel thread blocks in a grid in CUDA terms.[23]
B. Differences between CUDA and OpenCL CUDA is hardware specific whereas OpenCL is not vendor specific. Due to this fact CUDA knows the hardware on which it runs and can be optimized for it. OpenCL has to be adapted for each different hardware vendor and may not perform as well as CUDA as a result.[5, 12] OpenCL is an open language, very portable and maintained by the Khronos Group, CUDA is not an open language. CUDA has been around for longer than OpenCL thus it has a large code and user-base; it is a more mature language. OpenCLs compilation techniques are less mature and the programmer needs to do a lot more low level programming than with CUDA.[5, 12]
client-server model. Once the hardware and software configuration are compliant this model guarantees consistent presentation on any compliant hardware and software configuration.[30-34]
The GPU is allowed to function as a stream processor since all fragments can be thought of as independent, thus making the graphics pipeline well suited to the rendering process. All stages of the pipeline can be used simultaneously for different vertices or fragments, this independence allows the graphics processor to use parallel processing units. By using parallel processing units multiple vertices or fragments can be processed in a single stage of the pipeline at the same time.[35, 36] A. OpenGL shading language The OpenGL shading languages is known as GLSL. It is a high-level shading language. It has been designed to allow application programmers to express the processing that occurs at the programmable points of the OpenGL pipeline. Vertex and fragment processing is unified by GLSL in a single instruction set. This allows branches and conditional loops. The GLSL has five shader stages; vertex, geometry, fragment tessellation control, and tessellation evaluation.[37, 38]
D. Advantages of OpenGL Due to the fact that OpenGL is a C-based API it is extremely portable and widely supported. OpenGL provides functions for an application to generate 2D or 3D images and allows these rendered images to be copied to its own memory or displayed on the screen. The OpenGL specification is adhered to for every implementation of OpenGL. A set of conformance tests must be passed, thus implementation is reliable. Similarly to OpenCL, OpenGLs specification is controlled by the Khronos Group. This guarantees industry acceptance as the members of this industry consortium are many of the major companies in the computer graphics industry.[30-34] VII. SHADERS A computer program that is used to calculate rendering effects on graphics hardware is a shader. A shader is used to program the GPU programmable rendering pipeline. Programming languages adapted to map on shader programming are known as shading languages. Instructions are send to the GPU by the CPU in the form of a compile shading language program.[35, 36] The geometry is transformed and lighting calculations are performed within the vertex shader. Some changes in the geometrics in the scene are performed if a geometry shader is in the GPU. The calculated geometry is subdivided into triangles which are then broken down into pixel quads. Transformation of 3D data into useful 2D data for displaying by the frame buffer is done by the graphic pipeline using the above steps from the shader program.[35, 36]
OpenGL has the benefit of having cross-platform compatibility on multiply operating systems. Shaders that are written can be used on any hardware vendors graphics card once GLSL is supported. Each hardware vendor can create code optimised for their particular graphics cards architecture because the GLSL compiler is included in their driver.[37, 38] B. Cg programming language Nvidia developed Cg, C for graphics which is a high-level shading language. It was developed in close collaboration with Microsoft for programming pixel and vertex shaders. This is not a general programming language; it is only suitable for GPU programming. Microsoft has a similar shading language called HLSL.[39] Cg features API independence and a variety of free tools to improve asset management are available. It was designed for easy and efficient production pipeline integration. Connectors are special data structures used in Cg to link the various stages of processing. They define the input from application to vertex processing stage and the attributes to be used as inputs to the fragment processing.[39] C. DirectX High-Level Shader Language HLSL is the high-level shader language developed by Microsoft for DirectX and Xbox. It is a C-type shader languages supported by DirectX and Xbox game consoles. Shaders for the Direct3D pipeline can be created using HLSL. There are three shader stages in the HLSL; the vertex shader, the geometry shader and the pixel shader.[40, 41]
VIII. THE EVOLUTION OF GPUS GPUs are extensively used in the computer games market. This is a booming market and it drives the sale of the GPU. This means that the future of the GPU is greater than that of the general-purpose CPU. The CPU will still remain as the main processor but there is much more potential for expanding the computing experience using the GPU. The GPU is much better at parallelism than the CPU, thus complex problems can be easily solved by the GPU which can be both graphical and non-graphical.[42] Due to high volumes of GPUs being sold to PC gamers, as a result of this high demand for GPUs they are relatively inexpensive. The trade off of having high cost special purpose hardware is thus less of a factor. According to Moores Law the CPU growth doubles every 18 months and the GPU growth doubles every 6 months. This makes it impossible for CPU manufactures to keep up with the rapid growth of GPU advancement. It would prove too expensive to re manufacture a new CPU every time a new GPU chip is released. Figure 9 shows how GPUs are obeying Moores Law and CPUs are being left behind. The graphical processing unit is visually and visibly changing the course of general purpose computing[43].[42, 44]
generations of the GeForce design. The original release of the design was the GeForce 256 in 1999. The first GeForce products were intended for the high-margin PC gaming market. It was designed that they would be used on add-on graphics boards, they were discrete GPUs. All tiers of the PC gaming market were covered in subsequent designs. NVidias embedded application processors include the most recent GeForce technology. These are designed for mobile phones.[46, 47]
A. The GeForce 6 Series The sixth generation of GeForce is the GeForce 6 Series. It was released in 2004. This series can have a 4, 8, 12, or 16 pixel-pipeline GPU architecture. It contains an on-chip video processor with full MPEG-2 encoding and decoding, and advanced adaptive de-interlacing called PureVideo. This design also has High Precision Dynamic Range technology and 8 times more shading performance than previous designs. There is DirectX 9 Shader Model 3.0 support and OpenGL 2.0 optimizations and support also.[47-49] B. Architecture of the GeForce 6 Series The GPU memory interface has an available bandwidth of 35GBps. The CPU memory interface has 6.4GBps available bandwidth and the PCI express bus has 8GBps. This shows that there is a vast amount of internal bandwidth available on the GPU. More dramatic performance improvements can be made by making sure that algorithms running on the GPU take advantage of this bandwidth.[50]
GPU hardware architecture is moving from a single core hardware pipeline implementation for graphics processing to highly parallel and programmable core for more general purpose computing. By adding more programmability and parallelism to a GPU core architecture it is evolving towards a general purpose CPU-like core.[45] IX. GEFORCE
Figure 11 shows the block diagram of the GeForce 6 Series Architecture. It shows the process of the graphics by which input arrives from the CPU (host) and is output as pixels drawn to the frame buffer. The CPU writes a command stream which sets and modifies the state, references the vertex and texture data, and sends rendering commands. These states, commands and vertices flow down through the block diagram where they will be used in subsequent pipelines.[50]
GeForce is a brand of GPUs designed by nVidia. The GeForce logo is shown below in Figure 10. There are over 10
The vertex shaders/ processors, shown in Figure 12, allow a program to be applied to each vertex in the object. Transformations, skinning and other pre-vertex operations are performed here. All operations in this processor are done in 32-bit floating-point (fp32) precision. There can be up to six vertex units on high-end models and there may be two on lowend models.[50] The vertex programs can fetch texture data. The texture cache is shared between the fragment processor and the vertex processor due to the fact that the vertex processor can perform texture access. There is also a vertex cache to store all data before and after the vertex processor.[50] Primitives are points, lines or triangles. The vertices are grouped into these primitives. Three blocks are using cull, clip and set-up to perform pre-primitive operations. Primitives that arent visible are removed (cull), primitives that intersect the view frustum are clipped (clip) and edge and plane equation set up on the data is performed for the rasterization process (setup).[50]
The calculation of pixels which are covered by each primitive is done in the rasterization block. It uses the z-cull block to discard pixels. A fragment will then pass through the fragment processor where there will be tests performed on it. Once passing the tests it will carry depth and colour information to a pixel on the frame buffer.[50] The fragment processor and texel pipeline is also known as the pixel shader, Figure 13. This unit applies a shader program to each fragment independently. There can be a varying number of fragment pipelines on the GeForce 6 Series GPUs. Texture data is cached on-chip, similarly to the vertex processor. This reduces bandwidth requirements and increases performance.[50]
Quads are squares of four pixels. The texture and fragment-processing units operate on quads. This allows
direct computation of derivatives for calculating texture level of detail. The texture unit fetches the data from memory for the fragment processor and returned in fp16 or fp32 format. The texture unit can read a 2D or 3D array of data. 16-bit floating-point precision filtering is supported by this design.[50] There are two fp32 shader units per pipeline in the fragment processor. Before the fragments re-circulate through the pipeline to execute the next set of instructions they are passed through both shader units and the branch processor. This happens once every clock cycle.[50] Once the fragments have passed through the fragment-processing unit they are sent to the z-compare and blend units in the order in which they were rasterized. Stencil operations, alpha blending, depth testing and the final colour write to the target surface are performed in these units.[50] There are four DRAMs which divide up the memory system, all of which are independent. The memory subsystem can operate efficiently by having smaller, independent memory partitions. This is regardless of whether small or large blocks of data are transferred. The streaming of 32-byte memory access near the physical limit of 35GBps is possible due to the four independent memory partitions giving the GPU a wide flexible memory subsystem of roughly 256 bits.[50] C. Challenges and Oportunities for High-Perofrmance Computing To achieve optimal performance of the devices there are some actions that could be carried out. The z-cull block shown in Figure 10, is used to discard pixels. It avoids work that doesnt contribute to the final result. By concluding early that a computation doesnt contribute the z-values for all objects can be rendered first before shading. For example, with general purpose computing the z-cull can be used to select which parts are still active in the computation. It will cull the computational threads that have already been resolved.[50] The texture math can be exploited when loading data. This unit filters data before it is returned to the fragment processor. The total data needed by the shader is thus reduced. The total work done by the shader can be reduced if the texture units bilinear filter is used more frequently. Similarly when performing compares, the work form the processor can be offloaded by using the filtering support in shadow buffering, the result can then be filtered.[50] By making sure that the work avoided by branching outweighs the cost of branching, it can be very beneficial. The fragment processor operates many fragments simultaneously. Fragments in a group may take different branches; in this case both branches have to be taken by the fragment processor. This could reduce the performance of branching in programs.
However, if branching is not an effective choice, conditional writes can be used.[50] A full-speedfp16 normalize instruction in parallel is supported by this design. By having fp16 intermediate values the internal storage and datapath requirements are reduced. Instead of using fp32 intermediate values, these could be saved for cases where the precision is needed; the performance will be increased by using fp16 intermediate values.[50] There is a fixed amount of register space per fragment to keep hundreds of fragments in flight by the shader pipeline. Fewer fragments will remain in flight if this register space is exceeded. This will reduce the latency tolerance for texture fetches, thus adversely affecting performance. If the register file uses fp32x4 values exclusively it may run out of read and write bandwidth to feed all units. There is enough bandwidth if reading fp16x4 values to keep all units busy.[50] Extraordinary new performances are delivered from this new design. They streamline the creation of stunning effects in games and other 3D real-time applications. The hardware power that is needed to create such detailed and vibrant images wont be too intense on the PC due to this new architecture.[52] The new superscalar shader architecture in this design double the number of operations expected per cycle. There is a significant performance increase as a result. A full 32-bit floating point precision is provided to deliver higher quality images. Developers can implement stunning visual effects. There is no compromise of speed for quality in this design.[52] X. COMPARISON OF GPU PROGRAMMING MODELS TO DESKTOP MULTICORES
A. The difference between CPU and GPU The CPU is the central processing unit; it is the brain of the computer system. The GPU is a graphics processing unit. The GPU is a complimentary processing unit which handles the computation of intensive graphics processing. The rest of the application still runs on the CPU. The application runs faster from a users perspective due to the processing power of the GPU to boost performance. Hybrid computing is using the GPU as a co-processor to the CPU. Graphics processing is parallel, therefore it can be easily parallelized and accelerated.[2, 53] B. How a multicore system differs from a GPU A CPU is designed with a few cores; it can consist of 4 to 8 cores. These cores can handle a few software threads which can be exploited in an application program. Figure 14 shows an example of a CPU with multiple cores. Compared to single core predecessor, multi-core CPUs can operate lower
frequencies, consume less power, and complete work much faster by running tasks in parallel. A single core CPU takes longer periods of time to complete a given task and runs at higher clock frequencies and voltages than a multi-core CPU. Symmetrical multiprocessing and distributing the workload across multiple CPU core is known as workload sharing on multi-core CPUs. As a result of this each CPU core can run at lower frequencies and voltages to complete multi-threaded tasks. Each core also consumes significantly less power and offers a higher performance per watt than single core CPUs due to the lower operating frequencies and voltages.[2, 53, 54]
Figure 16. The GPU devotes more transistors to data processing than the CPU[44]
XI.
Nvidia has produced a number of GPUs under the GeForce brand for notebook computers. Features present in the desktop are mostly present in the mobile ones also. Nvidia has also developed a family of system-on-a-chip (SoC) mobile processors called Tegra. The Tegra was developed for devices such as smart-phones, tablets, PDAs and mobile Internet devices. The Tegra comprises of CPU, GPU and image, video and sound processors in a highly energy efficient package. The Tegra runs a variety of operating systems including Android, Linux and Windows. Figure 17 shows the block diagram of the Tegra 2.[55]
A GPU on the other hand is designed with hundreds of cores to utilize the available parallelism; giving the GPU its high compute performance. The difference between the multicore processor and the GPU can be seen in Figure 15.[2, 53]
The GPU is specialized for computation which is highly parallel and compute-intensive. It is designed with graphics rendering operations in mind, thus more transistors are devoted to data processing over data caching and flow control. This is the reason for the discrepancy in floatingpoint capability between the CPU and the GPU. This can be seen in Figure 16.[44]
The Tegra 2 is a dual core processor, as can be seen from the block diagram there are two ARM Cortex A9 CPUs. It uses the ARM7 processor for system management features and system life battery features. It has an ULP (Ultra-low power) GeForce GPU. It has been architecture for low powered applications. There are also processors to support image, audio and video.[57] The Tegra offers extreme multitasking web browsing, gaming and HD Video. The Tegra 2 is known as the worlds
first mobile super chipwith the first mobile dual-core CPU[58]. The new Tegra 3 which has quad core processing has the 4-PLUS-1 battery-saver technology which provides great mobile performance.[57, 58] LG, Motorola and Samsung are among some of the phones which are powered by Tegra 2[59]. There are a long list of tablets powered by Tegra 2, the most popular ones among these are the Samsung Galaxy Tablet, Sony Tablet and Toshiba Thrive[60]. The challenges of HD video playback, streaming videos and 3D gaming etc for power consumption and performance have been faced previously by desktop and notebook CPUs. Now mobile application processors are facing this challenge which stretches the capabilities of current single core mobile processors. To increase their performance and stay within mobile power budgets mobile processors need to have multi-core processors.[54, 61] The Tegra 2 was designed to harness the power of Symmetrical Multiprocessing which delivers a higher performance and lowers power consumption. It offers faster Web page loading times a higher quality of game play with faster multitasking features and tremendous battery life improvements.[58, 61] XII. DISCUSSION GPUs are more effective than general-purpose CPUs due to the processing of large blocks of data in parallel. They are used along-side CPUs for this purpose, by performing those mathematically intensive tasks it relieves the strain that would have been put on the CPU and it is freed up to perform other tasks.[1, 5] CUDA and OpenCL are the main contender languages for GPU programming along with DirectCompute from Microsoft. By implementing OpenCL correctly for the target architecture it performs no worse than CUDA. CUDA is a more mature language due to its code and userbase. OpenCL is missing the relevant middleware tools and libraries that CUDA has. As the OpenCL toolkit matures the gap between it and the CUDA toolkit will converge.[5, 12, 26] The GPU has begun to evolve from a single core, fixed function hardware pipeline implementation just for graphics rendering, to a set of highly parallel and programmable cores for more general purpose computation. The architecture of many-core GPUs are starting to look more and more like multi-core, general purpose CPUs.[45] A single core CPU runs at higher clock frequencies and voltages than a multi-core CPU and it takes longer periods of time to complete a given task. By distributing the workload across multiple CPU core is known as workload sharing on
multi-core CPUs each CPU core can run at lower frequencies and voltages to complete multi-threaded tasks. Significantly less power is consumed by each core they and offers a higher performance per watt than single core CPUs due to the lower operating frequencies and voltages.[61] Nvidia developed the Tegra to harness the power of multi-core CPUs to deliver a higher performance and lower power consumption on mobile devices. There are tremendous battery life improvements as a result along with extreme multitasking features, a better game playing experience and faster Web browsing.[54, 61] REFERENCES [1] Wikipedia. (2012, 4th April 2012). Graphics processing unit. Available: http://en.wikipedia.org/wiki/Graphics_processing_un it nVidia. (2012, 12th April 2012). What is GPU computing? Available: http://www.nvidia.com/object/GPU_Computing.html TechTerms. (2012, 6th April 2012). GPU. Available: http://www.techterms.com/definition/gpu nVidia. (2012, 6th April 2012). GeForce 256. Available: http://www.nvidia.com/page/geforce256.html Wikipedia. (2012, 5th April 2012). CUDA. Available: http://en.wikipedia.org/wiki/CUDA nVidia. (2012, 6th April 2012). What is CUDA. Available: http://developer.nvidia.com/what-cuda nVidia. (2012, 6th April 2012). CUDA FAQ. Available: http://developer.nvidia.com/cuda-faq J. Cohen. (2009, 13th April 2012). CUDA Libraries and Tools. Available: http://gpgpu.org/wp/wpcontent/uploads/2009/11/SC09_CUDA_Tools_Cohe n.pdf DAAC. (2009, 13th April 2012). CUDA programming model. Available: http://www.visualization.hpc.mil/wiki/CUDA_Progra mming_Model M. F. Ahmed. (2010, 6th April 2012). CUDA Computer Unified Device Architecture. Available: http://mohamedfahmed.wordpress.com/2010/05/03/c uda-computer-unified-device-architecture/ nVidia. (2009, 13th April 2012). Nvidia CUDA Architecture. Available: http://developer.download.nvidia.com/compute/cuda/ docs/CUDA_Architecture_Overview.pdf Wikipedia. (2012, 5th April 2012). OpenCL. Available: http://en.wikipedia.org/wiki/OpenCL Khronos. (2012, 5th April 2012). OpenCL. Available: http://www.khronos.org/opencl/ Intel. (2012, 5th April 2012). Intel OpenCL SDK. Available: http://software.intel.com/enus/articles/vcsource-tools-opencl-sdk/
[2]
[3] [4]
[9]
[10]
[11]
[15]
[16] [17]
[18] [19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27] [28]
[29]
[30]
[31]
AMD. (2011, 5th April 2012). OpenCL Zone. Available: http://developer.amd.com/zones/openclzone/Pages/de fault.aspx nVidia. (2012, 5th April 2012). OpenCL. Available: http://developer.nvidia.com/opencl ARM. (2012, 5th April 2012). Khronos Standards. Available: http://www.arm.com/community/multimedia/standar ds-apis.php IBM. (2012, 5th April 2012). OpenCL. Available: http://researcher.ibm.com/view_project.php?id=1835 Apple. (2012, 5th April 2012). OpenCL. Available: https://developer.apple.com/softwarelicensing/agree ments/opencl.html C.-R. Lee. (2010, 13th April 2012). CUDA Programming. Available: http://www.cs.nthu.edu.tw/~cherung/teaching/2010gp ucell/CUDA06.pdf B. Alun-Jones. (2010, 13th April 2012). An Quick Introduction to OpenCL. Available: http://www.mat.ucsb.edu/594cm/2010/benalunjonesrp1/index.html W. W. Hwu, Stone, J. (2010, 13th April 2012). The OpenCL Prgoramming Model. Available: http://www.ks.uiuc.edu/Research/gpu/files/upcrc_ope ncl_lec1.pdf W. W. Hwu, Stone, J. (2010, 13th April 2012). The OpenCL Prgoramming Model. Available: http://www.ks.uiuc.edu/Research/gpu/files/upcrc_ope ncl_lec2.pdf DrZaius. (2009, 13th April 2012). Matrix Multiplication 2 (OpenCL). Available: http://gpgpucomputing4.blogspot.com/ NERSC. (2011, 13th April 2012). Performance and optimization. Available: http://www.nersc.gov/users/computationalsystems/dirac/performance-and-optimization/ Wikipedia. (2012, 5th April 2012). DirectCompute. Available: http://en.wikipedia.org/wiki/DirectCompute nVidia. (2012, 6th April 2012). DirectCompute. Available: http://developer.nvidia.com/directcompute Microsoft. (2010, 6th April 2012). DirectX11 DirectCompute. Available: http://www.microsoftpdc.com/2009/P09-16 nVidia. (2010, 15th April 2012). DirectCompute Programming Guide. Available: http://developer.download.nvidia.com/compute/DevZ one/docs/html/DirectCompute/doc/DirectCompute_P rogramming_Guide.pdf (15th April 2012). OpenGL Programming Guide. Available: http://www.glprogramming.com/red/chapter01.html Khronos. (2012, 15th April 2012). OpenGL - The Industry's Foundation for High Performance Graphics. Available: http://www.khronos.org/opengl
[32]
[33] [34]
[35]
[36]
[37]
[38] [39]
[40]
[41]
[42]
[43] [44]
M. Segal and K. Akeley. (2011, 15th April 2012). The OpenGL Graphics System: A Specication. Available: http://www.opengl.org/registry/doc/glspec42.core.20 110808.pdf Wikipedia. (2012, 15th April 2012). OpenGL. Available: http://en.wikipedia.org/wiki/OpenGL Apple. (2012, 15th April 2012). OpenGL Programming Guide for Mac OS X. Available: https://developer.apple.com/library/mac/#documentat ion/GraphicsImaging/Conceptual/OpenGLMacProgGuide/opengl_intro/opengl_intro.html Wikipedia. (2012, 15th April 2012). Shading language. Available: http://en.wikipedia.org/wiki/Shading_language Wikipedia. (2012, 15th April 2012). Graphics pipeline. Available: http://en.wikipedia.org/wiki/Graphics_pipeline J. Kessenich, D. Baldwin, and R. Rost. (2011, 15th April 2012). The OpenGL Shading Language. Available: http://www.opengl.org/registry/doc/GLSLangSpec.4. 20.8.clean.pdf Wikipedia. (2012, 15th April 2012). GLSL. Available: http://en.wikipedia.org/wiki/GLSL Wikipedia. (2012, 12th April 2012). Cg(programming language). Available: http://en.wikipedia.org/wiki/Cg_(programming_lang uage) Microsoft. (2012, 15th April 2012). Programming Guide for HLSL. Available: http://msdn.microsoft.com/enus/library/windows/desktop/bb509635(v=vs.85).aspx Wikipedia. (2012, 15th April 2012). High Level Shader Language. Available: http://en.wikipedia.org/wiki/High_Level_Shader_Lan guage T. S. Crow, "Evolution of the Graphical Processing Unit," Master of Science, Computer Science, University of Nevada, Reno, 2004. M. Macedonia, "The GPU Enters Computing's Mainstream," Computer, vol. 36, pp. 106-108, 2003. nVidia. (2011, 16th April 2012). NVIDIA CUDA C Programming Guide. Available: http://developer.download.nvidia.com/compute/DevZ one/docs/html/C/doc/CUDA_C_Programming_Guide .pdf C. McClanahan, "History and Evolution of GPU Architecture," 2010. Wikipedia. (2012, 15th April 2012). GeForce 256. Available: http://en.wikipedia.org/wiki/GeForce_256 Wikipedia. (2012, 15th April 2012). GeForce. Available: http://en.wikipedia.org/wiki/GeForce Wikipedia. (2012, 15th April 2012). GeForce 6 Series. Available: http://en.wikipedia.org/wiki/GeForce_6_Series
[49]
[50]
[51]
[52]
[53]
[54] [55]
nVidia. (2012, 15th April 2012). GeForce 6 Series. Available: http://www.nvidia.com/page/geforce6.html E. Kilgariff and R. Fernando, "Chapter 30 The GeForce 6 Series GPU Architecture," in GPU Gems 2, M. Pharr, Ed., ed, 2005. G. Chunev. (2009, 15th April 2012). Graphics Processing. Available: http://www.cs.indiana.edu/~gnchunev/files/Lecture.p df nVidia. (2012, 16th April 2012). High-Performance, High-Precision Effects Available: http://www.nvidia.com/object/feature_HPeffects.html R. Ragel. (2011, 13th April 2012). Difference Between CPU and GPU. Available: http://www.differencebetween.com/differencebetween-cpu-and-vs-gpu/ nVidia, "The Benifits of Quad Core CPUs in Mobile Devices," 2011. nVidia. (2012, 17th April 2012). Getting Started with Tegra. Available: http://developer.nvidia.com/tegrastart
[56]
[57]
[58]
[59]
[60]
[61]
R. Pogson. (2011, 17th April 2012). Nvidia Tegra2 block diagram. Available: http://mrpogson.com/2011/04/03/ nVidia. (2012, 17th April 2012). NVidia Tegra Mobile Processor Features. Available: http://www.nvidia.com/object/tegra-features.html nVidia. (2012, 17th April 2012). NVIDIA Tegra 2. Available: http://www.nvidia.com/object/tegra2.html nVidia. (2012, 17th April 2012). Tegra Super Phones. Available: http://www.nvidia.com/object/tegrasuperphones.html nVidia. (2012, 17th April 2012). Tegra Super Tablets. Available: http://www.nvidia.com/object/tegra-supertablets.html nVidia, "The Benifits of Multiple CPU Cores in Mobile Devices," 2011.