Monday, July 12, 2010

Future of computing: GPGPU?

During the 2010 Heathgrid Conference, some papers presented results involving GPGPU calculations. I'll try to answer some questions related to this technology:
  • What?
The acronym stands for "General-purpose computing on graphics processing units" and the main concept is to use GPUs (Graphical Processing Units) as a standard computing element.
  • Why?
Instead of a common CPU, which nowadays contains up to 12 cores on a single chip (AMD Opteron 6000). A GPU contains hundreds of cores (240 for the Nvidia Tesla series). The following chart from the Nvidia CUDA programming guide illustrates clearly the theoretical performance of GPUs vs. CPUs.

Even if the per-core raw performance is lower than a CPU, a GPU is well fitted for massive parallel computations.
  • Who?
Three major constructors announced or have already deployed commercial solutions. The main company is Nvidia, followed by AMD (ATI chips) and Intel. Each company deployed his own solution (Stream for AMD, CUDA for Nvidia) but a open standard (openCL) is now emerging.
  • How?
Programming on graphic cards is not straightforward. Even if developers designed powerful APIs, the programmer has to rewrite all algorithms to be GPU-enabled and efficient.
  • A small example?
Here is a simple vector multiplication in C++:

void compute(float A[N][N], float B[N][N], float C[N][N]){
for (int i=0;i<N;i++)

for (int j=0;j<N;j++)

C[i][j] = A[i][j] + B[i][j];
int main(){
And the GPU-enabled version of this algorithm (in Nvidia CUDA):

__global__ void compute(float A[N][N], float B[N][N], float C[N][N]){
int i = threadIdx.x;
int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];

int main(){
//call the GPU-kernel
compute<<N_blocks,N_threads>>(A, B, C);
Actually, the GPU compiler handles all the algorithm parallelization according to the kernel definition (__global__tag). During the execution, the kernel will be computed by N_Blocks of N_threads simultaneously.
  • Applications?
GPGPU is not the global solution to all problems and only highly parallelizable algorithms can take advantage of this solution.
The main restriction is related to memory management as hundreds of threads have to share a restricted amount of memory (1gb).

However, more and more applications, in matrix computation, image and signal processing or MonteCarlo simulation have already demonstrated the potential of this solution.

Nvidia CUDA:
AMD Stream:

1 comment:

Sam Skipsey said...

Technically, I suppose it should be mentioned that GPGPU computing is, in particular, only terribly useful for "SIMD" parallel computing - where you have the same process to apply to a lot of bits of data.
Obviously, this is better for some things than others, and the speed-ups from GPGPU vary dramatically for the various use cases - anything from 4x to 200x depending on the problem.
(Image processing is, of course, a trivial example of a really good application. Afterall, GPUs were originally designed the way they are to... process graphics.)