As a side note, Matsuoka told us about the discussions they had at the Tokyo Institute of Technology after the earthquake last year. The country was hit by serious power cuts in the aftermath, and the institute suggested turning off the supercomputer in the interests of saving power. The computing team pointed out that as Tsubame was much more efficient, it would be better to turn off the laptops. But of course you can’t access the supercomputer without the laptops - so in the end, they kept them both on.
Matsuoka walked us through some impressive examples of the modelling work carried out on Tsubame, such as weather prediction down to resolutions of a few hundred metres, dendritic crystal growth in lightweight metal alloys, blood flow past the type of coronary obstructions suffered recently by the Japanese emperor. Modelling earthquakes and tsunamis is naturally of high importance in the region but, perhaps surprisingly, tsunamis are relatively easy to model – it’s their effects once they move into built up areas that are more difficult to predict.
Looking to the future, Japan is setting up a High Performance Computing Infrastructure similar to PRACE in Europe and XSEDE in the US. HPCI will connect the national supercomputing centres, with the famous “K” computer as the focal point, and will be in operation by the summer 2012. HPCI will be able to boast up to 100PB dedicated shared storage.
Back in 1995, an iPad2 would have been classed as a supercomputer. Providing the next generation of exascale rather than petascale computing is going to be a knotty problem. Not least in solving how to manage the power consumption that machines with up to a billion cores will eat up.
Today’s final keynote speaker, Thomas Sterling of Indiana University talked about an execution model for HPC, moving towards the exascale. He described the “K” computer as the T-rex of the computing world, in both positive and negative senses – and not just because it has a big ‘byte’. (Pardon the pun!). He described both K and Tsubame as very impressive machines when you get under the hood, in terms of system, network and chip design. However, while there are lessons to be learnt from both of them, they are not where we will end up with future supercomputing.
Sterling’s question was where do clouds fit into the HPC future? At the moment, we would need to see a paradigm shift to broaden clouds to HPC. Machines such as the LIGO Laser Interferometer Gravitational Observatory are the source of an enormous amount of data. The largest gravitational experiment going (other than the universe itself), LIGO tracks changes in length of less than the size of an atomic nucleus in legs that are kilometres long. The challenge is to extract the signal from the noise – and there is not a lot of parallelism in the computations. Unpredictable computations of this type are not supported well by static tools such as MPI.
HPC divides naturally into three modalities – capacity computing is one, which is high throughput with job stream parallelism and without synchronisation or data exchange. Capability computing uses coupled parallel elements, has strong scaling, and is all about reducing the time it takes to do a particular, well defined job. HPC fits on the cloud via the throughput mode of computing, as it lends itself to sequential jobs of the same program on separate blocks of data, often carried out where the data is located. Sterling defined a third modality as “cooperative computing” – increasing both the speed achieved to solve a problem, but also the size of the data sets. Interim results are exchanged cooperatively to drive a dynamic computation process. It is no longer about just leveraging Moore’s Law to improve performance.
Some HPC applications are not a good fit for the cloud, such as capability codes, jobs with very large data sets, jobs with data or results that are sensitive, proprietary or security related. There are several challenges for improving efficiency and scalability, one being starvation ie not having enough concurrency, high latency which impacts efficiency, overhead affecting the critical time to manage tasks and resources and waiting, for example where there is contention for shared resources. Put Starvation, Latency, Overhead and Waiting together, and you get SLOW computing...
To bring a new clouds execution model online you need to decouple software from the hardware it runs on, be dynamic not static, hide latency and use message driven computation. “It’s not just how many cores you can squeak together in the cloud,” said Sterling,” latency matters. We need new models and programming interfaces to make this work.”