After a stunning meal at the Mitsui Japanese restaurant last night (described as ‘exquisite’ by a visiting Japanese guest, so it must have been good!), I would say that the keynote speakers this morning probably had their work cut out to grab our attention. However, they managed it effortlessly – both presentations were instantly engaging.
Alex Szalay of Johns Hopkins University talked to us about extreme data intensive scientific computing. He told us that while the big data sets attract all the press, such as the LHC and genomic research, we shouldn’t forget about the ‘long tail’ of data, generated by day to day science. Together this adds up to such a significant quantity of data, that we are actually living through a scientific revolution. Science is moving from being hypothesis driven, to data driven discoveries, and funding agencies in particular need to catch up to this trend.
Alex’s research field is astronomy, and he discussed the Sloan Digital Sky survey, which generated 100 TB of data presented through the Skyserver website, a site that has received an impressive 993 million hits over the last 10 years. Processing this data doesn’t just need computing power, but with only 15,000 astronomers worldwide, additional pairs of eyes are needed to identify the astronomical objects in the images. This ‘internet scientist’ idea led to GalaxyZoo, a mash up of the million brightest images in Sloan. To date there have been 40 million visual galaxy classifications by the public, including a few discoveries such as Hanny’s Voorwerp and the “Green Peas”.
HPC is an instrument in its own right and presents challenges many challenges: how to move data, how to look at it, how to interface to it and how to analyse it. Scientists are using HPC to solve problems such as modelling turbulence, biological systems and the brain. The Milky Way Laboratory has just been funded by the NSF, and will use cosmology simulations as an immersive lab for general users. Data Intensive Scalable Computing challenges of these type lead to a whole host of trade offs in computing. System designers need to balance the need for data storage, cost, sequential IO, fast streams, and low power – and this is a moving target that changes every year.
Storing 100 TB is really hard, especially if you need months to process the data and moving it around isn’t much easier – the quickest option can sometimes be to throw disks into the back of the car, or send it via courier if you don’t fancy a long drive. However, once you get your data where it’s supposed to be, however it travels, particularly in the cloud, you start to open up interesting possibilities that bringing small, seemlngly unrelated data into a single cloud may see new value start to emerge. This has happened in Facebook, which now has billions of photos from hundreds of millions of users. What might this bring to the long tail of science?