Sunday, March 22, 2009

WLCG Workshop - Day 1

The WLCG (Worldwide LHC Computing Grid) collaboration brings grid infrastructures in Europe, the USA and Asia together to meet the data analysis requirements for the experiments using the LHC. The WLCG Workshop in Prague provides an opportunity before the CHEP (Computing in High Energy and Nuclear Physics) conference to understand the readiness of the
infrastructure to support the planned LHC start up later in the year.

While computing may have been the main focus in early stages of the grid - data movement and access are now providing many of the service challenges. Firstly, you have to move the data. When the LHC is running data will be streaming from the sensors to CERN (Tier 0) and out to
around a dozen major sites around the world (Tier 1) where the data can be accessed by regional (Tier 2) and local (Tier 3) sites. Once the data is at a site it has to be stored. Depending (usually on the size of the site) different software and hardware combinations are used to store the
files - these can range from simple disk attached to a machine to complex dedicated hierarchies of tape and disk that can archive peta bytes of data.

Most of the first day focused on the user and service experiences derived from the recent use of WLCG.

Clearly, the combination of complex software and hardware can yield to failures that have to be managed from a service perspective (WLCG is there to enable research to be carried out) and within the applications built around the distributed infrastructures. All the experiments have
been continuing to test the WLCG through test runs or the analysis of cosmic ray data collected from the installed instruments.

The deliberate duplication and redundancy within WLCG means that although there are frequent problems across the federated grids the service as a whole is always up. Unexpected events, data centres catching fire, power failing, cables being cut, etc are rare events for
an individual site, but with so many sites within WLCG they become regular events that have to be handled. However, despite the scaling up of the number of WLCG sites, availability and reliability continues to improve, although there are still improvements to be made. Mainly, as to
how we as a community can learn from these incidents to improve the plans at our own sites if similar events happen to us. This requires continual honest communication between the sites!

While these 'acts of god' are beyond control the reliability of the software (both the middleware and application software) should be under better control and not subject to random acts.

If only that was true!

The demands made on software rarely stay the same. As the software starts being deployed and used, users may change or expand the way they use the software. Eventually these requirements settle down... but before that happens the developers are always trying to catch up. While generally the experiments seem to be happy with the functionality offered by the data management software, scalability and reliability under load have started to appear as issues. This is especially true as the workloads have been scaling up in preparation to the LHC start up.
As a result further improvements may be made in the software and these releases deployed before a final data challenge is staged (a realistic stressing of the grid with the workload expected when the LHC running) before the LHC starts again in the Autumn.

The first day was summarised by the organiser Jamie Shiers with - 'you want change and you also want everything to stay the same' - a challenge for us all!

No comments: