Monday, March 23, 2009

WLCG Workshop - Day II

The focus on the second day of the WLCG Workshop shifted to some of the higher level issues around the use of the WLCG infrastructure. It is worth noting that the four experiments within the WLCG each have different approaches to using the available grid infrastructure.

Some use a generic middleware service (the Workload Management Service - WMS - from gLite) to select a computing resource from those that are available. The users are able to specify their job's compute and data requirements, a site is selected and the job monitored until it has successfully completed or the work moved to another site because of a local failure.

Another approach taken by the experiments, is to develop their own applications that execute a fixed workflow. The compute and data resources are selected from those available to the experiment that the experiments knows are available and working for its particular workload. Jobs are submitted directly to the sites from the application software.

Some of these jobs are applications that will execute directly once they are scheduled by the local batch system. Other jobs are classed as 'pilot jobs' (or placeholder jobs) in that is only once they start running they will be given the work they need to do. This allows a community to decide in real time the work that needs to be done. The other models rely on jobs submitted to the infrastructure being executed in an arbitrary sequence depending on which site they are executed on and the load on that particular site.

Having reviewed some of the issues in these different computing models... the discussion moved onto the details of how the infrastructure will be kept operating. Recall that when the LHC is operating it will be running 24/7. Even when it is not running scientists will be wanting to analyse the data that has been previously collected. Of course, both operations also need to take place concurrently, and to continue to keep operating as the user community grows.

A range of issue were discussed covering... how the infrastructure is being monitored, how the infrastructure usage is being recorded, how the
core infrastructure services are tested on a regular basis and how these tests are being complemented with experiment specific tests, including how any issues resulting from these tests are handled by the site operators.

No comments: