Voir en

français

Too much data – a good problem to have

Author

Eckhard Elsen is Director for Research and Computing

Large volumes of high-quality data may be a challenge, but addressing it brings innovation

Last week, the 22nd International Conference on Computing in High-Energy and Nuclear Physics, CHEP 2016, took place in San Francisco, attracting some 500 experts from all over the world. This gave the LHC experiments a great opportunity to showcase the impressive progress they have made in mastering the ever-increasing data volumes and to highlight their plans for the High-Luminosity period of the LHC.

The experiments have made a fantastic effort in optimising their code and minimising unnecessary copying of data. Triggering is becoming more sophisticated with the inclusion of track and vertex information allowing ATLAS and CMS to be more selective in what they record. Meanwhile, LHCb has introduced its turbo stream, which serves some 80% of LHCb analyses. It is based on a compact record containing all the information necessary for analyses. ALICE is adopting a similar approach, blurring the divisions between online and offline, recording data from all events without a trigger decision, while reducing the amount of data to be stored per event.

With the LHC performing as well as it is, this is welcome news; the availability has almost been doubled. As a consequence, the experiments are recording more events than anticipated so far in Run 2, so they still exceed the allocated resources. Too much high-quality data may be a challenge, but it is a good problem to have.

Progress like this keeps CERN in the vanguard of high-throughput computing (HTC). This is important, not only for us, but also because it enables us to share experience with other fields of science for which HTC is becoming increasingly important. The conference programme at CHEP was bustling with presentations of new software tools, machine learning and progress in effectively using multi-cores on modern computing platforms. Experiments are joining forces via the HEP Software Foundation. Key to LHC computing is, however, the development of the network itself, where the rate of progress has not slowed down. The issue of national and transcontinental networks thus figured highly at the conference. With sufficient bandwidth installed, the location of the computing resource becomes arbitrary.

And that brings me to another recent conference, the International Conference on Research Infrastructures, ICRI, held in Cape Town from 3-5 October. There’s a good reason why ICRI was in South Africa this year. The country co-hosts an exciting new research infrastructure: the Square Kilometre Array, SKA, the world’s largest radio telescope. A precursor to the SKA, MeerKAT, is up and running, but MeerKAT is only a small fraction of the final SKA configuration. Once complete in 2025, it will bring together dishes in South Africa and Australia with a surface area of one square kilometre. They will all be on stream all the time, producing data volumes that dwarf even those of the LHC.

South Africa already hosts a WLCG Tier 2 computing centre, and there was some discussion at ICRI on how to build on this to bring in other areas of science, such as the SKA. One way forward is for South Africa to build a Science Cloud – a public sector facility for scientific computing. Science Clouds are, I believe, the way forward for public sector science and an evolution of the WLCG. Such a facility would be a wonderful showcase for scientific cloud computing, and an asset for South African science.

It’s been an interesting few weeks for scientific computing, leading me to conclude that CERN remains in the vanguard not simply because of our high data volumes, but because we're developing new tools to deal with them. The bottom line for me is that we have much to give, and we have much to learn from others. In scientific computing, interdisciplinary collaboration is the future.