Finding and Downloading Datasets
The ATLAS Collaboration, as part of its commitment to open science, has made an extensive array of data for research available through the CERN Open Data portal.
Search for Data
Inside the main entry point of the open data for research you will find eleven different links that point to different datasets. These are explained below.
We have two datasets with detector data:
- Run 2 2015 proton-proton collision data: Detector data for the 2015 run, with an integrated luminosity of ~3.2 fb
- Run 2 2016 proton-proton collision data: Detector data for the 2016 run, with an integrated luminosity of ~33 fb
We have seven groups for Standard Model Monte Carlo samples. Four for nominal samples:
- MC simulation electroweak boson nominal samples.
- MC simulation Higgs nominal samples.
- MC simulation QCD jet nominal samples.
- MC simulation top nominal samples.
Three for alternative samples for the calculation of systematic variations (since the systematics for electroweak boson are included in the nominal datasets):
- MC simulation Higgs systematic variation samples.
- MC simulation QCD jet systematic variation samples.
- MC simulation top systematic variation samples.
And two groups for Beyond the Standard Model signal samples:
Downloading the Data
When you click any of the links, you will be directed to a webpage with information about the data. At the bottom of the page, under "File Indexes," you will find various "containers." These containers, which you can think of as folders, hold the actual data.
Fig.1 File indexes on the CERN Open Data Portal.
If you click on "List Files," you may find a single file or several. The data may be spread across multiple files.
Fig.2 Clicking on "List Files" will show you the list of files associated to a container.
To download a file, simply click on the download icon next to it.
Fig.3 Clicking on the download button will start the download of the data.
Downloading data from the website can be slow, which is suitable if you only need a few samples. For downloading entire datasets, we recommend using the CERN Open Data Client.
File Naming convention
In ATLAS, we use specific nomenclature for naming files to ensure they are easily identifiable. The naming conventions vary based on the type of file (Monte Carlo simulations or detector data) to maintain clarity and organization.
Monte Carlo Simulations
The names for Monte Carlo simulations are composed by different substrings, separated by a dot:
campaign.dataset_id.short_description.production_step.data_format.processing_tags
Each part represents the following:
- campaign: Indicates the MC simulation campaign and center of mass energy, when relevant. For example, for the released data from the MC20 campaign of proton-proton collisions at 13TeV of center of mass energy is "mc20_13TeV".
- dataset_id: An 6 to 8 character numerical identifier, different for each dataset.
- short_description: Indicates the simulation tools used and the physical process described by the dataset. Common simulation tools are Powheg, Pythia, Sherpa, among others. You can check the list of simulation tools or common abbreviations for more information about you can find on this substring.
- production_step: The production step that generated the dataset. For the release data it is always "deriv" from derivation.
- data_format: The dataset format. All the released data is in PHYSLITE format, so this substring is always "DAOD_PHYSLITE".
- processing_tags: These tags indicate the configuration of the software used in each production step in the creation of the dataset. In the released MC data we have four tags: e-tag, for the event generation configuration; s-tag, for the simulation configuration; r-tag, for the reconstruction configuration; and a p-tag for physlite production. To understand more about the tags you can read about the MC production chain.
An example of a name would be:
mc20_13TeV.364350.Sherpa_224_NNPDF30NNLO_Diphoton_myy_0_50.deriv.DAOD_PHYSLITE.e7081_s3681_r13167_p5855
Which is: a Monte Carlo simulation from the mc20 project, at 13 TeV of center of mass energy. It contains the simulation of diphoton events (events where two photons are produced) using Sherpa 2.24 as event generator, particularly using the NNPDF 3.0 at next-to-next-to-leading order precision. This dataset focuses on events where the invariant mass of the photon pair lies between 0 and 50 GeV. The dataset is in PHYSLITE format, and we can identify it by its ID "364350".
Detector Data
The naming convention for detector data is similar in form, but differs in content:
project.period.data_stream.production_step.data_format.processing_tags
- project: Indicates the data taking year and the center of mass energy.
- dataset_id: An 8 character numerical identifier, different for each dataset.
- data_stream: The released datasets include two primary data streams. The first is "physics_Main," which refers to the main physics data stream used for general-purpose physics analyses. The second is "physics_HardProbes", which pertains to data derived from lead nucleus collisions.
- production_step: The production step that generated the dataset. For the release data it is always "deriv" from derivation.
- data_format: The dataset format. All the released data is in PHYSLITE format, so this substring is always "DAOD_PHYSLITE".
- processing_tags: For detector data this inncludes an r-tag and p-tag.
An example of a detector dataset name:
data16_13TeV.00298633.physics_Main.deriv.DAOD_PHYSLITE.r13286_p4910_p5631
This is a detector dataset from the data taking period of 2016. It is a dataset for general-purpose physics analyses. It is in PHYSLITE format, and we can identify it by its ID "00298633".