Accesing Datasets

You can explore the proton-proton data or the heavy ion data in the CERN Open Data portal. For more information about either of the collections, check the linked sections.

Downloading the Data

When you click any of the links, you will be directed to a webpage with information about the data. At the bottom of the page, under "File Indexes," you will find various "containers." These containers, which you can think of as folders, hold the actual data.

Fig.1 File indexes on the CERN Open Data Portal.

If you click on "List Files," you may find a single file or several. The data may be spread across multiple files.

Fig.2 Clicking on "List Files" will show you the list of files associated to a container.

To download a file, simply click on the download icon next to it.

Fig.3 Clicking on the download button will start the download of the data.

warning

Downloading data from the website can be slow, which is suitable if you only need a few samples. For downloading entire datasets, we recommend using the CERN Open Data Client.

File Naming convention

In ATLAS, we use specific nomenclature for naming files to ensure they are easily identifiable. The naming conventions vary based on the type of file (Monte Carlo simulations or detector data) to maintain clarity and organization.

Monte Carlo Simulations

The names for Monte Carlo simulations are composed by different substrings, separated by a dot:

campaign.dataset_id.short_description.production_step.data_format.processing_tags

Each part represents the following:

campaign: Indicates the MC simulation campaign and center of mass energy, when relevant. For example, for the released data from the MC20 campaign of proton-proton collisions at 13TeV of center of mass energy is "mc20_13TeV".
dataset_id: An 6 to 8 character numerical identifier, different for each dataset.
short_description: Indicates the simulation tools used and the physical process described by the dataset. Common simulation tools are Powheg, Pythia, Sherpa, among others. You can check the list of simulation tools or common abbreviations for more information about you can find on this substring.
production_step: The production step that generated the dataset. For the release data it is always "deriv" from derivation.
data_format: The dataset format. All the released data is in PHYSLITE format, so this substring is always "DAOD_PHYSLITE".
processing_tags: These tags indicate the configuration of the software used in each production step in the creation of the dataset. In the released MC data we have four tags: e-tag, for the event generation configuration; s-tag, for the simulation configuration; r-tag, for the reconstruction configuration; and a p-tag for physlite production. To understand more about the tags you can read about the MC production chain.

An example of a name would be:

mc20_13TeV.364350.Sherpa_224_NNPDF30NNLO_Diphoton_myy_0_50.deriv.DAOD_PHYSLITE.e7081_s3681_r13167_p5855

Which is: a Monte Carlo simulation from the mc20 project, at 13 TeV of center of mass energy. It contains the simulation of diphoton events (events where two photons are produced) using Sherpa 2.24 as event generator, particularly using the NNPDF 3.0 at next-to-next-to-leading order precision. This dataset focuses on events where the invariant mass of the photon pair lies between 0 and 50 GeV. The dataset is in PHYSLITE format, and we can identify it by its ID "364350".

Detector Data

The naming convention for detector data is similar in form, but differs in content:

project.period.data_stream.production_step.data_format.processing_tags

project: Indicates the data taking year and the center of mass energy.
dataset_id: An 8 character numerical identifier, different for each dataset.
data_stream: The released datasets include two primary data streams. The first is "physics_Main," which refers to the main physics data stream used for general-purpose physics analyses. The second is "physics_HardProbes", which pertains to data derived from lead nucleus collisions.
production_step: The production step that generated the dataset. For the release data it is always "deriv" from derivation.
data_format: The dataset format. All the released data is in PHYSLITE format, so this substring is always "DAOD_PHYSLITE".
processing_tags: For detector data this inncludes an r-tag and p-tag.

An example of a detector dataset name:

data16_13TeV.00298633.physics_Main.deriv.DAOD_PHYSLITE.r13286_p4910_p5631

This is a detector dataset from the data taking period of 2016. It is a dataset for general-purpose physics analyses. It is in PHYSLITE format, and we can identify it by its ID "00298633".

Downloading the Data​

File Naming convention​

Monte Carlo Simulations​

Detector Data​

Downloading the Data

File Naming convention

Monte Carlo Simulations

Detector Data