Combining Samples

Many of the common Standard Model processes are not generated as a single inclusive process, but as many individual samples that have to be combined in order to be used in a physics analysis. In many cases, it's not obvious which samples are "safe" (or necessary) to combine. This is tricky for a variety of reasons, and this page is intended to help guide users through some of the common issues.

The metadata page is an extremely useful reference to identify samples that are meant for specific use-cases (e.g. as the baseline, starting point estimate for a physics process, or for calculating a systematic uncertainty). With a bit of practice, reading the "Physics short" (short description of a sample) becomes easier and can reveal most of the key features of a sample.

The QCD jet samples provide a nice example of a combination. They are named "JZ0", "JZ1", "JZ2", and so on, with each sample corresponding to a specific range of leading jet transverse momenta. In order to create a complete, smooth jet spectrum, one must combine all of these samples according to their cross sections, filter efficiencies, and k-factors. If an analysis is being done that requires a 500 GeV jet, then there is no need to consider the first few samples. "Slicing" the transverse momentum spectrum provides an analysis with roughly constant statistical uncertainties with increasing jet transverse momentum without the need to generate (and analyze) billions of events.

The divisions between samples might not be consistent between event generators, which can lead to further confusion. Some event generators, like Sherpa, can generate final states directly (e.g. "four leptons") without regard to the internal configuration of the physics process. Another generator might only be able to generate "two Z boson" events that then decay into four leptons. With Z bosons and four leptons, there is likely little ambiguity. Two charged leptons and two neutrinos, however, can be created by either two Z bosons or two W bosons. Some generators will provide these in two separate samples (ZZ and WW), and others will provide these as a single sample (llvv). When comparing these two configurations, it is important to compare apples to apples, and therefore to carefully consider these sorts of issues.

Explaining the origins and combinations of all the samples available is a significant undertaking. On this page, we aim to document the most common setups and questions. As users request help with other samples, we will continue to add to these pages to document features and clarify points of confusion. In order to make the discovery process as painless as possible, related samples in the Open Data portal have been gathered into collections.

The Starting Point

In general, it is a good idea to carefully consider the physics of an analysis (e.g. the final state being considered) before making a list of samples to be used. For example, QCD jet production samples do not need to be used when considering events with at least two leptons. If an analysis does not specifically select events with high-momentum photons, the inclusive top-quark production samples are sufficient and the top-quark+photon samples are probably not necessary. A little bit of intuition and careful thinking can save a great deal of processing time down the line.

Samples that are tagged with the keyword "Baseline" on the metadata page are meant to provide a solid starting point for all analyses. Samples tagged with "Systematic" are meant to be used for systematic uncertainty derivation (e.g. when one generator configuration can be compared to another to derive a systematic uncertainty). Samples tagged with "Alternative" are samples that are not normally used as baseline or to derive a systematic uncertainty, but which might be useful for analyses in specific final states or with particular kinematic features. Samples tagged with "Specialized" are meant to serve particular analyses and often don't need to be considered (e.g. QCD jet production with specific hadrons in the final state, or top samples with b-quark fragmentation variations or top mass variations).

Samples - and generators - are regularly marked obsolete and replaced. In some cases, the obsoletion is done and new versions are provided only for 13.6 TeV analyses (those done using Run 3 data). This means, inevitably, that the lists of samples are slightly different for 13 TeV and 13.6 TeV, and additional care must be taken when examining both. If a particularly complex analysis is being undertaken such that correlations between 13 TeV and 13.6 TeV samples are required, we generally recommend testing a number of correlation schemes and identifying the one that provides the most conservative physics result (i.e. the largest uncertainty or weakest limit - but note that "uncorrelated" or "correlated" does not necessarily lead to a larger or smaller uncertainty).

Top samples

For top quark pair production, we generally recommend using the samples 410470, 410471, and 410472 at 13 TeV, and 610229, 601230, and 601237 at 13.6 TeV. Note that the 13.6 TeV samples are disjoint by construction, while 410470 (non-all-hadronic) and 410472 (dileptonic) overlap, such that only one or the other should be used. There are samples filtered on missing transverse momentum (MET) or the sum of jet transverse momenta (HT) for analyses that require significant MET or HT in the final state that want improved statistical uncertainties.

For single top, we generally recommend the samples 410644-5, 410658-9, 600027-8, and 601352-5 at 13 TeV, and 601348-601355 at 13.6 TeV. The "diagram removal" (DR) interference treatment scheme is the one used in these samples; for analyses sensitive to interference between top pair production and single top production, alternative schemes are available.

Samples with extra W, Z, or Higgs bosons or additional top quarks must be included when relevant and do not overlap with the inclusive samples. Samples with photons and additional b-quarks generally do overlap with the inclusive samples, and the overlap must be removed; they are provided because they generally include higher precision calculations of the photon or additional quark momenta.

Bosons

For boson samples, we generally recommend using Sherpa samples. These samples are separated by boson (Z and W separately), by heavy flavor in the final state (events with b-quarks, events without b-quarks and with c-quarks, and events with neither b- nor c-quarks), and by lepton flavor (electron, muon, and tau-lepton separately). Samples with additional electroweak vertices (vector boson scattering, VBS) are also provided separately and should be included when relevant. Samples in dilepton final states with dilepton masses between 10 and 40 GeV are also provided separately (labelled as "Drell-Yan production"). Events with dilepton masses below 10 GeV are generally not provided, as the resonance structure becomes complicated to model.

For diboson production, we generally recommend using Sherpa samples as well. These are separated by the number of charged leptons in the final state (4, 3, 2, 1, 0), and VBS is again provided separately.

Samples with extra W, Z, or Higgs bosons must be included when relevant and do not overlap with the inclusive samples. Samples with photons and additional b-quarks generally do overlap with the inclusive samples, and the overlap must be removed; they are provided because they generally include higher precision calculations of the photon or additional quark momenta.

Samples with top quarks are generally classified as "top samples" and are included in those collections, rather than in the boson collections.

The Starting Point​

Top samples​

Bosons​

The Starting Point

Top samples

Bosons