binning comparison

Welcome back to the third part of this blog-post series. In case you missed it, the first and second parts may be found here and here. Last time we finished laying the necessary groundwork for parameter estimation, and in this post we'll look at the example presented in the INFERNO paper, and demonstrate how the typical approach in HEP would go about solving it.

For the rest of this blog-post series, we'll be using a PyTorch implementation of INFERNO which I've put together, but I'll do my best to show the relevant code, and explain what each component does. The package can be installed using:

!pip install pytorch_inferno==0.0.1

Docs are located here and the Github repo is here. I'm installing v0.0.1 here, since it was what I used when writing this post, but for your own work, you should grab the latest version.

As an additional note, these blog posts are fully runnable as Jupyter notebooks. You can either download them, or click the Google Colab badge at the top to follow along!

The example

Similar to our examples in the previous two posts, the paper considers a toy example in which two classes of process (signal and background) contribute to an observed count. The count density is defined using three arbitrary features, $\bar{x}=(x_0,x_1,x_2)$, and differences in the probability distributions functions (PDFs) of the two classes can be exploited in order to improve the precision of the parameter of interest, $\mu$, which is the signal strength scaling the contributions of the signal process: $N_o=\mu s + b$, where $N_o$ is the observed count and $s$ and $b$ are the expected contributions from the signal and background processes.

The PDFs of the two classes are specified using a 2D Gaussian distribution and an exponential distribution: $$f_b(\bar{x}) = \mathcal{N}\left((x_0,x_1)\,|\,(2,0),\begin{bmatrix}5&0\\0&9\end{bmatrix}\right)\mathrm{Exp}(x_2\,|\,3)$$ $$f_s(\bar{x}) = \mathcal{N}\left((x_0,x_1)\,|\,(0,0),\begin{bmatrix}1&0\\0&1\end{bmatrix}\right)\mathrm{Exp}(x_2\,|\,2)$$

N.B. Eq.15 in the paper states the means of the signal Gaussian are $(1,1)$, however the code on which the results are derived used $(0,0)$.

In the package, this is implemented as:

class _PaperData():
    r'''Callable class generating pseudodata from Inferno paper'''
    def __init__(self, mu:List[float], conv:List[List[float]], r:float, l:float):
        store_attr(but=['mu', 'r'])
        self.r = np.array([r,0])
        self.mu = np.array(mu)

    def sample(self, n:int) -> np.ndarray:
        return np.hstack((np.random.multivariate_normal(self.mu+self.r, self.conv, n),
                          np.random.exponential(1/self.l, size=n)[:,None]))

    def __call__(self, n:int) -> np.ndarray: return self.sample(n)

r is a nuisance parameter which can affect the mean of the Gaussian, but we'll get to that later on.

from pytorch_inferno.pseudodata import _PaperData
_PaperData(mu=[0,0], conv=[[1,0],[0,1]], r=0, l=2).sample(2)
array([[-0.78121616,  1.2277006 ,  0.12666304],
       [ 0.01064229, -2.22750248,  0.82898485]])

Sampling two points, returns the values of the three features for both points. Let's generate samples for signal and background, and plot out the features:

%%capture --no-display
from pytorch_inferno.pseudodata import PseudoData, paper_bkg, paper_sig
import seaborn as sns
n=1000
df = PseudoData(paper_sig, 1).get_df(n).append(PseudoData(paper_bkg, 0).get_df(n), ignore_index=True)
sns.pairplot(df, hue='gen_target', vars=[f for f in df.columns if f != 'gen_target'])
<seaborn.axisgrid.PairGrid at 0x1076efd68>

From the distributions above we can see that the signal (1-orange) overlaps considerably with the background (0-blue), which no particular feature offering good separation power; the features as they stand are unsuitable for inferring $\mu$ via binning. $x_1$ perhaps could be used, since the background displays a greater spread than the signal, however we can hope to do better.

Classifier training

As stated last time, what we want is a single feature in which the classes of signal and background are most linearly separable. This has to be a function of the observed features $\bar{x}$: $$f(\bar{x})=y,$$ where $y=0$ for background and $y=1$ for signal. I.e. $f$ maps $\bar{x}$ into a space in which the two processes are completely distinct. Unfortunately, such as function is not always achievable given the information available in $\bar{x}$. Instead we can attempt to approximate $f$ with a parameterisation: $$f_\theta(\bar{x})=\hat{y}\approx y,$$ Since $f_\theta$ is only an approximation, the resulting predictions $\hat{y}$ will be a continuous variable in which (hopefully) background data will be clustered close to zero and signal clustered close to one.

Depending on any domain knowledge available, suitable (monotonically related) forms of $f_\theta$ may be available in the form of high-level features: E.g. when searching for a Higgs boson decaying to a pair of tau leptons, the invariant mass of the taus should form a peak around 125 GeV for signal whereas background should not form a peak. Such a high-level feature can then either be transformed into the range [0,1] or just binned for inference directly.

A better approach (usually), especially when domain theory cannot help, however is to learn the form of $f_\theta$ from example data using a sufficiently flexible model by optimising the parameters $\theta$ such that: $$\theta=\arg\min_\theta[L(y,f_\theta(\bar{x})],$$ where the loss function $L$ quantifies the difference between the predictions of the model $f_\theta(\bar{x})$ and the targets $y$.

Common forms of $f_\theta(\bar{x})$ in HEP are Boosted Decision Trees and now more recently (deep) neural networks ((D)NNs. Eventually I'll write a new series on DNNs, but for now I can point you to my old series here. I'll proceed assuming a basic understanding of DNNs, I'm afraid, so as not to go beyond scope.

Let's train a DNN as a binary classifier to predict whether data belongs to signal or background. The pytorch_inferno package includes a basic boilerplate code for yielding data and training DNNs.

As per the paper, we'll use a network with three hidden layers network and ReLU activations to map the three input features to a single output with a sigmoid activation.

from pytorch_inferno.utils import init_net
from torch import nn
net = nn.Sequential(nn.Linear(3,100),  nn.ReLU(),
                    nn.Linear(100,100),nn.ReLU(),
                    nn.Linear(100,1),  nn.Sigmoid())
init_net(net)  # Initialise weights and biases

The ModelWrapper class then provides support for training the network, and saving, loading, and predicting.

from pytorch_inferno.model_wrapper import ModelWrapper
model = ModelWrapper(net)

A training and testing dataset can be generated and wrapped in a class to yield mini-batches.

from pytorch_inferno.data import get_paper_data
data, test = get_paper_data(n=200000, bs=32, n_test=1000000)

Now let's train the model. In contrast to the paper, we'll use ADAM for optimisation, at a slightly higher learning rate, to save time. We'll also avoid overtraining by saving when the model improves using callbacks. The Loss is binary cross-entropy, since we want to train a binary classifier.

from fastcore.all import partialler
from torch import optim
from pytorch_inferno.callback import LossTracker, SaveBest, EarlyStopping
model.fit(200, data=data, opt=partialler(optim.Adam,lr=8e-4), loss=nn.BCELoss(),
          cbs=[LossTracker(),SaveBest('weights/best_bce.h5'),EarlyStopping(5)])
<progress value='20' class='' max='200', style='width:300px; height:20px; vertical-align: middle;'></progress> 10.00% [20/200 04:04<36:44]
<progress value='3125' class='' max='3125', style='width:300px; height:20px; vertical-align: middle;'></progress> 100.00% [3125/3125 00:03<00:00]
1: Train=0.3446095490694046 Valid=0.3378930442857742
2: Train=0.3353524197614193 Valid=0.3383234238386154
3: Train=0.3335498013615608 Valid=0.3354001357984543
4: Train=0.3327178750550747 Valid=0.334430864071846
5: Train=0.3327170829820633 Valid=0.3354277195715904
6: Train=0.3322705823850632 Valid=0.33362953107357024
7: Train=0.3319925680863857 Valid=0.3360459116411209
8: Train=0.3318649928891659 Valid=0.3335313368606567
9: Train=0.3317312321817875 Valid=0.33429972858905793
10: Train=0.33157649008274076 Valid=0.3346566602230072
11: Train=0.3315507458817959 Valid=0.33596284610271454
12: Train=0.331496762791872 Valid=0.33324935112953186
13: Train=0.3312726239180565 Valid=0.3352210135364532
14: Train=0.3312872124123573 Valid=0.3343600254631042
15: Train=0.3312903489100933 Valid=0.33351335374832153
16: Train=0.33124875135302545 Valid=0.33321514286994935
17: Train=0.3312225800585747 Valid=0.3346719899749756
18: Train=0.33116737874269486 Valid=0.3332399683332443
19: Train=0.33107781036496164 Valid=0.3342269439792633
20: Train=0.3311147364091873 Valid=0.33339463568210603
21: Train=0.33108707209706306 Valid=0.3348931646203995
Early stopping
Loading best model with loss 0.33321514286994935

Having trained the model, we can now check the predictions on the test set

preds = model._predict_dl(test)
<progress value='15625' class='' max='15625', style='width:300px; height:20px; vertical-align: middle;'></progress> 100.00% [15625/15625 00:14<00:00]
import pandas as pd
df = pd.DataFrame({'pred':preds.squeeze()})
df['gen_target'] = test.dataset.y
df.head()
pred gen_target
0 0.298936 1.0
1 0.757607 1.0
2 0.758828 1.0
3 0.944337 1.0
4 0.446215 1.0
from pytorch_inferno.plotting import plot_preds
plot_preds(df)

From the above plot, we can see that the model was able to maps both classes towards their target values. It also shows good agreement with Fig. 3a in the paper.

Parameter estimation

No nuisance parameters

Now that we have learnt a feature (the DNN predictions) in which the signal and background a highly separated, we're now in a position to estimate our parameter of interest by binning the PDFs of signal and background, as per the last post.

Similar to the paper, we'll bin the predictions in 10 bins of equal size:

from pytorch_inferno.inference import bin_preds
import numpy as np
bin_preds(df, np.linspace(0,1,11))  # Bins each prediction

df.head()
pred gen_target pred_bin
0 0.298936 1.0 2
1 0.757607 1.0 7
2 0.758828 1.0 7
3 0.944337 1.0 9
4 0.446215 1.0 4

Now we want to get the PDFs for signal and background

from pytorch_inferno.inference import get_shape
f_s,f_b = get_shape(df,targ=1),get_shape(df,targ=0)  # Gets normalised shape for each class
f_s,f_b
(tensor([0.0065, 0.0099, 0.0137, 0.0189, 0.0274, 0.0431, 0.0716, 0.1421, 0.3973,
         0.2695]),
 tensor([0.6139, 0.0616, 0.0427, 0.0358, 0.0343, 0.0361, 0.0383, 0.0470, 0.0686,
         0.0217]))

In this example (and generally in HEP) we use simulated data to estimate the PDFs of signal and background. The expected yields (overall normalisation) are separate parameters. This is to say that although we trained our model on 200,000 datapoints, and estimated the PDFs on 1,000,000 data points, these sample sizes need not have any relationship to the expected yields of signal and background; for the paper example, the background yield is 1000, and the signal yield is our parameter of interest, $\mu=50$.

At this stage of a proper HEP analysis, we would still be optimising our analysis and so will not have access to the observed data. Instead, as mentioned last time, we can check expected performance by computing the Asimov dataset; the expected yield per bin, where any parameters are at their expected yield. I.e., the Asmiov yield in bin i will be: $$N_a=50\cdot f_{s,i}+1000\cdot f_{b,i}$$

asimov = (50*f_s)+(1000*f_b)
asimov, asimov.sum()
(tensor([614.2289,  62.0748,  43.3652,  36.7367,  35.6574,  38.2549,  41.9260,
          54.0901,  88.5101,  35.1559]), tensor(1050.))

Now we can compute the negative loglikelihood for a range of $\mu$ values, as usual

import torch
n = 1050
mu = torch.linspace(20,80,61)
nll = -torch.distributions.Poisson((mu[:,None]*f_s)+(1000*f_b)).log_prob(asimov).sum(1)
nll
tensor([31.9203, 31.7625, 31.6112, 31.4658, 31.3276, 31.1947, 31.0683, 30.9478,
        30.8333, 30.7245, 30.6218, 30.5242, 30.4327, 30.3465, 30.2655, 30.1904,
        30.1201, 30.0554, 29.9959, 29.9413, 29.8918, 29.8473, 29.8075, 29.7725,
        29.7428, 29.7177, 29.6973, 29.6814, 29.6701, 29.6635, 29.6615, 29.6634,
        29.6703, 29.6809, 29.6964, 29.7159, 29.7392, 29.7673, 29.7990, 29.8351,
        29.8750, 29.9190, 29.9668, 30.0181, 30.0739, 30.1335, 30.1964, 30.2634,
        30.3340, 30.4083, 30.4864, 30.5676, 30.6529, 30.7412, 30.8336, 30.9290,
        31.0277, 31.1302, 31.2356, 31.3449, 31.4572])
from pytorch_inferno.plotting import plot_likelihood
plot_likelihood(nll-nll.min())

Comparing the width to Tab.2 in the paper (NN classifier, Benchmark 0 = 14.99), our value seems to be in good agreement!

Nuisance parameters

There are two ways in which nuisance parameters can affect our measurement: 1) the nuisance parameter leads to a shift in the PDF of a class, but the overall normalisation stays the same; 2) the PDF stays the same, but the normalisation changes. Both of these types of nuisances can be constrained by auxiliary measurements, as described in part-1.

The paper considers a range of benchmarks for how the presence of different nuisances can affect the precision of the measurement:

  1. The mean of the Gaussian distribution of the background is allowed to shift in one dimension by an amount $r$
  2. In addition to 1) the rate of the Exponential distribution, $\lambda$, of the background is allowed to change
  3. Similar to 2), except $r$ and $\lambda$ are constrained by auxiliary measurements
  4. Similar to 3), except now the overall background rate, $b$ is allowed to shift and is constrained by an auxiliary measurement

The uncertainty of $r$ and $\lambda$ leads to changes in the input features of our model, and so leads to changes in the PDF of the background distribution of the inference feature. When attempting to minimise the negative log-likelihood, by optimising the nuisance parameters, these shifts must be accounted for: if the $r$ shift is +0.2, then all of the $x_0$ features for the background are shifted by +0.2, which means that the predictions of the DNN will also change in some more complex way. Similarly, if $\lambda$ is increased by 0.5, then the whole of $x_3$ for background are multiplied by a factor $3.5/3$.

We are lucky in this example, that the analytic effects on the input features of these two nuisances can be computed; sometimes this is not the case. In HEP, simulated data are generated by stochastic modelling, and the exact effect on a given variable of say jet energy scale or tau ID cannot be analytically derived. The approach, then, is to generate additional datasets in which the values affected by the systematics are moved up and down according to their uncertainty (e.g. one would have a nominal dataset, and then JES_UP and JES_DOWN datasets in which the energy scale is shift up/down by 1-sigma). These additional datasets are then also fed through the DNN, in order to arrive at up/down shifts in the PDFs. Since the analytical effects on the input features in our example systematics are known, we can simply adjust the existing dataset and pass it through our model.

Shape variations

Let's now compute all the shifts in the background PDF. Following the paper, $r$ is shifted by $\pm0.2$, and $\lambda$ by $\pm0.5$. Remember that these shifts only affect the background process. get_paper_syst_shapes will shift the data, recompute the new model predictions, and then compute the new PDF for background

from pytorch_inferno.inference import get_paper_syst_shapes
bkg = test.dataset.x[test.dataset.y.squeeze() == 0]  # Select the background test data
b_shapes = get_paper_syst_shapes(bkg, df, model=model, bins=np.linspace(0,1,11))
Running: r=-0.2
<progress value='1' class='' max='1', style='width:300px; height:20px; vertical-align: middle;'></progress> 100.00% [1/1 00:04<00:00]
Running: r=0
<progress value='1' class='' max='1', style='width:300px; height:20px; vertical-align: middle;'></progress> 100.00% [1/1 00:04<00:00]
Running: r=0.2
<progress value='1' class='' max='1', style='width:300px; height:20px; vertical-align: middle;'></progress> 100.00% [1/1 00:05<00:00]
Running: l=2.5
<progress value='1' class='' max='1', style='width:300px; height:20px; vertical-align: middle;'></progress> 100.00% [1/1 00:04<00:00]
Running: l=3
<progress value='1' class='' max='1', style='width:300px; height:20px; vertical-align: middle;'></progress> 100.00% [1/1 00:04<00:00]
Running: l=3.5
<progress value='1' class='' max='1', style='width:300px; height:20px; vertical-align: middle;'></progress> 100.00% [1/1 00:04<00:00]

The function also saves the new predicted bin to the DataFrame, allowing us to see how the background shifts compared to the nominal prediction.

plot_preds(df, pred_names=['pred', 'pred_-0.2_3', 'pred_0.2_3', 'pred_0_2.5', 'pred_0_3.5'])

So we can see that the changes in the nuisance parameters really do result in shifts in the background PDF. In b_shapes we have a a dictionary of the nominal, up, and down shifts according to changes in $r$ and $\lambda$.

b_shapes
OrderedDict([('f_b_nom',
              tensor([0.6139, 0.0616, 0.0427, 0.0358, 0.0343, 0.0361, 0.0383, 0.0470, 0.0686,
                      0.0217])),
             ('f_b_up',
              tensor([[0.6361, 0.0595, 0.0414, 0.0344, 0.0322, 0.0345, 0.0358, 0.0434, 0.0628,
                       0.0198],
                      [0.6093, 0.0615, 0.0430, 0.0356, 0.0340, 0.0359, 0.0382, 0.0470, 0.0687,
                       0.0268]])),
             ('f_b_dw',
              tensor([[0.5921, 0.0634, 0.0446, 0.0374, 0.0356, 0.0377, 0.0409, 0.0502, 0.0744,
                       0.0237],
                      [0.6184, 0.0617, 0.0428, 0.0357, 0.0344, 0.0364, 0.0387, 0.0468, 0.0682,
                       0.0168]]))])

These shifts, however, are discrete and our optimisation of the values of $r$ and $\lambda$ needs to happen in a continuous manner, but generating new simulated data at the desired nuisance values is too time consuming. Instead we can interpolate between the systematic shifts and the nominal values in each bin in order to estimate the expected shape of the PDF for proposed values of the nuisances. Additionally, such shifts need to happen in a differentiable manner, so that we can optimise the nuisance values using the Newtonian method, as per part-1.

The interp_shape function performs a quadratic interpolation between up/down shifts and the nominal value, and a linear extrapolation outside the range of up/down shifts, and does so in a way that allows the resulting values to be differentiated w.r.t. the nuisance parameters. For example in bin zero, we can plot (in red) the computed PDF values at up (1.0), nominal (0.0), and down (-1.0), and also then plot the results of the interpolation and extrapolation for shifts in $r$ (blue scatter). The corresponding PDF value is shown in the y-axis.

from pytorch_inferno.inference import interp_shape
from torch import Tensor
import matplotlib.pyplot as plt

i = 0
d = b_shapes['f_b_dw'][0][i]
n = b_shapes['f_b_nom'][i]
u = b_shapes['f_b_up'][0][i]
interp = []
rs = np.linspace(-2,2)
for r in rs: interp.append(interp_shape(Tensor((r,0))[None,:], **b_shapes)[0][i].data.item())
plt.scatter(rs, interp)
plt.plot([-1,0,1],[d,n,u], label=i, color='r')
[<matplotlib.lines.Line2D at 0x1a37954438>]

Benchmark 2

Let's now evaluate our trained model under Benchmark 2, in which both $r$ and $\lambda$ are completely free. The process of minimising the NLL at each value of $\mu$ is now slightly more complicated than before since we now have multiple nuisances, and so need to compute the full Jacobian and Hessian matrices (as warned in part-1).

Unfortunately here either my PyTorch skills fail me, or PyTorch lacks sufficient flexibility, but I haven't yet managed to perform these calculations in a batch-wise manner. Meaning that the NLL optimsation must be done in series for every value of $\mu$, rather than in parallel as before. I think the problem is that PyTorch is overly secure and refuses to compute gradients when not all variables are present, even if they aren't necessary to compute the gradients. So if anyone has any ideas, please let me know!

Anyway, let's use the calc_profile function to minimise the nuisances for a range of signal strengths and return the NLL

%%capture --no-display
from pytorch_inferno.inference import calc_profile
profiler = partialler(calc_profile, f_s=f_s, n=1050, mu_scan=torch.linspace(20,80,61), true_mu=50, n_steps=100)
nll = profiler(**b_shapes).cpu().detach()
plot_likelihood(nll-nll.min())
<progress value='61' class='' max='61', style='width:300px; height:20px; vertical-align: middle;'></progress> 100.00% [61/61 00:10<00:00]

Right, so the uncertainty has gone way up! What a shame.

Benchmark 3

We can slightly improve things by including auxiliary measurements on $r$ and $\lambda$. Benchmark 3 in the paper considers constraints as Gaussian distributions with standard deviations equal to 2, so let's include those in the NLL minimisation.

N.B. in contrast to the paper and paper code, in the implementation here, systematics are treated as new parameters perturbing existing parameters from their nominal values, rather than being the existing parameters themselves.

%%capture --no-display
alpha_aux = [torch.distributions.Normal(0,2), torch.distributions.Normal(0,2)]
nll = profiler(alpha_aux=alpha_aux, **b_shapes).cpu().detach()
plot_likelihood(nll-nll.min())
<progress value='61' class='' max='61', style='width:300px; height:20px; vertical-align: middle;'></progress> 100.00% [61/61 00:13<00:00]

The constraints restore some sensitivity, but we're still doing much worse than before, when we didn't consider the nuisance parameters.

Normalisation uncertainty

The second way that nuisances can affect the measurement, is by globally rescaling the background contributions (i.e. letting $b$ float when minimising the NLL). As we saw in part-1 these kinds of nuisances really need to be constrained, however with the binned approach, we do have some slight resilience to them, unlike before where they completely killed our sensitivity.

%%capture --no-display
nll = profiler(alpha_aux=alpha_aux, float_b=True, b_aux=torch.distributions.Normal(1000,100), **b_shapes).cpu().detach()
plot_likelihood(nll-nll.min())
<progress value='61' class='' max='61', style='width:300px; height:20px; vertical-align: middle;'></progress> 100.00% [61/61 00:19<00:00]

Ouch, that really doesn't look good!

Closing

So what we've seen here is that by using a neural network we can map a set of weakly discriminating features down into a single powerful feature, and use it for statistical inference of parameters in our model. When applicable, this is the typical approach now in contemporary HEP, (and has been used to get some really great results over the years!).

Unfortunately, what we've also seen is that the presence of nuisance parameters (systematic uncertainties) can really spoil the performance of the measurement. Both by altering the input features, leading to flexibility in the shape of the inference variable, and by adjusting the overall normalisation of different processes contributing to the observed data (or its Asimov expectation).

The way in which we map the weak features down to the single, strong feature, and the way that we bin the resulting distribution, should ideally be done in a manner which accounts for the effects of the nuisance parameters that will later be included in the parameter estimation. This is precisely what the INFERNO method sets out to accomplish, and in the next post, we shall begin looking at how this can be achieved.