binning comparison

Welcome back to the concluding part of this blog-post series. In case you missed the previous posts: first, second, third, and fourth.

Last time we applied INFERNO to the toy-example presented in the INFERNO paper, and demonstrated how the method made our parameter estimation more resilient to the effects of nuisance parameters that affect both the shapes of input features and the overall normalisation of the classes. Here we'll look at how we can achieve the same effect without having to know the analytical effect of the nuisance parameters on the input features.

As a reminder, we're using a PyTorch implementation of INFERNO which I've put together. The package can be installed using:

!pip install pytorch_inferno==0.0.1

Docs are located here and the Github repo is here. I'm installing v0.0.1 here, since it was what I used when writing this post, but for your own work, you should grab the latest version.

As an additional note, these blog posts are fully runnable as Jupyter notebooks. You can either download them, or click the Google Colab badge at the top to follow along!

Approximating INFERNO

The approach as presented last time, requires modelling the effect of the shape nuisances on the inputs in an analytical manner. Usually, though, this is difficult. As an example, in high-energy physics, we often have systematic uncertainties like QCD scale, various energy scales, and correction factors for upstream algorithms. These affect a range of features of the data in complex ways. Additionally, their effects are sometimes modelled by changing fundamental parameters in the Monte Carlo simulator used to generate training data.

Knowing exactly how variations in a given nuisance parameter alter the input features is not normally possible. Indeed, when we have optimised the likelihood in previous posts, we have assumed this and so haven't fully propagated the effects of the parameters through our classifier, instead we generated up/down systematic variations for the model response and interpolated between them and the nominal template (or extrapolated away from them). The trick, then, is to also do this interpolation/extrapolation inside INFERNO.

When performing a HEP analysis with systematic uncertainties, these up/down variation samples are usually generated via modifying the relevant parameters in the simulation, i.e. for an analysis one has a nominal dataset and also datasets for up and down variations for each nuisance parameter separately. The approximation of INFERNO presented here is designed to work directly on such datasets.

One key point is that we will compare nominal and systematic datasets in terms of template shape, rather than on a per data-point level. This means that the systematic datasets don't have to match the nominal dataset event-wise, i.e. we don't need to know how a given event would change if a certain parameter were altered, instead we only need to know the average effect that changing the parameter causes. This certainly simplifies things from both a simulation and implementation point of view.

Abstract implementation

The interpolation-based INFERNO loss in pytorch_inferno is implemented similar to the exact version, i.e. as a callback. Again, there is an abstract class implementing the basic functionality, and the idea is that users can inherit from it to apply to their specific problems, e.g in our case PaperInferno. Note however that even the "abstract" class is still designed for a Poisson counting problem.

Let's go through the code function by function:

def __init__(self, n:int, true_mu:float, aug_alpha:bool=False, n_alphas:int=0, n_steps:int=100, lr:float=0.1,
                 float_b:bool=False, alpha_aux:Optional[List[Distribution]]=None, b_aux:Optional[Distribution]=None):
        super().__init__()
        store_attr()
        self.true_b = self.n-self.true_mu

The initialisation is similar to the exact version: Asimov count n, and the true signal strength true_mu. The true background count is then computed automatically. n_alphas is the number of shape systematics to expect and float_b is whether or not the background normalisation should be allowed to vary. Any auxiliary measurements for the shape parameters and background yield can be passed as PyTorch distributions to the alpha_aux and b_aux arguments.

There is also the option to augment the nuisance parameters aug_alpha. This is an idea I had which is still experimental - basically the nuisances start each minibatch at random points about their nominal values and undergo Newtonian optimisation according to lr and n_steps. This could perhaps be thought of as a form of regularisation or data-augmentation. In this example it seems to work well, but as I say, it's still experimental, so I won't talk about it much here.

def on_train_begin(self) -> None:
    self.wrapper.loss_func = None
    for c in self.wrapper.cbs:
        if hasattr(c, 'loss_is_meaned'): c.loss_is_meaned = False

Unlike the exact version, when the training starts we don't need to cache a tensor of the nuisance parameters, but we do still want to ensure that the model doesn't have any other losses.

Having passed the nominal input data through the network, we call:

def on_forwards_end(self) -> None:
    r'''Compute loss and replace wrapper loss value'''
    b = self.wrapper.y.squeeze() == 0
    f_s = self.to_shape(self.wrapper.y_pred[~b])
    f_b = self.to_shape(self.wrapper.y_pred[b])
    f_b_up,f_b_dw = self._get_up_down(self.wrapper.x[b])
    self.wrapper.loss_val = self.get_ikk(f_s=f_s, f_b_nom=f_b, f_b_up=f_b_up, f_b_dw=f_b_dw)

This extracts the normalised shapes for signal and background (their PDFs) from the network predictions (self.wrapper.y_pred). Unlike the exact version, the background predictions don't include the shape nuisances. In order to evaluate these, we pass the background data to self._get_up_down, which passes up/down systematic data through the network and returns the shape templates. Currently it an abstract function to be overridden in a problem-specific class.

@abstractmethod
def _get_up_down(self, x:Tensor) -> Tuple[Tensor,Tensor]:
    r'''Compute up/down shapes. Override this for specific problem.'''
    pass

Now that we have nominal shape templates for signal and background, and the shape templates for background after +1 and -1 sigma variations of each shape-affecting nuisance parameters, we are now at a similar state to when we are evaluating the trained models on the benchmark problems in the previous posts. I.e. we can compute the negative log-likelihood, and optimise it by varying the nuisance parameters based on interpolating between the nominal and ±1-sigma variations, for fixed values of the parameter of interest.

The key point is that we actually start with the nuisances at their nominal values, and we only need to evaluate the parameter of interest at its nominal value, so no optimisation is required. The actual implementation includes code for applying the data augmentation mentioned earlier, but I'll remove it here for simplicity.

def get_ikk(self, f_s:Tensor, f_b_nom:Tensor, f_b_up:Tensor, f_b_dw:Tensor) -> Tensor:
    r'''Compute full hessian at true param values, or at random starting values with Newton updates'''
    alpha = torch.zeros((self.n_alphas+1+self.float_b), requires_grad=True, device=self.wrapper.device)
    with torch.no_grad(): alpha[0] += self.true_mu
    get_nll = partialler(calc_nll, s_true=self.true_mu, b_true=self.true_b,
                         f_s=f_s, f_b_nom=f_b_nom, f_b_up=f_b_up, f_b_dw=f_b_dw, alpha_aux=self.alpha_aux, b_aux=self.b_aux)
    nll = get_nll(s_exp=alpha[0], b_exp=self.true_b+alpha[1] if self.float_b else self.true_b, alpha=alpha[1+self.float_b:])
    _,h = calc_grad_hesse(nll, alpha, create_graph=True)
    return torch.inverse(h)[0,0]

So, first we create a tensor, alpha, to contain the parameter of interest and the nuisance parameters (all zero except the parameter of interest (PoI)), and then we compute the NLL using calc_nll (we've called this function in the past when evaluating benchmarks, but it has been wrapped by calc_profile to handle multiple PoIs). Once we have the NLL, we can then compute the INFERNO loss as usual by inverting the Hessian of the NLL w.r.t. the PoI and nuisances (calc_grad_hesse), and returning the element corresponding to the PoI.

Problem-specific class

Similar to last time, we want to inherit from our abstract implementation of INFERNO to better suit the problem at hand. PaperInferno is included to work with the toy problem in the original paper. I mentioned at the start that this approximation of INFERNO was designed to work with pre-produced datasets that contain both nominal and systematic datasets, however for this example, we'll generate the systematic datasets on the fly since we can analytically modify the inputs. If we were to do this properly, we would generate the systematic datasets beforehand, and allow the callback to sample batches from them during training. Probably I'll eventually add a class to the library that can do this.

Anyway, for now, our class overrides _get_up_down like this:

def _get_up_down(self, x:Tensor) -> Tuple[Tensor,Tensor]:
    if self.r_mods is None and self.l_mods is None: return None,None
    u,d = [],[]
    if self.r_mods is not None:
        with torch.no_grad(): x = x+self.r_mod_t[0]
        d.append(self.to_shape(self.wrapper.model(x)))
        with torch.no_grad(): x = x+self.r_mod_t[1]-self.r_mod_t[0]
        u.append(self.to_shape(self.wrapper.model(x)))
        with torch.no_grad(): x = x-self.r_mod_t[1]
    if self.l_mods is not None:
        with torch.no_grad(): x = x*self.l_mod_t[0]
        d.append(self.to_shape(self.wrapper.model(x)))
        with torch.no_grad(): x = x*self.l_mod_t[1]/self.l_mod_t[0]
        u.append(self.to_shape(self.wrapper.model(x)))
        with torch.no_grad(): x = x/self.l_mod_t[1]
    return torch.stack(u),torch.stack(d)

I.e. it modifies the current data, passes it through the model, and constructs the templates for independent up and down systematics. A proper implementation would instead sample minibatch x from the respective systematic dataset and pass it through the model

DNN training

We train the network in a similar fashion as in part-4, except we now use the approximated version of INFERNO

from pytorch_inferno.utils import init_net
from pytorch_inferno.model_wrapper import ModelWrapper
from torch import nn
def get_model() -> ModelWrapper:
    net = nn.Sequential(nn.Linear(3,100),  nn.ReLU(),
                        nn.Linear(100,100),nn.ReLU(),
                        nn.Linear(100,10),  VariableSoftmax(0.1))
    init_net(net)  # Initialise weights and biases
    return ModelWrapper(net)    
from pytorch_inferno.data import get_paper_data
data, test = get_paper_data(n=200000, bs=2000, n_test=1000000)
from pytorch_inferno.inferno_interp import PaperInferno, VariableSoftmax
from torch import distributions
inferno = PaperInferno(r_mods=(-0.2,0.2),  # Let r be a shape nuisance and specify down/up values
                       l_mods=(2.5,3.5),   # Let lambda be a shape nuisance and specify down/up values
                       float_b=True,  # Let the background yield be a nusiance
                       alpha_aux=[distributions.Normal(0,2), distributions.Normal(0,2)],  # Include auxiliary measurements on r & lambda
                       b_aux=distributions.Normal(1000,100))  # Include auxiliary measurements on the background yield

Now let's train the model. In contrast to the paper, we'll use ADAM for optimisation, at a slightly higher learning rate, to save time. We'll also avoid overtraining by saving when the model improves using callbacks. Note that we don't specify a loss argument.

from fastcore.all import partialler
from torch import optim
from pytorch_inferno.callback import LossTracker, SaveBest, EarlyStopping
model = get_model()
model.fit(200, data=data, opt=partialler(optim.Adam,lr=8e-5), loss=None,
          cbs=[inferno,LossTracker(),SaveBest('weights/best_inferno.h5'),EarlyStopping(5)])
<progress value='103' class='' max='200', style='width:300px; height:20px; vertical-align: middle;'></progress> 51.50% [103/200 16:23<15:26]
<progress value='50' class='' max='50', style='width:300px; height:20px; vertical-align: middle;'></progress> 100.00% [50/50 00:03<00:00]
1: Train=1903.3741125488282 Valid=1099.821317138672
2: Train=949.5193017578125 Valid=871.2384777832032
3: Train=784.0374075317383 Valid=740.7831909179688
4: Train=676.6555770874023 Valid=623.6444390869141
5: Train=588.5027008056641 Valid=579.0961938476562
6: Train=556.9197653198243 Valid=553.1921911621093
7: Train=531.2430877685547 Valid=531.6476770019531
8: Train=511.1146792602539 Valid=512.2304235839844
9: Train=496.7247784423828 Valid=496.2313848876953
10: Train=475.7442514038086 Valid=481.18335876464846
11: Train=465.06764404296877 Valid=467.6013262939453
12: Train=453.10702941894533 Valid=456.12772888183594
13: Train=441.7800866699219 Valid=447.8130987548828
14: Train=433.41732116699217 Valid=437.7292547607422
15: Train=424.49437103271487 Valid=430.0029229736328
16: Train=416.7866101074219 Valid=426.0567614746094
17: Train=412.7526251220703 Valid=419.2637548828125
18: Train=408.7203356933594 Valid=416.4257537841797
19: Train=400.82221466064453 Valid=409.8280517578125
20: Train=399.35216857910154 Valid=407.7358648681641
21: Train=394.8920980834961 Valid=404.62935546875
22: Train=392.22727661132814 Valid=400.5902557373047
23: Train=387.8922058105469 Valid=396.4555090332031
24: Train=386.4134075927734 Valid=395.3793377685547
25: Train=384.1147979736328 Valid=389.80433471679686
26: Train=380.9291650390625 Valid=386.91029174804686
27: Train=377.0340869140625 Valid=388.1162451171875
28: Train=375.92713134765626 Valid=385.6933905029297
29: Train=374.33164367675784 Valid=382.5351153564453
30: Train=371.9804290771484 Valid=380.67197082519533
31: Train=368.66335632324217 Valid=378.97169189453126
32: Train=367.6147705078125 Valid=378.1017291259766
33: Train=363.45591827392576 Valid=377.79085327148437
34: Train=363.84340896606443 Valid=374.8612762451172
35: Train=361.1047927856445 Valid=374.572109375
36: Train=360.6424502563477 Valid=372.1866046142578
37: Train=359.7146035766602 Valid=370.7572198486328
38: Train=358.75559661865236 Valid=371.33667236328125
39: Train=355.9152355957031 Valid=368.7162133789063
40: Train=355.1406524658203 Valid=370.5099066162109
41: Train=352.2504364013672 Valid=367.6858612060547
42: Train=351.0478335571289 Valid=366.42741638183594
43: Train=352.3582815551758 Valid=365.37479431152343
44: Train=349.79847747802734 Valid=364.5373815917969
45: Train=351.1881408691406 Valid=362.6999938964844
46: Train=348.9755392456055 Valid=360.88500427246095
47: Train=347.34831604003904 Valid=360.76511657714843
48: Train=345.6294583129883 Valid=360.39421447753904
49: Train=343.59228576660155 Valid=359.66538818359373
50: Train=346.0175961303711 Valid=357.86841064453125
51: Train=344.6366781616211 Valid=359.06419250488284
52: Train=341.4972285461426 Valid=356.19458435058596
53: Train=341.9684248352051 Valid=356.2013336181641
54: Train=340.62243362426756 Valid=354.7986822509766
55: Train=339.8979489135742 Valid=355.1641271972656
56: Train=340.5874548339844 Valid=354.5460137939453
57: Train=337.78608337402346 Valid=352.9345196533203
58: Train=338.2697674560547 Valid=351.7689556884766
59: Train=336.8008772277832 Valid=352.7039239501953
60: Train=336.6271438598633 Valid=350.5672857666016
61: Train=333.62806350708007 Valid=350.3022137451172
62: Train=333.2799528503418 Valid=351.6690576171875
63: Train=334.6952801513672 Valid=349.2768273925781
64: Train=334.7041619873047 Valid=349.21451782226563
65: Train=333.62857818603516 Valid=351.12249572753905
66: Train=332.74018524169924 Valid=346.973759765625
67: Train=329.8943818664551 Valid=348.11360412597656
68: Train=332.9227757263184 Valid=347.2443560791016
69: Train=329.22062530517576 Valid=346.0074743652344
70: Train=329.7236764526367 Valid=346.23931884765625
71: Train=329.0929122924805 Valid=344.219853515625
72: Train=329.3751695251465 Valid=345.1238238525391
73: Train=329.4808396911621 Valid=345.1286047363281
74: Train=328.91672256469724 Valid=345.9363116455078
75: Train=327.8580014038086 Valid=343.5010699462891
76: Train=327.675544128418 Valid=343.328955078125
77: Train=326.15745666503904 Valid=341.92960876464844
78: Train=328.9135433959961 Valid=343.7149389648437
79: Train=326.5412274169922 Valid=340.72755249023436
80: Train=327.1108027648926 Valid=342.9395086669922
81: Train=325.51490631103513 Valid=342.25386932373044
82: Train=323.1186293029785 Valid=339.41467224121095
83: Train=323.5835791015625 Valid=340.5097528076172
84: Train=323.02451416015623 Valid=340.3460504150391
85: Train=323.5133598327637 Valid=339.0770629882812
86: Train=324.4291862487793 Valid=342.1396295166016
87: Train=322.38175201416016 Valid=341.73817932128907
88: Train=321.80700439453125 Valid=340.46791748046877
89: Train=323.08181732177735 Valid=336.96313232421875
90: Train=322.5729107666016 Valid=338.9583264160156
91: Train=321.12098159790037 Valid=340.28571838378906
92: Train=322.7699349975586 Valid=337.8809533691406
93: Train=320.90445220947265 Valid=336.4360284423828
94: Train=321.4063262939453 Valid=336.58940979003904
95: Train=319.529539642334 Valid=337.05964599609376
96: Train=321.07305908203125 Valid=334.4169158935547
97: Train=319.94526412963864 Valid=337.0082684326172
98: Train=320.0385334777832 Valid=335.60974365234375
99: Train=318.8555616760254 Valid=333.65868713378904
100: Train=317.8938897705078 Valid=336.39325256347655
101: Train=318.79115554809573 Valid=334.6770684814453
102: Train=317.1104347229004 Valid=334.6987811279297
103: Train=316.3874523925781 Valid=334.5063348388672
104: Train=317.27213409423825 Valid=335.46395202636717
Early stopping
Loading best model with loss 333.65868713378904

Having trained the model, we again want to bin the predictions via hard assignments

from pytorch_inferno.inferno import InfernoPred
preds = model._predict_dl(test, pred_cb=InfernoPred())
<progress value='250' class='' max='250', style='width:300px; height:20px; vertical-align: middle;'></progress> 100.00% [250/250 00:12<00:00]
import pandas as pd
df = pd.DataFrame({'pred':preds.squeeze()})
df['gen_target'] = test.dataset.y
df.head()
pred gen_target
0 9 1.0
1 9 1.0
2 6 1.0
3 4 1.0
4 9 1.0
from pytorch_inferno.plotting import plot_preds
import numpy as np
plot_preds(df, bin_edges=np.linspace(0,10,11))

Parameter estimation

To save time let's test the method on just the easiest and most difficult benchmarks. For readability, I'll hide the code cells (they'll still show up if you open the notebook yourself to run it).

Benchmark 0 - No nuisance parameters

from pytorch_inferno.plotting import plot_likelihood
plot_likelihood(nll-nll.min())

As a reminder, the binary cross-entropy classifier achieved a width of 15.02, and the exact version of INFERNO got 15.86, so the interpolation version does quite well!

Benchmark 4 - Normalisation uncertainty with auxiliary measurements

%%capture --no-display
alpha_aux = [torch.distributions.Normal(0,2), torch.distributions.Normal(0,2)]
nll = profiler(alpha_aux=alpha_aux, float_b=True, b_aux=torch.distributions.Normal(1000,100), **b_shapes).cpu().detach()
plot_likelihood(nll-nll.min())
<progress value='61' class='' max='61', style='width:300px; height:20px; vertical-align: middle;'></progress> 100.00% [61/61 00:18<00:00]

Again, the interpolation method does pretty well: the BCE classifier blew up to a width of 27.74, and the exact version of INFERNO got a much better constraint of 19.08 (slightly high compared to the paper value of 18.79).

Closing

In this post we've looked at how the INFERNO method can be implemented without requiring analytical knowledge of how the nuisance parameters affect the inputs in the data. Instead we have approximated their affects by interpolating between up/down systematics datasets about the the nominal values. Such datasets are already produced during the course of a typical HEP analysis, and so this approximation improves the potential to be a drop-in replacement for the event-level classifier used in contemporary analyses.

Whilst I've only tested the approximation using the same toy-data problem used by the paper, I have found that it was able to match the performance of the exact version, so I'll be interested to see how well it performs on real-world problems. Certainly, given the difference in performance illustrated by INFERNO (exact or interpolated) and a traditional BCE network, I think the method has a lot to offer in domains where uncertainties play a key role, and I'm looking forward to testing it out.