Philosophy of statistics

Statistical analysis is very important in addressing the problem of induction. Can inductive inference be formalized? What are the caveats? Can inductive inference be automated? How does machine learning work?

All knowledge is, in final analysis, history. All sciences are, in the abstract, mathematics. All judgements are, in their rationale, statistics.

– C. R. Rao1


  1. Introduction to the foundations of statistics
    1. Problem of induction
    2. Early investigators
    3. Foundations of modern statistics
    4. Pedagogy
  2. Probability and its related concepts
    1. Probability
    2. Expectation and variance
    3. Cross entropy
    4. Uncertainty
    5. Bayes’ theorem
    6. Likelihood and frequentist vs bayesian probability
    7. Curse of dimensionality
  3. Statistical models
    1. Parametric models
    2. Canonical distributions
    3. Central limit theorem
    4. Mixture models
  4. Point estimation and confidence intervals
    1. Inverse problems
    2. Bias and variance
    3. Maximum likelihood estimation
    4. Variance of MLEs
    5. Bayesian credibility intervals
    6. Uncertainty on measuring an efficiency
    7. Examples
  5. Statistical hypothesis testing
    1. Null hypothesis significance testing
    2. Neyman-Pearson theory
    3. p-values and significance
    4. Asymptotics
    5. Student’s t-test
    6. Frequentist vs bayesian decision theory
    7. Examples
  6. Uncertainty quantification
    1. Sinervo classification of systematic uncertainties
    2. Profile likelihoods
    3. Examples of poor estimates of systematic uncertanties
  7. Statistical classification
    1. Introduction
    2. Examples
  8. Causal inference
    1. Introduction
    2. Causal models
    3. Counterfactuals
  9. Exploratory data analysis
    1. Introduction
    2. Look-elsewhere effect
    3. Archiving and data science
  10. “Statistics Wars”
    1. Introduction
    2. Likelihood principle
    3. Discussion
  11. Replication crisis
    1. Introduction
    2. p-value controversy
  12. Classical machine learning
    1. Introduction
    2. History
    3. Logistic regression
    4. Softmax regression
    5. Decision trees
    6. Clustering
  13. Deep learning
    1. Introduction
    2. Gradient descent
    3. Deep double descent
    4. Regularization
    5. Batch size vs learning rate
    6. Normalization
    7. Finetuning
    8. Computer vision
    9. Natural language processing
    10. Reinforcement learning
    11. Applications in physics
  14. Theoretical machine learning
    1. Algorithmic information theory
    2. No free lunch theorems
    3. Connectivists vs symbolicists
    4. Graphical tensor notation
    5. Universal approximation theorem
    6. Relationship to statistical mechanics
    7. Relationship to gauge theory
    8. Thermodynamics of computation
  15. Information geometry
    1. Introduction
    2. Geometric understanding of classical statistics
    3. Geometric understanding of deep learning
  16. Automation
    1. AutoML
    2. Surrogate models
    3. AutoScience
  17. Implications for the realism debate
    1. Introduction
    2. Real clusters
    3. Word meanings
  18. My thoughts
  19. Annotated bibliography
    1. Mayo, D.G. (1996). Error and the Growth of Experimental Knowledge.
    2. Cowan, G. (1998). Statistical Data Analysis.
    3. James, F. (2006). Statistical Methods in Experimental Physics.
    4. Cowan, G. et al. (2011). Asymptotic formulae for likelihood-based tests of new physics.
    5. ATLAS Collaboration. (2012). Combined search for the Standard Model Higgs boson.
    7. More articles to do
  21. References

Introduction to the foundations of statistics

Problem of induction

A key issue for the scientific method, as discussed in the previous outline, is the problem of induction. Inductive inferences are used in the scientific method to make generalizations from finite data. This introduces unique avenues of error not found in purely deductive inferences, like in logic and mathematics. Compared to deductive inferences, which are sound and necessarily follow if an argument is valid and all of its premises obtain, inductive inferences can be valid and probably (not certainly) sound, and therefore can still result in error in some cases because the support of the argument is ultimately probabilistic.

A skeptic may further probe if we are even justified in using the probabilities we use in inductive arguments. What is the probability the Sun will rise tomorrow? What kind of probabilities are reasonable?

In this outline, we sketch and explore how the mathematical theory of statistics has arisen to wrestle with the problem of induction, and how it equips us with careful ways of framing inductive arguments and notions of confidence in them.

See also:

Early investigators

The art of measuring, as precisely as possible, probabilities of things, with the goal that we would be able always to choose or follow in our judgments and actions that course, which will have been determined to be better, more satisfactory, safer or more advantageous.4

Foundations of modern statistics



Probability is of epistemic interest, being in some sense a measure of inductive confidence.


Expectation and variance


\[ \mathbb{E}(y) \equiv \int dx \: p(x) \: y(x) \label{eq:expectation} \]

Expectation values can be approximated with a partial sum over some data or Monte Carlo sample:

\[ \mathbb{E}(y) \approx \frac{1}{n} \sum_s^n y(x_s) \label{eq:expectation_sum} \]

The variance of a random variable, \(y\), is defined as

\[\begin{align} \mathrm{Var}(y) &\equiv \mathbb{E}((y - \mathbb{E}(y))^2) \nonumber \\ &= \mathbb{E}(y^2 - 2 \: y \: \mathbb{E}(y) + \mathbb{E}(y)^2) \nonumber \\ &= \mathbb{E}(y^2) - 2 \: \mathbb{E}(y) \: \mathbb{E}(y) + \mathbb{E}(y)^2 \nonumber \\ &= \mathbb{E}(y^2) - \mathbb{E}(y)^2 \label{eq:variance} \end{align}\]

The covariance matrix, \(\boldsymbol{V}\), of random variables \(x_i\) is

\[\begin{align} V_{ij} &= \mathrm{Cov}(x_i, x_j) \equiv \mathbb{E}[(x_i - \mathbb{E}(x_i)) \: (x_j - \mathbb{E}(x_j))] \nonumber \\ &= \mathbb{E}(x_i \: x_{j} - \mu_i \: x_j - x_i \: \mu_j + \mu_i \: \mu_j ) \nonumber \\ &= \mathbb{E}(x_i \: x_{j}) - \mu_i \: \mu_j \label{eq:covariance_matrix_indexed} \end{align}\]

\[\begin{equation} \boldsymbol{V} = \begin{pmatrix} \mathrm{Var}(x_1) & \mathrm{Cov}(x_1, x_2) & \cdots & \mathrm{Cov}(x_1, x_n) \\ \mathrm{Cov}(x_2, x_1) & \mathrm{Var}(x_2) & \cdots & \mathrm{Cov}(x_2, x_n) \\ \vdots & \vdots & \ddots & \vdots \\ \mathrm{Cov}(x_n, x_1) & \mathrm{Cov}(x_n, x_2) & \cdots & \mathrm{Var}(x_n) \end{pmatrix} \label{eq:covariance_matrix_array} \end{equation}\]

Diagonal elements of the covariance matrix are the variances of each variable.

\[ \mathrm{Cov}(x_i, x_i) = \mathrm{Var}(x_i) \]

Off-diagonal elements of a covariance matrix measure how related two variables are, linearly. Covariance can be normalized to give the correlation coefficient between variables:

\[ \mathrm{Cor}(x_i, x_j) \equiv \frac{ \mathrm{Cov}(x_i, x_j) }{ \sqrt{ \mathrm{Var}(x_i) \: \mathrm{Var}(x_j) } } \label{eq:correlation_matrix} \]

which is bounded: \(-1 \leq \mathrm{Cor}(x_i, x_j) \leq 1\).

The covariance of two random vectors is given by

\[ \boldsymbol{V} = \mathrm{Cov}(\vec{x}, \vec{y}) = \mathbb{E}(\vec{x} \: \vec{y}^{\mathsf{T}}) - \vec{\mu}_x \: \vec{\mu}_{y}^{\mathsf{T}}\label{eq:covariance_matrix_vectors} \]

Cross entropy

TODO: discuss the Shannon entropy and Kullback-Leibler (KL) divergence.34

Shannon entropy:

\[ H(p) = - \underset{x\sim{}p}{\mathbb{E}}\big[ \log p(x) \big] \label{eq:shannon_entropy} \]

Cross entropy:

\[ H(p, q) = - \underset{x\sim{}p}{\mathbb{E}}\big[ \log q(x) \big] = - \sum_{x} p(x) \: \log q(x) \label{eq:cross_entropy} \]

Kullback-Leibler (KL) divergence:

\[\begin{align} D_\mathrm{KL}(p, q) &= \underset{x\sim{}p}{\mathbb{E}}\left[ \log \left(\frac{p(x)}{q(x)}\right) \right] = \underset{x\sim{}p}{\mathbb{E}}\big[ \log p(x) - \log q(x) \big] \label{eq:kl_divergence} \\ &= - H(p) + H(p, q) \\ \end{align}\]

See also the section on logistic regression.


Quantiles and standard error


Propagation of error

Given some vector of random variables, \(\vec{x}\), with estimated means, \(\vec{\mu}\), and estimated covariance matrix, \(\boldsymbol{V}\), suppose we are concerned with estimating the variance of some variable, \(y\), that is a function of \(\vec{x}\). The variance of \(y\) is given by

\[ \sigma^2_y = \mathbb{E}(y^2) - \mathbb{E}(y)^2 \,. \]

Taylor expanding \(y(\vec{x})\) about \(x=\mu\) gives

\[ y(\vec{x}) \approx y(\vec{\mu}) + \left.\frac{\partial y}{\partial x_i}\right|_{\vec{x}=\vec{\mu}} (x_i - \mu_i) \,. \]

Therefore, to first order

\[ \mathbb{E}(y) \approx y(\vec{\mu}) \]


\[\begin{align} \mathbb{E}(y^2) &\approx y^2(\vec{\mu}) + 2 \, y(\vec{\mu}) \, \left.\frac{\partial y}{\partial x_i}\right|_{\vec{x}=\vec{\mu}} \mathbb{E}(x_i - \mu_i) \nonumber \\ &+ \mathbb{E}\left[ \left(\left.\frac{\partial y}{\partial x_i}\right|_{\vec{x}=\vec{\mu}}(x_i - \mu_i)\right) \left(\left.\frac{\partial y}{\partial x_j}\right|_{\vec{x}=\vec{\mu}}(x_j - \mu_j)\right) \right] \\ &= y^2(\vec{\mu}) + \, \left.\frac{\partial y}{\partial x_i}\frac{\partial y}{\partial x_j}\right|_{\vec{x}=\vec{\mu}} V_{ij} \\ \end{align}\]



Bayes’ theorem

\[ P(A|B) = P(B|A) \: P(A) \: / \: P(B) \label{eq:bayes_theorem} \]

Likelihood and frequentist vs bayesian probability

\[ P(H|D) = P(D|H) \: P(H) \: / \: P(D) \label{eq:bayes_theorem_hd} \]

\[ L(\theta) = P(D|\theta) \label{eq:likelihood_def_x} \]


To appeal to such a result is absurd. Bayes’ theorem ought only to be used where we have in past experience, as for example in the case of probabilities and other statistical ratios, met with every admissible value with roughly equal frequency. There is no such experience in this case.41

Curse of dimensionality

Statistical models

Parametric models

Canonical distributions

Bernoulli distribution

\[ \mathrm{Ber}(k; p) = \begin{cases} p & \mathrm{if}\ k = 1 \\ 1-p & \mathrm{if}\ k = 0 \end{cases} \label{eq:bernoulli} \]

which can also be written as

\[ \mathrm{Ber}(k; p) = p^k \: (1-p)^{(1-k)} \quad \mathrm{for}\ k \in \{0, 1\} \]


\[ \mathrm{Ber}(k; p) = p k + (1-p)(1-k) \quad \mathrm{for}\ k \in \{0, 1\} \]

TODO: explain, another important relationship is

Figure 1: Relationships among Bernoulli, binomial, categorical, and multinomial distributions.

Normal/Gaussian distribution

\[ N(x \,|\, \mu, \sigma^2) = \frac{1}{\sqrt{2\,\pi\:\sigma^2}} \: \exp\left(\frac{-(x-\mu)^2}{2\,\sigma^2}\right) \label{eq:gaussian} \]

and in \(k\) dimensions:

\[ N(\vec{x} \,|\, \vec{\mu}, \boldsymbol{\Sigma}) = (2 \pi)^{-k/2}\:\left|\boldsymbol{\Sigma}\right|^{-1/2} \: \exp\left(\frac{-1}{2}\:(\vec{x}-\vec{\mu})^{\mathsf{T}}\:\boldsymbol{\Sigma}^{-1}\:(\vec{x}-\vec{\mu})\right) \label{eq:gaussian_k_dim} \]

where \(\boldsymbol{\Sigma}\) is the covariance matrix (defined in eq. \(\eqref{eq:covariance_matrix_indexed}\)) of the distribution.

Central limit theorem

Let \(X_{1}\), \(X_{2}\), … , \(X_{n}\) be a random sample drawn from any distribution with a finite mean \(\mu\) and variance \(\sigma^{2}\). As \(n \rightarrow \infty\), the distribution of

\[ \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim N(0, 1) \label{eq:central_limit_theorem} \]

Figure 2: Detail of a figure showing relationships among univariate distributions. See the full figure here.

Mixture models

Point estimation and confidence intervals

Inverse problems

Recall that in the context of parametric models of data, \(x_i\) the pdf of which is modeled by a function, \(f(x_i ; \theta_j)\) with parameters, \(\theta_j\). In a statistical inverse problem, the goal is to infer values of the model parameters, \(\theta_j\) given some finite set of data, \(\{x_i\}\) sampled from a probability density, \(f(x_i; \theta_j)\) that models the data reasonably well.48

Bias and variance

The bias of an estimator, \(\hat\theta\), is defined as

\[ \mathrm{Bias}(\hat{\theta}) \equiv \mathbb{E}(\hat{\theta} - \theta) = \int dx \: P(x|\theta) \: (\hat{\theta} - \theta) \label{eq:bias} \]

The mean squared error (MSE) of an estimator has a similar formula to variance (eq. \(\eqref{eq:variance}\)) except that instead of quantifying the square of the difference of the estimator and its expected value, the MSE uses the square of the difference of the estimator and the true parameter:

\[ \mathrm{MSE}(\hat{\theta}) \equiv \mathbb{E}((\hat{\theta} - \theta)^2) \label{eq:mse} \]

The MSE of an estimator can be related to its bias and its variance by the following proof:

\[\begin{align} \mathrm{MSE}(\hat{\theta}) &= \mathbb{E}(\hat{\theta}^2 - 2 \: \hat{\theta} \: \theta + \theta^2) \nonumber \\ &= \mathbb{E}(\hat{\theta}^2) - 2 \: \mathbb{E}(\hat{\theta}) \: \theta + \theta^2 \end{align}\]

noting that

\[ \mathrm{Var}(\hat{\theta}) = \mathbb{E}(\hat{\theta}^2) - \mathbb{E}(\hat{\theta})^2 \]


\[\begin{align} \mathrm{Bias}(\hat{\theta})^2 &= \mathbb{E}(\hat{\theta} - \theta)^2 \nonumber \\ &= \mathbb{E}(\hat{\theta})^2 - 2 \: \mathbb{E}(\hat{\theta}) \: \theta + \theta^2 \end{align}\]

we see that MSE is equivalent to

\[ \mathrm{MSE}(\hat{\theta}) = \mathrm{Var}(\hat{\theta}) + \mathrm{Bias}(\hat{\theta})^2 \label{eq:mse_variance_bias} \]

For an unbiased estimator, the MSE is the variance of the estimator.


See also:

Maximum likelihood estimation

A maximum likelihood estimator (MLE) was first used by Fisher.50

\[\hat{\theta} \equiv \underset{\theta}{\mathrm{argmax}} \: \mathrm{log} \: L(\theta) \label{eq:mle} \]

Maximizing \(\mathrm{log} \: L(\theta)\) is equivalent to maximizing \(L(\theta)\), and the former is more convenient because for data that are independent and identically distributed (i.i.d.) the joint likelihood can be factored into a product of individual measurements:

\[ L(\theta) = \prod_i L(\theta|x_i) = \prod_i P(x_i|\theta) \]

and taking the log of the product makes it a sum:

\[ \mathrm{log} \: L(\theta) = \sum_i \mathrm{log} \: L(\theta|x_i) = \sum_i \mathrm{log} \: P(x_i|\theta) \]

Maximizing \(\mathrm{log} \: L(\theta)\) is also equivalent to minimizing \(-\mathrm{log} \: L(\theta)\), the negative log-likelihood (NLL). For distributions that are i.i.d.,

\[ \mathrm{NLL} \equiv - \log L = - \log \prod_i L_i = - \sum_i \log L_i = \sum_i \mathrm{NLL}_i \]

Invariance of likelihoods under reparametrization

See also:

Ordinary least squares

Variance of MLEs

Figure 3: Transformation of non-parabolic log-likelihood to parabolic (source: my slides, recreation of F. James (2006), p. 235).

Bayesian credibility intervals

Uncertainty on measuring an efficiency


Statistical hypothesis testing

Null hypothesis significance testing


[T]he null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation.68

Neyman-Pearson theory


Figure 4: TODO: ROC explainer. (Wikimedia, 2015).

See also:

Neyman-Pearson lemma

Neyman-Pearson lemma:73

For a fixed signal efficiency, \(1-\alpha\), the selection that corresponds to the lowest possible misidentification probability, \(\beta\), is given by

\[ \frac{L(H_1)}{L(H_0)} > k_{\alpha} \,, \label{eq:np-lemma} \]

where \(k_{\alpha}\) is the cut value required to achieve a type-1 error rate of \(\alpha\).

Neyman-Pearson test statistic:

\[ q_\mathrm{NP} = - 2 \ln \frac{L(H_1)}{L(H_0)} \label{eq:qnp-test-stat} \]

Profile likelihood ratio:

\[ \lambda(\mu) = \frac{ L(\mu, \hat{\theta}_\mu) }{ L(\hat{\mu}, \hat{\theta}) } \label{eq:profile-llh-ratio} \]

where \(\hat{\theta}\) is the (unconditional) maximum-likelihood estimator that maximizes \(L\), while \(\hat{\theta}_\mu\) is the conditional maximum-likelihood estimator that maximizes \(L\) for a specified signal strength, \(\mu\), and \(\theta\) as a vector includes all other parameters of interest and nuisance parameters.

Neyman construction

Cranmer: Neyman construction.

Figure 5: Neyman construction for a confidence belt for \theta (source: K. Cranmer, 2020).

TODO: fix

\[ q = - 2 \ln \frac{L(\mu\,s + b)}{L(b)} \label{eq:q0-test-stat} \]


p-values and significance

Cowan et al. define a \(p\)-value as

a probability, under assumption of \(H\), of finding data of equal or greater incompatibility with the predictions of \(H\).76


It should be emphasized that in an actual scientific context, rejecting the background-only hypothesis in a statistical sense is only part of discovering a new phenomenon. One’s degree of belief that a new process is present will depend in general on other factors as well, such as the plausibility of the new signal hypothesis and the degree to which it can describe the data. Here, however, we only consider the task of determining the \(p\)-value of the background-only hypothesis; if it is found below a specified threshold, we regard this as “discovery.”77

Uppper limits

CLs method


Student’s t-test

Frequentist vs bayesian decision theory

Support for using Bayes factors:

which properly separates issues of long-run behavior from evidential strength and allows the integration of background knowledge with statistical findings.89

See also:


Uncertainty quantification

Sinervo classification of systematic uncertainties


In analyses involving enough data to achieve reasonable statistical accuracy, considerably more effort is devoted to assessing the systematic error than to determining the parameter of interest and its statistical error.93

Figure 6: Classification of measurement uncertainties (, 2016).

Profile likelihoods

Examples of poor estimates of systematic uncertanties

Figure 7: Demonstration of sensitivity to the jet energy scale for an alleged excess in Wjj by Tommaso Dorigo (2011) (see also: GIF).

Statistical classification



See also:

Causal inference


See also:

Causal models


Exploratory data analysis


Look-elsewhere effect

Archiving and data science

“Statistics Wars”



Bayes’s theorem is a theorem, so there’s no debating it. It is not the case that Frequentists dispute whether Bayes’s theorem is true. The debate is whether the necessary probabilities exist in the first place. If one can define the joint probability \(P (A, B)\) in a frequentist way, then a Frequentist is perfectly happy using Bayes theorem. Thus, the debate starts at the very definition of probability.107


Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong.108

Figure 8: From Kruschke.

Likelihood principle


The first key argument in favour of the Bayesian approach can be called the axiomatic argument. We can formulate systems of axioms of good inference, and under some persuasive axiom systems it can be proved that Bayesian inference is a consequence of adopting any of these systems… If one adopts two principles known as ancillarity and sufficiency principles, then under some statement of these principles it follows that one must adopt another known as the likelihood principle. Bayesian inference conforms to the likelihood principle whereas classical inference does not. Classical procedures regularly violate the likelihood principle or one or more of the other axioms of good inference. There are no such arguments in favour of classical inference.114


Likelihoods are vital to all statistical accounts, but they are often misunderstood because the data are fixed and the hypothesis varies. Likelihoods of hypotheses should not be confused with their probabilities. … [T]he same phenomenon may be perfectly predicted or explained by two rival theories; so both theories are equally likely on the data, even though they cannot both be true.120



Particle Physicists tend to favor a frequentist method. This is because we really do consider that our data are representative as samples drawn according to the model we are using (decay time distributions often are exponential; the counts in repeated time intervals do follow a Poisson distribution, etc.), and hence we want to use a statistical approach that allows the data “to speak for themselves,” rather than our analysis being dominated by our assumptions and beliefs, as embodied in Bayesian priors.121

Figure 9: The major virtues and vices of Bayesian, frequentist, and likelihoodist approaches to statistical inference (, 2015).


The idea that the \(P\) value can play both of these roles is based on a fallacy: that an event can be viewed simultaneously both from a long-run and a short-run perspective. In the long-run perspective, which is error-based and deductive, we group the observed result together with other outcomes that might have occurred in hypothetical repetitions of the experiment. In the “short run” perspective, which is evidential and inductive, we try to evaluate the meaning of the observed result from a single experiment. If we could combine these perspectives, it would mean that inductive ends (drawing scientific conclusions) could be served with purely deductive methods (objective probability calculations).146

Replication crisis


p-value controversy

[N]o isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon; for the “one chance in a million” will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be that it should occur to us. In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result.151

From “The ASA president’s task force statement on statistical significance and replicability”:

P-values are valid statistical measures that provide convenient conventions for communicating the uncertainty inherent in quantitative results. Indeed, P-values and significance tests are among the most studied and best understood statistical procedures in the statistics literature. They are important tools that have advanced science through their proper application.155

Classical machine learning



See also:

Logistic regression

From a probabilistic point of view,169 logistic regression can be derived from doing maximum likelihood estimation of a vector of model parameters, \(\vec{w}\), in a dot product with the input features, \(\vec{x}\), and squashed with a logistic function that yields the probability, \(\mu\), of a Bernoulli random variable, \(y \in \{0, 1\}\).

\[ p(y | \vec{x}, \vec{w}) = \mathrm{Ber}(y | \mu(\vec{x}, \vec{w})) = \mu(\vec{x}, \vec{w})^y \: (1-\mu(\vec{x}, \vec{w}))^{(1-y)} \]

The negative log-likelihood of multiple trials is

\[\begin{align} \mathrm{NLL} &= - \sum_i \log p(y_i | \vec{x}_i, \vec{w}) \nonumber \\ &= - \sum_i \log\left( \mu(\vec{x}_i, \vec{w})^{y_i} \: (1-\mu(\vec{x}_i, \vec{w}))^{(1-y_i)} \right) \nonumber \\ &= - \sum_i \log\left( \mu_i^{y_i} \: (1-\mu_i)^{(1-y_i)} \right) \nonumber \\ &= - \sum_i \big( y_i \, \log \mu_i + (1-y_i) \log(1-\mu_i) \big) \label{eq:cross_entropy_loss0} \end{align}\]

which is the cross entropy loss. Note that the first term is non-zero only when the true target is \(y_i=1\), and similarly the second term is non-zero only when \(y_i=0\).170 Therefore, we can reparametrize the target \(y_i\) in favor of \(t_{ki}\) that is one-hot in an index \(k\) over classes.

\[ \mathrm{CEL} = \mathrm{NLL} = - \sum_i \sum_k \big( t_{ki} \, \log \mu_{ki} \big) \label{eq:cross_entropy_loss1} \]


\[ t_{ki} = \begin{cases} 1 & \mathrm{if}\ (k = y_i = 0)\ \mathrm{or}\ (k = y_i = 1) \\ 0 & \mathrm{otherwise} \end{cases} \]


\[ \mu_{ki} = \begin{cases} 1-\mu_i & \mathrm{if}\ k = 0 \\ \mu_i & \mathrm{if}\ k =1 \end{cases} \]

This readily generalizes from binary classification to classification over many classes as we will discuss more below. Note that in the sum over classes, \(k\), only one term for the true class contributes.

\[ \mathrm{CEL} = - \left. \sum_i \log \mu_{ki} \right|_{k\ \mathrm{is\ such\ that}\ y_k=1} \label{eq:cross_entropy_loss2} \]

Logistic regression uses the logit function,171 which is the logarithm of the odds—the ratio of the chance of success to failure. Let \(\mu\) be the probability of success in a Bernoulli trial, then the logit function is defined as

\[ \mathrm{logit}(\mu) \equiv \log\left(\frac{\mu}{1-\mu}\right) \label{eq:logit} \]

Logistic regression assumes that the logit function is a linear function of the explanatory variable, \(x\).

\[ \log\left(\frac{\mu}{1-\mu}\right) = \beta_0 + \beta_1 x \]

where \(\beta_0\) and \(\beta_1\) are trainable parameters. (TODO: Why would we assume this?) This can be generalized to a vector of multiple input variables, \(\vec{x}\), where the input vector has a 1 prepended to be its zeroth component in order to conveniently include the bias, \(\beta_0\), in a dot product.

\[ \vec{x} = (1, x_1, x_2, \ldots, x_n)^{\mathsf{T}}\]

\[ \vec{w} = (\beta_0, \beta_1, \beta_2, \ldots, \beta_n)^{\mathsf{T}}\]

\[ \log\left(\frac{\mu}{1-\mu}\right) = \vec{w}^{\mathsf{T}}\vec{x} \]

For the moment, let \(z \equiv \vec{w}^{\mathsf{T}}\vec{x}\). Exponentiating and solving for \(\mu\) gives

\[ \mu = \frac{ e^z }{ 1 + e^z } = \frac{ 1 }{ 1 + e^{-z} } \]

This function is called the logistic or sigmoid function.

\[ \mathrm{logistic}(z) \equiv \mathrm{sigm}(z) \equiv \frac{ 1 }{ 1 + e^{-z} } \label{eq:logistic} \]

Since we inverted the logit function by solving for \(\mu\), the inverse of the logit function is the logistic or sigmoid.

\[ \mathrm{logit}^{-1}(z) = \mathrm{logistic}(z) = \mathrm{sigm}(z) \]

And therefore,

\[ \mu = \mathrm{sigm}(z) = \mathrm{sigm}(\vec{w}^{\mathsf{T}}\vec{x}) \]

Figure 10: Logistic regression.

See also:

Softmax regression

Again, from a probabilistic point of view, we can derive the use of multi-class cross entropy loss by starting with the Bernoulli distribution, generalizing it to multiple classes (indexed by \(k\)) as

\[ p(y_k | \mu) = \mathrm{Cat}(y_k | \mu_k) = \prod_k {\mu_k}^{y_k} \label{eq:categorical_distribution} \]

which is the categorical or multinoulli distribution. The negative-log likelihood of multiple independent trials is

\[ \mathrm{NLL} = - \sum_i \log \left(\prod_k {\mu_{ki}}^{y_{ki}}\right) = - \sum_i \sum_k y_{ki} \: \log \mu_{ki} \label{eq:nll_multinomial} \]

Noting again that \(y_{ki} = 1\) only when \(k\) is the true class, and is 0 otherwise, this simplifies to eq. \(\eqref{eq:cross_entropy_loss2}\).

See also:

Decision trees


See also:

Deep learning


Figure 11: Raw input image is transformed into gradually higher levels of representation.

Gradient descent

\[ \hat{f} = \underset{f \in \mathcal{F}}{\mathrm{argmin}} \underset{x \sim \mathcal{X}}{\mathbb{E}} L(f, x) \]

The workhorse algorithm for optimizing (training) model parameters is gradient descent:

\[ \vec{w}[t+1] = \vec{w}[t] - \eta \frac{\partial L}{\partial \vec{w}}[t] \]

In Stochastic Gradient Descent (SGD), you chunk the training data into minibatches (AKA batches), \(\vec{x}_{bt}\), and take a gradient descent step with each minibatch:

\[ \vec{w}[t+1] = \vec{w}[t] - \frac{\eta}{m} \sum_{i=1}^m \frac{\partial L}{\partial \vec{w}}[\vec{x}_{bt}] \]


Deep double descent



Twitter threads:


Regularization = any change we make to the training algorithm in order to reduce the generalization error but not the training error.220

Most common regularizations:


Batch size vs learning rate


  1. Keskar, N.S. et al. (2016). On large-batch training for deep learning: Generalization gap and sharp minima.

[L]arge-batch methods tend to converge to sharp minimizers of the training and testing functions—and as is well known—sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation.

  1. Hoffer, E. et al. (2017). Train longer, generalize better: closing the generalization gap in large batch training of neural networks.

    • \(\eta \propto \sqrt{m}\)
  2. Goyal, P. et al. (2017). Accurate large minibatch SGD: Training ImageNet in 1 hour.

    • \(\eta \propto m\)
  3. You, Y. et al. (2017). Large batch training of convolutional networks.

    • Layer-wise Adaptive Rate Scaling (LARS)
  4. You, Y. et al. (2017). ImageNet training in minutes.

    • Layer-wise Adaptive Rate Scaling (LARS)
  5. Jastrzebski, S. (2018). Three factors influencing minima in SGD.

    • \(\eta \propto m\)
  6. Smith, S.L. & Le, Q.V. (2018). A Bayesian Perspective on Generalization and Stochastic Gradient Descent.

  7. Smith, S.L. et al. (2018). Don’t decay the learning rate, increase the batch size.

    • \(m \propto \eta\)
  8. Masters, D. & Luschi, C. (2018). Revisiting small batch training for deep neural networks.

This linear scaling rule has been widely adopted, e.g., in Krizhevsky (2014), Chen et al. (2016), Bottou et al. (2016), Smith et al. (2017) and Jastrzebski et al. (2017).

On the other hand, as shown in Hoffer et al. (2017), when \(m \ll M\), the covariance matrix of the weight update \(\mathrm{Cov(\eta \Delta\theta)}\) scales linearly with the quantity \(\eta^2/m\).

This implies that, adopting the linear scaling rule, an increase in the batch size would also result in a linear increase in the covariance matrix of the weight update \(\eta \Delta\theta\). Conversely, to keep the scaling of the covariance of the weight update vector \(\eta \Delta\theta\) constant would require scaling \(\eta\) with the square root of the batch size \(m\) (Krizhevsky, 2014; Hoffer et al., 2017).

  1. Lin, T. et al. (2020). Don’t use large mini-batches, use local SGD.
    - Post-local SGD.

  2. Golmant, N. et al. (2018). On the computational inefficiency of large batch sizes for stochastic gradient descent.

Scaling the learning rate as \(\eta \propto \sqrt{m}\) attempts to keep the weight increment length statistics constant, but the distance between SGD iterates is governed more by properties of the objective function than the ratio of learning rate to batch size. This rule has also been found to be empirically sub-optimal in various problem domains. … There does not seem to be a simple training heuristic to improve large batch performance in general.

  1. McCandlish, S. et al. (2018). An empirical model of large-batch training.
    • Critical batch size
  2. Shallue, C.J. et al. (2018). Measuring the effects of data parallelism on neural network training.

In all cases, as the batch size grows, there is an initial period of perfect scaling (\(b\)-fold benefit, indicated with a dashed line on the plots) where the steps needed to achieve the error goal halves for each doubling of the batch size. However, for all problems, this is followed by a region of diminishing returns that eventually leads to a regime of maximal data parallelism where additional parallelism provides no benefit whatsoever.

  1. Jastrzebski, S. et al. (2018). Width of minima reached by stochastic gradient descent is influenced by learning rate to batch size ratio.
    • \(\eta \propto m\)

We show this experimentally in Fig. 5, where similar learning dynamics and final performance can be observed when simultaneously multiplying the learning rate and batch size by a factor up to a certain limit.

  1. You, Y. et al. (2019). Large-batch training for LSTM and beyond.
    • Warmup and use \(\eta \propto m\)

[W]e propose linear-epoch gradual-warmup approach in this paper. We call this approach Leg-Warmup (LEGW). LEGW enables a Sqrt Scaling scheme in practice and as a result we achieve much better performance than the previous Linear Scaling learning rate scheme. For the GNMT application (Seq2Seq) with LSTM, we are able to scale the batch size by a factor of 16 without losing accuracy and without tuning the hyper-parameters mentioned above.

  1. You, Y. et al. (2019). Large batch optimization for deep learning: Training BERT in 76 minutes.
    • LARS and LAMB
  2. Zhang, G. et al. (2019). Which algorithmic choices matter at which batch sizes? Insights from a Noisy Quadratic Model.

Consistent with the empirical results of Shallue et al. (2018), each optimizer shows two distinct regimes: a small-batch (stochastic) regime with perfect linear scaling, and a large-batch (deterministic) regime insensitive to batch size. We call the phase transition between these regimes the critical batch size.

  1. Li, Y., Wei, C., & Ma, T. (2019). Towards explaining the regularization effect of initial large learning rate in training neural networks.

Our analysis reveals that more SGD noise, or larger learning rate, biases the model towards learning “generalizing” kernels rather than “memorizing” kernels.

  1. Kaplan, J. et al. (2020). Scaling laws for neural language models.

  2. Jastrzebski, S. et al. (2020). The break-even point on optimization trajectories of deep neural networks.




Computer vision


Natural language processing




Chain rule of language modeling (chain rule of probability):

\[ P(x_1, \ldots, x_T) = P(x_1, \ldots, x_{n-1}) \prod_{t=n}^{T} P(x_t | x_1 \ldots x_{t-1}) \label{eq:chain_rule_of_lm} \]

or for the whole sequence:

\[ P(x_1, \ldots, x_T) = \prod_{t=1}^{T} P(x_t | x_1 \ldots x_{t-1}) \label{eq:chain_rule_of_lm_2} \]

\[ = P(x_1) \: P(x_2 | x_1) \: P(x_3 | x_1 x_2) \: P(x_4 | x_1 x_2 x_3) \ldots \]

A language model (LM), predicts the next token given previous context. The output of the model is a vector of logits, which is given to a softmax to convert to probabilities for the next token.

\[ P(x_t | x_1 \ldots x_{t-1}) = \mathrm{softmax}\left( \mathrm{model}(x_1 \ldots x_{t-1}) \right) \]

Auto-regressive inference follows this chain rule. If done with greedy search:

\[ \hat{x}_t = \underset{x_t \in V}{\mathrm{argmax}} \: P(x_t | x_1 \ldots x_{t-1}) \label{eq:greedy_search} \]

Beam search:

Backpropagation through time (BPTT):

Neural Machine Translation (NMT):


Figure 12: Diagram of the Transformer model (source:

\[ \mathrm{attention}(Q, K, V) = \mathrm{softmax}\left(\frac{Q\, K^\intercal}{\sqrt{d_k}}\right) V \label{eq:attention} \]

Attention and Transformers
Figure 13: Diagram of the BERT model (source:
Computational complexity of transformers
Efficient transformers
What comes after Transformers?

Evaluation methods

Scaling laws in NLP

Language understanding

See also:


Linear probes:

Reinforcement learning






Regret minimization

Regret matching (RM)

Consider a game like rock-paper-scissors, where there is only one action per round. Let \(v^{t}(a)\) be the value observed when playing action \(a\) on iteration \(t\).

TODO: explain that the entire rewards vector, \(v^{t}(a)\), over \(a\) is observable after the chosen action is played.

Let a strategy, \(\sigma^t\), be a probability distribution over actions, \(a \in A\). Then the value of a strategy, \(v^{t}(\sigma^{t})\), is the expectation of its value over actions.

\[ v^{t}(\sigma^{t}) = \sum_{a \in A} \sigma^{t}(a) \: v^{t}(a) \label{eq:value_of_strategy} \]

Regret, \(R^{T}\), measures how much better some sequence of strategies, \(\sigma'\), would do compared to the chosen sequence of strategies, \(\sigma = \{\sigma^1, \sigma^2, \ldots \sigma^T\}\).

\[ R^{T} \equiv \sum_{t=1}^{T} \left( v^{t}({\sigma'}^{t}) - v^{t}(\sigma^{t}) \right) \label{eq:regret} \]

External regret, \(R^{T}(a)\), measures the regret of the chosen sequence of strategies versus a hypothetical stategy where action \(a\) is always chosen.

\[ R^{T}(a) \equiv \sum_{t=1}^{T} \left( v^{t}(a) - v^{t}(\sigma^{t}) \right) \label{eq:external_regret} \]

Regret Matching (RM) is a rule to determine the strategy for the next iteration:

\[ \sigma^{t+1}(a) \equiv \frac{ R^{t}_{+}(a) }{ \sum_{b \in A} R^{t}_{+}(b) } \label{eq:regret_matching} \]

where \(R_{+} \equiv \mathrm{max}(R, 0)\).

At the end of training, the resulting recommended strategy with convergence bounds is not the final strategy used in training, \(\sigma^{T}\), but the average strategy over all time steps:

\[ \bar{\sigma}^{T}(a) = \frac{1}{T} \sum_{t=1}^{T} \sigma^{t}(a) \]

TODO: explain the convergence of \(\bar{\sigma}^{t}\) to an \(\varepsilon\)-Nash equilibrium.

Counterfactual regret minimization (CFR)

TODO: explain extensive-form games.

A finite extensive game with imperfect information has the following components:340

The player reach, \(\pi^{\sigma}_{i}(h)\), of a history \(h\) is the product of the probabilities for all agent \(i\) actions leading to \(h\). Formally,341

\[ \pi^{\sigma}_{i}(h) \equiv \prod_{h' \cdot a' \sqsubseteq h | P(h') = i} \sigma_{i}(h', a') \label{eq:player_reach} \]

Due to perfect recall, any two histories in infoset \(I_i\) have the same player reach for player \(i\). Thus, we similarly define the player reach \(\pi^{\sigma}_{i}(I_i)\) of infoset \(I_i\) as

\[ \pi^{\sigma}_{i}(I_i) \equiv \prod_{ {I'}_{i} \cdot a' \sqsubseteq I_i | P(I_i) = i } \sigma_{i}({I'}_{i}, a') = \left.\pi^{\sigma}_{i}(h)\right|_{h \in I_i} \label{eq:player_reach_from_infoset} \]

The external reach AKA opponent reach, \(\pi^{\sigma}_{-i}(h)\), of a history \(h\) is the contribution of chance and all other players than \(i\). Formally,

\[ \pi^{\sigma}_{-i}(h) \equiv \prod_{h' \cdot a' \sqsubseteq h | P(h') \neq i} \sigma_{i}(h', a') \label{eq:external_reach} \]

We also define the external reach of an infoset as

\[ \pi^{\sigma}_{-i}(I_i) \equiv \sum_{h \in I_{i}} \pi^{\sigma}_{-i}(h) \label{eq:external_reach_from_infoset} \]

The counterfactual value of an infoset \(I\) is the expected utility to player \(i\) given that \(I\) has been reached, weighted by the external reach of \(I\) for player \(i\). Formally,342

\[ v(I) = \sum_{h \in I} \pi^{\sigma}_{-i}(h) \sum_{z \in Z} \pi^{\sigma}(h, z) \: u_{i}(z) \label{eq:counter_factual_value} \]

The counterfactual value of an action, \(a\), is

\[ v(I, a) = \sum_{h \in I} \pi^{\sigma}_{-i}(h) \sum_{z \in Z} \pi^{\sigma}(h \cdot a, z) \: u_{i}(z) \label{eq:counter_factual_value_of_a} \]

Let’s consider the case where, like in NLHE, our two private hole cards each make a single unique history \(h\), and we form infosets with a single hand, so \(I=h\). Then

\[ v(h) = \pi^{\sigma}_{-i}(h) \sum_{z \in Z} \pi^{\sigma}(h, z) \: u_{i}(z) \]

making explicit the player reach and the external reach,

\[ v(h) = \pi^{\sigma}_{-i}(h) \sum_{z \in Z} \pi_{i}^{\sigma}(h, z) \: \pi_{-i}^{\sigma}(h, z) \: u_{i}(z) \]

At a leaf node where we finally calculate the rewards,

\[ v(z) = \pi^{\sigma}_{-i}(z) \: u_{i}(z) \]

TODO: explain CFR.

The instantaneous regret is

\[ r^{t}(I, a) = v^{\sigma^t}(I, a) - v^{\sigma^t}(I) \]

The (cummulative) counterfactual regret

\[ R^{t}(I, a) = \sum_{t=1}^{T} r^{t}(I, a) \]

Similar to the single-node game discussed above, eq. \(\eqref{eq:regret_matching}\), applying regret matching during training means to update strategies according to the following rule.

\[ \sigma^{t+1}(I, a) \equiv \frac{ R^{t}_{+}(I, a) }{ \sum_{b \in A} R^{t}_{+}(I, b) } \label{eq:regret_matching_cfr} \]

The average strategy is

\[ \bar{\sigma}^{T}(I, a) = \sum_{t=1}^{T} \frac{\pi^{t}_{i}(I) \: \sigma^{t}(I, a) }{\pi^{t}_{i}(I)} \]

Monte Carlo Counterfactual Regret Minimization (MCCFR)

TODO: explain MCCFR.

External sampling MCCFR:

\[ \tilde{v}^{\sigma}_{i}(I) = \sum_{z \in Q} u_{i}(z) \: \pi^{\sigma}_{i}(z[I] \rightarrow z) \label{eq:external_sample_mccfr} \]

Best response and exploitability

Best response:

\[ \mathrm{BR}(\sigma_{-i}) = \underset{\sigma_{i}^{\prime}}{\mathrm{argmax}} \: u_{i}(\sigma_{i}^{\prime}, \sigma_{-i}) \label{eq:best_response} \]

TODO: Local Best Response (LBR).355


\[ \varepsilon_{i}(\sigma) = u_{i}(\mathrm{BR}(\sigma_{-i}), \sigma_{-i}) - u_{i}(\sigma_{i}, \mathrm{BR}(\sigma_{i})) \label{eq:exploitability} \]

NashConv356 exploitability uses the convention:

\[ \varepsilon_{i}(\sigma) = u_{i}(\mathrm{BR}(\sigma_{-i}), \sigma_{-i}) - u_{i}(\sigma_{i}, \sigma_{-i}) \label{eq:nc_exploitability} \]

The average exploitability per player is

\[ \varepsilon(\sigma) = \frac{1}{n} \sum_{i}^{n} \varepsilon_{i}(\sigma) \]

Note that in zero-sum games, when summing over players, the second terms in NashConv sum to zero.357

\[ \varepsilon(\sigma) = \frac{1}{n} \sum_{i}^{n} u_{i}(\mathrm{BR}(\sigma_{-i}), \sigma_{-i}) \label{eq:average_exploitability} \]

In two-player games:

\[ \varepsilon(\sigma) = \frac{1}{2} \Big( u_{1}(\mathrm{BR}(\sigma_{2}), \sigma_{2}) + u_{2}(\sigma_{1}, \mathrm{BR}(\sigma_{1})) \Big) \label{eq:average_exploitability_two_player} \]

Solving poker

Applications in physics

See also:

Theoretical machine learning

Algorithmic information theory

No free lunch theorems

Raissi et al.:

encoding such structured information into a learning algorithm results in amplifying the information content of the data that the algorithm sees, enabling it to quickly steer itself towards the right solution and generalize well even when only a few training examples are available.404


From an algorithmic complexity standpoint it is somewhat miraculous that we can compress our huge look-up table of experiment/outcome into such an efficient description. In many senses, this type of compression is precisely what we mean when we say that physics enables us to understand a given phenomenon.405

Connectivists vs symbolicists

Graphical tensor notation

Universal approximation theorem

Relationship to statistical mechanics

Relationship to gauge theory


\[ f(g x) = f(x) \]


\[ f(g x) = g' f(x) \]

Same-equivariant is the case that \(g' = g\).

Thermodynamics of computation

Information geometry


Geometric understanding of classical statistics

Geometric understanding of deep learning



Surrogate models



Figure 14: The inference cycle for the process of scientific inquiry. The three distinct forms of inference (abduction, deduction, and induction) facilitate an all-encompassing vision, enabling HPC and HDA to converge in a rational and structured manner. HPC: high- performance computing; HDA: high-end data analysis.

See also:

Implications for the realism debate


See also:

Real clusters

See also:

Word meanings

Wittgenstein in PI:

The meaning of a word is its use in the language.474


One cannot guess how a word functions. One has to look at its use, and learn from that.475


Modern large language models integrate syntax and semantics in the underlying representations: encoding words as vectors in a high-dimensional space, without an effort to separate out e.g. part of speech categories from semantic representations, or even predict at any level of analysis other than the literal word. Part of making these models work well was in determining how to encode semantic properties into vectors, and in fact initializing word vectors via encodings of distribution semantics from e.g. Mikolov et al. 2013 (Radford et al. 2019). Thus, an assumption of the autonomy of syntax is not required to make models that predict syntactic material and may well hinder it.476

See also:

My thoughts

My docs:

My talks:

Annotated bibliography

Mayo, D.G. (1996). Error and the Growth of Experimental Knowledge.

  • Mayo (1996)

My thoughts

  • TODO

Cowan, G. (1998). Statistical Data Analysis.

  • Cowan (1998) and Cowan (2016)

My thoughts

  • TODO

James, F. (2006). Statistical Methods in Experimental Physics.

  • F. James (2006)

My thoughts

  • TODO

Cowan, G. et al. (2011). Asymptotic formulae for likelihood-based tests of new physics.

  • Cowan et al. (2011)
  • Glen Cowan, Kyle Cranmer, Eilam Gross, Ofer Vitells

My thoughts

  • TODO

ATLAS Collaboration. (2012). Combined search for the Standard Model Higgs boson.

My thoughts

  • TODO

Cranmer, K. (2015). Practical statistics for the LHC.

  • Cranmer (2015)

My thoughts

  • TODO

  • All of Statistics477
  • The Foundations of Statistics478







