# Philosophy of statistics

Statistics are *way* important in addressing the problem of induction.

### Contents

- Introduction to the foundations of statistics
- Probability and its related concepts
- Statistical models
- Point estimation and confidence intervals
- Statistical hypothesis testing
- Uncertainty quantification
- Statistical classification
- Causal inference
- Exploratory data analysis
- “Statistics Wars”
- Replication crisis
- Classical machine learning
- Deep learning
- Theoretical machine learning
- Information geometry
- Automation
- Implications for the realism debate
- My thoughts
- Annotated bibliography
- Mayo, D.G. (1996).
*Error and the Growth of Experimental Knowledge*. - Cowan, G. (1998).
*Statistical Data Analysis*. - James, F. (2006).
*Statistical Methods in Experimental Physics*. - Cowan, G.
*et al.*(2011). Asymptotic formulae for likelihood-based tests of new physics. - ATLAS Collaboration. (2012). Combined search for the Standard Model Higgs boson.
- Cranmer, K (2015). Practical statistics for the LHC.
- More articles to do

- Mayo, D.G. (1996).
- Links and encyclopedia articles
- References

## Introduction to the foundations of statistics

### Problem of induction

A key issue for the scientific method, as discussed in the previous outline, is the problem of induction. Inductive inferences are used in the scientific method to make generalizations from finite data. This introduces unique avenues of error not found in purely deductive inferences, like in logic and mathematics. Compared to deductive inferences, which are sound and necessarily follow if an argument is valid and all of its premises obtain, inductive inferences can be valid and probably (not certainly) sound, and therefore can still result in error in some cases because the support of the argument is ultimately probabilistic.

A skeptic may further probe if we are even justified in using the probabilities we use in inductive arguments. What is the probability the Sun will rise tomorrow? What kind of probabilities are reasonable?

In this outline, we sketch and explore how the mathematical theory of statistics has arisen to wrestle with the problem of induction, and how it equips us with careful ways of framing inductive arguments and notions of confidence in them.

See also:

### Early investigators

- “Ibn al-Haytham was an early proponent of the concept that a hypothesis must be supported by experiments based on confirmable procedures or mathematical evidence—an early pioneer in the scientific method five centuries before Renaissance scientists.” - Wikipedia
- Gerolamo Cardano (1501-1576)
*Book on Games of Chance*(1564)

- John Graunt (1620-1674)
- Jacob Bernoulli (1655-1705)

The art of measuring, as precisely as possible, probabilities of things, with the goal that we would be able always to choose or follow in our judgments and actions that course, which will have been determined to be better, more satisfactory, safer or more advantageous.

^{3}

- Thomas Bayes (1701-1761)
- Pierre-Simon Laplace (1749-1827)
- The rule of succssion, bayesian

- Carl Friedrich Gauss (1777-1855)
- John Stuart Mill (1806-1873)
- John Venn (1834-1923)
*The Logic of Chance*(1866)^{4}

### Foundations of modern statistics

- Central limit theorem
- De Moivre-Laplace theorem (1738)
- Glivenko-Cantelli theorem (1933)

- Charles Sanders Peirce (1839-1914)
- Formulated modern statistics in “Illustrations of the Logic of Science,” a series published in
*Popular Science Monthly*(1877-1878), and also “A Theory of Probable Inference” in*Studies in Logic*(1883).^{5} - With a repeated measures design, introduced blinded, controlled randomized experiments (before Fisher).

- Formulated modern statistics in “Illustrations of the Logic of Science,” a series published in
- Karl Pearson (1857-1936)
*The Grammar of Science*(1892)- “On the criterion that a given system of deviations…” (1900)
^{6}- Proposed testing the validity of hypothesized values by evaluating the chi distance between the hypothesized and the empirically observed values via the \(p\)-value.

- With Frank Raphael Weldon, he established the journal
*Biometrika*in 1902. - Founded the world’s first university statistics department at University College, London in 1911.

- Ronald Fisher (1890-1972)
- Fisher significance of the null hypothesis (\(p\)-values)
- “On an absolute criterion for fitting frequency curves”
^{7} - “Frequency distribution of the values of the correlation coefficient in samples of indefinitely large population”
^{8} - “On the ‘probable error’ of a coefficient of correlation deduced from a small sample”
^{9}- Definition of
*likelihood* - ANOVA

- Definition of
*Statistical Methods for Research Workers*(1925)*The Design of Experiments*(1935)- “Statistical methods and scientific induction”
^{10} *The Lady Tasting Tea*^{11}

- “On an absolute criterion for fitting frequency curves”

- Fisher significance of the null hypothesis (\(p\)-values)
- Jerzy Neyman (1894-1981)
- biography by Reid
^{12} - Neyman, J. (1955). The problem of inductive inference.
^{13}- Shows that Neyman read Carnap, but did Carnap read Neyman?
- Discussion: Mayo, D.G. (2014). Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity.

- biography by Reid
- Egon Pearson (1895-1980)
- Neyman-Pearson confidence intervals with fixed error probabilities (also \(p\)-values but considering two hypotheses involves two types of errors)

- Harold Jeffreys (1891-1989)
- objective (non-informative) Jeffreys priors

- Andrey Kolmogorov (1903-1987)
- C.R. Rao (b. 1920)

### Pedagogy

- Kendall
^{14} - James
^{15} - Cowan
^{16} - Cranmer
^{17} - Lista: book,
^{18}notes^{19} - Cox
^{20} - Cousins
^{21} - Weisberg
^{22} - Cranmer, K. (2020).
*Statistics and Data Science*. - Cosma Shalizi’s notes on

## Probability and its related concepts

### Probability

Probability is of epistemic interest, being in some sense a measure of inductive confidence.

TODO:

- Kolmogorov axioms
- Probability vs odds: \(p/(p+q)\) vs \(p/q\)
- Carnap: “Probability as a guide in life”
^{23}

### Expectation and variance

Expectation:

\[ \mathbb{E}(y) \equiv \int dx \: p(x) \: y(x) \label{eq:expectation} \]

Expectation values can be approximated with a partial sum over some data or Monte Carlo sample:

\[ \mathbb{E}(y) \approx \frac{1}{n} \sum_s^n y(x_s) \label{eq:expectation_sum} \]

The variance of a random variable, \(y\), is defined as

\[\begin{align} \mathrm{Var}(y) &\equiv \mathbb{E}((y - \mathbb{E}(y))^2) \nonumber \\ &= \mathbb{E}(y^2 - 2 \: y \: \mathbb{E}(y) + \mathbb{E}(y)^2) \nonumber \\ &= \mathbb{E}(y^2) - 2 \: \mathbb{E}(y) \: \mathbb{E}(y) + \mathbb{E}(y)^2 \nonumber \\ &= \mathbb{E}(y^2) - \mathbb{E}(y)^2 \label{eq:variance} \end{align}\]

The covariance matrix, \(\boldsymbol{V}\), of random variables \(x_i\) is

\[\begin{align} V_{ij} &= \mathrm{Cov}(x_i, x_j) \equiv \mathbb{E}[(x_i - \mathbb{E}(x_i)) \: (x_j - \mathbb{E}(x_j))] \nonumber \\ &= \mathbb{E}(x_i \: x_{j} - \mu_i \: x_j - x_i \: \mu_j + \mu_i \: \mu_j ) \nonumber \\ &= \mathbb{E}(x_i \: x_{j}) - \mu_i \: \mu_j \label{eq:covariance_matrix_indexed} \end{align}\]

\[\begin{equation} \boldsymbol{V} = \begin{pmatrix} \mathrm{Var}(x_1) & \mathrm{Cov}(x_1, x_2) & \cdots & \mathrm{Cov}(x_1, x_n) \\ \mathrm{Cov}(x_2, x_1) & \mathrm{Var}(x_2) & \cdots & \mathrm{Cov}(x_2, x_n) \\ \vdots & \vdots & \ddots & \vdots \\ \mathrm{Cov}(x_n, x_1) & \mathrm{Cov}(x_n, x_2) & \cdots & \mathrm{Var}(x_n) \end{pmatrix} \label{eq:covariance_matrix_array} \end{equation}\]

Diagonal elements of the covariance matrix are the variances of each variable.

\[ \mathrm{Cov}(x_i, x_i) = \mathrm{Var}(x_i) \]

Off-diagonal elements of a covariance matrix measure how related two variables are, linearly. Covariance can be normalized to give the correlation coefficient between variables:

\[ \mathrm{Cor}(x_i, x_j) \equiv \frac{ \mathrm{Cov}(x_i, x_j) }{ \sqrt{ \mathrm{Var}(x_i) \: \mathrm{Var}(x_j) } } \label{eq:correlation_matrix} \]

which is bounded: \(-1 \leq \mathrm{Cor}(x_i, x_j) \leq 1\).

The covariance of two random vectors is given by

\[ \boldsymbol{V} = \mathrm{Cov}(\vec{x}, \vec{y}) = \mathbb{E}(\vec{x} \: \vec{y}^{\mathsf{T}}) - \vec{\mu}_x \: \vec{\mu}_{y}^{\mathsf{T}}\label{eq:covariance_matrix_vectors} \]

### Cross entropy

TODO: discuss the Shannon entropy and Kullback-Leibler (KL) divergence.^{24}

Shannon entropy:

\[ H(p) = - \underset{x\sim{}p}{\mathbb{E}}\big[ \log p(x) \big] \label{eq:shannon_entropy} \]

Cross entropy:

\[ H(p, q) = - \underset{x\sim{}p}{\mathbb{E}}\big[ \log q(x) \big] \label{eq:cross_entropy} \]

Kullback-Leibler (KL) divergence:

\[\begin{align} D_\mathrm{KL}(p, q) &= \underset{x\sim{}p}{\mathbb{E}}\left[ \log \left(\frac{p(x)}{q(x)}\right) \right] = \underset{x\sim{}p}{\mathbb{E}}\big[ \log p(x) - \log q(x) \big] \label{eq:kl_divergence} \\ &= - H(p) + H(p, q) \\ \end{align}\]

See also the section on logistic regression.

### Uncertainty

#### Quantiles and standard error

TODO:

- Quantiles
- Practice of standard error for uncertainty quantification.

#### Propagation of error

Given some vector of random variables, \(\vec{x}\), with estimated means, \(\vec{\mu}\), and estimated covariance matrix, \(\boldsymbol{V}\), suppose we are concerned with estimating the variance of some variable, \(y\), that is a function of \(\vec{x}\). The variance of \(y\) is given by

\[ \sigma^2_y = \mathbb{E}(y^2) - \mathbb{E}(y)^2 \,. \]

Taylor expanding \(y(\vec{x})\) about \(x=\mu\) gives

\[ y(\vec{x}) \approx y(\vec{\mu}) + \left.\frac{\partial y}{\partial x_i}\right|_{\vec{x}=\vec{\mu}} (x_i - \mu_i) \,. \]

Therefore, to first order

\[ \mathbb{E}(y) \approx y(\vec{\mu}) \]

and

\[\begin{align} \mathbb{E}(y^2) &\approx y^2(\vec{\mu}) + 2 \, y(\vec{\mu}) \, \left.\frac{\partial y}{\partial x_i}\right|_{\vec{x}=\vec{\mu}} \mathbb{E}(x_i - \mu_i) \nonumber \\ &+ \mathbb{E}\left[ \left(\left.\frac{\partial y}{\partial x_i}\right|_{\vec{x}=\vec{\mu}}(x_i - \mu_i)\right) \left(\left.\frac{\partial y}{\partial x_j}\right|_{\vec{x}=\vec{\mu}}(x_j - \mu_j)\right) \right] \\ &= y^2(\vec{\mu}) + \, \left.\frac{\partial y}{\partial x_i}\frac{\partial y}{\partial x_j}\right|_{\vec{x}=\vec{\mu}} V_{ij} \\ \end{align}\]

TODO: clarify above, then specific examples.

See Cowan.^{25}

### Bayes’ theorem

- Bayes, Thomas (1701-1761)
- Bayes’ theorem

\[ P(A|B) = P(B|A) \: P(A) \: / \: P(B) \label{eq:bayes_theorem} \]

- Extended version of Bayes theorem
- Example of conditioning with medical diagnostics

### Likelihood and frequentist vs bayesian probability

- Frequentist vs bayesian probability
- Frequentism grew out of theories of statistical sampling error.
- Bayesianism grew out of what used to be called “inverse probability.”
- Fienberg, S.E. (2006). When did Bayesian inference become “Bayesian?”
^{26}

- Fienberg, S.E. (2006). When did Bayesian inference become “Bayesian?”
- Weisberg: “Two Schools”
^{27}

\[ P(H|D) = P(D|H) \: P(H) \: / \: P(D) \label{eq:bayes_theorem_hd} \]

- Likelihood

\[ L(\theta) = P(D|\theta) \label{eq:likelihood_def_x} \]

- We will return to the frequentist vs bayesian debate in the section on the “Statistics Wars”.

To appeal to such a result is absurd. Bayes’ theorem ought only to be used where we have in past experience, as for example in the case of probabilities and other statistical ratios, met with every admissible value with roughly equal frequency. There is no such experience in this case.

^{28}

### Curse of dimensionality

- Curse of dimensionality
- The volume of the space increases so fast that the available data become sparse.

- Stein’s paradox
- The ordinary decision rule for estimating the mean of a multivariate Gaussian distribution (with dimensions, \(n \geq 3\)) is inadmissible under mean squared error risk.

- Proof of Stein’s example
- Probability in high dimensions
^{29} *High-Dimensional Probability:An introduction with applications in data science*^{30}

## Statistical models

### Parametric models

- Data: \(x_i\)
- Parameters: \(\theta_j\)
- Model: \(f(\vec{x} ; \vec{\theta})\)

### Canonical distributions

#### Bernoulli distribution

\[ \mathrm{Ber}(k; p) = \begin{cases} p & \mathrm{if}\ k = 1 \\ 1-p & \mathrm{if}\ k = 0 \end{cases} \label{eq:bernoulli} \]

which can also be written as

\[ \mathrm{Ber}(k; p) = p^k \: (1-p)^{(1-k)} \quad \mathrm{for}\ k \in \{0, 1\} \]

or

\[ \mathrm{Ber}(k; p) = p k + (1-p)(1-k) \quad \mathrm{for}\ k \in \{0, 1\} \]

- Binomial distribution
- Poisson distribution

TODO: explain, another important relationship is

#### Normal/Gaussian distribution

\[ N(x \,|\, \mu, \sigma^2) = \frac{1}{\sqrt{2\,\pi\:\sigma^2}} \: \exp\left(\frac{-(x-\mu)^2}{2\,\sigma^2}\right) \label{eq:gaussian} \]

and in \(k\) dimensions:

\[ N(\vec{x} \,|\, \vec{\mu}, \boldsymbol{\Sigma}) = (2 \pi)^{-k/2}\:\left|\boldsymbol{\Sigma}\right|^{-1/2} \: \exp\left(\frac{-1}{2}\:(\vec{x}-\vec{\mu})^{\mathsf{T}}\:\boldsymbol{\Sigma}^{-1}\:(\vec{x}-\vec{\mu})\right) \label{eq:gaussian_k_dim} \]

where \(\boldsymbol{\Sigma}\) is the covariance matrix (defined in eq. \(\eqref{eq:covariance_matrix_indexed}\)) of the distribution.

- Central limit theorem
- \(\chi^2\) distribution
- Univariate distribution relationships

### Mixture models

- Gaussian mixture models (GMM)
- Marked poisson
- pyhf model description
- HistFactory
^{32}

## Point estimation and confidence intervals

### Inverse problems

Recall that in the context of parametric models of data, \(x_i\) the pdf of which is modeled by a function, \(f(x_i ; \theta_j)\) with parameters, \(\theta_j\). In a statistical inverse problem, the goal is to infer values of the model parameters, \(\theta_j\) given some finite set of data, \(\{x_i\}\) sampled from a probability density, \(f(x_i; \theta_j)\) that models the data reasonably well.^{33}

- Inverse problem
- Inverse probability (Fisher)
- Statistical inference
- See also: Structural realism

- Estimators
- Regression
- Accuracy vs precision
^{34}

### Bias and variance

The bias of an estimator, \(\hat\theta\), is defined as

\[ \mathrm{Bias}(\hat{\theta}) \equiv \mathbb{E}(\hat{\theta} - \theta) = \int dx \: P(x|\theta) \: (\hat{\theta} - \theta) \label{eq:bias} \]

The mean squared error (MSE) of an estimator has a similar formula to variance (eq. \(\eqref{eq:variance}\)) except that instead of quantifying the square of the difference of the estimator and its expected value, the MSE uses the square of the difference of the estimator and the true parameter:

\[ \mathrm{MSE}(\hat{\theta}) \equiv \mathbb{E}((\hat{\theta} - \theta)^2) \label{eq:mse} \]

The MSE of an estimator can be related to its bias and its variance by the following proof:

\[\begin{align} \mathrm{MSE}(\hat{\theta}) &= \mathbb{E}(\hat{\theta}^2 - 2 \: \hat{\theta} \: \theta + \theta^2) \nonumber \\ &= \mathbb{E}(\hat{\theta}^2) - 2 \: \mathbb{E}(\hat{\theta}) \: \theta + \theta^2 \end{align}\]

noting that

\[ \mathrm{Var}(\hat{\theta}) = \mathbb{E}(\hat{\theta}^2) - \mathbb{E}(\hat{\theta})^2 \]

and

\[\begin{align} \mathrm{Bias}(\hat{\theta})^2 &= \mathbb{E}(\hat{\theta} - \theta)^2 \nonumber \\ &= \mathbb{E}(\hat{\theta})^2 - 2 \: \mathbb{E}(\hat{\theta}) \: \theta + \theta^2 \end{align}\]

we see that MSE is equivalent to

\[ \mathrm{MSE}(\hat{\theta}) = \mathrm{Var}(\hat{\theta}) + \mathrm{Bias}(\hat{\theta})^2 \label{eq:mse_variance_bias} \]

For an unbiased estimator, the MSE is the variance of the estimator.

TODO:

- Note the discussion of the bias-variance tradeoff by Cranmer.
- Note the new deep learning view. See Deep learning.

### Maximum likelihood estimation

A maximum likelihood estimator (MLE) was first used by Fisher.^{35}

\[\hat{\theta} \equiv \underset{\theta}{\mathrm{argmax}} \: \mathrm{log} \: L(\theta) \label{eq:mle} \]

Maximizing \(\mathrm{log} \: L(\theta)\) is equivalent to maximizing \(L(\theta)\), and the former is more convenient because for data that are independent and identically distributed (*i.i.d.*) the joint likelihood can be factored into a product of individual measurements:

\[ L(\theta) = \prod_i L(\theta|x_i) = \prod_i P(x_i|\theta) \]

and taking the log of the product makes it a sum:

\[ \mathrm{log} \: L(\theta) = \sum_i \mathrm{log} \: L(\theta|x_i) = \sum_i \mathrm{log} \: P(x_i|\theta) \]

Maximizing \(\mathrm{log} \: L(\theta)\) is also equivalent to minimizing \(-\mathrm{log} \: L(\theta)\), the negative log-likelihood (NLL). For distributions that are *i.i.d.*,

\[ \mathrm{NLL} \equiv - \log L = - \log \prod_i L_i = - \sum_i \log L_i = \sum_i \mathrm{NLL}_i \]

#### Invariance of likelihoods under reparametrization

- Likelihoods are invariant under reparametrization.
^{36} - Bayesian posteriors are not invariant in general.

See also:

#### Ordinary least squares

- Least squares from MLE of gaussian models: \(\chi^2\)
- Ordinary Least Squares (OLS)
- Geometric interpretation

### Variance of MLEs

- Taylor expansion of a likelihood near its maximum
- Cramér-Rao bound
^{39}- Define efficiency of an estimator.
- Common formula for variance of unbiased and efficient estimators
- Proof in Rice
^{40} - Cranmer: Cramér-Rao bound
- Under some reasonable conditions, one can show that MLEs are efficient and unbiased. TODO: find ref.

- Fisher information matrix
- “is the key part of the proof of Wilks’ theorem, which allows confidence region estimates for maximum likelihood estimation (for those conditions for which it applies) without needing the Likelihood Principle.”

- Variance of MLEs
- Wilks’s theorem
- Method of \(\Delta\chi^2\) or \(\Delta{}L\)
- Frequentist confidence intervals (e.g. at 95% CL)
- Cowan
^{41} - Likelihood need not be Gaussian
^{42} - Minos method in particle physics in MINUIT
^{43} - See slides for my talk: Primer on statistics: MLE, Confidence Intervals, and Hypothesis Testing

- Asymptotics
- Cowan, G., Cranmer, K., Gross, E., & Vitells, O. (2012). Asymptotic distribution for two-sided tests with lower and upper boundaries on the parameter of interest.
^{44}

- Cowan, G., Cranmer, K., Gross, E., & Vitells, O. (2012). Asymptotic distribution for two-sided tests with lower and upper boundaries on the parameter of interest.

- Common error bars
- Poisson error bars
- Gaussian approximation: \(\sqrt{n}\)
- Wilson-Hilferty approximation

- Binomial error bars
- Error on efficiency or proportion
- See: Statistical classification

- Poisson error bars
- Discussion
- Wainer, H. (2007). The most dangerous equation. (de Moivre’s equation for variance of means)
^{45}

- Wainer, H. (2007). The most dangerous equation. (de Moivre’s equation for variance of means)
- Misc
- Karhunen-Loève eigenvalue problems in cosmology: How should we tackle large data sets?
^{46}

- Karhunen-Loève eigenvalue problems in cosmology: How should we tackle large data sets?

### Bayesian credibility intervals

- Inverse problem to find a posterior probability distribution.
- Maximum a posteriori estimation (MAP)
- Prior sensitivity
- Betancourt, M. (2018). Towards a principled Bayesian workflow - ipynb

- Not invariant to reparametrization in general
- Jeffreys priors are
- TODO: James

### Uncertainty on measuring an efficiency

- Binomial proportion confidence interval
- Normal/Gaussian/Wald interval
- Wilson score interval
- Clopper-Pearson interval
^{47} - Agresti-Coull interval
- Rule of three
^{48} - Review by Brown, Cai, & DasGupta
^{49} - Precision vs recall for classification, again
- Classification and logistic regression
- See also:
- Logistic regression in the section on Classical machine learning.
- Clustering in the section on Classical machine learning.

### Examples

- Some sample mean
- Bayesian lighthouse
- Measuring an efficiency
- Some HEP fit

## Statistical hypothesis testing

### Null hypothesis significance testing

- Karl Pearson observing how rare sequences of roulette spins are
- Null hypothesis significance testing (NHST)
- goodness of fit
- Fisher

Fisher:

[T]he null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation.

^{50}

### Neyman-Pearson theory

#### Introduction

- probes an alternative hypothesis
^{51} - Type-1 and type-2 errors
- Power and confidence
- Cranmer, K. (2020). Thumbnail of LHC statistical procedures.
- ATLAS and CMS Collaborations. (2011). Procedure for the LHC Higgs boson search combination in Summer 2011.
^{52} - Cowan, G., Cranmer, K., Gross, E., & Vitells, O. (2011). Asymptotic formulae for likelihood-based tests of new physics.
^{53}

See also:

#### Neyman-Pearson lemma

Neyman-Pearson lemma:^{54}

For a fixed signal efficiency, \(1-\alpha\), the selection that corresponds to the lowest possible misidentification probability, \(\beta\), is given by

\[ \frac{L(H_1)}{L(H_0)} > k_{\alpha} \,, \label{eq:np-lemma} \]

where \(k_{\alpha}\) is the cut value required to achieve a type-1 error rate of \(\alpha\).

Neyman-Pearson test statistic:

\[ q_\mathrm{NP} = - 2 \ln \frac{L(H_1)}{L(H_0)} \label{eq:qnp-test-stat} \]

Profile likelihood ratio:

\[ \lambda(\mu) = \frac{ L(\mu, \hat{\theta}_\mu) }{ L(\hat{\mu}, \hat{\theta}) } \label{eq:profile-llh-ratio} \]

where \(\hat{\theta}\) is the (unconditional) maximum-likelihood estimator that maximizes \(L\), while \(\hat{\theta}_\mu\) is the conditional maximum-likelihood estimator that maximizes \(L\) for a specified signal strength, \(\mu\), and \(\theta\) as a vector includes all other parameters of interest and nuisance parameters.

#### Neyman construction

Cranmer: Neyman construction.

TODO: fix

\[ q = - 2 \ln \frac{L(\mu\,s + b)}{L(b)} \label{eq:q0-test-stat} \]

#### Flip-flopping

- Flip-flopping and Feldman-Cousins confidence intervals
^{55}

*p*-values and significance

- \(p\)-values and significance
^{56} - Coverage
- Fisherian vs Neyman-Pearson \(p\)-values

Cowan *et al.* define a \(p\)-value as

a probability, under assumption of \(H\), of finding data of equal or greater incompatibility with the predictions of \(H\).

^{57}

Also:

It should be emphasized that in an actual scientific context, rejecting the background-only hypothesis in a statistical sense is only part of discovering a new phenomenon. One’s degree of belief that a new process is present will depend in general on other factors as well, such as the plausibility of the new signal hypothesis and the degree to which it can describe the data. Here, however, we only consider the task of determining the \(p\)-value of the background-only hypothesis; if it is found below a specified threshold, we regard this as “discovery.”

^{58}

#### Uppper limits

- Cousins, R.D. & Highland, V.L. (1992). Incorporating systematic uncertainties into an upper limit.
^{59}

#### CLs method

### Asymptotics

- Analytic variance of the likelihood-ratio of gaussians: \(\chi^2\)
- Wilks
^{63}- Under the null hypothesis, \(-2 \ln(\lambda) \sim \chi^{2}_{k}\), where \(k\), the degrees of freedom for the \(\chi^{2}\) distribution is the number of parameters of interest (including signal strength) in the signal model but not in the null hypothesis background model.

- Wald
^{64}- Wald generalized the work of Wilks for the case of testing some nonzero signal for exclusion, showing \(-2 \ln(\lambda) \approx (\hat{\theta} - \theta)^{\mathsf{T}}V^{-1} (\hat{\theta} - \theta) \sim \mathrm{noncentral}\:\chi^{2}_{k}\).
- In the simplest case where there is only one parameter of interest (the signal strength, \(\mu\)), then \(-2 \ln(\lambda) \approx \frac{ (\hat{\mu} - \mu)^{2} }{ \sigma^2 } \sim \mathrm{noncentral}\:\chi^{2}_{1}\).

- Pearson \(\chi^2\)-test

- Wilks
- Cowan
*et al.*^{65}- Wald approximation
- Asimov dataset
- Talk by Armbruster: Asymptotic formulae (2013).

- Criteria for projected discovery and exclusion sensitivities of counting experiments
^{66}

### Student’s *t*-test

- Student’s
*t*-test - ANOVA
- A/B-testing

### Frequentist vs bayesian decision theory

- Frequentist vs bayesian decision theory
^{67} - Goodman, S.N. (1999). Toward evidence-based medical statistics 2: The Bayes factor.
^{68}

Support for using Bayes factors:

which properly separates issues of long-run behavior from evidential strength and allows the integration of background knowledge with statistical findings.

^{69}

See also:

### Examples

- Difference of two means: \(t\)-test
- A/B-testing
- New physics

## Uncertainty quantification

### Sinervo classification of systematic uncertainties

- Class-1, class-2, and class-3 systematic uncertanties (good, bad, ugly), Classification by Pekka Sinervo (PhyStat2003)
^{70} - Not to be confused with type-1 and type-2 errors in Neyman-Pearson theory
- Heinrich, J. & Lyons, L. (2007). Systematic errors.
^{71} - Caldeira & Nord
^{72}

Lyons:

In analyses involving enough data to achieve reasonable statistical accuracy, considerably more effort is devoted to assessing the systematic error than to determining the parameter of interest and its statistical error.

^{73}

- Poincaré’s three levels of ignorance

### Profile likelihoods

- Profiling and the profile likelihood
- Importance of Wald and Cowan
*et al*. - hybrid Bayesian-frequentist method

- Importance of Wald and Cowan

### Examples of poor estimates of systematic uncertanties

- Unaccounted-for effects
- CDF \(Wjj\) bump
- Phys.Rev.Lett.106:171801 (2011) / arxiv:1104.0699
- Invariant mass distribution of jet pairs produced in association with a \(W\) boson in \(p\bar{p}\) collisions at \(\sqrt{s}\) = 1.96 TeV
- Dorigo, T. (2011). The jet energy scale as an explanation of the CDF signal.

- OPERA. (2011). Faster-than-light neutrinos.
- BICEP2 claimed evidence of B-modes in the CMB as evidence of cosmic inflation without accounting for cosmic dust.

## Statistical classification

### Introduction

- Precision vs recall
- Recall is sensitivity
- Sensitivity vs specificity
- Accuracy

### Examples

- TODO

See also:

## Causal inference

### Introduction

- Pearl, J. (2018).
*The Book of Why: The new science of cause and effect*.^{74}

See also:

### Causal models

- Structural Causal Model (SCM)
- Pearl, J. (2009). Causal inference in statistics: An overview.
^{75} - Robins, J.M. & Wasserman, L. (1999). On the impossibility of inferring causation from association without background knowledge.
^{76} - Peters, J., Janzing, D., & Schölkopf, B. (2017).
*Elements of Causal Inference*.^{77}

### Counterfactuals

- TODO

## Exploratory data analysis

### Introduction

- Tukey, John (1915-2000)
- Exploratory data analysis
*Exploratory Data Analysis*(1977)^{78}

### Look-elsewhere effect

- Look-elsewhere effect (LEE)
- AKA File-drawer effect

- Stopping rules
- validation dataset
- statistical issues, violates the likelihood principle

### Archiving and data science

- “Data science”
- Data collection, quality, analysis, archival, and reinterpretation
- RECAST

- Scientific research and big data

## “Statistics Wars”

### Introduction

Cranmer:

Bayes’s theorem is a theorem, so there’s no debating it. It is not the case that Frequentists dispute whether Bayes’s theorem is true. The debate is whether the necessary probabilities exist in the first place. If one can define the joint probability \(P (A, B)\) in a frequentist way, then a Frequentist is perfectly happy using Bayes theorem. Thus, the debate starts at the very definition of probability.

^{81}

Neyman:

Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong.

^{82}

### Likelihood principle

- Likelihood principle
- The likelihood principle is the proposition that, given a statistical model and a data sample, all the evidence relevant to model parameters is contained in the likelihood function.
- The history of likelihood
^{84}- Allan Birnbaum proved that the likelihood principle follows from two more primitive and seemingly reasonable principles, the conditionality principle and the sufficiency principle.
^{85} - Hacking identified the “law of likelihood.”
^{86}

- Allan Birnbaum proved that the likelihood principle follows from two more primitive and seemingly reasonable principles, the conditionality principle and the sufficiency principle.
- Berger & Wolpert. (1988).
*The Likelihood Principle*.^{87}

O’Hagan:

The first key argument in favour of the Bayesian approach can be called the axiomatic argument. We can formulate systems of axioms of good inference, and under some persuasive axiom systems it can be proved that Bayesian inference is a consequence of adopting any of these systems… If one adopts two principles known as ancillarity and sufficiency principles, then under some statement of these principles it follows that one must adopt another known as the likelihood principle. Bayesian inference conforms to the likelihood principle whereas classical inference does not. Classical procedures regularly violate the likelihood principle or one or more of the other axioms of good inference. There are no such arguments in favour of classical inference.

^{88}

- Gandenberger
- “A new proof of the likelihood principle”
^{89} - Thesis:
*Two Principles of Evidence and Their Implications for the Philosophy of Scientific Method*(2015) - gandenberger.org/research
- Do frequentist methods violate the likelihood principle?

- “A new proof of the likelihood principle”
- Criticisms:
- Likelihoodist statistics

Mayo:

Likelihoods are vital to all statistical accounts, but they are often misunderstood because the data are fixed and the hypothesis varies. Likelihoods of hypotheses should not be confused with their probabilities. … [T]he same phenomenon may be perfectly predicted or explained by two rival theories; so both theories are equally likely on the data, even though they cannot both be true.

^{93}

### Discussion

Lyons:

Particle Physicists tend to favor a frequentist method. This is because we really do consider that our data are representative as samples drawn according to the model we are using (decay time distributions often are exponential; the counts in repeated time intervals do follow a Poisson distribution, etc.), and hence we want to use a statistical approach that allows the data “to speak for themselves,” rather than our analysis being dominated by our assumptions and beliefs, as embodied in Bayesian priors.

^{94}

- Carnap
- Sznajder on the alleged evolution of Carnap’s views of inductive logic
^{95}

- Sznajder on the alleged evolution of Carnap’s views of inductive logic
- David Cox
- Ian Hacking
*Logic of Statistical Inference*^{96}

- Neyman
- “Frequentist probability and frequentist statistics”
^{97}

- “Frequentist probability and frequentist statistics”
- Zech
- “Comparing statistical data to Monte Carlo simulation”
^{98}

- “Comparing statistical data to Monte Carlo simulation”
- Richard Royall
*Statistical Evidence: A likelihood paradigm*^{99}

- Jim Berger
- “Could Fisher, Jeffreys, and Neyman have agreed on testing?”
^{100}

- “Could Fisher, Jeffreys, and Neyman have agreed on testing?”
- Deborah Mayo
- “In defense of the Neyman-Pearson theory of confidence intervals”
^{101} - Concept of “Learning from error” in
*Error and the Growth of Experimental Knowledge*^{102} - “Severe testing as a basic concept in a Neyman-Pearson philosophy of induction”
^{103} - “Error statistics”
^{104} *Statistical Inference as Severe Testing*^{105}- Statistics Wars: Interview with Deborah Mayo - APA blog
- Review of
*SIST*by Prasanta S. Bandyopadhyay - LSE Research Seminar: Current Controversies in Phil Stat (May 21, 2020)
- Meeting 5 (June 18, 2020)

- “In defense of the Neyman-Pearson theory of confidence intervals”
- Andrew Gelman
- Confirmationist and falsificationist paradigms of science - Sept. 5, 2014
- Beyond subjective and objective in statistics
^{106} - Retire Statistical Significance: The discussion
- Exchange with Deborah Mayo on abandoning statistical significance
- Several reviews of
*SIST*

- Larry Wasserman
- Kevin Murphy
- Greg Gandenberger
- An introduction to likelihoodist, bayesian, and frequentist methods (1/3)
- As Neyman and Pearson put it in their original presentation of the frequentist approach, “without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behavior with regard to them, in the following which we insure that, in the long run of experience, we shall not too often be wrong” (1933, 291).
- An introduction to likelihoodist, bayesian, and frequentist methods (2/3)
- An introduction to likelihoodist, bayesian, and frequentist methods (3/3)
- An argument against likelihoodist methods as genuine alternatives to bayesian and frequentist methods
- “Why I am not a likelihoodist”
^{109}

- Jon Wakefield
*Bayesian and Frequentist Regression Methods*^{110}

- Efron & Hastie
- “Flaws in Frequentist Inference”
^{111}

- “Flaws in Frequentist Inference”
- Kruschke & Liddel
^{112}

Goodman:

The idea that the \(P\) value can play both of these roles is based on a fallacy: that an event can be viewed simultaneously both from a long-run and a short-run perspective. In the long-run perspective, which is error-based and deductive, we group the observed result together with other outcomes that might have occurred in hypothetical repetitions of the experiment. In the “short run” perspective, which is evidential and inductive, we try to evaluate the meaning of the observed result from a single experiment. If we could combine these perspectives, it would mean that inductive ends (drawing scientific conclusions) could be served with purely deductive methods (objective probability calculations).

^{113}

## Replication crisis

### Introduction

- Ioannidis, J.P. (2005). Why most published research findings are false.
^{114}

*p*-value controversy

- Wasserstein, R.L. & Lazar, N.A. (2016). The ASA’s statement on \(p\)-values: Context, process, and purpose.
^{115} - Wasserstein, R.L., Allen, L.S., & Lazar, N.A. (2019). Moving to a World Beyond “p<0.05.”
^{116} - Big names in statistics want to shake up much-maligned P value
^{117} - Hi-Phi Nation, episode 7
- Fisher:

[N]o isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon; for the “one chance in a million” will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be that it should occur to us. In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result.

^{118}

- Relationship to the LEE
- Tukey, John (1915-2000)
- Wasserman
- Mayo
- “Les stats, c’est moi: We take that step here!”
- “Significance tests: Vitiated or vindicated by the replication crisis in psychology?”
^{119} - At long last! The ASA President’s Task Force Statement on Statistical Significance and Replicability

- Gorard & Gorard. (2016). What to do instead of significance testing.
^{120} - Vox: What a nerdy debate about p-values shows about science–and how to fix it
- Karen Kafadar: The Year in Review … And More to Come
- The JASA Reproducibility Guide

From “The ASA president’s task force statement on statistical significance and replicability”:

P-values are valid statistical measures that provide convenient conventions for communicating the uncertainty inherent in quantitative results. Indeed,P-values and significance tests are among the most studied and best understood statistical procedures in the statistics literature. They are important tools that have advanced science through their proper application.^{121}

## Classical machine learning

### Introduction

- Classification vs regression
- Supervised and unsupervised learning
- Classification = supervised; clustering = unsupervised

- Arthur Samuel (1901-1990)
- Hastie, Tibshirani, & Friedman
^{122} *Information Theory, Inference, and Learning*^{123}- Murphy, K.P. (2012).
*Machine Learning: A probabilistic perspective*. MIT Press.^{124} - Murphy, K.P. (2022).
*Probabilistic Machine Learning: An introduction*. MIT Press.^{125} - Shalev-Shwarz, S. & Ben-David, S. (2014).
*Understanding Machine Learning: From Theory to Algorithms*.^{126} - VC-dimension

### Logistic regression

From a probabilistic point of view,^{129} logistic regression can be derived from doing maximum likelihood estimation of a vector of model parameters, \(\vec{w}\), in a dot product with the input features, \(\vec{x}\), and squashed with a logistic function that yields the probability, \(\mu\), of a Bernoulli random variable, \(y \in \{0, 1\}\).

\[ p(y | \vec{x}, \vec{w}) = \mathrm{Ber}(y | \mu(\vec{x}, \vec{w})) = \mu(\vec{x}, \vec{w})^y \: (1-\mu(\vec{x}, \vec{w}))^{(1-y)} \]

The negative log-likelihood of multiple trials is

\[\begin{align} \mathrm{NLL} &= - \sum_i \log p(y_i | \vec{x}_i, \vec{w}) \nonumber \\ &= - \sum_i \log\left( \mu(\vec{x}_i, \vec{w})^{y_i} \: (1-\mu(\vec{x}_i, \vec{w}))^{(1-y_i)} \right) \nonumber \\ &= - \sum_i \log\left( \mu_i^{y_i} \: (1-\mu_i)^{(1-y_i)} \right) \nonumber \\ &= - \sum_i \big( y_i \, \log \mu_i + (1-y_i) \log(1-\mu_i) \big) \label{eq:cross_entropy_loss0} \end{align}\]

which is the **cross entropy loss**. Note that the first term is non-zero only when the true target is \(y_i=1\), and similarly the second term is non-zero only when \(y_i=0\).^{130} Therefore, we can reparametrize the target \(y_i\) in favor of \(t_{ki}\) that is one-hot in an index \(k\) over classes.

\[ \mathrm{CEL} = \mathrm{NLL} = - \sum_i \sum_k \big( t_{ki} \, \log \mu_{ki} \big) \label{eq:cross_entropy_loss1} \]

where

\[ t_{ki} = \begin{cases} 1 & \mathrm{if}\ (k = y_i = 0)\ \mathrm{or}\ (k = y_i = 1) \\ 0 & \mathrm{otherwise} \end{cases} \]

and

\[ \mu_{ki} = \begin{cases} 1-\mu_i & \mathrm{if}\ k = 0 \\ \mu_i & \mathrm{if}\ k =1 \end{cases} \]

This readily generalizes from binary classification to classification over many classes as we will discuss more below. Note that in the sum over classes, \(k\), only one term for the true class contributes.

\[ \mathrm{CEL} = - \left. \sum_i \log \mu_{ki} \right|_{k\ \mathrm{is\ such\ that}\ y_k=1} \label{eq:cross_entropy_loss2} \]

Logistic regression uses the **logit function**,^{131} which is the logarithm of the odds—the ratio of the chance of success to failure. Let \(\mu\) be the probability of success in a Bernoulli trial, then the logit function is defined as

\[ \mathrm{logit}(\mu) \equiv \log\left(\frac{\mu}{1-\mu}\right) \label{eq:logit} \]

Logistic regression assumes that the logit function is a linear function of the explanatory variable, \(x\).

\[ \log\left(\frac{\mu}{1-\mu}\right) = \beta_0 + \beta_1 x \]

where \(\beta_0\) and \(\beta_1\) are trainable parameters. (TODO: Why would we assume this?) This can be generalized to a vector of multiple input variables, \(\vec{x}\), where the input vector has a 1 prepended to be its zeroth component in order to conveniently include the bias, \(\beta_0\), in a dot product.

\[ \vec{x} = (1, x_1, x_2, \ldots, x_n)^{\mathsf{T}}\]

\[ \vec{w} = (\beta_0, \beta_1, \beta_2, \ldots, \beta_n)^{\mathsf{T}}\]

\[ \log\left(\frac{\mu}{1-\mu}\right) = \vec{w}^{\mathsf{T}}\vec{x} \]

For the moment, let \(z \equiv \vec{w}^{\mathsf{T}}\vec{x}\). Exponentiating and solving for \(\mu\) gives

\[ \mu = \frac{ e^z }{ 1 + e^z } = \frac{ 1 }{ 1 + e^{-z} } \]

This function is called the **logistic or sigmoid function**.

\[ \mathrm{logistic}(z) \equiv \mathrm{sigm}(z) \equiv \frac{ 1 }{ 1 + e^{-z} } \label{eq:logistic} \]

Since we inverted the logit function by solving for \(\mu\), the inverse of the logit function is the logistic or sigmoid.

\[ \mathrm{logit}^{-1}(z) = \mathrm{logistic}(z) = \mathrm{sigm}(z) \]

And therefore,

\[ \mu = \mathrm{sigm}(z) = \mathrm{sigm}(\vec{w}^{\mathsf{T}}\vec{x}) \]

See also:

- Logistic regression
- Harlan, W.S. (2007). Bounded geometric growth: motivation for the logistic function.
- Heesch, D. A short intro to logistic regression.
- Roelants, P. (2019). Logistic classification with cross-entropy.

### Softmax regression

Again, from a probabilistic point of view, we can derive the use of multi-class cross entropy loss by starting with the Bernoulli distribution, generalizing it to multiple classes (indexed by \(k\)) as

\[ p(y_k | \mu) = \mathrm{Cat}(y_k | \mu_k) = \prod_k {\mu_k}^{y_k} \label{eq:categorical_distribution} \]

which is the categorical or multinoulli distribution. The negative-log likelihood of multiple independent trials is

\[ \mathrm{NLL} = - \sum_i \log \left(\prod_k {\mu_{ki}}^{y_{ki}}\right) = - \sum_i \sum_k y_{ki} \: \log \mu_{ki} \label{eq:nll_multinomial} \]

Noting again that \(y_{ki} = 1\) only when \(k\) is the true class, and is 0 otherwise, this simplifies to eq. \(\eqref{eq:cross_entropy_loss2}\).

See also:

- Multinomial logistic regression
- McFadden
^{132} - Softmax is really a soft argmax. TODO: find ref.
- Softmax is not unique. There are other squashing functions.
^{133} - Roelants, P. (2019). Softmax classification with cross-entropy.
- Gradients from backprop through a softmax
- Goodfellow et al. point out that
*any*negative log-likelihood is a cross entropy between the training data and the probability distribution predicted by the model.^{134}

### Decision trees

- TODO
- Chen, T. & Guestrin, C. (2016). Xgboost: A scalable tree boosting system.
^{135}

### Clustering

- unsupervised
- Gaussian Mixture Models (GMMs)
- Gaussian discriminant analysis
- \(\chi^2\)

- Generalized Linear Models (GLMs)
- Exponential family of PDFs
- Multinoulli \(\mathrm{Cat}(x|\mu)\)
- GLMs

- EM algorithm
- \(k\)-means

- Clustering high-dimensional data
*t*-distributed stochastic neighbor embedding (t-SNE)- Slonim, N., Atwal, G.S., Tkacik, G. & Bialek, W. (2005). Information-based clustering.
^{136}

- Topological data analysis
- Dindin, M. (2018). TDA To Rule Them All: ToMATo Clustering.

- Relationship of clustering and autoencoding
- Olah, C. (2014). Neural networks, manifolds, and topology.
- Batson et al. (2021). Topological obstructions to autoencoding.
^{137}

- “What are the true clusters?”
^{138}

See also:

## Deep learning

### Introduction

- Conceptual reviews of deep learning
- Lower to higher level representations
^{139} - LeCun, Y., Bengio, Y., & Hinton, G. (2015). Review: Deep learning.
^{140} *Deep Learning*^{141}- Kaplan, J. (2019). Notes on contemporary machine learning.
^{142}

- Lower to higher level representations
- Backpropagation
- Rumelhart
^{143} - Schmidhuber’s Critique of Honda Prize for Dr. Hinton.
- Schmidhuber: Who invented backpropagation?
- Scmidhuber: The most cited neural networks all build on work done in my labs.

- Rumelhart
- Pratical guides
- Bengio, Y. (2012). Practical recommendations for gradient-based training of deep architectures.
- Hao, L. et al. (2017). Visualizing the loss landscape of neural nets.

- Discussion
- Sutton, R. (2019). The bitter lesson.
^{144} - Watson, D. & Floridi, L. (2019). The explanation game: A formal framework for interpretable machine learning.
^{145} - AIMyths.com

- Sutton, R. (2019). The bitter lesson.

### Deep double descent

- Bias and variance trade-off. See Bias and variance.
- MSE and model capacity
- Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias-variance trade-off.
^{147} - Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2019). Deep double descent: Where bigger models and more data hurt.
^{148}- Deep Double Descent.
*OpenAI Blog*.

- Deep Double Descent.
- Hubinger, E. (2019). Understanding “Deep Double Descent”.
*LessWrong*. - Dar, Y., Muthukumar, V., & Baraniuk, R.G. (2021). A farewell to the bias-variance tradeoff? An overview of the theory of overparameterized machine learning.
^{149} - Balestriero, R., Pesenti, J., & LeCun, Y. (2021). Learning in high dimension always amounts to extrapolation.
^{150} - Nagarajan, V. (2021). Explaining generalization in deep learning: progress and fundamental limits.
^{151} - Bubeck, S. & Sellke, M. (2021). A universal law of robustness via isoperimetry.
^{152} - Bach, F. (2022).
*Learning Theory from First Principles*.^{153}

### Regularization

Regularization = any change we make to the training algorithm in order to reduce the generalization error but not the training error.^{154}

Most common regularizations:

- L2 Regularization
- L1 Regularization
- Data Augmentation
- Dropout
- Early Stopping

Papers:

- Loshchilov, I. & Hutter, F. (2019). Decoupled weight decay regularization.
- A group theoretic framework for data augmentation
^{155}

### Batch size vs learning rate

Papers:

- Keskar, N.S. et al. (2016). On large-batch training for deep learning: Generalization gap and sharp minima.

[L]arge-batch methods tend to converge to sharp minimizers of the training and testing functions—and as is well known—sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation.

Hoffer, E. et al. (2017). Train longer, generalize better: closing the generalization gap in large batch training of neural networks.

- \(\eta \propto \sqrt{m}\)

Goyal, P. et al. (2017). Accurate large minibatch SGD: Training ImageNet in 1 hour.

- \(\eta \propto m\)

You, Y. et al. (2017). Large batch training of convolutional networks.

- Layer-wise Adaptive Rate Scaling (LARS)

You, Y. et al. (2017). ImageNet training in minutes.

- Layer-wise Adaptive Rate Scaling (LARS)

Jastrzebski, S. (2018). Three factors influencing minima in SGD.

- \(\eta \propto m\)

Smith, S.L. & Le, Q.V. (2018). A Bayesian Perspective on Generalization and Stochastic Gradient Descent.

Smith, S.L. et al. (2018). Don’t decay the learning rate, increase the batch size.

- \(m \propto \eta\)

Masters, D. & Luschi, C. (2018). Revisiting small batch training for deep neural networks.

This linear scaling rule has been widely adopted, e.g., in Krizhevsky (2014), Chen et al. (2016), Bottou et al. (2016), Smith et al. (2017) and Jastrzebski et al. (2017).

On the other hand, as shown in Hoffer et al. (2017), when \(m \ll M\), the covariance matrix of the weight update \(\mathrm{Cov(\eta \Delta\theta)}\) scales linearly with the quantity \(\eta^2/m\).

This implies that, adopting the linear scaling rule, an increase in the batch size would also result in a linear increase in the covariance matrix of the weight update \(\eta \Delta\theta\). Conversely, to keep the scaling of the covariance of the weight update vector \(\eta \Delta\theta\) constant would require scaling \(\eta\) with the square root of the batch size \(m\) (Krizhevsky, 2014; Hoffer et al., 2017).

Lin, T. et al. (2020). Don’t use large mini-batches, use local SGD.

- Post-local SGD.Golmant, N. et al. (2018). On the computational inefficiency of large batch sizes for stochastic gradient descent.

Scaling the learning rate as \(\eta \propto \sqrt{m}\) attempts to keep the weight increment length statistics constant, but the distance between SGD iterates is governed more by properties of the objective function than the ratio of learning rate to batch size. This rule has also been found to be empirically sub-optimal in various problem domains. … There does not seem to be a simple training heuristic to improve large batch performance in general.

- McCandlish, S. et al. (2018). An empirical model of large-batch training.
- Critical batch size

- Shallue, C.J. et al. (2018). Measuring the effects of data parallelism on neural network training.

In all cases, as the batch size grows, there is an initial period of perfect scaling (\(b\)-fold benefit, indicated with a dashed line on the plots) where the steps needed to achieve the error goal halves for each doubling of the batch size. However, for all problems, this is followed by a region of diminishing returns that eventually leads to a regime of maximal data parallelism where additional parallelism provides no benefit whatsoever.

- Jastrzebski, S. et al. (2018). Width of minima reached by stochastic gradient descent is influenced by learning rate to batch size ratio.
- \(\eta \propto m\)

We show this experimentally in Fig. 5, where similar learning dynamics and final performance can be observed when simultaneously multiplying the learning rate and batch size by a factor up to a certain limit.

- You, Y. et al. (2019). Large-batch training for LSTM and beyond.
- Warmup and use \(\eta \propto m\)

[W]e propose linear-epoch gradual-warmup approach in this paper. We call this approach Leg-Warmup (LEGW). LEGW enables a Sqrt Scaling scheme in practice and as a result we achieve much better performance than the previous Linear Scaling learning rate scheme. For the GNMT application (Seq2Seq) with LSTM, we are able to scale the batch size by a factor of 16 without losing accuracy and without tuning the hyper-parameters mentioned above.

- You, Y. et al. (2019). Large batch optimization for deep learning: Training BERT in 76 minutes.
- LARS and LAMB

- Zhang, G. et al. (2019). Which algorithmic choices matter at which batch sizes? Insights from a Noisy Quadratic Model.

Consistent with the empirical results of Shallue et al. (2018), each optimizer shows two distinct regimes: a small-batch (stochastic) regime with perfect linear scaling, and a large-batch (deterministic) regime insensitive to batch size. We call the phase transition between these regimes the critical batch size.

- Li, Y., Wei, C., & Ma, T. (2019). Towards explaining the regularization effect of initial large learning rate in training neural networks.

Our analysis reveals that more SGD noise, or larger learning rate, biases the model towards learning “generalizing” kernels rather than “memorizing” kernels.

Kaplan, J. et al. (2020). Scaling laws for neural language models.

Jastrzebski, S. et al. (2020). The break-even point on optimization trajectories of deep neural networks.

Blogs:

- Shen, K. (2018). Effect of batch size on training dynamics.
- Chang, D. (2020). Effect of batch size on neural net training.

### Normalization

- BatchNorm
- LayerNorm, GroupNorm
- OnlineNorm
- Kiani, B., Balestriero, R., Lecun, Y., & Lloyd, S. (2022). projUNN: efficient method for training deep networks with unitary matrices.
^{156}

### Computer vision

- Computer Vision (CV)
- Fukushima: neocognitron
^{157} - LeCun: OCR with backpropagation
^{158} - LeCun: LeNet-5
^{159} - Ciresan: MCDNN
^{160} - Krizhevsky, Sutskever, and Hinton: AlexNet
^{161} - VGG
^{162} - ResNet
^{163}- ResNet is performing a forward Euler discretisation of the ODE: \(\dot{x} = \sigma(F(x))\).
^{164}

- ResNet is performing a forward Euler discretisation of the ODE: \(\dot{x} = \sigma(F(x))\).
- MobileNet
^{165} - Neural ODEs
^{166} - EfficientNet
^{167} - VisionTransformer
^{168} - EfficientNetV2
^{169} - gMLP
^{170} - Liu, Y. et al. (2021). A survey of visual transformers.
- Ingrosso, A. & Goldt, S. (2022). Data-driven emergence of convolutional structure in neural networks.
^{171} - Park, N. & Kim, S. (2022). How do vision transformers work?
^{172}

Resources:

- Neptune.ai. (2021). Object detection algorithms and libraries.
- facebookresearch/vissl
- PyTorch Geometric (PyG)

### Natural language processing

- Natural Language Processing (NLP)
- History
- Firth (1957): “You shall know a word by the company it keeps.”
^{173} - Nirenburg, S. (1996). Bar Hillel and Machine Translation: Then and Now.
^{174} - Hutchins, J. (2000). Yehoshua Bar-Hillel: A philosophers’ contribution to machine translation.
^{175}

- Firth (1957): “You shall know a word by the company it keeps.”
- word2vec
- Mikolov
^{176} - Julia Bazińska
- Olah, C. (2014). Deep learning, NLP, and representations.
- Migdal, P. (2017). king - man + woman is queen; but why?
- Ethayarajh, K. (2019). Word embedding analogies: Understanding King - Man + Woman = Queen.
- Allen, C. (2019). “Analogies Explained” … Explained.

- Mikolov
- RNNs and LSTMs
- Hochreiter, S. & Schmidhuber, J. (1997). Long short-term memory.
^{177} - Olah, C. (2015). Understanding LSTM networks.
- Karpathy, A. (2015). The unreasonable effectiveness of recurrent neural networks.

- Hochreiter, S. & Schmidhuber, J. (1997). Long short-term memory.
- Backpropagation through time (BPTT)
- Neural Machine Translation (NMT)
- Scaling laws in NLP
- Rationalism and empiricism in artificial intellegence: A survey of 25 years of evaluation [in NLP].
^{182} - Kaplan, J. et al. (2020). Scaling laws for neural language models.
^{183}

- Rationalism and empiricism in artificial intellegence: A survey of 25 years of evaluation [in NLP].
- Attention and Transformers
- Transformer
^{184} - BERT
^{185} - Horev, R. (2018). BERT Explained: State of the art language model for NLP.
- Horan, C. (2021). 10 things you need to know about BERT and the transformer architecture that are reshaping the AI landscape.
- Video: What are transformer neural networks?
- Video: How to get meaning from text with language model BERT.
- ALBERT
^{186} - GPT-1,
^{187}2,^{188}3^{189} - Alammar, J. (2019). The illustrated GPT-2.
- Yang, Z. et al. (2019). XLNet: Generalized autoregressive pretraining for language understanding.
^{190} - Daily Nous: Philosophers On GPT-3.
- DeepMind’s blog posts for more details: AlphaFold1, AlphaFold2 (2020). Slides from the CASP14 conference are publicly available here.
- Joshi, C. (2020). Transformers are GNNs.
- Lakshmanamoorthy, R. (2020). A complete learning path to transformers (with guide to 23 architectures).
- Zaheer, M. et al. (2020). Big Bird: Transformers for longer sequences.
^{191} - Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2022). Efficient transformers: A survey.
^{192}

- Transformer
- Textbooks
- Jurafsky, D. & Martin, J.H. (2022).
*Speech and Language Processing: An introduction to natural language processing, computational linguistics, and speech recognition*.^{193}

- Jurafsky, D. & Martin, J.H. (2022).

See also:

### Reinforcement learning

- Reinforcement Learning (RL)
- Dynamic programming
- Bellman equation
- Backward induction
- John von Neumann & Oskar Morgenstern. (1944).
*Theory of Games and Economic Behavior*.

- John von Neumann & Oskar Morgenstern. (1944).

Pedagogy:

Tutorials:

- RL course by David Silver
- RL course by Emma Brunskill
- DeepMind Reinforcement Learning Lecture Series 2021

More:

#### Q-learning

- Q-learning and DQN
- Uses the Markov Decision Process (MDP) framework
- The Bellman equation
^{196} - Q-learning is a values-based learning algorithm. Value based algorithms updates the value function based on an equation (particularly Bellman equation). Whereas the other type, policy-based estimates the value function with a greedy policy obtained from the last policy improvement (source: towardsdatascience.com).
- DQN masters Atari
^{197}

#### AlphaZero

- AlphaGo Lee
^{198}→ AlphaGo Zero^{199}→ AlphaZero^{200} - OpenAI Five masters Dota2
- AlphaStar masters StarCraftII
- AlphaZero
- \(\pi(a|s)\) and \(V(s)\)
- Monte Carlo Tree Search (MCTS)

#### Counterfactual regret minimization

- Counterfactual Regret Minimization (CFR)
- CFR differs from traditional RL algorithms in that it does not try to maximize expected return. Instead, it minimizes exploitability. CFR does not use the MDP framework; instead, it uses extensive-form games (source: Quora).
- Zinkevich, M., Johanson, M., Bowling, M., & Piccione, C. (2007). Regret minimization in games with incomplete information.
^{201} - Lanctot, M. (2009). Monte Carlo sampling for regret minimization.
^{202}- Monte Carlo Counterfactual Regret Minimization (MCCFR)

- Neller, T.W. & Lanctot, M. (2013). An introduction to counterfactual regret minimization.
^{203}

#### Solving poker

- Earlier poker work
- Bowling, M., Burch, N., Johanson, M., & Tammelin, O. (2015). Heads-up limit hold’em poker is solved.
^{204}- CFR+

- Heinrich & Silver. (2016). Deep reinforcement learning from self play in imperfect-information games.
^{205}- Q-learning

- Moravcik, M. et al. (2017). DeepStack: Expert-level artificial intelligence in heads-up no-limit poker.
^{206}

- Bowling, M., Burch, N., Johanson, M., & Tammelin, O. (2015). Heads-up limit hold’em poker is solved.
- Libratus
- Brown, N. & Sandholm, T. (2018). Superhuman AI for heads-up no-limit poker: Libratus beats top professionals.
^{207}- bet and card abstraction
- MCCFR used to find a solution of the abstracted game: blueprint

- Brown, N. & Sandholm, T. (2019). Solving imperfect-information games via discounted regret minimization.
^{208} - Brown, N., Lerer, A., Gross, S., & Sandholm, T. (2019). Deep counterfactual regret minimization.
^{209} - Hart, S. & Mas‐Colell, A. (2000). A simple adaptive procedure leading to correlated equilibrium.
^{210}

- Brown, N. & Sandholm, T. (2018). Superhuman AI for heads-up no-limit poker: Libratus beats top professionals.
- Pluribus
- Brown, N. & Sandholm, T. (2019). Superhuman AI for multiplayer poker.
^{211} - Brown, N. (2019). Facebook, Carnegie Mellon build first AI that beats pros in 6-player poker.
- No limit: AI poker bot is first to beat professionals at multiplayer game

- Brown, N. & Sandholm, T. (2019). Superhuman AI for multiplayer poker.
- ReBeL
- Brown, N. et al. (2020). Combining deep reinforcement learning and search.
^{212} - ReBeL: A general game-playing AI bot that excels at poker and more
- YouTube by Brown: Combining deep reinforcement learning and search for imperfect-information games.
- Brown, N. (2020).
*Equilibrium finding for large adversarial imperfect-information games*.^{213}

- Brown, N. et al. (2020). Combining deep reinforcement learning and search.

### Applications in physics

- HEPML-LivingReview: A Living Review of Machine Learning for Particle Physics
- Spears, B.K. et al. (2018). Deep learning: A guide for practitioners in the physical sciences.
^{214} - Cranmer, K., Seljak, U., & Terao, K. (2021). Machine learning (Review in the PDG).
^{215}

See also:

## Theoretical machine learning

### Algorithmic information theory

- Ray Solomonoff (1926-2009)
- Solomonoff induction
- Naturally formalizes Occam’s razor
- Incomputable

- Rathmanner, S. & Hutter, M. (2011). A philosophical treatise of universal induction.
^{216}

### No free lunch theorems

- David Wolpert and William G. Macready
- No free lunch theorems for search (1995)
^{217} - The lack of a priori distinctions between learning algorithms (1996)
^{218} - No free lunch theorems for optimization (1997)
^{219} - Shalev-Shwarz, S. & Ben-David, S. (2014).
^{220} - McDermott, J. (2019). When and why metaheuristics researchers can ignore “no free lunch” theorems.
^{221} - Wolpert, D.H. (2007). Physical limits of inference.
^{222} - Wolpert, D.H. & Kinney, D. (2020). Noisy deductive reasoning: How humans construct math, and how math constructs universes.
^{223}

- No free lunch theorems for search (1995)
- Blogs:
- Fedden, L. (2017). The no free lunch theorem.
- Lokesh, M. (2020). The intuition behind the no free lunch theorem.
- Mueller, A. (2019). Don’t cite the no free lunch theorem.
- Quora answer by Luis Argerich

- Inductive bias
- Yudkowsky, E. (2007). Inductive bias.
*LessWrong*. - Ugly duckling theorem
- Hamilton, L.D. (2014). The inductive biases of various machine learning algorithms.
- Mitchell, T.M. (1980). The need for biases in learning generalizations.
^{224}

- Yudkowsky, E. (2007). Inductive bias.
- Gerhard Schurz
- Dan A. Roberts. (2021). Why is AI hard and physics simple?
^{225}- See also: Unreasonable effectiveness

- More
- Goldreich, O. & Ron, D. (1997). On universal learning algorithms.
^{226} - Nakkiran, P. (2021). Turing-universal learners with optimal scaling laws.
^{227} - Bousquet, O., Hanneke, S., Moran, S., Van Handel, R., & Yehudayoff, A. (2021). A theory of universal learning.
^{228}

- Goldreich, O. & Ron, D. (1997). On universal learning algorithms.

Raissi et al.:

encoding such structured information into a learning algorithm results in amplifying the information content of the data that the algorithm sees, enabling it to quickly steer itself towards the right solution and generalize well even when only a few training examples are available.

^{229}

Roberts:

From an algorithmic complexity standpoint it is somewhat miraculous that we can compress our huge look-up table of experiment/outcome into such an efficient description. In many senses, this type of compression is precisely what we mean when we say that physics enables us to understand a given phenomenon.

^{230}

### Graphical tensor notation

- Penrose graphical notation
- Predrag Cvitanovic
- Matrices as Tensor Network Diagrams
- Multi-layer perceptions

### Universal approximation theorem

- Minsky, M. & Papert, S. (1969).
*Perceptrons: An Introduction to Computational Geometry*.^{231} - Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators.
^{232} - Lu, Z., Pu, H., Wang, F., Hu, Z., & Wang, L. (2017). The expressive power of neural networks: A view from the width.
^{233} - Ismailov, V. (2020). A three layer neural network can represent any multivariate function.
^{234} - Multi-layer perceptions with two or more layers are universal approximators.
^{235} - Seemed to slow the interest in deeper networks?

### Relationship to statistical mechanics

- Logistic/softmax and Boltzman factors
- Bahri
^{236} - Halverson
^{237} - Canatar, A., Bordelon, B., & Pehlevan, C. (2020). Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks.
^{238} - Roberts, Yaida, & Hanin. (2021).
*The Principles of Deep Learning Theory: An Effective Theory Approach to Understanding Neural Networks*.^{239}- Introduced in Facebook AI’s blog

### Relationship to gauge theory

- Cohen & Welling. (2016). Group equivariant convolutional networks.
^{240} - Gauge equivariant convolutional networks and the icosahedral CNN (2019)
^{241} - Pavlus, J. (2020). An idea from physics helps AI see in higher dimensions.
- SE(3)-Transformers
^{242}and blog post. - e3nn: a modular PyTorch framework for Euclidean neural networks
- List of papers: Chen-Cai-OSU/awesome-equivariant-network

### Thermodynamics of computation

- Wolpert, D. (2018). Why do computers use so much energy?
- Sante Fe Institute: Thermodynamics of Computation

## Information geometry

### Introduction

- Smith, L. (2019). A gentle introduction to information geometry.
^{243} - Nielsen, F. (2018). An elementary introduction to information geometry.
^{244} - Amari, S. (2016).
*Information Geometry and Its Applications*.^{245} - Geomstats tutorial: Information geometry

### Geometric understanding of classical statistics

- Balasubramanian, V. (1996). A geometric formulation of Occam’s razor for inference of parametric distributions.
^{246} - Balasubramanian, V. (1996). Statistical inference, Occam’s razor and statistical mechanics on the space of probability distributions.
^{247} - Calin, O. & Udriste, C. (2014).
*Geometric Modeling in Probability and Statistics*.^{248} - Cranmer: Information geometry (coming soon?)

### Geometric understanding of deep learning

- Lei, N. et al. (2018). Geometric understanding of deep learning.
^{249} - Gao, Y. & Chaudhari, P. (2020). An information-geometric distance on the space of tasks.
^{250} - Bronstein, M.M. et al. (2021). Geometric deep learning: Grids, groups, graphs, geodesics, and gauges.
^{251}

## Automation

### AutoML

- Neural Architecture Search (NAS)
- AutoML frameworks
- RL-driven NAS
- learned sparsity

### Surrogate models

- Autoencoders, latent variables
- The manifold hypothesis
- Olah, C. (2014). Neural networks, manifolds, and topology.
- Fefferman, C., Mitter, S., & Narayanan, H. (2016). Testing the manifold hypothesis.
^{252}

- Physical constratints in loss functions
- Raissi, M., Perdikaris, P., & Karniadakis, G.E. (2017). Physics informed deep learning (Part I) and (Part II).
^{253} - Karniadakis, G.E. et al. (2021). Physics-informed machine learning.
^{254} - Howard, J.N. et al. (2021). Foundations of a fast, data-driven, machine-learned simulator.
^{255} - Thuerey, N. et al. (2021). Physics-based deep learning.
^{256} - physicsbaseddeeplearning.org

- Raissi, M., Perdikaris, P., & Karniadakis, G.E. (2017). Physics informed deep learning (Part I) and (Part II).
- Simulation-based inference
- Cranmer, K., Brehmer, J., & Louppe, G. (2019). The frontier of simulation-based inference.
^{257} - Baydin, A.G. et al. (2019). Etalumis: Bringing probabilistic programming to scientific simulators at scale.
^{258}

- Cranmer, K., Brehmer, J., & Louppe, G. (2019). The frontier of simulation-based inference.

Lectures:

- Paul Hand. (2020). Invertible neural networks and inverse problems.

### AutoScience

- Automated discovery
- Anderson, C. (2008). The End of Theory: The data deluge makes the scientific method obsolete.
^{259} - Cranmer, K. (2017). Active sciencing.
- Asch, M. et al. (2018). Big data and extreme-scale computing: Pathways to Convergence-Toward a shaping strategy for a future software and data ecosystem for scientific inquiry.
^{260}- Note that this description of abduction is missing that it is normative (i.e. “best-fit”).

- D’Agnolo, R.T. & Wulzer, A. (2019). Learning New Physics from a Machine.
^{261} - Udrescu, S. & Tegmark, M. (2020). Symbolic pregression: Discovering physical laws from raw distorted video.
^{262} - Cranmer, M. et al. (2020). Discovering symbolic models from deep learning with inductive biases.
^{263}- Video: Discussion by Yannic Kilcher

- Liu, Z., Madhavan, V., & Tegmark, M. (2022). AI Poincare 2.0: Machine learning conservation laws from differential equations.
^{264}

See also:

## Implications for the realism debate

### Introduction

See also:

### Real clusters

- Nope: Hennig

See also:

### Word meanings

- Note that NLP has implications to the philosophy of language and realism
- NLP word representations and the Wittgenstein philosophy of language.
^{269}

See also:

## My thoughts

My docs:

My talks:

- Likelihood functions for supersymmetric observables
- Machine learning and realism
- Primer on statistics: MLE, confidence intervals, and hypothesis testing

## Annotated bibliography

### Mayo, D.G. (1996). *Error and the Growth of Experimental Knowledge*.

- Mayo (1996)

#### My thoughts

- TODO

### Cowan, G. (1998). *Statistical Data Analysis*.

- Cowan (1998) and Cowan (2016)

#### My thoughts

- TODO

### James, F. (2006). *Statistical Methods in Experimental Physics*.

- James (2006)

#### My thoughts

- TODO

### Cowan, G. *et al.* (2011). Asymptotic formulae for likelihood-based tests of new physics.

- Cowan et al. (2011)
- Glen Cowan, Kyle Cranmer, Eilam Gross, Ofer Vitells

#### My thoughts

- TODO

### ATLAS Collaboration. (2012). Combined search for the Standard Model Higgs boson.

- ATLAS Collaboration (2012)
- arxiv:1207.0319

#### My thoughts

- TODO

### Cranmer, K (2015). Practical statistics for the LHC.

- Cranmer (2015)

#### My thoughts

- TODO

### More articles to do

## Links and encyclopedia articles

### SEP

- Abduction
- Analysis of knowledge
- Bayes’ theorem
- Bayesian epistemology
- Carnap, Rudolf (1891-1970)
- Causal models
- Causal processes
- Confirmation
- Dutch book arguments
- Epistemology
- Foundationalist Theories of Epistemic Justification
- Hume, David (1711-1776)
- Identity of indiscernibles
- Induction, The problem of
- Logic and Probability
- Naturalized epistemology
- Peirce, Charles Sanders (1839-1914)
- Popper, Karl (1902-1994)
- Principle of sufficient reason
- Probability, Interpretations of
- Probabilistic pausation
- Reichenbach, Hans (1891-1953)
- Scientific explanation
- Scientific research and big data
- Statistics, Philosophy of

### IEP

- Carnap, Rudolf (1891-1970)
- Epistemology
- Hempel, Carl Gustav (1905-1997)
- Hume, David (1711-1776)
- Naturalism
- Naturalistic Epistemology
- Peirce, Charles Sanders (1839-1914)
- Reductionism
- Safety Condition for Knowledge, The
- Simplicity in the philosophy of science
- William of Ockham (1280-1349)

### Scholarpedia

### Wikipedia

- Abductive reasoning
- Akaike_information_criterion
- Algorithmic information theory
- Algorithmic probability
- Analysis of variance
- Aumann’s agreement theorem
- Bayes, Thomas (1701-1761)
- Bayesian inference
- Bernoulli, Jacob (1655-1705)
- Birnbaum, Allan (1923-1976)
- Bootstrapping
- Carnap, Rudolf (1891-1970)
- Confidence interval
- Cosmic variance
- Cramér-Rao bound
- Cramér, Harald (1893-1985)
- Data science
- Decision theory
- Deductive-nomological model
- Empiricism
- Epistemology
- Exploratory data analysis
- Fisher, Ronald (1890-1962)
- Frequentist inference
- Foundations of statistics
- Gauss, Carl Friedrich (1777-1855)
- German tank problem
- Gosset, William Sealy (1876-1937)
- Graunt, John (1620-1674)
- History of probability
- History of statistics
- Hume, David (1711-1776)
- Induction, The problem of
- Inductive reasoning
- Interval estimation
- Inverse probability
- Inverse problem
- Ivakhnenko, Alexey (1913-2007)
- Jeffrey, Richard (1926-2002)
- Jeffreys, Harold (1891-1989)
- Jeffreys prior
- Kolmogorov, Andrey (1903-1987)
- Kolmogorov complexity
- Lady tasting tea
- Laplace, Pierre-Simon (1749-1827)
- Likelihood principle
- Likelihoodist statistics
- List of important publications in statistics
- Machine learning
- Maximum likelihood estimation
- Mill, John Stuart (1806-1873)
- Misuse of p-values
- Neyman, Jerzy (1894-1981)
- Neyman construction
- Neyman-Pearson lemma
- Ockham, William of (1287-1347)
- Pearson, Egon (1895-1980)
- Pearson, Karl (1857-1936)
- Peirce, Charles Sanders (1839-1914)
- Poisson, Siméon Denis (1781-1840)
- Popper, Karl (1902-1994)
- Principle of sufficient reason
- Precision and recall
- Proteus phenomenon
- P-value
- Rao, C.R. (b. 1920)
- Replication crisis
- Rule of three
- Savage, Leonard Jimmie (1917-1971)
- Solomonoff, Ray (1926-2009)
- Solomonoff’s theory of inductive inference
- Statistical classification
- Statistical hypothesis testing
- Statistical inference
- Statistical sensitivity and specificity
- Statistical significance
- Statistics
- Statistics, Founders of
- Statistics, History of
- Statistics, Mathematical
- Statistics, Outline of
- Statistics, Philosophy of
- Student’s t-test
- Systematic error
- Thorp, Edward O. (b. 1932)
- Trial and error
- Tukey, John (1915-2000)
- Type-I and type-II errors
- Uncomfortable science
- Uniformitarianism
- Unsolved problems in statistics, List of
- Venn, John (1834-1923)
- Wilks, S.S. (1906-1964)
- Wilks’s theorem

### Others

- Deep Learning: Our Miraculous Year 1990-1991 - Schmidhuber
- errorstatistics.com - Deborah Mayo’s blog
- Graunt, John (1620-1674) - statprob.com
- Peng, R. (2016). A Simple Explanation for the Replication Crisis in Science. - simplystatistics.org
- Why is binary classification not a hypothesis test? - stackexchange.com
- If the likelihood principle clashes with frequentist probability then do we discard one of them? - stackexchange.com
- Wilks’s theorem - fiveMinuteStats
- Dallal, G.E. (2012). The Little Handbook of Statistical Practice.

## References

*Statistical Science*,

*12*, 162–176.

*Information Geometry and Its Applications*. Springer Japan.

*Wired*. June 23, 2008. https://www.wired.com/2008/06/pb-theory/

*IEEE Signal Processing Magazine*,

*34*, 26–38.

*The International Journal of High Performance Computing Applications*,

*32*, 435–479.

*Physical Review D*,

*86*, 032003. https://arxiv.org/abs/1207.0319

*Learning Theory from First Principles*. (Draft). https://www.di.ens.fr/~fbach/ltfp_book.pdf

*International Conference on Learning Representations, 3rd*,

*2015*. https://arxiv.org/abs/1409.0473

*Annual Review of Condensed Matter Physics*,

*11*, 501–528.

*Proceedings of the National Academy of Sciences*,

*116*, 15849–15854. https://arxiv.org/abs/1812.11118

*Proceedings of the National Academy of Sciences*,

*38*, 716–719.

*Foundations and Trends in Machine Learning*,

*2*, 1–127. https://www.iro.umontreal.ca/~lisa/pointeurs/TR1312.pdf

*PsyArXiv*. July 22, 2017. https://psyarxiv.com/mky9j/

*Annals of Applied Statistics*,

*16*, 1–2. https://magazine.amstat.org/blog/2021/08/01/task-force-statement-p-value/

*ECAI2000 Workshop notes on scientific Reasoning in Artificial Intelligence and the Philosophy of Science*(pp. 9–14).

*Statistical Science*,

*18*, 1–32.

*The Likelihood Principle*(2nd ed.). Haywood, CA: The Institute of Mathematical Statistics.

*Journal of the American Statistical Association*,

*57*, 269–326.

*Pattern Recognition and Machine Learning*. Springer.

*Journal of Machine Learning Research*,

*21*, 1–69.

*Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing*(pp. 532–541). https://dl.acm.org/doi/pdf/10.1145/3406325.3451087

*Science*,

*347*, 145–149. http://science.sciencemag.org/content/347/6218/145

*Statistical Science*,

*16*, 101–133. https://projecteuclid.org/euclid.ss/1009213286

*Equilibrium finding for large adversarial imperfect-information games*. (Ph.D. thesis). http://www.cs.cmu.edu/~noamb/thesis.pdf

*Science*,

*359*, 418–424. https://science.sciencemag.org/content/359/6374/418

*Proceedings of the AAAI Conference on Artificial Intelligence*,

*33*, 1829–1836. https://arxiv.org/abs/1809.04040

*Science*,

*365*, 885–890. https://science.sciencemag.org/content/365/6456/885

*Machine Learning: Science and Technology*,

*2*, 015002. https://iopscience.iop.org/article/10.1088/2632-2153/aba6f3

*Geometric Modeling in Probability and Statistics*. Springer Switzerland.

*Philosophy and Phenomenological Research*,

*5*, 513–32.

*Journal of Philosophy*,

*44*, 141–48.

*Natural Language Engineering*,

*25*, 753–767. https://www.cambridge.org/core/journals/natural-language-engineering/article/survey-of-25-years-of-evaluation/E4330FAEB9202EC490218E3220DDA291

*Neural Networks*,

*32*, 333–338. https://arxiv.org/abs/1202.2745

*Biometrika*,

*26*, 404–413.

*Proceedings of International Conference on Machine Learning*,

*2016*, 2990–9. http://proceedings.mlr.press/v48/cohenc16.pdf

*Nuclear Instruments and Methods in Physics Research Section A*,

*320*, 331–335. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.193.1581&rep=rep1&type=pdf

*Statistical Data Analysis*. Clarendon Press.

*Chinese Physics C*,

*40*, 100001. http://pdg.lbl.gov/2016/reviews/rpp2016-rev-statistics.pdf

*European Physical Journal C*,

*71*, 1544. https://arxiv.org/abs/1007.1727

*Principles of Statistical Inference*. Cambridge University Press.

*Skandinavisk Aktuarietidskrift*,

*29*, 85–94.

*Progress of Theoretical and Experimental Physics*. 2020, 083C01. (and 2021 update). https://pdg.lbl.gov/2021-rev/2021/reviews/contents_sports.html

*Physical Review D*,

*99*, 015014. https://arxiv.org/abs/1806.02350

*International Statistical Review*,

*42*, 9–15.

*Computer Age Statistical Inference: Algorithms, evidence, and data science*. Cambridge University Press.

*Journal of the American Mathematical Society*,

*29*, 983–1049. https://www.ams.org/journals/jams/2016-29-04/S0894-0347-2016-00852-4/S0894-0347-2016-00852-4.pdf

*Physical Review D*,

*57*, 3873. https://arxiv.org/abs/physics/9711021

*Bayesian Analysis*,

*1*, 1–40. https://projecteuclid.org/journals/bayesian-analysis/volume-1/issue-1/When-did-Bayesian-inference-become-Bayesian/10.1214/06-BA101.full

*Studies in Linguistic Analysis*(pp. 1–31). Oxford: Blackwell.

*Statistical Science*,

*12*, 39–41.

*Biometrika*,

*10*, 507–521.

*Metron*,

*1*, 1–32.

*The Design of Experiments*. Hafner.

*Journal of the Royal Statistical Society, Series B*,

*17*, 69–78.

*Revue de l’Institut International de Statistique*,

*11*, 182–205.

*Pattern Recognition*,

*15*, 455–469.

*British Journal for the Philosophy of Science*,

*66*, 475–503. https://www.journals.uchicago.edu/doi/abs/10.1093/bjps/axt039

*Philosopher’s Imprint*,

*16*, 1–22. https://quod.lib.umich.edu/p/phimp/3521354.0016.007/--why-i-am-not-a-likelihoodist

*Journal of the Royal Statistical Society: Series A (Statistics in Society)*,

*180*, 967–1033.

*Information Processing Letters*,

*63*, 131–136. https://www.wisdom.weizmann.ac.il/~oded/p_ul.html

*Deep Learning*. MIT Press. http://www.deeplearningbook.org

*Annals of Internal Medicine*,

*130*, 995–1004. https://courses.botany.wisc.edu/botany_940/06EvidEvol/papers/goodman1.pdf

*Annals of Internal Medicine*,

*130*, 1005–1013. https://courses.botany.wisc.edu/botany_940/06EvidEvol/papers/goodman2.pdf

*International Journal of Social Research Methodology*,

*19*, 481–490.

*Logic of Statistical Inference*. Cambridge University Press.

*The British Journal for the Philosophy of Science*,

*22*, 209–229.

*JAMA*,

*249*, 1743–1745.

*The Elements of Statistical Learning: Data Mining, Inference, and Prediction*(2nd ed.). Springer.

*Annual Reviews of Nuclear and Particle Science*,

*57*, 145–169. https://www.annualreviews.org/doi/abs/10.1146/annurev.nucl.57.090506.123052

*Pattern Recognition Letters*,

*64*, 53–62. https://arxiv.org/abs/1502.02555

*Neural Computation*,

*9*, 1735–1780.

*Neural Networks*,

*2*, 359–366. https://cognitivemedium.com/magic_paper/assets/Hornik.pdf

*PLOS Medicine*,

*2*, 696–701.

*Statistical Methods in Experimental Particle Physics*(2nd ed.). World Scientific.

*Computational Physics Communications*,

*10*, 343–367. https://cds.cern.ch/record/310399

*Nuclear Instruments and Methods in Physics Research Section A*,

*434*, 435–443. https://arxiv.org/abs/hep-ex/9902006

*Speech and Language Processing: An introduction to natural language processing, computational linguistics, and speech recognition*(3rd ed.). https://web.stanford.edu/~jurafsky/slp3/ed3book_jan122022.pdf

*Nature Reviews Physics*,

*3*, 422–440. https://doi.org/10.1038/s42254-021-00314-5

*Proceedings of the ECML-PKDD-01 Workshop on Machine Learning as Experimental Philosophy of Science*. Freiburg.

*Advances in Neural Information Processing Systems*,

*2012*, 1097–1105. https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

*Psychonomic Bulletin & Review*,

*25*, 178–206. https://link.springer.com/article/10.3758/s13423-016-1221-4

*Advances in Neural Information Processing Systems*,

*22*, 1078–1086.

*Nature*,

*521*, 436–44.

*Proceedings of the IEEE*,

*86*, 2278–2324. http://vision.stanford.edu/cs598_spring07/papers/Lecun98.pdf

*Neural Computation*,

*1*, 541–551. http://yann.lecun.com/exdb/publis/pdf/lecun-89e.pdf

*The American Statistician*,

*62*, 45–53. http://www.stat.rice.edu/~dobelman/courses/texts/leemis.distributions.2008amstat.pdf

*Statistical Methods for Data Analysis in Particle Physics*. Springer. http://foswiki.oris.mephi.ru/pub/Main/Literature/st_methods_for_data_analysis_in_particle_ph.pdf

*Advances in Neural Information Processing Systems*,

*30*. https://proceedings.neurips.cc/paper/2017/file/32cbf687880eb1674a07bf717761dd3a-Paper.pdf

*The Annals of Applied Statistics*,

*2*, 887–915. https://projecteuclid.org/journals/annals-of-applied-statistics/volume-2/issue-3/Open-statistical-issues-in-Particle-Physics/10.1214/08-AOAS163.full

*Information Theory, Inference, and Learning Algorithms*. Cambridge University Press.

*Philosophy of Science*,

*48*, 269–280.

*Error and the Growth of Experimental Knowledge*. Chicago University Press.

*Statistical Science*,

*29*, 227–266.

*Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars*. Cambridge University Press.

*Review of Philosophy and Psychology*,

*12*, 101–121. https://link.springer.com/article/10.1007/s13164-020-00501-w

*British Journal for the Philosophy of Science*,

*57*, 323–357.

*Philosophy of Statistics*(pp. 153–198). North-Holland.

*Frontiers in Econometrics*(pp. 105–142). New York: Academic Press.

*Perceptrons: An Introduction to Computational Geometry*. MIT Press.

*Readings in Machine Learning*(pp. 184–192). San Mateo, CA, USA. http://www.cs.cmu.edu/afs/cs/usr/mitchell/ftp/pubs/NeedForBias_1980.pdf

*Nature*,

*518*, 529–533. http://files.davidqiu.com//research/nature14236.pdf

*Science*,

*356*, 508–513. https://arxiv.org/abs/1701.01724

*Machine Learning: A probabilistic perspective*. MIT Press.

*Probabilistic Machine Learning: An introduction*. MIT Press.

*Explaining generalization in deep learning: progress and fundamental limits*. (Ph.D. thesis). https://arxiv.org/abs/2110.08922

*Proceedings of Model AI Assignments*,

*11*. http://cs.gettysburg.edu/~tneller/modelai/2013/cfr/cfr.pdf

*Communications on Pure and Applied Mathematics*,

*8*, 13–45. https://errorstatistics.files.wordpress.com/2017/04/neyman-1955-the-problem-of-inductive-inference-searchable.pdf

*Synthese*,

*36*, 97–131.

*Philosophical Transactions of the Royal Society A*,

*231*, 289–337.

*Kendall’s Advanced Theory of Statistics, Vol 2B: Bayesian Inference*. Wiley.

*Statistics Surveys*,

*3*, 96–146. https://projecteuclid.org/journals/statistics-surveys/volume-3/issue-none/Causal-inference-in-statistics-An-overview/10.1214/09-SS057.pdf

*The Book of Why: The new science of cause and effect*. Basic Books.

*The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science*,

*50*, 157–175.

*Studies in Logic*. Boston: Little, Brown, and Co.

*Elements of Causal Inference*. MIT Press.

*Bulletin of the Calcutta Mathematical Society*,

*37*, 81–91.

*Mathematical Proceedings of the Cambridge Philosophical Society*. 43, 280–283. Cambridge University Press.

*Entropy*,

*13*, 1076–1136. https://www.mdpi.com/1099-4300/13/6/1076/pdf

*Journal of Physics G: Nuclear and Particle Physics*,

*28*, 2693. https://indico.cern.ch/event/398949/attachments/799330/1095613/The_CLs_Technique.pdf

*Neyman*. Springer-Verlag.

*Mathematical Statistics and Data Analysis*(3rd ed.). Thomson.

*The Principles of Deep Learning Theory: An Effective Theory Approach to Understanding Neural Networks*. Cambridge University Press. https://deeplearningtheory.com/PDLT.pdf

*Computation, Causation, and Discovery*(pp. 305–321). AAAI & MIT Press.

*Statistical Evidence: A likelihood paradigm*. CRC Press.

*Nature*,

*323*, 533–536.

*The Lady Tasting Tea*. Holt.

*The Foundations of Statistics*. John Wiley & Sons.

*Understanding Machine Learning: From Theory to Algorithms*. Cambridge University Press. https://www.cs.huji.ac.il/w~shais/UnderstandingMachineLearning/understanding-machine-learning-theory-algorithms.pdf

*Nature*,

*529*, 484–489.

*Nature*,

*550*, 354–359.

*Proceedings of the Conference on Advanced Statistical Techniques in Particle Physics*. Durham, UK: Institute of Particle Physics Phenomenology. https://arxiv.org/abs/hep-ex/0208005v1

*Proceedings of the Conference on Statistical Problems in Particle Physics, Astrophysics, and Cosmology (PhyStat2003)*(pp. 122–129). Stanford Linear Accelerator Center. https://www.slac.stanford.edu/econf/C030908/papers/TUAT004.pdf

*Proceedings of the National Academy of Sciences*,

*102*, 18297–18302. https://arxiv.org/abs/q-bio/0511043

*Physics of Plasmas*,

*25*, 080901.

*Kendall’s Advanced Theory of Statistics, Vol 2A: Classical Inference and the Linear Model*. Wiley.

*Advances in Neural Information Processing Systems*,

*2014*, 3104–3112. https://arxiv.org/abs/1409.3215

*Reinforcement Learning*(2nd ed.). MIT Press.

*The Monist*,

*101*, 417–440.

*The Astrophysical Journal*,

*480*, 22–35. https://arxiv.org/abs/astro-ph/9603021

*Exploratory Data Analysis*. Pearson.

*Neural Computation*,

*6*, 851–876.

*Advances in Neural Information Processing Systems*,

*2017*, 5998–6008. https://arxiv.org/abs/1706.03762

*The Logic of Chance*. London: MacMillan and Co. (Originally published in 1866).

*High-Dimensional Probability:An introduction with applications in data science*. Cambridge University Press. https://www.math.uci.edu/~rvershyn/papers/HDP-book/HDP-book.pdf

*American Scientist*,

*95*, 249–256. https://sites.stat.washington.edu/people/peter/498.Sp16/Equation.pdf

*Bayesian and Frequentist Regression Methods*. Springer.

*Transactions of the American Mathematical Society*,

*54*, 426–482. https://www.ams.org/journals/tran/1943-054-03/S0002-9947-1943-0012401-3/S0002-9947-1943-0012401-3.pdf

*All of Statistics: A Concise Course in Statistical Inference*. Springer.

*American Statistician*,

*73*, 1–19.

*American Statistician*,

*70*, 129–133.

*SSRN*,

*3509737*. https://ssrn.com/abstract=3509737

*Odds & Ends: Introducing Probability & Decision with a Visual Emphasis*. https://jonathanweisberg.org/vip/

*The Annals of Mathematical Statistics*,

*9*, 60–62. https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-9/issue-1/The-Large-Sample-Distribution-of-the-Likelihood-Ratio-for-Testing/10.1214/aoms/1177732360.full

*Scientific Data Mining and Knowledge Discovery*(pp. 77–89). Springer, Berlin, Heidelberg.

*Neural Computation*,

*8*, 1341–1390.

*IEEE Transactions on Evolutionary Computation*,

*1*, 67–82.

*Advances in Neural Information Processing Systems*,

*20*, 1729–1736.

Edwards (1974), p. 9.↩︎

Hacking (1971).↩︎

Bernoulli, J. (1713).

*Ars Conjectandi*, Chapter II, Part IV, defining the art of conjecture [wikiquote].↩︎Venn (1888).↩︎

Peirce (1883), p. 126–181.↩︎

Pearson (1900).↩︎

Fisher (1912).↩︎

Fisher (1915).↩︎

Fisher (1921).↩︎

Fisher (1955).↩︎

Salsburg (2001).↩︎

Reid (1998).↩︎

Neyman (1955).↩︎

Stuart, Ord, & Arnold (2010).↩︎

James (2006).↩︎

Cowan (1998) and Cowan (2016).↩︎

Cranmer (2015).↩︎

Lista (2016b).↩︎

Lista (2016a).↩︎

Cox (2006).↩︎

Cousins (2018).↩︎

Weisberg (2019).↩︎

Carnap (1947).↩︎

Goodfellow, Bengio, & Courville (2016), p. 72-73.↩︎

Cowan (1998), p. 20-22.↩︎

Fienberg (2006).↩︎

Weisberg (2019), ch. 15.↩︎

Fisher (1921), p. 15.↩︎

van Handel (2016).↩︎

Vershynin (2018).↩︎

Leemis & McQueston (2008).↩︎

Cranmer, K. et al. (2012).↩︎

This assumption that the model models the data “reasonably” well reflects that to the degree required by your analysis, the important features of the data match well within the systematic uncertainties parametrized within the model. If the model is incomplete because it is missing an important feature of the data, then this is the “

*ugly*” (class-3) error in the Sinervo classification of systematic uncertainties.↩︎Cowan (1998) and Cowan (2016), p. TODO.↩︎

Aldrich (1997).↩︎

James (2006), p. 234.↩︎

Cox (2006), p. 11.↩︎

Murphy (2012), p. 222.↩︎

Fréchet (1943), Cramér (1946), Rao (1945), and Rao (1947).↩︎

Rice (2007), p. 300–2.↩︎

Cowan (1998), p. 130-5.↩︎

James (2006), p. 234.↩︎

James & Roos (1975).↩︎

Cowan, Cranmer, Gross, & Vitells (2012).↩︎

Wainer (2007).↩︎

Tegmark, Taylor, & Heavens (1997).↩︎

Clopper & Pearson (1934).↩︎

Hanley & Lippman-Hand (1983).↩︎

L. D. Brown, Cai, & DasGupta (2001).↩︎

Fisher (1935), p. 16.↩︎

Goodman (1999a). p. 998.↩︎

ATLAS and CMS Collaborations (2011).↩︎

Cowan, Cranmer, Gross, & Vitells (2011).↩︎

Neyman & Pearson (1933).↩︎

Feldman & Cousins (1998).↩︎

Sinervo (2002) and Cowan (2012).↩︎

Cowan et al. (2011), p. 2–3.↩︎

Cowan et al. (2011), p. 3.↩︎

Cousins & Highland (1992).↩︎

Junk (1999).↩︎

Read (2002).↩︎

ATLAS Statistics Forum (2011).↩︎

Wilks (1938).↩︎

Wald (1943).↩︎

Cowan et al. (2011).↩︎

Bhattiprolu, Martin, & Wells (2020).↩︎

Murphy (2012), p. 197.↩︎

Goodman (1999b).↩︎

Goodman (1999a). p. 995.↩︎

Sinervo (2003).↩︎

Heinrich & Lyons (2007).↩︎

Caldeira & Nord (2020).↩︎

Lyons (2008), p. 890.↩︎

Pearl (2018).↩︎

Pearl (2009).↩︎

Robins & Wasserman (1999).↩︎

Peters, Janzing, & Scholkopf (2017).↩︎

Tukey (1977).↩︎

Carnap (1945).↩︎

Royall (1997), p. 171–2.↩︎

Cranmer (2015), p. 6.↩︎

Neyman & Pearson (1933).↩︎

Kruschke & Liddell (2018).↩︎

Edwards (1974).↩︎

Birnbaum (1962).↩︎

Hacking (1965).↩︎

Berger & Wolpert (1988).↩︎

O’Hagan (2010), p. 17–18.↩︎

Gandenberger (2015).↩︎

Evans (2013).↩︎

Mayo (2014).↩︎

Mayo (2019).↩︎

Mayo (2019).↩︎

Lyons (2008), p. 891.↩︎

Sznajder (2018).↩︎

Hacking (1965).↩︎

Neyman (1977).↩︎

Zech (1995).↩︎

Royall (1997).↩︎

Berger (2003).↩︎

Mayo (1981).↩︎

Mayo (1996).↩︎

Mayo & Spanos (2006).↩︎

Mayo & Spanos (2011).↩︎

Mayo (2018).↩︎

Gelman & Hennig (2017).↩︎

Murphy (2012), ch. 6.6.↩︎

Murphy (2022), p. 195–198.↩︎

Gandenberger (2016).↩︎

Wakefield (2013), ch. 4.↩︎

Efron & Hastie (2016), p. 30–36.↩︎

Kruschke & Liddell (2018).↩︎

Goodman (1999a). p. 999.↩︎

Ioannidis (2005).↩︎

Wasserstein & Lazar (2016).↩︎

Wasserstein, Allen, & Lazar (2019).↩︎

Benjamin, D.J. et al. (2017).↩︎

Fisher (1935), p. 13–14.↩︎

Mayo (2021).↩︎

Gorard & Gorard (2016).↩︎

Benjamini, Y. et al. (2021), p. 1.↩︎

Hastie, Tibshirani, & Friedman (2009).↩︎

MacKay (2003).↩︎

Murphy (2012).↩︎

Murphy (2022), p. 195–198.↩︎

Shalev-Shwarz & Ben-David (2014).↩︎

Vapnik, Levin, & LeCun (1994).↩︎

Shalev-Shwarz & Ben-David (2014), p. 67–82.↩︎

Murphy (2012), p. 21.↩︎

Note:

*Label smoothing*is a regularization technique that smears the activation over other labels, but we don’t do that here.↩︎“Logit” was coined by Joseph Berkson (1899-1982).↩︎

McFadden & Zarembka (1973).↩︎

Blondel, Martins, & Niculae (2020).↩︎

Goodfellow et al. (2016), p. 129.↩︎

T. Chen & Guestrin (2016).↩︎

Slonim, Atwal, Tkacik, & Bialek (2005).↩︎

Batson, Haaf, Kahn, & Roberts (2021).↩︎

Hennig (2015).↩︎

Bengio (2009).↩︎

LeCun, Bengio, & Hinton (2015).↩︎

Goodfellow et al. (2016).↩︎

Kaplan, J. et al. (2019).↩︎

Rumelhart, Hinton, & Williams (1986).↩︎

Sutton (2019).↩︎

Watson & Floridi (2019).↩︎

Bengio (2009).↩︎

Belkin, Hsu, Ma, & Mandal (2019).↩︎

Nakkiran, P. et al. (2019).↩︎

Dar, Muthukumar, & Baraniuk (2021).↩︎

Balestriero, Pesenti, & LeCun (2021).↩︎

Nagarajan (2021).↩︎

Bubeck & Sellke (2021).↩︎

Bach (2022), p. 225–230.↩︎

Mishra, D. (2020). Weight Decay == L2 Regularization?↩︎

S. Chen, Dobriban, & Lee (2020).↩︎

Kiani, Balestriero, Lecun, & Lloyd (2022).↩︎

Fukushima & Miyake (1982).↩︎

LeCun, Y. et al. (1989).↩︎

LeCun, Bottou, Bengio, & Haffner (1998).↩︎

Ciresan, Meier, Masci, & Schmidhuber (2012).↩︎

Krizhevsky, Sutskever, & Hinton (2012).↩︎

Simonyan & Zisserman (2014).↩︎

He, Zhang, Ren, & Sun (2015).↩︎

Haber & Ruthotto (2017).↩︎

Howard, A.G. et al. (2017).↩︎

R. T. Q. Chen, Rubanova, Bettencourt, & Duvenaud (2018).↩︎

Tan & Le (2019).↩︎

Dosovitskiy, A. et al. (2020).↩︎

Tan & Le (2021).↩︎

H. Liu, Dai, So, & Le (2021).↩︎

Ingrosso & Goldt (2022).↩︎

Park & Kim (2022).↩︎

Firth (1957).↩︎

Nirenburg (1996).↩︎

Hutchins (2000).↩︎

Mikolov, Chen, Corrado, & Dean (2013), Mikolov, Yih, & Zweig (2013), and Mikolov, T. et al. (2013).↩︎

Hochreiter & Schmidhuber (1997).↩︎

Sutskever, Vinyals, & Le (2014).↩︎

Bahdanau, Cho, & Bengio (2015).↩︎

Wu, Y. et al. (2016).↩︎

Stahlberg (2019).↩︎

Church & Hestness (2019).↩︎

Kaplan, J. et al. (2020).↩︎

Vaswani, A. et al. (2017).↩︎

Devlin, Chang, Lee, & Toutanova (2018).↩︎

Lan, Z. et al. (2019).↩︎

Radford, Narasimhan, Salimans, & Sutskever (2018).↩︎

Radford, A. et al. (2019).↩︎

Brown, T.B. et al. (2020).↩︎

Yang, Z. et al. (2019).↩︎

Zaheer, M. et al. (2020).↩︎

Tay, Dehghani, Bahri, & Metzler (2022).↩︎

Jurafsky & Martin (2022).↩︎

Sutton & Barto (2018).↩︎

Arulkumaran, Deisenroth, Brundage, & Bharath (2017).↩︎

Bellman (1952).↩︎

Mnih, V. et al. (2013) and Mnih, V. et al. (2015).↩︎

Silver, D. et al. (2016).↩︎

Silver, D. et al. (2017b).↩︎

Silver, D. et al. (2017a).↩︎

Zinkevich, Johanson, Bowling, & Piccione (2007).↩︎

Lanctot, Waugh, Zinkevich, & Bowling (2009).↩︎

Neller & Lanctot (2013).↩︎

Bowling, Burch, Johanson, & Tammelin (2015).↩︎

Heinrich & Silver (2016).↩︎

Moravcik, M. et al. (2017).↩︎

N. Brown & Sandholm (2018).↩︎

N. Brown & Sandholm (2019a).↩︎

N. Brown, Lerer, Gross, & Sandholm (2019).↩︎

Hart & Mas‐Colell (2000).↩︎

N. Brown & Sandholm (2019b).↩︎

N. Brown, Bakhtin, Lerer, & Gong (2020).↩︎

N. Brown (2020).↩︎

Spears, B.K. et al. (2018).↩︎

Cranmer, Seljak, & Terao (2021).↩︎

Rathmanner & Hutter (2011).↩︎

Wolpert & Macready (1995).↩︎

Wolpert (1996).↩︎

Wolpert & Macready (1997).↩︎

Shalev-Shwarz & Ben-David (2014), p. 60–66.↩︎

McDermott (2019).↩︎

Wolpert (2007).↩︎

Wolpert & Kinney (2020).↩︎

Mitchell (1980).↩︎

Roberts (2021).↩︎

Goldreich & Ron (1997).↩︎

Nakkiran (2021).↩︎

Bousquet, O. et al. (2021).↩︎

Raissi, Perdikaris, & Karniadakis (2017a), p. 2.↩︎

Roberts (2021), p. 7.↩︎

Minsky & Papert (1969).↩︎

Hornik, Stinchcombe, & White (1989).↩︎

Lu, Z. et al. (2017).↩︎

Ismailov (2020).↩︎

Bishop (2006), p. 230.↩︎

Bahri, Y. et al. (2020).↩︎

Halverson, Maiti, & Stoner (2020).↩︎

Canatar, Bordelon, & Pehlevan (2020).↩︎

Roberts, Yaida, & Hanin (2021).↩︎

Cohen & Welling (2016).↩︎

Cohen, Weiler, Kicanaoglu, & Welling (2019).↩︎

Fuchs, Worrall, Fischer, & Welling (2020).↩︎

Smith (2019).↩︎

Nielsen (2018).↩︎

Amari (2016).↩︎

Balasubramanian (1996a).↩︎

Balasubramanian (1996b).↩︎

Calin & Udriste (2014).↩︎

Lei, Luo, Yau, & Gu (2018).↩︎

Gao & Chaudhari (2020).↩︎

Bronstein, Bruna, Cohen, & Velickovic (2021).↩︎

Fefferman, Mitter, & Narayanan (2016).↩︎

Raissi et al. (2017a) and Raissi, Perdikaris, & Karniadakis (2017b).↩︎

Karniadakis, G.E. et al. (2021).↩︎

Howard, Mandt, Whiteson, & Yang (2021).↩︎

Thuerey, N. et al. (2021).↩︎

Cranmer, Brehmer, & Louppe (2019).↩︎

Baydin, A.G. et al. (2019).↩︎

Anderson (2008).↩︎

Asch, M. et al. (2018).↩︎

D’Agnolo & Wulzer (2019).↩︎

Udrescu & Tegmark (2020).↩︎

Cranmer, M. et al. (2020).↩︎

Z. Liu, Madhavan, & Tegmark (2022).↩︎

Asch, M. et al. (2018).↩︎

Korb (2001).↩︎

Williamson (2009).↩︎

Bensusan (2000).↩︎

Perone (2018).↩︎

Wasserman (2003).↩︎

Savage (1954).↩︎