3 Philosophy of statistics

Published

October 15, 2025

Statistical analysis is very important in addressing the problem of induction. Can inductive inference be formalized? What are the caveats? Can inductive inference be automated? How does machine learning work?

All knowledge is, in final analysis, history. All sciences are, in the abstract, mathematics. All judgements are, in their rationale, statistics.

– C. R. Rao ¹

¹ Rao (1997), p. x.

3.1 Introduction to the foundations of statistics

3.1.1 Problem of induction

A key issue for the scientific method, as discussed in the previous outline, is the problem of induction. Inductive inferences are used in the scientific method to make generalizations from finite data. This introduces unique avenues of error not found in purely deductive inferences, like in logic and mathematics. Compared to deductive inferences, which are sound and necessarily follow if an argument is valid and all of its premises obtain, inductive inferences can be valid and probably (not certainly) sound, and therefore can still result in error in some cases because the support of the argument is ultimately probabilistic.

A skeptic may further probe if we are even justified in using the probabilities we use in inductive arguments. What is the probability the Sun will rise tomorrow? What kind of probabilities are reasonable?

In this outline, we sketch and explore how the mathematical theory of statistics has arisen to wrestle with the problem of induction, and how it equips us with careful ways of framing inductive arguments and notions of confidence in them.

See also:

Statistics as a solution to the problem of induction

3.1.2 Early investigators

Ibn al-Haytham (c. 965-1040)
- “Ibn al-Haytham was an early proponent of the concept that a hypothesis must be supported by experiments based on confirmable procedures or mathematical evidence—an early pioneer in the scientific method five centuries before Renaissance scientists.” - Wikipedia
Gerolamo Cardano (1501-1576)
- Book on Games of Chance (1564)
John Graunt (1620-1674)
Jacob Bernoulli (1655-1705)
- Ars Conjectandi (1713, posthumous)
- First modern phrasing of the problem of parameter estimation ²
- See Hacking ³
- Early vision of decision theory:

² Edwards (1974), p. 9.

³ Hacking (1971).

The art of measuring, as precisely as possible, probabilities of things, with the goal that we would be able always to choose or follow in our judgments and actions that course, which will have been determined to be better, more satisfactory, safer or more advantageous. ⁴

⁴ Bernoulli, J. (1713). Ars Conjectandi, Chapter II, Part IV, defining the art of conjecture [wikiquote].

Thomas Bayes (1701-1761)
Pierre-Simon Laplace (1749-1827)
- The rule of succession, bayesian
Carl Friedrich Gauss (1777-1855)
John Stuart Mill (1806-1873)
Francis Galton (1822-1911)
- Regression towards the mean in phenotypes
John Venn (1834-1923)
- The Logic of Chance (1866) ⁵
William Stanley Jevons (1835-1882)
- Jevons, W.S. (1873). The philosophy of inductive inference. ⁶
- Jevons, W.S. (1873). The use of hypothesis. ⁷

⁵ Venn (1888).

⁶ Jevons (1873a).

⁷ Jevons (1873b).

3.1.3 Foundations of modern statistics

Central limit theorem
- De Moivre-Laplace theorem (1738)
- Glivenko-Cantelli theorem (1933)
Charles Sanders Peirce (1839-1914)
- Formulated modern statistics in “Illustrations of the Logic of Science”, a series published in Popular Science Monthly (1877-1878), and also “A Theory of Probable Inference” in Studies in Logic (1883). ⁸
- With a repeated measures design, introduced blinded, controlled randomized experiments (before Fisher).
Karl Pearson (1857-1936)
- The Grammar of Science (1892)
- “On the criterion that a given system of deviations…” (1900) ⁹
  - Proposed testing the validity of hypothesized values by evaluating the chi distance between the hypothesized and the empirically observed values via the \(p\)-value.
- With Frank Raphael Weldon, he established the journal Biometrika in 1902.
- Founded the world’s first university statistics department at University College, London in 1911.
John Maynard Keynes (1883-1946)
- Keynes, J. M. (1921). A Treatise on Probability. ¹⁰
Ronald Fisher (1890-1972)
- Fisher significance of the null hypothesis (\(p\)-values)
  - “On an absolute criterion for fitting frequency curves” ¹¹
  - “Frequency distribution of the values of the correlation coefficient in samples of indefinitely large population” ¹²
- “On the ‘probable error’ of a coefficient of correlation deduced from a small sample” ¹³
  - Definition of likelihood
  - ANOVA
- Statistical Methods for Research Workers (1925)
- The Design of Experiments (1935)
- “Statistical methods and scientific induction” ¹⁴
- The Lady Tasting Tea ¹⁵
Jerzy Neyman (1894-1981)
- biography by Reid ¹⁶
- Neyman, J. (1955). The problem of inductive inference. ¹⁷
  - Discussion: Mayo, D.G. (2014). Power taboos: Statue of Liberty, Senn, Neyman, Carnap, Severity.
  - Shows that Neyman read Carnap.
- Carnap, R. (1960). Logical Foundations of Probability. ¹⁸
  - Shows that Carnap read Neyman.
  - TODO: Look into this more.
Egon Pearson (1895-1980)
- Neyman-Pearson confidence intervals with fixed error probabilities (also \(p\)-values but considering two hypotheses involves two types of errors)
Harold Jeffreys (1891-1989)
- objective (non-informative) Jeffreys priors
Andrey Kolmogorov (1903-1987)
C.R. Rao (1920-2023)
Ray Solomonoff (1926-2009)
Shun’ichi Amari (b. 1936)
Judea Pearl (b. 1936)

⁸ Peirce (1883), p. 126–181.

⁹ Pearson (1900).

¹⁰ Keynes (1921).

¹¹ Fisher (1912).

¹² Fisher (1915).

¹³ Fisher (1921).

¹⁴ Fisher (1955).

¹⁵ Salsburg (2001).

¹⁶ Reid (1998).

¹⁷ Neyman (1955).

¹⁸ Carnap (1960).

3.1.4 Pedagogy

Kendall ¹⁹
James ²⁰
Cowan ²¹
Cranmer ²²
Jaynes, E.T. (2003). Probability Theory: The Logic of Science. ²³
Lista: book ²⁴, notes ²⁵
Cox ²⁶
Behnke, O., Kröninger, K., Schott, G., & Schörner-Sadenius, T. (2013). Data Analysis in High Energy Physics: A Practical Guide to Statistical Methods. ²⁷
Cousins ²⁸
Weisberg ²⁹
Cranmer, K. (2020). Statistics and Data Science.
Cosma Shalizi’s notes on
Gelman, A. & Vehtari, A. (2021). What are the most important statistical ideas of the past 50 years? ³⁰
Taboga, M. (2022). statlect.com.
Otsuka, J. (2023). Thinking About Statistics: The Philosophical Foundations. ³¹
- Otsuka, J. (2023). Talk: What machine learning tells us about the mathematical structures of concepts.

¹⁹ Stuart, Ord, & Arnold (2010).

²⁰ F. James (2006).

²¹ Cowan (1998) and Cowan (2016).

²² Cranmer (2015).

²³ Jaynes (2003).

²⁴ Lista (2016b).

²⁵ Lista (2016a).

²⁶ Cox (2006).

²⁷ Behnke, Kröninger, Schott, & Schörner-Sadenius (2013).

²⁸ Cousins (2018).

²⁹ Weisberg (2019).

³⁰ Gelman & Vehtari (2021).

³¹ Otsuka (2023).

3.2 Probability and its related concepts

3.2.1 Probability

Probability is of epistemic interest, being in some sense a measure of inductive confidence.

TODO:

Kolmogorov axioms
Probability vs odds: \(p/(p+q)\) vs \(p/q\)
Carnap, R. (1947). Probability as a guide in life. ³²
Carnap, R. (1953). What is probability?. ³³

³² Carnap (1947).

³³ Carnap (1953).

3.2.2 Expectation and variance

Expectation:

\[ \mathbb{E}(y) \equiv \int dx \: p(x) \: y(x) \]

Expectation values can be approximated with a partial sum over some data or Monte Carlo sample:

\[ \mathbb{E}(y) \approx \frac{1}{n} \sum_s^n y(x_s) \]

The variance of a random variable, \(y\), is defined as

\[ \begin{align} \mathrm{Var}(y) &\equiv \mathbb{E}((y - \mathbb{E}(y))^2) \\ &= \mathbb{E}(y^2 - 2 \: y \: \mathbb{E}(y) + \mathbb{E}(y)^2) \\ &= \mathbb{E}(y^2) - 2 \: \mathbb{E}(y) \: \mathbb{E}(y) + \mathbb{E}(y)^2 \\ &= \mathbb{E}(y^2) - \mathbb{E}(y)^2 \end{align} \tag{3.1}\]

The covariance matrix, \(\boldsymbol{V}\), of random variables \(x_i\) is

\[ \begin{align} V_{ij} &= \mathrm{Cov}(x_i, x_j) \equiv \mathbb{E}[(x_i - \mathbb{E}(x_i)) \: (x_j - \mathbb{E}(x_j))] \\ &= \mathbb{E}(x_i \: x_{j} - \mu_i \: x_j - x_i \: \mu_j + \mu_i \: \mu_j ) \\ &= \mathbb{E}(x_i \: x_{j}) - \mu_i \: \mu_j \end{align} \tag{3.2}\]

\[ \boldsymbol{V} = \begin{pmatrix} \mathrm{Var}(x_1) & \mathrm{Cov}(x_1, x_2) & \cdots & \mathrm{Cov}(x_1, x_n) \\ \mathrm{Cov}(x_2, x_1) & \mathrm{Var}(x_2) & \cdots & \mathrm{Cov}(x_2, x_n) \\ \vdots & \vdots & \ddots & \vdots \\ \mathrm{Cov}(x_n, x_1) & \mathrm{Cov}(x_n, x_2) & \cdots & \mathrm{Var}(x_n) \end{pmatrix} \]

Diagonal elements of the covariance matrix are the variances of each variable.

\[ \mathrm{Cov}(x_i, x_i) = \mathrm{Var}(x_i) \]

Off-diagonal elements of a covariance matrix measure how related two variables are, linearly. Covariance can be normalized to give the correlation coefficient between variables:

\[ \mathrm{Cor}(x_i, x_j) \equiv \frac{ \mathrm{Cov}(x_i, x_j) }{ \sqrt{ \mathrm{Var}(x_i) \: \mathrm{Var}(x_j) } } \]

which is bounded: \(-1 \leq \mathrm{Cor}(x_i, x_j) \leq 1\).

The covariance of two random vectors is given by

\[ \boldsymbol{V} = \mathrm{Cov}(\vec{x}, \vec{y}) = \mathbb{E}(\vec{x} \: \vec{y}^\intercal) - \vec{\mu}_x \: \vec{\mu}_{y}^\intercal \]

3.2.3 Cross entropy

TODO: discuss the Shannon entropy and Kullback-Leibler (KL) divergence. ³⁴

³⁴ Goodfellow, Bengio, & Courville (2016), p. 72-73.

Shannon entropy:

\[ H(p) = - \underset{x\sim{}p}{\mathbb{E}}\big[ \log p(x) \big] \]

Cross entropy:

\[ H(p, q) = - \underset{x\sim{}p}{\mathbb{E}}\big[ \log q(x) \big] = - \sum_{x} p(x) \: \log q(x) \]

Kullback-Leibler (KL) divergence:

\[ \begin{align} D_\mathrm{KL}(p, q) &= \underset{x\sim{}p}{\mathbb{E}}\left[ \log \left(\frac{p(x)}{q(x)}\right) \right] = \underset{x\sim{}p}{\mathbb{E}}\big[ \log p(x) - \log q(x) \big] \\ &= - H(p) + H(p, q) \\ \end{align} \]

See also the section on logistic regression.

3.2.4 Uncertainty

3.2.4.1 Quantiles and standard error

TODO:

Quantiles
Practice of standard error for uncertainty quantification.

3.2.4.2 Propagation of error

Given some vector of random variables, \(\vec{x}\), with estimated means, \(\vec{\mu}\), and estimated covariance matrix, \(\boldsymbol{V}\), suppose we are concerned with estimating the variance of some variable, \(y\), that is a function of \(\vec{x}\). The variance of \(y\) is given by

\[ \sigma^2_y = \mathbb{E}(y^2) - \mathbb{E}(y)^2 \,. \]

Taylor expanding \(y(\vec{x})\) about \(x=\mu\) gives

\[ y(\vec{x}) \approx y(\vec{\mu}) + \left.\frac{\partial y}{\partial x_i}\right|_{\vec{x}=\vec{\mu}} (x_i - \mu_i) \,. \]

Therefore, to first order

\[ \mathbb{E}(y) \approx y(\vec{\mu}) \]

and

\[ \begin{align} \mathbb{E}(y^2) &\approx y^2(\vec{\mu}) + 2 \, y(\vec{\mu}) \, \left.\frac{\partial y}{\partial x_i}\right|_{\vec{x}=\vec{\mu}} \mathbb{E}(x_i - \mu_i) \\ &+ \left.\frac{\partial y}{\partial x_i}\right|_{\vec{x}=\vec{\mu}} \left.\frac{\partial y}{\partial x_j}\right|_{\vec{x}=\vec{\mu}} \mathbb{E}\left[ (x_i - \mu_i) (x_j - \mu_j) \right] \\ &= y^2(\vec{\mu}) + \, \left.\frac{\partial y}{\partial x_i}\frac{\partial y}{\partial x_j}\right|_{\vec{x}=\vec{\mu}} V_{ij} \end{align} \]

TODO:

Clarify above, then specific examples.
“The Delta Method”

See:

Cowan. ³⁵
Doob, J.L. (1935). The limiting distributions of certain statistics. ³⁶
Arras, K.O. (1998). An introduction to error propagation: Derivation, meaning and examples of \(C_y= F_x C_x F_{x}^{\top}\). ³⁷
Ver Hoef, J.M. (2012). Who invented the delta method?. ³⁸

³⁵ Cowan (1998), p. 20-22.

³⁶ Doob (1935).

³⁷ Arras (1998).

³⁸ Ver Hoef (2012).

3.2.5 Bayes’ theorem

Bayes, Thomas (1701-1761)
Bayes’ theorem

\[ P(A|B) = P(B|A) \: P(A) \: / \: P(B) \]

Extended version of Bayes theorem
TODO: Example of conditioning with medical diagnostics
Cook, J.D. (2008). Canonical example of Bayes’ theorem in detail.

3.2.6 Likelihood and frequentist vs bayesian probability

Frequentist vs bayesian probability
Frequentism grew out of theories of statistical sampling error.
Bayesianism grew out of what used to be called “inverse probability”.
- Fienberg, S.E. (2006). When did Bayesian inference become “Bayesian”? ³⁹
Weisberg: “Two Schools” ⁴⁰

³⁹ Fienberg (2006).

⁴⁰ Weisberg (2019), ch. 15.

\[ P(H|D) = P(D|H) \: P(H) \: / \: P(D) \]

Likelihood

\[ L(\theta) = P(D|\theta) \]

We will return to the frequentist vs bayesian debate in the section on the “Statistics Wars”.

Fisher:

To appeal to such a result is absurd. Bayes’ theorem ought only to be used where we have in past experience, as for example in the case of probabilities and other statistical ratios, met with every admissible value with roughly equal frequency. There is no such experience in this case. ⁴¹

⁴¹ Fisher (1921), p. 15.

3.2.7 Curse of dimensionality

Curse of dimensionality
- The volume of the space increases so fast that the available data become sparse.
Stein’s paradox
- The ordinary decision rule for estimating the mean of a multivariate Gaussian distribution (with dimensions, \(n \geq 3\)) is inadmissible under mean squared error risk.
- Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. ⁴²
- James, W. & Stein, C. (1961). Estimation with quadratic loss. ⁴³
- Proof of Stein’s example
van Handel, R. (2016). Probability in high dimensions. ⁴⁴
Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications in Data Dcience. ⁴⁵

⁴² Stein (1956).

⁴³ W. James & Stein (1961).

⁴⁴ van Handel (2016).

⁴⁵ Vershynin (2018).

3.3 Statistical models

3.3.1 Parametric models

Data: \(x_i\)
Parameters: \(\theta_j\)
Model: \(f(\vec{x} ; \vec{\theta})\)

3.3.2 Canonical distributions

3.3.2.1 Bernoulli distribution

\[ \mathrm{Ber}(k; p) = \begin{cases} p & \mathrm{if}\ k = 1 \\ 1-p & \mathrm{if}\ k = 0 \end{cases} \]

which can also be written as

\[ \mathrm{Ber}(k; p) = p^k \: (1-p)^{(1-k)} \quad \mathrm{for}\ k \in \{0, 1\} \]

\[ \mathrm{Ber}(k; p) = p k + (1-p)(1-k) \quad \mathrm{for}\ k \in \{0, 1\} \]

Binomial distribution
Poisson distribution

TODO: explain, another important relationship is

Figure 3.1: Relationships among Bernoulli, binomial, categorical, and multinomial distributions.

3.3.2.2 Normal/Gaussian distribution

\[ N(x \,|\, \mu, \sigma^2) = \frac{1}{\sqrt{2\,\pi\:\sigma^2}} \: \exp\left(\frac{-(x-\mu)^2}{2\,\sigma^2}\right) \]

and in \(k\) dimensions:

\[ N(\vec{x} \,|\, \vec{\mu}, \boldsymbol{\Sigma}) = (2 \pi)^{-k/2}\:\left|\boldsymbol{\Sigma}\right|^{-1/2} \: \exp\left(\frac{-1}{2}\:(\vec{x}-\vec{\mu})^\intercal \:\boldsymbol{\Sigma}^{-1}\:(\vec{x}-\vec{\mu})\right) \]

where \(\boldsymbol{\Sigma}\) is the covariance matrix of the distribution (defined in eq. 3.2).

3.3.3 Central limit theorem

Let \(X_{1}\), \(X_{2}\), … , \(X_{n}\) be a random sample drawn from any distribution with a ﬁnite mean \(\mu\) and variance \(\sigma^{2}\). As \(n \rightarrow \infty\), the distribution of

\[ \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim N(0, 1) \]

\(\chi^2\) distribution
Univariate distribution relationships

Figure 3.2: Detail of a figure showing relationships among univariate distributions. See the full figure here ⁴⁶.

The exponential family of distributions are maximum entropy distributions.

3.3.4 Mixture models

Gaussian mixture models (GMM)
Marked poisson
- pyhf model description
- HistFactory ⁴⁷

⁴⁷ Cranmer, K. et al. (2012).

3.4 Point estimation and confidence intervals

3.4.1 Inverse problems

Recall that in the context of parametric models of data, \(x_i\) the pdf of which is modeled by a function, \(f(x_i ; \theta_j)\) with parameters, \(\theta_j\). In a statistical inverse problem, the goal is to infer values of the model parameters, \(\theta_j\) given some finite set of data, \(\{x_i\}\) sampled from a probability density, \(f(x_i; \theta_j)\) that models the data reasonably well ⁴⁸.

⁴⁸ This assumption that the model models the data “reasonably” well reflects that to the degree required by your analysis, the important features of the data match well within the systematic uncertainties parametrized within the model. If the model is incomplete because it is missing an important feature of the data, then this is the “ugly” (class-3) error in the Sinervo classification of systematic uncertainties.

Inverse problem
- Inverse probability (Fisher)
- Statistical inference
- See also: Structural realism
Estimators
Regression
Accuracy vs precision ⁴⁹

⁴⁹ Cowan (1998) and Cowan (2016), p. TODO.

3.4.2 Bias and variance

The bias of an estimator, \(\hat\theta\), is defined as

\[ \mathrm{Bias}(\hat{\theta}) \equiv \mathbb{E}(\hat{\theta} - \theta) = \int dx \: P(x|\theta) \: (\hat{\theta} - \theta) \]

The mean squared error (MSE) of an estimator has a similar formula to variance (eq. 3.1) except that instead of quantifying the square of the difference of the estimator and its expected value, the MSE uses the square of the difference of the estimator and the true parameter:

\[ \mathrm{MSE}(\hat{\theta}) \equiv \mathbb{E}((\hat{\theta} - \theta)^2) \]

The MSE of an estimator can be related to its bias and its variance by the following proof:

\[ \begin{align} \mathrm{MSE}(\hat{\theta}) &= \mathbb{E}(\hat{\theta}^2 - 2 \: \hat{\theta} \: \theta + \theta^2) \\ &= \mathbb{E}(\hat{\theta}^2) - 2 \: \mathbb{E}(\hat{\theta}) \: \theta + \theta^2 \end{align} \]

noting that

\[ \mathrm{Var}(\hat{\theta}) = \mathbb{E}(\hat{\theta}^2) - \mathbb{E}(\hat{\theta})^2 \]

and

\[ \begin{align} \mathrm{Bias}(\hat{\theta})^2 &= \mathbb{E}(\hat{\theta} - \theta)^2 \\ &= \mathbb{E}(\hat{\theta})^2 - 2 \: \mathbb{E}(\hat{\theta}) \: \theta + \theta^2 \end{align} \]

we see that MSE is equivalent to

\[ \mathrm{MSE}(\hat{\theta}) = \mathrm{Var}(\hat{\theta}) + \mathrm{Bias}(\hat{\theta})^2 \]

For an unbiased estimator, the MSE is the variance of the estimator.

TODO:

Note the discussion of the bias-variance tradeoff by Cranmer.
Note the new deep learning view. See Deep learning.

See also:

Deep double descent

3.4.3 Maximum likelihood estimation

A maximum likelihood estimator (MLE) was first used by Fisher. ⁵⁰

⁵⁰ Aldrich (1997).

\[\hat{\theta} \equiv \underset{\theta}{\mathrm{argmax}} \: \mathrm{log} \: L(\theta) \]

Maximizing \(\mathrm{log} \: L(\theta)\) is equivalent to maximizing \(L(\theta)\), and the former is more convenient because for data that are independent and identically distributed (i.i.d.) the joint likelihood can be factored into a product of individual measurements:

\[ L(\theta) = \prod_i L(\theta|x_i) = \prod_i P(x_i|\theta) \]

and taking the log of the product makes it a sum:

\[ \mathrm{log} \: L(\theta) = \sum_i \mathrm{log} \: L(\theta|x_i) = \sum_i \mathrm{log} \: P(x_i|\theta) \]

Maximizing \(\mathrm{log} \: L(\theta)\) is also equivalent to minimizing \(-\mathrm{log} \: L(\theta)\), the negative log-likelihood (NLL). For distributions that are i.i.d.,

\[ \mathrm{NLL} \equiv - \log L = - \log \prod_i L_i = - \sum_i \log L_i = \sum_i \mathrm{NLL}_i \]

3.4.3.1 Invariance of likelihoods under reparametrization

Likelihoods are invariant under reparametrization. ⁵¹
Bayesian posteriors are not invariant in general.

⁵¹ F. James (2006), p. 234.

See also:

Bayesian credibility intervals

3.4.3.2 Ordinary least squares

Least squares from MLE of gaussian models: \(\chi^2\)
Ordinary Least Squares (OLS)
Geometric interpretation
- Cox ⁵²
- Murphy ⁵³

⁵² Cox (2006), p. 11.

⁵³ Murphy (2012), p. 222.

3.4.4 Variance of MLEs

Taylor expansion of a likelihood near its maximum
Cramér-Rao bound ⁵⁴
- Define efficiency of an estimator.
- Common formula for variance of unbiased and efficient estimators
- Proof in Rice ⁵⁵
- Cranmer: Cramér-Rao bound
- Nielsen, F. (2013). Cramer-Rao lower bound and information geometry. ⁵⁶
- Under some reasonable conditions, one can show that MLEs are efficient and unbiased. TODO: find ref.
Fisher information matrix
- “is the key part of the proof of Wilks’ theorem, which allows confidence region estimates for maximum likelihood estimation (for those conditions for which it applies) without needing the Likelihood Principle.”
Confidence intervals
- Variance of MLEs
- Wilks’s theorem
- Method of \(\Delta\chi^2\) or \(\Delta{}L\)
- Frequentist confidence intervals (e.g. at 95% CL)
- Cowan ⁵⁷
- Likelihood need not be Gaussian ⁵⁸
- Minos method in particle physics in MINUIT ⁵⁹
- See slides for my talk: Primer on statistics: MLE, Confidence Intervals, and Hypothesis Testing
Asymptotics
- See Asymptotics.

⁵⁴ Fréchet (1943), Cramér (1946), Rao (1945), and Rao (1947).

⁵⁵ Rice (2007), p. 300–2.

⁵⁶ Nielsen (2013).

⁵⁷ Cowan (1998), p. 130-5.

⁵⁸ F. James (2006), p. 234.

⁵⁹ F. James & Roos (1975).

Figure 3.3: Transformation of non-parabolic log-likelihood to parabolic (source: my slides, recreation of F. James (2006), p. 235).

Common error bars
- Poisson error bars
  - Gaussian approximation: \(\sqrt{n}\)
  - Wilson-Hilferty approximation
- Binomial error bars
  - Error on efficiency or proportion
  - See: Statistical classification
More on confidence intervals
- Loh, W.Y. (1987). Calibrating confidence coefficients. ⁶⁰
Discussion
- Wainer, H. (2007). The most dangerous equation. (de Moivre’s equation for variance of means) ⁶¹
Misc
- Karhunen-Loève eigenvalue problems in cosmology: How should we tackle large data sets? ⁶²

⁶⁰ Loh (1987).

⁶¹ Wainer (2007).

⁶² Tegmark, Taylor, & Heavens (1997).

3.4.5 Bayesian credibility intervals

Inverse problem to find a posterior probability distribution.
- See also: Likelihood and frequentist vs bayesian probability
Maximum a posteriori estimation (MAP)
Prior sensitivity
- Betancourt, M. (2018). Towards a principled Bayesian workflow - ipynb
Not invariant to reparametrization in general
- Jeffreys priors are
- TODO: James

3.4.6 Uncertainty on measuring an efficiency

Binomial proportion confidence interval
Normal/Gaussian/Wald interval
- Derivation of the Wald interval
Wilson score interval
Clopper-Pearson interval (1934) ⁶³
Agresti-Coull interval (1998) ⁶⁴
- The modified Wald method for computing the confidence interval of a proportion
Rule of three (1983) ⁶⁵
Review by Brown, Cai, & DasGupta (2001) ⁶⁶
Casadei, D. (2012). Estimating the selection efficiency. ⁶⁷
Precision vs recall for classification, again
Classification and logistic regression
See also:
- Logistic regression in the section on Classical machine learning.
- Clustering in the section on Classical machine learning.

⁶³ Clopper & Pearson (1934).

⁶⁴ Agresti & Coull (1998).

⁶⁵ Hanley & Lippman-Hand (1983).

⁶⁶ L. D. Brown, Cai, & DasGupta (2001).

⁶⁷ Casadei (2012).

3.4.7 Examples

Some sample mean
Bayesian lighthouse
Measuring an efficiency
Some HEP fit

3.5 Statistical hypothesis testing

3.5.1 Null hypothesis significance testing

Karl Pearson observing how rare sequences of roulette spins are
Null hypothesis significance testing (NHST)
goodness of fit
Fisher

Fisher:

[T]he null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. ⁶⁸

⁶⁸ Fisher (1935), p. 16.

3.5.2 Neyman-Pearson theory

3.5.2.1 Introduction

probes an alternative hypothesis ⁶⁹
Type-1 and type-2 errors
Power and confidence
Cohen, J. (1992). A power primer. ⁷⁰
Cranmer, K. (2020). Thumbnail of LHC statistical procedures.
ATLAS and CMS Collaborations. (2011). Procedure for the LHC Higgs boson search combination in Summer 2011. ⁷¹
Cowan, G., Cranmer, K., Gross, E., & Vitells, O. (2011). Asymptotic formulae for likelihood-based tests of new physics. ⁷²

⁶⁹ Goodman (1999a). p. 998.

⁷⁰ J. Cohen (1992).

⁷¹ ATLAS and CMS Collaborations (2011).

⁷² Cowan, Cranmer, Gross, & Vitells (2011).

Figure 3.4: TODO: ROC explainer. (Wikimedia, 2015).

See also:

Statistical classification

3.5.2.2 Neyman-Pearson lemma

Neyman-Pearson lemma: ⁷³

⁷³ Neyman & Pearson (1933).

For a fixed signal efficiency, \(1-\alpha\), the selection that corresponds to the lowest possible misidentification probability, \(\beta\), is given by

\[ \frac{L(H_1)}{L(H_0)} > k_{\alpha} \]

where \(k_{\alpha}\) is the cut value required to achieve a type-1 error rate of \(\alpha\).

Neyman-Pearson test statistic:

\[ q_\mathrm{NP} = - 2 \ln \frac{L(H_1)}{L(H_0)} \]

Profile likelihood ratio:

\[ \lambda(\mu) = \frac{ L(\mu, \hat{\theta}_\mu) }{ L(\hat{\mu}, \hat{\theta}) } \]

where \(\hat{\theta}\) is the (unconditional) maximum-likelihood estimator that maximizes \(L\), while \(\hat{\theta}_\mu\) is the conditional maximum-likelihood estimator that maximizes \(L\) for a specified signal strength, \(\mu\), and \(\theta\) as a vector includes all other parameters of interest and nuisance parameters.

3.5.2.3 Neyman construction

Cranmer: Neyman construction.

Figure 3.5: Neyman construction for a confidence belt for \(\theta\) (source: K. Cranmer, 2020).

TODO: fix

\[ q = - 2 \ln \frac{L(\mu\,s + b)}{L(b)} \]

3.5.2.4 Flip-flopping

Flip-flopping and Feldman-Cousins confidence intervals ⁷⁴

⁷⁴ Feldman & Cousins (1998).

3.5.3 p-values and significance

\(p\)-values and significance ⁷⁵
Coverage
Fisherian vs Neyman-Pearson \(p\)-values

⁷⁵ Sinervo (2002) and Cowan (2012).

Cowan et al. define a \(p\)-value as

a probability, under assumption of \(H\), of finding data of equal or greater incompatibility with the predictions of \(H\). ⁷⁶

⁷⁶ Cowan et al. (2011), p. 2–3.

Also:

It should be emphasized that in an actual scientific context, rejecting the background-only hypothesis in a statistical sense is only part of discovering a new phenomenon. One’s degree of belief that a new process is present will depend in general on other factors as well, such as the plausibility of the new signal hypothesis and the degree to which it can describe the data. Here, however, we only consider the task of determining the \(p\)-value of the background-only hypothesis; if it is found below a specified threshold, we regard this as “discovery”. ⁷⁷

⁷⁷ Cowan et al. (2011), p. 3.

3.5.3.1 Uppper limits

Cousins, R.D. & Highland, V.L. (1992). Incorporating systematic uncertainties into an upper limit. ⁷⁸

⁷⁸ Cousins & Highland (1992).

3.5.3.2 CLs method

Conservative coverage; used in particle physics
Junk ⁷⁹
Read ⁸⁰
ATLAS ⁸¹

⁷⁹ Junk (1999).

⁸⁰ Read (2002).

⁸¹ ATLAS Statistics Forum (2011).

3.5.4 Asymptotics

Analytic variance of the likelihood-ratio of gaussians: \(\chi^2\)
- Pearson \(\chi^2\)-test
- Wilks ⁸²
  - Under the null hypothesis, \(-2 \ln(\lambda) \sim \chi^{2}_{k}\), where \(k\), the degrees of freedom for the \(\chi^{2}\) distribution is the number of parameters of interest (including signal strength) in the signal model but not in the null hypothesis background model.
- Wald ⁸³
  - Wald generalized the work of Wilks for the case of testing some nonzero signal for exclusion, showing \(-2 \ln(\lambda) \approx (\hat{\theta} - \theta)^\intercal V^{-1} (\hat{\theta} - \theta) \sim \mathrm{noncentral}\:\chi^{2}_{k}\).
  - In the simplest case where there is only one parameter of interest (the signal strength, \(\mu\)), then \(-2 \ln(\lambda) \approx \frac{ (\hat{\mu} - \mu)^{2} }{ \sigma^2 } \sim \mathrm{noncentral}\:\chi^{2}_{1}\).
Cowan, G., Cranmer, K., Gross, E., & Vitells, O. (2011). Asymptotic formulae for likelihood-based tests of new physics. ⁸⁴
- Wald approximation
- Asimov dataset
- Talk by Armbruster: Asymptotic formulae (2013).
Cowan, G., Cranmer, K., Gross, E., & Vitells, O. (2012). Asymptotic distribution for two-sided tests with lower and upper boundaries on the parameter of interest. ⁸⁵
Bhattiprolu, P.N., Martin, S.P., & Wells, J.D. (2020). Criteria for projected discovery and exclusion sensitivities of counting experiments. ⁸⁶
- github.com/prudhvibhattiprolu/Zstats

⁸² Wilks (1938).

⁸³ Wald (1943).

⁸⁴ Cowan et al. (2011).

⁸⁵ Cowan, Cranmer, Gross, & Vitells (2012).

⁸⁶ Bhattiprolu, Martin, & Wells (2020).

3.5.5 Student’s t-test

Student’s t-test
ANOVA
A/B-testing

3.5.6 Decision theory

Suppes, P. (1961). The philosophical relevance of decision theory. The Journal of Philosophy, 58, 605–614. http://www.jstor.org/stable/2023536
Frequentist vs bayesian decision theory ⁸⁷
Goodman, S.N. (1999). Toward evidence-based medical statistics 2: The Bayes factor. ⁸⁸

⁸⁷ Murphy (2012), p. 197.

⁸⁸ Goodman (1999b).

Support for using Bayes factors:

which properly separates issues of long-run behavior from evidential strength and allows the integration of background knowledge with statistical findings. ⁸⁹

⁸⁹ Goodman (1999a). p. 995.

See also:

“Statistics Wars”

3.5.7 Examples

Difference of two means: \(t\)-test
A/B-testing
New physics
- Slides by Me, Ryan Reece: “ATLAS, data reduction, and epistemology”
- Talk by Tommaso Dorigo: “Frequentist Statistics, the Particle Physicists’ Way”

3.6 Uncertainty quantification

3.6.1 Sinervo classification of systematic uncertainties

Class-1, class-2, and class-3 systematic uncertanties (good, bad, ugly), Classification by Pekka Sinervo (PhyStat2003) ⁹⁰
Not to be confused with type-1 and type-2 errors in Neyman-Pearson theory
Heinrich, J. & Lyons, L. (2007). Systematic errors. ⁹¹
Caldeira & Nord ⁹²

⁹⁰ Sinervo (2003).

⁹¹ Heinrich & Lyons (2007).

⁹² Caldeira & Nord (2020).

Lyons:

In analyses involving enough data to achieve reasonable statistical accuracy, considerably more effort is devoted to assessing the systematic error than to determining the parameter of interest and its statistical error. ⁹³

⁹³ Lyons (2008), p. 890.

Figure 3.6: Classification of measurement uncertainties (philosophy-in-figures.tumblr.com, 2016).

Poincaré’s three levels of ignorance

3.6.2 Profile likelihoods

Profiling and the profile likelihood
- Importance of Wald and Cowan et al.
- hybrid Bayesian-frequentist method

3.6.3 Examples of poor estimates of systematic uncertanties

Unaccounted-for effects
CDF \(Wjj\) bump
- Phys.Rev.Lett.106:171801 (2011) / arxiv:1104.0699
- Invariant mass distribution of jet pairs produced in association with a \(W\) boson in \(p\bar{p}\) collisions at \(\sqrt{s}\) = 1.96 TeV
- Dorigo, T. (2011). The jet energy scale as an explanation of the CDF signal.

Figure 3.7: Demonstration of sensitivity to the jet energy scale for an alleged excess in \(Wjj\) by Tommaso Dorigo (2011) (see also: GIF).

OPERA. (2011). Faster-than-light neutrinos.
BICEP2 claimed evidence of B-modes in the CMB as evidence of cosmic inflation without accounting for cosmic dust.

3.6.4 Conformal prediction

Conformal prediction
Gammerman, A., Vovk, V., & Vapnik, V. (1998). Learning by transduction. ⁹⁴

⁹⁴ Gammerman, Vovk, & Vapnik (1998).

3.7 Statistical classification

3.7.1 Introduction

Precision vs recall
Recall is sensitivity
Sensitivity vs specificity
Accuracy

3.7.2 Examples

TODO

See also:

Decision trees

3.8 Causal inference

3.8.1 Introduction

Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. ⁹⁵
Lewis, D. (1981). Causal decision theory. ⁹⁶
Pearl, J. (2018). The Book of Why: The new science of cause and effect. ⁹⁷

⁹⁵ Rubin (1974).

⁹⁶ Lewis (1981).

⁹⁷ Pearl (2018).

See also:

3.8.2 Causal models

Structural Causal Model (SCM)
Pearl, J. (2009). Causal inference in statistics: An overview. ⁹⁸
Robins, J.M. & Wasserman, L. (1999). On the impossibility of inferring causation from association without background knowledge. ⁹⁹
Peters, J., Janzing, D., & Schölkopf, B. (2017). Elements of Causal Inference. ¹⁰⁰
Lundberg, I., Johnson, R., & Stewart, B.M. (2021). What is your estimand? Defining the target quantity connects statistical evidence to theory. ¹⁰¹

⁹⁸ Pearl (2009).

⁹⁹ Robins & Wasserman (1999).

¹⁰⁰ Peters, Janzing, & Scholkopf (2017).

¹⁰¹ Lundberg, Johnson, & Stewart (2021).

3.8.3 Counterfactuals

Counterfactuals
Regret
Interventionist conception of causation
Ismael, J. (2023). Reflections on the asymmetry of causation. ¹⁰²
Chevalley, M., Schwab, P., & Mehrjou, A. (2024). Deriving causal order from single-variable interventions: Guarantees & algorithm. ¹⁰³

¹⁰² Ismael (2023).

¹⁰³ Chevalley, Schwab, & Mehrjou (2024).

3.9 Exploratory data analysis

3.9.1 Introduction

William Playfair (1759-1823)
- Father of statistical graphics
John Tukey (1915-2000)
- Exploratory data analysis
- Exploratory Data Analysis (1977) ¹⁰⁴

¹⁰⁴ Tukey (1977).

3.9.2 Look-elsewhere effect

Look-elsewhere effect (LEE)
- AKA File-drawer effect
Stopping rules
- validation dataset
- statistical issues, violates the likelihood principle

3.9.3 Archiving and data science

“Data science”
- Data collection, quality, analysis, archival, and reinterpretation
- Scientific research and big data
Reproducible an reinterpretable
- RECAST
- Chen, X. et al. (2018). Open is not enough. ¹⁰⁵

¹⁰⁵ Chen, X. et al. (2018).

3.10 “Statistics Wars”

3.10.1 Introduction

Kruschke
Carnap
- “The two concepts of probability” ¹⁰⁶
Royall
- “What do these data say?” ¹⁰⁷

¹⁰⁶ Carnap (1945).

¹⁰⁷ Royall (1997), p. 171–2.

Cranmer:

Bayes’s theorem is a theorem, so there’s no debating it. It is not the case that Frequentists dispute whether Bayes’s theorem is true. The debate is whether the necessary probabilities exist in the first place. If one can define the joint probability \(P (A, B)\) in a frequentist way, then a Frequentist is perfectly happy using Bayes theorem. Thus, the debate starts at the very definition of probability. ¹⁰⁸

¹⁰⁸ Cranmer (2015), p. 6.

Neyman:

Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong. ¹⁰⁹

¹⁰⁹ Neyman & Pearson (1933).

3.10.2 Likelihood principle

Likelihood principle
The likelihood principle is the proposition that, given a statistical model and a data sample, all the evidence relevant to model parameters is contained in the likelihood function.
The history of likelihood ¹¹¹
- Allan Birnbaum proved that the likelihood principle follows from two more primitive and seemingly reasonable principles, the conditionality principle and the sufficiency principle. ¹¹²
- Hacking identified the “law of likelihood”. ¹¹³
Berger & Wolpert. (1988). The Likelihood Principle. ¹¹⁴

¹¹¹ Edwards (1974).

¹¹² Birnbaum (1962).

¹¹³ Hacking (1965).

¹¹⁴ Berger & Wolpert (1988).

O’Hagan:

The first key argument in favour of the Bayesian approach can be called the axiomatic argument. We can formulate systems of axioms of good inference, and under some persuasive axiom systems it can be proved that Bayesian inference is a consequence of adopting any of these systems… If one adopts two principles known as ancillarity and sufficiency principles, then under some statement of these principles it follows that one must adopt another known as the likelihood principle. Bayesian inference conforms to the likelihood principle whereas classical inference does not. Classical procedures regularly violate the likelihood principle or one or more of the other axioms of good inference. There are no such arguments in favour of classical inference. ¹¹⁵

¹¹⁵ O’Hagan (2010), p. 17–18.

Gandenberger
- “A new proof of the likelihood principle” ¹¹⁶
- Thesis: Two Principles of Evidence and Their Implications for the Philosophy of Scientific Method (2015)
- gandenberger.org/research
- Do frequentist methods violate the likelihood principle?
Criticisms:
- Evans ¹¹⁷
- Mayo ¹¹⁸
- Mayo: The law of likelihood and error statistics ¹¹⁹
- Mayo’s response to Hennig and Gandenberger
- Dawid, A.P. (2014). Discussion of “On the Birnbaum Argument for the Strong Likelihood Principle”. ¹²⁰
Likelihoodist statistics

¹¹⁶ Gandenberger (2015).

¹¹⁷ Evans (2013).

¹¹⁸ Mayo (2014).

¹¹⁹ Mayo (2019).

¹²⁰ Dawid (2014).

Mayo:

Likelihoods are vital to all statistical accounts, but they are often misunderstood because the data are fixed and the hypothesis varies. Likelihoods of hypotheses should not be confused with their probabilities. … [T]he same phenomenon may be perfectly predicted or explained by two rival theories; so both theories are equally likely on the data, even though they cannot both be true. ¹²¹

¹²¹ Mayo (2019).

3.10.3 Discussion

Lyons:

Particle Physicists tend to favor a frequentist method. This is because we really do consider that our data are representative as samples drawn according to the model we are using (decay time distributions often are exponential; the counts in repeated time intervals do follow a Poisson distribution, etc.), and hence we want to use a statistical approach that allows the data “to speak for themselves,” rather than our analysis being dominated by our assumptions and beliefs, as embodied in Bayesian priors. ¹²²

¹²² Lyons (2008), p. 891.

Carnap
- Carnap, R. (1952). The Continuum of Inductive Methods. ¹²³
- Carnap, R. (1960). Logical Foundations of Probability. ¹²⁴
- Sznajder on the alleged evolution of Carnap’s views of inductive logic ¹²⁵
David Cox
Ian Hacking
- Logic of Statistical Inference ¹²⁶
Neyman
- “Frequentist probability and frequentist statistics” ¹²⁷
Rozeboom
- Rozeboom, W.W. (1960). The fallacy of the null-hypothesis significance test. ¹²⁸
Meehl
- Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. ¹²⁹
Zech
- “Comparing statistical data to Monte Carlo simulation” ¹³⁰
Richard Royall
- Statistical Evidence: A likelihood paradigm ¹³¹
Jim Berger
- “Could Fisher, Jeffreys, and Neyman have agreed on testing?” ¹³²
Kendall & Stuart
- “The fiducialist argument rests on the assumption that \(\mathrm{probability}_{2}\) can be converted into \(\mathrm{probability}_{1}\) by means of a pivoting operation.” ¹³³
Deborah Mayo
- “In defense of the Neyman-Pearson theory of confidence intervals” ¹³⁴
- Concept of “Learning from error” in Error and the Growth of Experimental Knowledge ¹³⁵
- “Severe testing as a basic concept in a Neyman-Pearson philosophy of induction” ¹³⁶
- “Error statistics” ¹³⁷
- Statistical Inference as Severe Testing ¹³⁸
- Statistics Wars: Interview with Deborah Mayo - APA blog
- Review of SIST by Prasanta S. Bandyopadhyay
- LSE Research Seminar: Current Controversies in Phil Stat (May 21, 2020)
  - Meeting 5 (June 18, 2020)
- Slides: The Statistics Wars and Their Casualties
Andrew Gelman
- Confirmationist and falsificationist paradigms of science - Sept. 5, 2014
- Beyond subjective and objective in statistics ¹³⁹
- Retire Statistical Significance: The discussion
- Exchange with Deborah Mayo on abandoning statistical significance
- Several reviews of SIST
Larry Wasserman
- Statistical Principles?
Kevin Murphy
- Pathologies of frequentist statistics ¹⁴⁰
- \(p\)-values considered harmful ¹⁴¹
Greg Gandenberger
- An introduction to likelihoodist, bayesian, and frequentist methods (1/3)
- As Neyman and Pearson put it in their original presentation of the frequentist approach, “without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behavior with regard to them, in the following which we insure that, in the long run of experience, we shall not too often be wrong” (1933, 291).
- An introduction to likelihoodist, bayesian, and frequentist methods (2/3)
- An introduction to likelihoodist, bayesian, and frequentist methods (3/3)
- An argument against likelihoodist methods as genuine alternatives to bayesian and frequentist methods
- “Why I am not a likelihoodist” ¹⁴²
Jon Wakefield
- Bayesian and Frequentist Regression Methods ¹⁴³
Efron & Hastie
- “Flaws in Frequentist Inference” ¹⁴⁴
Kruschke & Liddel ¹⁴⁵
Steinhardt, J. (2012). Beyond Bayesians and frequentists. ¹⁴⁶
VanderPlas, J. (2014). Frequentism and Bayesianism III: Confidence, credibility, and why frequentism and science do not mix.
Kent, B. (2021). No, your confidence interval is not a worst-case analysis.
Aubrey Clayton
- Clayton, A. (2021). Bernoulli’s Fallacy: Statistical Illogic and the Crisis of Modern Science. ¹⁴⁷
- Wagenmakers, E.J. (2021). Review: Bernoulli’s Fallacy. ¹⁴⁸

¹²³ Carnap (1952).

¹²⁴ Carnap (1960).

¹²⁵ Sznajder (2018).

¹²⁶ Hacking (1965).

¹²⁷ Neyman (1977).

¹²⁸ Rozeboom (1960).

¹²⁹ Meehl (1978).

¹³⁰ Zech (1995).

¹³¹ Royall (1997).

¹³² Berger (2003).

¹³³ Stuart et al. (2010), p. 460.

¹³⁴ Mayo (1981).

¹³⁵ Mayo (1996).

¹³⁶ Mayo & Spanos (2006).

¹³⁷ Mayo & Spanos (2011).

¹³⁸ Mayo (2018).

¹³⁹ Gelman & Hennig (2017).

¹⁴⁰ Murphy (2012), ch. 6.6.

¹⁴¹ Murphy (2022), p. 195–198.

¹⁴² Gandenberger (2016).

¹⁴³ Wakefield (2013), ch. 4.

¹⁴⁴ Efron & Hastie (2016), p. 30–36.

¹⁴⁵ Kruschke & Liddell (2018).

¹⁴⁶ Steinhardt (2012).

¹⁴⁷ Clayton (2021).

¹⁴⁸ Wagenmakers (2021).

Figure 3.9: The major virtues and vices of Bayesian, frequentist, and likelihoodist approaches to statistical inference (gandenberger.org/research/, 2015).

Goodman:

The idea that the \(P\) value can play both of these roles is based on a fallacy: that an event can be viewed simultaneously both from a long-run and a short-run perspective. In the long-run perspective, which is error-based and deductive, we group the observed result together with other outcomes that might have occurred in hypothetical repetitions of the experiment. In the “short run” perspective, which is evidential and inductive, we try to evaluate the meaning of the observed result from a single experiment. If we could combine these perspectives, it would mean that inductive ends (drawing scientific conclusions) could be served with purely deductive methods (objective probability calculations). ¹⁴⁹

¹⁴⁹ Goodman (1999a). p. 999.

3.11 Replication crisis

3.11.1 Introduction

Ioannidis, J.P. (2005). Why most published research findings are false. ¹⁵⁰

¹⁵⁰ Ioannidis (2005).

3.11.2 p-value controversy

Wasserstein, R.L. & Lazar, N.A. (2016). The ASA’s statement on \(p\)-values: Context, process, and purpose. ¹⁵¹
Wasserstein, R.L., Allen, L.S., & Lazar, N.A. (2019). Moving to a World Beyond “p<0.05”. ¹⁵²
Big names in statistics want to shake up much-maligned P value ¹⁵³
Hi-Phi Nation, episode 7
Fisher:

¹⁵¹ Wasserstein & Lazar (2016).

¹⁵² Wasserstein, Allen, & Lazar (2019).

¹⁵³ Benjamin, D.J. et al. (2017).

[N]o isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon; for the “one chance in a million” will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be that it should occur to us. In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. ¹⁵⁴

¹⁵⁴ Fisher (1935), p. 13–14.

Relationship to the LEE
Tukey, John (1915-2000)
- Uncomfortable science
Wasserman
- The Higgs boson and the p-value police
Rao & Lovric
- Rao, C.R. & Lovric, M.M. (2016). Testing point null hypothesis of a normal mean and the truth: 21st century perspective. ¹⁵⁵
Mayo
- “Les stats, c’est moi: We take that step here!”
- “Significance tests: Vitiated or vindicated by the replication crisis in psychology?” ¹⁵⁶
- At long last! The ASA President’s Task Force Statement on Statistical Significance and Replicability
Gorard & Gorard. (2016). What to do instead of significance testing. ¹⁵⁷
Vox: What a nerdy debate about p-values shows about science–and how to fix it
Karen Kafadar: The Year in Review … And More to Come
The JASA Reproducibility Guide

¹⁵⁵ Rao & Lovric (2016).

¹⁵⁶ Mayo (2021).

¹⁵⁷ Gorard & Gorard (2016).

From “The ASA president’s task force statement on statistical significance and replicability”:

P-values are valid statistical measures that provide convenient conventions for communicating the uncertainty inherent in quantitative results. Indeed, P-values and significance tests are among the most studied and best understood statistical procedures in the statistics literature. They are important tools that have advanced science through their proper application. ¹⁵⁸

¹⁵⁸ Benjamini, Y. et al. (2021), p. 1.

3.12 Classical machine learning

3.12.1 Introduction

Classification vs regression
Supervised and unsupervised learning
- Classification = supervised; clustering = unsupervised
Hastie, Tibshirani, & Friedman ¹⁵⁹
Information Theory, Inference, and Learning ¹⁶⁰
Murphy, K.P. (2012). Machine Learning: A probabilistic perspective. MIT Press. ¹⁶¹
Murphy, K.P. (2022). Probabilistic Machine Learning: An introduction. MIT Press. ¹⁶²
Shalev-Shwarz, S. & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. ¹⁶³
VC-dimension
- Vapnik (1994) ¹⁶⁴
- Shalev-Shwarz, S. & Ben-David, S. (2014). ¹⁶⁵

¹⁵⁹ Hastie, Tibshirani, & Friedman (2009).

¹⁶⁰ MacKay (2003).

¹⁶¹ Murphy (2012).

¹⁶² Murphy (2022), p. 195–198.

¹⁶³ Shalev-Shwarz & Ben-David (2014).

¹⁶⁴ Vapnik, Levin, & LeCun (1994).

¹⁶⁵ Shalev-Shwarz & Ben-David (2014), p. 67–82.

3.12.2 History

History of artificial intelligence
Arthur Samuel (1901-1990)
Dartmouth workshop (1956)
- McCarthy, J., Minsky, M.L., Rochester, N., & Shannon, C.E. (1955). A proposal for the Dartmouth Summer Research Project on Artificial Intelligence. ¹⁶⁶
- Solomonoff, G. (2016). Ray Solomonoff and the Dartmouth Summer Research Project in Artificial Intelligence, 1956. ¹⁶⁷
Rudolf Carnap (1891-1970)
- Kardum, M. (2020). Rudolf Carnap–The grandfather of artificial neural networks: The influence of Carnap’s philosophy on Walter Pitts. ¹⁶⁸
- Bright, L.K. (2022). Carnap’s contributions.
McCulloch & Pitts
- McCulloch, W. & Pitts, W. (1943). A logical calculus of ideas immanent in nervous activity. ¹⁶⁹
- Anderson, J.A. & Rosenfeld, E. (1998). Talking Nets: An oral history of neural networks. ¹⁷⁰
- Gefter, A. (2015). The man who tried to redeem the world with logic.
Perceptron
- Rosenblatt, F. (1961). Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. ¹⁷¹
- Sompolinsky, H. (2013). Introduction: The Perceptron.
Connectionist vs symbolic AI
- Minsky, M. & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. ¹⁷²
- AI Winter
- Cartwright, N. (2001). What is wrong with Bayes nets?. ¹⁷³

¹⁶⁶ McCarthy, Minsky, Rochester, & Shannon (1955).

¹⁶⁷ Solomonoff (2016).

¹⁶⁸ Kardum (2020).

¹⁶⁹ McCulloch & Pitts (1943).

¹⁷⁰ J. A. Anderson & Rosenfeld (1998).

¹⁷¹ Rosenblatt (1961).

¹⁷² Minsky & Papert (1969).

¹⁷³ Cartwright (2001).

See also:

Honorific reinterpretation of scientism

3.12.3 Logistic regression

From a probabilistic point of view, ¹⁷⁴ logistic regression can be derived from doing maximum likelihood estimation of a vector of model parameters, \(\vec{w}\), in a dot product with the input features, \(\vec{x}\), and squashed with a logistic function that yields the probability, \(\mu\), of a Bernoulli random variable, \(y \in \{0, 1\}\).

¹⁷⁴ Murphy (2012), p. 21.

\[ p(y | \vec{x}, \vec{w}) = \mathrm{Ber}(y | \mu(\vec{x}, \vec{w})) = \mu(\vec{x}, \vec{w})^y \: (1-\mu(\vec{x}, \vec{w}))^{(1-y)} \]

The negative log-likelihood of multiple trials is

\[ \begin{align} \mathrm{NLL} &= - \sum_i \log p(y_i | \vec{x}_i, \vec{w}) \\ &= - \sum_i \log\left( \mu(\vec{x}_i, \vec{w})^{y_i} \: (1-\mu(\vec{x}_i, \vec{w}))^{(1-y_i)} \right) \\ &= - \sum_i \log\left( \mu_i^{y_i} \: (1-\mu_i)^{(1-y_i)} \right) \\ &= - \sum_i \big( y_i \, \log \mu_i + (1-y_i) \log(1-\mu_i) \big) \end{align} \]

which is the cross entropy loss. Note that the first term is non-zero only when the true target is \(y_i=1\), and similarly the second term is non-zero only when \(y_i=0\). ¹⁷⁵ Therefore, we can reparametrize the target \(y_i\) in favor of \(t_{ki}\) that is one-hot in an index \(k\) over classes.

¹⁷⁵ Note: Label smoothing is a regularization technique that smears the activation over other labels, but we don’t do that here.

\[ \mathrm{CEL} = \mathrm{NLL} = - \sum_i \sum_k \big( t_{ki} \, \log \mu_{ki} \big) \]

where

\[ t_{ki} = \begin{cases} 1 & \mathrm{if}\ (k = y_i = 0)\ \mathrm{or}\ (k = y_i = 1) \\ 0 & \mathrm{otherwise} \end{cases} \]

and

\[ \mu_{ki} = \begin{cases} 1-\mu_i & \mathrm{if}\ k = 0 \\ \mu_i & \mathrm{if}\ k =1 \end{cases} \]

This readily generalizes from binary classification to classification over many classes as we will discuss more below. Note that in the sum over classes, \(k\), only one term for the true class contributes.

\[ \mathrm{CEL} = - \left. \sum_i \log \mu_{ki} \right|_{k\ \mathrm{is\ such\ that}\ y_k=1} \tag{3.3}\]

Logistic regression uses the logit function ¹⁷⁶, which is the logarithm of the odds—the ratio of the chance of success to failure. Let \(\mu\) be the probability of success in a Bernoulli trial, then the logit function is defined as

¹⁷⁶ “Logit” was coined by Joseph Berkson (1899-1982).

\[ \mathrm{logit}(\mu) \equiv \log\left(\frac{\mu}{1-\mu}\right) \]

Logistic regression assumes that the logit function is a linear function of the explanatory variable, \(x\).

\[ \log\left(\frac{\mu}{1-\mu}\right) = \beta_0 + \beta_1 x \]

where \(\beta_0\) and \(\beta_1\) are trainable parameters. (TODO: Why would we assume this?) This can be generalized to a vector of multiple input variables, \(\vec{x}\), where the input vector has a 1 prepended to be its zeroth component in order to conveniently include the bias, \(\beta_0\), in a dot product.

\[ \vec{x} = (1, x_1, x_2, \ldots, x_n)^\intercal \]

\[ \vec{w} = (\beta_0, \beta_1, \beta_2, \ldots, \beta_n)^\intercal \]

\[ \log\left(\frac{\mu}{1-\mu}\right) = \vec{w}^\intercal \vec{x} \]

For the moment, let \(z \equiv \vec{w}^\intercal \vec{x}\). Exponentiating and solving for \(\mu\) gives

\[ \mu = \frac{ e^z }{ 1 + e^z } = \frac{ 1 }{ 1 + e^{-z} } \]

This function is called the logistic or sigmoid function.

\[ \mathrm{logistic}(z) \equiv \mathrm{sigm}(z) \equiv \frac{ 1 }{ 1 + e^{-z} } \]

Since we inverted the logit function by solving for \(\mu\), the inverse of the logit function is the logistic or sigmoid.

\[ \mathrm{logit}^{-1}(z) = \mathrm{logistic}(z) = \mathrm{sigm}(z) \]

And therefore,

\[ \mu = \mathrm{sigm}(z) = \mathrm{sigm}(\vec{w}^\intercal \vec{x}) \]

See also:

Logistic regression
Harlan, W.S. (2007). Bounded geometric growth: motivation for the logistic function.
Heesch, D. A short intro to logistic regression.
Roelants, P. (2019). Logistic classification with cross-entropy.

3.12.4 Softmax regression

Again, from a probabilistic point of view, we can derive the use of multi-class cross entropy loss by starting with the Bernoulli distribution, generalizing it to multiple classes (indexed by \(k\)) as

\[ p(y_k | \mu) = \mathrm{Cat}(y_k | \mu_k) = \prod_k {\mu_k}^{y_k} \]

which is the categorical or multinoulli distribution. The negative-log likelihood of multiple independent trials is

\[ \mathrm{NLL} = - \sum_i \log \left(\prod_k {\mu_{ki}}^{y_{ki}}\right) = - \sum_i \sum_k y_{ki} \: \log \mu_{ki} \]

Noting again that \(y_{ki} = 1\) only when \(k\) is the true class, and is 0 otherwise, this simplifies to eq. 3.3.

See also:

Multinomial logistic regression
McFadden ¹⁷⁷
Softmax is really a soft argmax. TODO: find ref.
Softmax is not unique. There are other squashing functions. ¹⁷⁸
Roelants, P. (2019). Softmax classification with cross-entropy.
Gradients from backprop through a softmax
Goodfellow et al. point out that any negative log-likelihood is a cross entropy between the training data and the probability distribution predicted by the model. ¹⁷⁹

¹⁷⁷ McFadden & Zarembka (1973).

¹⁷⁸ Blondel, Martins, & Niculae (2020).

¹⁷⁹ Goodfellow et al. (2016), p. 129.

3.12.5 Decision trees

Freund, Y. & Schapire, R.E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. (AdaBoost) ¹⁸⁰
Chen, T. & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. ¹⁸¹
Aytekin, C. (2022). Neural networks are decision trees. ¹⁸²
Grinsztajn, L., Oyallon, E., & Varoquaux, G. (2022). Why do tree-based models still outperform deep learning on tabular data? ¹⁸³
Coadou, Y. (2022). Boosted decision trees. ¹⁸⁴

¹⁸⁰ Freund & Schapire (1997).

¹⁸¹ T. Chen & Guestrin (2016).

¹⁸² Aytekin (2022).

¹⁸³ Grinsztajn, Oyallon, & Varoquaux (2022).

¹⁸⁴ Coadou (2022).

3.12.6 Clustering

unsupervised
Gaussian Mixture Models (GMMs)
- Gaussian discriminant analysis
- \(\chi^2\)
Generalized Linear Models (GLMs)
- Exponential family of PDFs
- Multinoulli \(\mathrm{Cat}(x|\mu)\)
- GLMs
EM algorithm
- \(k\)-means
Discussion
- Hartigan, J.A. (1985). Statistical theory in clustering. ¹⁸⁵
Clustering high-dimensional data
- t-distributed stochastic neighbor embedding (t-SNE)
- Slonim, N., Atwal, G.S., Tkacik, G. & Bialek, W. (2005). Information-based clustering. ¹⁸⁶
Topological data analysis
- Dindin, M. (2018). TDA To Rule Them All: ToMATo Clustering.
Relationship of clustering and autoencoding
- Olah, C. (2014). Neural networks, manifolds, and topology.
- Batson et al. (2021). Topological obstructions to autoencoding. ¹⁸⁷
“What are the true clusters?” ¹⁸⁸
- See also:
  - Algorithmic information theory
  - Constructive empiricism
Lauc, D. (2020). Machine learning and the philosophical problems of induction. ¹⁸⁹
Ronen, M., Finder, S.E., & Freifeld, O. (2022). DeepDPM: Deep clustering with an unknown number of clusters. ¹⁹⁰
Fang, Z. et al. (2022). Is out-of-distribution detection learnable?. ¹⁹¹

¹⁸⁵ Hartigan (1985).

¹⁸⁶ Slonim, Atwal, Tkacik, & Bialek (2005).

¹⁸⁷ Batson, Haaf, Kahn, & Roberts (2021).

¹⁸⁸ Hennig (2015).

¹⁸⁹ Lauc (2020), p. 103–4.

¹⁹⁰ Ronen, Finder, & Freifeld (2022).

¹⁹¹ Fang, Z. et al. (2022).

See also:

Curse of dimensionality
Surrogate models

3.13 Deep learning

3.13.1 Introduction

Early contributions
- Gardner, M.W. & Dorling, S.R. (1998). Artificial neural networks (the multilayer perceptron)–a review of applications in the atmospheric sciences. ¹⁹²
Conceptual reviews of deep learning
- Lower to higher level representations ¹⁹³
- LeCun, Y., Bengio, Y., & Hinton, G. (2015). Review: Deep learning. ¹⁹⁴
- Sutskever, I. (2015). A brief overview of deep learning. ¹⁹⁵
- Deep Learning ¹⁹⁶
- Kaplan, J. (2019). Notes on contemporary machine learning. ¹⁹⁷
- Raissi, M. (2017). Deep learning tutorial.
Backpropagation
- Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). Learning representations by back-propagating errors. ¹⁹⁸
- Amari, S. (1993). Backpropagation and stochastic gradient descent method. ¹⁹⁹
- Schmidhuber’s Critique of Honda Prize for Dr. Hinton.
- Schmidhuber: Who invented backpropagation?
- Scmidhuber: The most cited neural networks all build on work done in my labs.
- LeCun, Y. & Bottou, L. (1998). Efficient BackProp. ²⁰⁰
Pedagogy
- Bekman, S. (2023). Machine Learning Engineering Open Book.
- Labonne, M. (2023). Large Language Model Course.
- Microsoft. (2023). Generative AI for Beginners.
- Scardapane, S. (2024). Alice’s Adventures in a Differentiable Wonderland, Vol. I: A Tour of the Land. ²⁰¹
- Grover, A. (2018). Notes on deep generative models.
Practical guides
- Bottou, L. (1998). Stochastic gradient descent tricks. ²⁰²
- Bengio, Y. (2012). Practical recommendations for gradient-based training of deep architectures.
- Hao, L. et al. (2017). Visualizing the loss landscape of neural nets.
Discussion
- Norvig, P. (2011). On Chomsky and the Two Cultures of Statistical Learning. ²⁰³
- Sutton, R. (2019). The bitter lesson. ²⁰⁴
- Frankle, J. & Carbin, M. (2018). The lottery ticket hypothesis: Finding sparse, trainable neural networks. ²⁰⁵
- AIMyths.com

¹⁹² Gardner & Dorling (1998).

¹⁹³ Bengio (2009).

¹⁹⁴ LeCun, Bengio, & Hinton (2015).

¹⁹⁵ Sutskever (2015).

¹⁹⁶ Goodfellow et al. (2016).

¹⁹⁷ Kaplan, J. et al. (2019).

¹⁹⁸ Rumelhart, Hinton, & Williams (1986).

¹⁹⁹ Amari (1993).

²⁰⁰ LeCun & Bottou (1998).

²⁰¹ Scardapane (2024).

²⁰² Bottou (1998).

²⁰³ Norvig (2011).

²⁰⁴ Sutton (2019).

²⁰⁵ Frankle & Carbin (2018).

Figure 3.11: Raw input image is transformed into gradually higher levels of representation. ²⁰⁶

3.13.2 Gradient descent

\[ \hat{f} = \underset{f \in \mathcal{F}}{\mathrm{argmin}} \underset{x \sim \mathcal{X}}{\mathbb{E}} L(f, x) \]

The workhorse algorithm for optimizing (training) model parameters is gradient descent:

\[ \vec{w}[t+1] = \vec{w}[t] - \eta \frac{\partial L}{\partial \vec{w}}[t] \]

In Stochastic Gradient Descent (SGD), you chunk the training data into minibatches (AKA batches), \(\vec{x}_{bt}\), and take a gradient descent step with each minibatch:

\[ \vec{w}[t+1] = \vec{w}[t] - \frac{\eta}{m} \sum_{i=1}^m \frac{\partial L}{\partial \vec{w}}[\vec{x}_{bt}] \]

where

\(t \in \mathbf{N}\) is the learning step number
\(\eta\) is the learning rate
\(m\) is the number of samples in a minibatch, called the batch size
\(L\) is the loss function
\(\frac{\partial L}{\partial \vec{w}}\) is the gradient

3.13.3 Deep double descent

Bias and variance trade-off. See Bias and variance.
MSE vs model capacity

Papers:

Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias-variance trade-off. ²⁰⁷
Muthukumar, V., Vodrahalli, K., Subramanian, V., & Sahai, A. (2019). Harmless interpolation of noisy data in regression. ²⁰⁸
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2019). Deep double descent: Where bigger models and more data hurt. ²⁰⁹
Chang, X., Li, Y., Oymak, S., & Thrampoulidis, C. (2020). Provable benefits of overparameterization in model compression: From double descent to pruning neural networks. ²¹⁰
Holzmüller, D. (2020). On the universality of the double descent peak in ridgeless regression. ²¹¹
Dar, Y., Muthukumar, V., & Baraniuk, R.G. (2021). A farewell to the bias-variance tradeoff? An overview of the theory of overparameterized machine learning. ²¹²
Balestriero, R., Pesenti, J., & LeCun, Y. (2021). Learning in high dimension always amounts to extrapolation. ²¹³
Belkin, M. (2021). Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. ²¹⁴
Nagarajan, V. (2021). Explaining generalization in deep learning: progress and fundamental limits. ²¹⁵
Bach, F. (2022). Learning Theory from First Principles. ²¹⁶
Barak, B. (2022). The uneasy relationship between deep learning and (classical) statistics.
Ghosh, N. & Belkin, M. (2022). A universal trade-off between the model size, test loss, and training loss of linear predictors. ²¹⁷
Singh, S.P., Lucchi, A., Hofmann, T., & Schölkopf, B. (2022). Phenomenology of double descent in finite-width neural networks. ²¹⁸
Hastie, T., Montanari, A., Rosset, S., & Tibshirani, R. J. (2022). Surprises in high-dimensional ridgeless least squares interpolation. ²¹⁹
Bubeck, S. & Sellke, M. (2023). A universal law of robustness via isoperimetry. ²²⁰
Gamba, M., Englesson, E., Björkman, M., & Azizpour, H. (2022). Deep double descent via smooth interpolation. ²²¹
Schaeffer, R. et al. (2023). Double descent demystified: Identifying, interpreting & ablating the sources of a deep learning puzzle. ²²²
Yang, T. & Suzuki, J. (2023). Dropout drops double descent. ²²³
Maddox, W.J., Benton, G., & Wilson, A.G. (2023). Rethinking parameter counting in deep models: Effective dimensionality revisited. ²²⁴

²⁰⁷ Belkin, Hsu, Ma, & Mandal (2019).

²⁰⁸ Muthukumar, Vodrahalli, Subramanian, & Sahai (2019).

²⁰⁹ Nakkiran, P. et al. (2019).

²¹⁰ Chang, Li, Oymak, & Thrampoulidis (2020).

²¹¹ Holzmüller (2020).

²¹² Dar, Muthukumar, & Baraniuk (2021).

²¹³ Balestriero, Pesenti, & LeCun (2021).

²¹⁴ Belkin (2021).

²¹⁵ Nagarajan (2021).

²¹⁶ Bach (2022), p. 225–230.

²¹⁷ Ghosh & Belkin (2022).

²¹⁸ Singh, Lucchi, Hofmann, & Schölkopf (2022).

²¹⁹ Hastie, Montanari, Rosset, & Tibshirani (2022).

²²⁰ Bubeck & Sellke (2023).

²²¹ Gamba, Englesson, Björkman, & Azizpour (2022).

²²² Schaeffer, R. et al. (2023).

²²³ Yang & Suzuki (2023).

²²⁴ Maddox, Benton, & Wilson (2023).

Blogs:

Hubinger, E. (2019). Understanding deep double descent. LessWrong.
OpenAI. (2019). Deep double descent.
Steinhardt, J. (2022). More is different for AI. ²²⁵
Henighan, T. et al. (2023). Superposition, memorization, and double descent. ²²⁶

²²⁵ Steinhardt (2022).

²²⁶ Henighan, T. et al. (2023).

Twitter threads:

Daniela Witten. (2020). Twitter thread: The bias-variance trade-off & double descent.
François Fleuret. (2020). Twitter thread: The double descent with polynomial regression.
adad8m. (2022). Twitter thread: The double descent with polynomial regression.
Peyman Milanfar. (2022). Twitter thread: The perpetually undervalued least-squares.
Pierre Ablin. (2023). Twitter thread: The double descent with polynomial regression.

3.13.4 Regularization

Regularization = any change we make to the training algorithm in order to reduce the generalization error but not the training error. ²²⁷

²²⁷ Mishra, D. (2020). Weight Decay == L2 Regularization?

Most common regularizations:

L2 Regularization
L1 Regularization
Data Augmentation
Dropout
Early Stopping

Papers:

Loshchilov, I. & Hutter, F. (2019). Decoupled weight decay regularization.
Chen, S., Dobriban, E., & Lee, J. H. (2020). A group theoretic framework for data augmentation. ²²⁸

²²⁸ S. Chen, Dobriban, & Lee (2020).

3.13.5 Batch size vs learning rate

Papers:

Keskar, N.S. et al. (2016). On large-batch training for deep learning: Generalization gap and sharp minima.

[L]arge-batch methods tend to converge to sharp minimizers of the training and testing functions—and as is well known—sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation.

Hoffer, E. et al. (2017). Train longer, generalize better: closing the generalization gap in large batch training of neural networks.
- \(\eta \propto \sqrt{m}\)
Goyal, P. et al. (2017). Accurate large minibatch SGD: Training ImageNet in 1 hour.
- \(\eta \propto m\)
You, Y. et al. (2017). Large batch training of convolutional networks.
- Layer-wise Adaptive Rate Scaling (LARS)
You, Y. et al. (2017). ImageNet training in minutes.
- Layer-wise Adaptive Rate Scaling (LARS)
Jastrzebski, S. (2018). Three factors influencing minima in SGD.
- \(\eta \propto m\)
Smith, S.L. & Le, Q.V. (2018). A Bayesian Perspective on Generalization and Stochastic Gradient Descent.
Smith, S.L. et al. (2018). Don’t decay the learning rate, increase the batch size.
- \(m \propto \eta\)
Masters, D. & Luschi, C. (2018). Revisiting small batch training for deep neural networks.

This linear scaling rule has been widely adopted, e.g., in Krizhevsky (2014), Chen et al. (2016), Bottou et al. (2016), Smith et al. (2017) and Jastrzebski et al. (2017).

On the other hand, as shown in Hoffer et al. (2017), when \(m \ll M\), the covariance matrix of the weight update \(\mathrm{Cov(\eta \Delta\theta)}\) scales linearly with the quantity \(\eta^2/m\).

This implies that, adopting the linear scaling rule, an increase in the batch size would also result in a linear increase in the covariance matrix of the weight update \(\eta \Delta\theta\). Conversely, to keep the scaling of the covariance of the weight update vector \(\eta \Delta\theta\) constant would require scaling \(\eta\) with the square root of the batch size \(m\) (Krizhevsky, 2014; Hoffer et al., 2017).

Lin, T. et al. (2020). Don’t use large mini-batches, use local SGD.
- Post-local SGD.
Golmant, N. et al. (2018). On the computational inefficiency of large batch sizes for stochastic gradient descent.

Scaling the learning rate as \(\eta \propto \sqrt{m}\) attempts to keep the weight increment length statistics constant, but the distance between SGD iterates is governed more by properties of the objective function than the ratio of learning rate to batch size. This rule has also been found to be empirically sub-optimal in various problem domains. … There does not seem to be a simple training heuristic to improve large batch performance in general.

McCandlish, S. et al. (2018). An empirical model of large-batch training.
- Critical batch size
Shallue, C.J. et al. (2018). Measuring the effects of data parallelism on neural network training.

In all cases, as the batch size grows, there is an initial period of perfect scaling (\(b\)-fold benefit, indicated with a dashed line on the plots) where the steps needed to achieve the error goal halves for each doubling of the batch size. However, for all problems, this is followed by a region of diminishing returns that eventually leads to a regime of maximal data parallelism where additional parallelism provides no benefit whatsoever.

Jastrzebski, S. et al. (2018). Width of minima reached by stochastic gradient descent is influenced by learning rate to batch size ratio.
- \(\eta \propto m\)

We show this experimentally in Fig. 5, where similar learning dynamics and final performance can be observed when simultaneously multiplying the learning rate and batch size by a factor up to a certain limit.

You, Y. et al. (2019). Large-batch training for LSTM and beyond.
- Warmup and use \(\eta \propto m\)

[W]e propose linear-epoch gradual-warmup approach in this paper. We call this approach Leg-Warmup (LEGW). LEGW enables a Sqrt Scaling scheme in practice and as a result we achieve much better performance than the previous Linear Scaling learning rate scheme. For the GNMT application (Seq2Seq) with LSTM, we are able to scale the batch size by a factor of 16 without losing accuracy and without tuning the hyper-parameters mentioned above.

You, Y. et al. (2019). Large batch optimization for deep learning: Training BERT in 76 minutes.
- LARS and LAMB
Zhang, G. et al. (2019). Which algorithmic choices matter at which batch sizes? Insights from a Noisy Quadratic Model.

Consistent with the empirical results of Shallue et al. (2018), each optimizer shows two distinct regimes: a small-batch (stochastic) regime with perfect linear scaling, and a large-batch (deterministic) regime insensitive to batch size. We call the phase transition between these regimes the critical batch size.

Li, Y., Wei, C., & Ma, T. (2019). Towards explaining the regularization effect of initial large learning rate in training neural networks.

Our analysis reveals that more SGD noise, or larger learning rate, biases the model towards learning “generalizing” kernels rather than “memorizing” kernels.

Kaplan, J. et al. (2020). Scaling laws for neural language models.
Jastrzebski, S. et al. (2020). The break-even point on optimization trajectories of deep neural networks.

Blogs:

Shen, K. (2018). Effect of batch size on training dynamics.
Chang, D. (2020). Effect of batch size on neural net training.

3.13.6 Normalization

BatchNorm
LayerNorm, GroupNorm
Online Normalization: Chiley, V. et al. (2019). Online normalization for training neural networks. ²²⁹
Kiani, B., Balestriero, R., LeCun, Y., & Lloyd, S. (2022). projUNN: efficient method for training deep networks with unitary matrices. ²³⁰
Huang, L. et al. (2020). Normalization techniques in training DNNs: Methodology, analysis and application. ²³¹

²²⁹ Chiley, V. et al. (2019).

²³⁰ Kiani, Balestriero, Lecun, & Lloyd (2022).

²³¹ Huang, L. et al. (2020).

3.13.7 Finetuning

Hu, E.J. et al (2021). LoRA: Low-rank adaptation of large language models. ²³²
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient finetuning of quantized LLMs. ²³³
Zhao, J. et al (2024). GaLore: Memory-efficient LLM training by gradient low-rank projection. ²³⁴
Huh, M. et al. (2024). Training neural networks from scratch with parallel low-rank adapters. ²³⁵

²³² Hu, E.J. et al. (2021).

²³³ Dettmers, Pagnoni, Holtzman, & Zettlemoyer (2023).

²³⁴ Zhao, J. et al. (2024).

²³⁵ Huh, M. et al. (2024).

3.13.8 Computer vision

Computer Vision (CV)
Fukushima: neocognitron ²³⁶
LeNet-5
LeCun, Y. (1989). Generalization and network design strategies.
LeCun: OCR with backpropagation ²³⁷
LeCun: LeNet-5 ²³⁸
Ciresan: MCDNN ²³⁹
Krizhevsky, Sutskever, and Hinton: AlexNet ²⁴⁰
VGG ²⁴¹
ResNet ²⁴²
- ResNet is performing a forward Euler discretisation of the ODE: \(\dot{x} = \sigma(F(x))\). ²⁴³
MobileNet ²⁴⁴
Neural ODEs ²⁴⁵
EfficientNet ²⁴⁶
VisionTransformer ²⁴⁷
EfficientNetV2 ²⁴⁸
gMLP ²⁴⁹
Dhariwal, P. & Nichol, A. (2021). Diffusion models beat GANs on image synthesis. ²⁵⁰
Liu, Y. et al. (2021). A survey of visual transformers. ²⁵¹
Ingrosso, A. & Goldt, S. (2022). Data-driven emergence of convolutional structure in neural networks. ²⁵²
Park, N. & Kim, S. (2022). How do vision transformers work? ²⁵³
Zhao, Y. et al. (2023). DETRs beat YOLOs on real-time object detection. ²⁵⁴
Nakkiran, P., Bradley, A., Zhou, H. & Advani, M. (2024). Step-by-step diffusion: An elementary tutorial. ²⁵⁵

²³⁶ Fukushima & Miyake (1982).

²³⁷ LeCun, Y. et al. (1989).

²³⁸ LeCun, Bottou, Bengio, & Haffner (1998).

²³⁹ Ciresan, Meier, Masci, & Schmidhuber (2012).

²⁴⁰ Krizhevsky, Sutskever, & Hinton (2012).

²⁴¹ Simonyan & Zisserman (2014).

²⁴² He, Zhang, Ren, & Sun (2015).

²⁴³ Haber & Ruthotto (2017) and Haber, E. et al. (2018).

²⁴⁴ Howard, A.G. et al. (2017).

²⁴⁵ R. T. Q. Chen, Rubanova, Bettencourt, & Duvenaud (2018).

²⁴⁶ Tan & Le (2019).

²⁴⁷ Dosovitskiy, A. et al. (2020).

²⁴⁸ Tan & Le (2021).

²⁴⁹ H. Liu, Dai, So, & Le (2021).

²⁵⁰ Dhariwal & Nichol (2021).

²⁵¹ Liu, Y. et al. (2021).

²⁵² Ingrosso & Goldt (2022).

²⁵³ Park & Kim (2022).

²⁵⁴ Zhao, Y. et al. (2023).

²⁵⁵ Nakkiran, Bradley, Zhou, & Advani (2024).

Resources:

Neptune.ai. (2021). Object detection algorithms and libraries.
facebookresearch/vissl
PyTorch Geometric (PyG)

3.13.9 Natural language processing

3.13.9.1 Introduction

Natural Language Processing (NLP)
History
- Firth, J.R. (1957): “You shall know a word by the company it keeps.” ²⁵⁶
- Nirenburg, S. (1996). Bar Hillel and Machine Translation: Then and Now. ²⁵⁷
- Hutchins, J. (2000). Yehoshua Bar-Hillel: A philosophers’ contribution to machine translation. ²⁵⁸
Textbooks
- Jurafsky, D. & Martin, J.H. (2022). Speech and Language Processing: An introduction to natural language processing, computational linguistics, and speech recognition. ²⁵⁹
- Liu, Z., Lin, Y., & Sun, M. (2023). Representation Learning for Natural Language Processing. ²⁶⁰

²⁵⁶ Firth (1957).

²⁵⁷ Nirenburg (1996).

²⁵⁸ Hutchins (2000).

²⁵⁹ Jurafsky & Martin (2022).

²⁶⁰ Z. Liu, Lin, & Sun (2023).

3.13.9.2 word2vec

Mikolov ²⁶¹
Julia Bazińska
Olah, C. (2014). Deep learning, NLP, and representations.
Alammar, J. (2019). The illustrated word2vec.
Migdal, P. (2017). king - man + woman is queen; but why?
Kun, J. (2018). A Programmer’s Introduction to Mathematics.
- Word vectors have semantic linear structure ²⁶²
Ethayarajh, K. (2019). Word embedding analogies: Understanding King - Man + Woman = Queen.
Allen, C. (2019). “Analogies Explained” … Explained.

²⁶¹ Mikolov, Chen, Corrado, & Dean (2013), Mikolov, Yih, & Zweig (2013), and Mikolov, T. et al. (2013).

²⁶² Kun (2018), p. 176–8.

3.13.9.3 RNNs

RNNs and LSTMs
Hochreiter, S. & Schmidhuber, J. (1997). Long short-term memory. ²⁶³
Graves, A. (2013). Generating sequences with recurrent neural networks. ²⁶⁴
- Auto-regressive language modeling
Olah, C. (2015). Understanding LSTM networks.
Karpathy, A. (2015). The unreasonable effectiveness of recurrent neural networks.

²⁶³ Hochreiter & Schmidhuber (1997).

²⁶⁴ Graves (2013).

Chain rule of language modeling (chain rule of probability):

\[ P(x_1, \ldots, x_T) = P(x_1, \ldots, x_{n-1}) \prod_{t=n}^{T} P(x_t | x_1 \ldots x_{t-1}) \]

or for the whole sequence:

\[ P(x_1, \ldots, x_T) = \prod_{t=1}^{T} P(x_t | x_1 \ldots x_{t-1}) \]

\[ = P(x_1) \: P(x_2 | x_1) \: P(x_3 | x_1 x_2) \: P(x_4 | x_1 x_2 x_3) \ldots \]

A language model (LM), predicts the next token given previous context. The output of the model is a vector of logits, which is given to a softmax to convert to probabilities for the next token.

\[ P(x_t | x_1 \ldots x_{t-1}) = \mathrm{softmax}\left( \mathrm{model}(x_1 \ldots x_{t-1}) \right) \]

Auto-regressive inference follows this chain rule. If done with greedy search:

\[ \hat{x}_t = \underset{x_t \in V}{\mathrm{argmax}} \: P(x_t | x_1 \ldots x_{t-1}) \]

Beam search:

Beam search as used in NLP is described in Sutskever. ²⁶⁵
Zhang, W. (1998). Complete anytime beam search. ²⁶⁶
Zhou, R. & Hansen, E. A. (2005). Beam-stack search: Integrating backtracking with beam search. ²⁶⁷
Collobert, R., Hannun, A., & Synnaeve, G. (2019). A fully differentiable beam search decoder. ²⁶⁸

²⁶⁵ Sutskever, Vinyals, & Le (2014), p. 4.

²⁶⁶ Zhang (1998).

²⁶⁷ Zhou & Hansen (2005).

²⁶⁸ Collobert, Hannun, & Synnaeve (2019).

Backpropagation through time (BPTT):

Werbos, P.J. (1990). Backpropagation through time: what it does and how to do it. ²⁶⁹

²⁶⁹ Werbos (1990).

Neural Machine Translation (NMT):

Sutskever seq2seq ²⁷⁰
Bahdanau attention ²⁷¹ and GNMT ²⁷²
Review by Stahlberg ²⁷³

²⁷⁰ Sutskever et al. (2014).

²⁷¹ Bahdanau, Cho, & Bengio (2015).

²⁷² Wu, Y. et al. (2016).

²⁷³ Stahlberg (2019).

3.13.9.4 Transformers

Figure 3.12: Diagram of the Transformer model (source: d2l.ai).

\[ \mathrm{attention}(Q, K, V) = \mathrm{softmax}\left(\frac{Q\, K^\intercal}{\sqrt{d_k}}\right) V \]

3.13.9.4.1 Attention and Transformers

Transformer ²⁷⁴
BERT ²⁷⁵
Alammar, J. (2018). The illustrated BERT.
Horev, R. (2018). BERT Explained: State of the art language model for NLP.
Liu, Y. et al. (2019). RoBERTa: A robustly optimized BERT pretraining approach. ²⁷⁶
Raffel, C. et al. (2019). Exploring the limits of transfer learning with a unified text-to-text transformer. (T5 model) ²⁷⁷
Horan, C. (2021). 10 things you need to know about BERT and the transformer architecture that are reshaping the AI landscape.
Video: What are transformer neural networks?
Video: How to get meaning from text with language model BERT.
ALBERT ²⁷⁸
BART ²⁷⁹
GPT-1 ²⁸⁰, 2 ²⁸¹, 3 ²⁸²
Alammar, J. (2019). The illustrated GPT-2.
Yang, Z. et al. (2019). XLNet: Generalized autoregressive pretraining for language understanding. ²⁸³
Daily Nous: Philosophers On GPT-3.
DeepMind’s blog posts for more details: AlphaFold1, AlphaFold2 (2020). Slides from the CASP14 conference are publicly available here.
Joshi, C. (2020). Transformers are GNNs.
Lakshmanamoorthy, R. (2020). A complete learning path to transformers (with guide to 23 architectures).
Zaheer, M. et al. (2020). Big Bird: Transformers for longer sequences. ²⁸⁴
Edelman, B.L., Goel, S., Kakade, S., & Zhang, C. (2021). Inductive biases and variable creation in self-attention mechanisms. ²⁸⁵
Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2022). Efficient transformers: A survey. ²⁸⁶
Phuong, M. & Hutter, M. (2022). Formal algorithms for transformers. ²⁸⁷
Chowdhery, A. et al. (2022). PaLM: Scaling language modeling with pathways. ²⁸⁸
Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback. (InstructGPT) ²⁸⁹
OpenAI. (2022). Blog: ChatGPT: Optimizing Language Models for Dialogue.
Wolfram, S. (2023). What is ChatGPT doing—and why does it work? ²⁹⁰
GPT-4 ²⁹¹
Mohamadi, S., Mujtaba, G., Le, N., Doretto, G., & Adjeroh, D.A. (2023). ChatGPT in the age of generative AI and large language models: A concise survey. ²⁹²
Zhao, W.X. et al. (2023). A survey of large language models. ²⁹³
3Blue1Brown. (2024). Video: But what is a GPT? Visual intro to Transformers.
Golovneva, O., Wang, T., Weston, J., & Sukhbaatar, S. (2024). Contextual position encoding: Learning to count what’s important. ²⁹⁴
Apple. (2024). Apple Intelligence Foundation Language Models.

²⁷⁴ Vaswani, A. et al. (2017).

²⁷⁵ Devlin, Chang, Lee, & Toutanova (2018).

²⁷⁶ Liu, Y. et al. (2019).

²⁷⁷ Raffel, C. et al. (2019).

²⁷⁸ Lan, Z. et al. (2019).

²⁷⁹ Lewis, M. et al. (2019).

²⁸⁰ Radford, Narasimhan, Salimans, & Sutskever (2018).

²⁸¹ Radford, A. et al. (2019).

²⁸² Brown, T.B. et al. (2020).

²⁸³ Yang, Z. et al. (2019).

²⁸⁴ Zaheer, M. et al. (2020).

²⁸⁵ Edelman, Goel, Kakade, & Zhang (2021).

²⁸⁶ Tay, Dehghani, Bahri, & Metzler (2022).

²⁸⁷ Phuong & Hutter (2022).

²⁸⁸ Chowdhery, A. et al. (2022).

²⁸⁹ Ouyang, L. et al. (2022).

²⁹⁰ Wolfram (2023).

²⁹¹ OpenAI (2023).

²⁹² Mohamadi, S. et al. (2023).

²⁹³ Zhao, W.X. et al. (2023).

²⁹⁴ Golovneva, Wang, Weston, & Sukhbaatar (2024).

Figure 3.13: Diagram of the BERT model (source: peltarion.com).

3.13.9.4.2 Computational complexity of transformers

kipply (2022). Transformer inference arithmetic.
Bahdanau, D. (2022). The FLOPs calculus of language model training.
Sanger, A. (2023). Inference characteristics of Llama-2.
Shenoy, V. & Kiely, P. (2023). A guide to LLM inference and performance.
Anthony, Q., Biderman, S., & Schoelkopf, H. (2023). Transformer math 101.

3.13.9.4.3 Efficient transformers

Dao, T., Fu, D.Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and memory-efficient exact attention with IO-awareness. ²⁹⁵
Pope, R. et al. (2022). Efficiently scaling transformer inference.
Dao, T. (2023). FlashAttention-2: Faster attention with better parallelism and work partitioning.
Kim, S. et al. (2023). Full stack optimization of transformer inference: A survey.
PyTorch. (2023). Accelerating generative AI with PyTorch II: GPT, Fast.
Nvidia. (2023). Mastering LLM techniques: Inference optimization.
Weng, L. (2023). Large transformer model inference optimization.
Kwon, W. et al. (2023). Efficient memory management for large language model serving with PagedAttention.
Zhang, L. (2023). Dissecting the runtime performance of the training, fine-tuning, and inference of large language models.
Dettmers, T. (2023). Which GPU(s) to Get for Deep Learning: My Experience and Advice for Using GPUs in Deep Learning.
Fu, Y. (2023). Towards 100x speedup: Full stack transformer inference optimization.
Fu, Y. (2024). Challenges in deploying long-context transformers: A theoretical peak performance analysis.
Fu, Y. et al. (2024). Data engineering for scaling language models to 128K context.
Kwon, W. et al. (2023). Efficient memory management for large language model serving with PagedAttention. (vLLM)
Shah, J. et al. (2024). FlashAttention-3: Fast and accurate attention with asynchrony and low-precision.

²⁹⁵ Dao, T. et al. (2022).

3.13.9.4.4 What comes after Transformers?

Gu, A., Goel, K., & Ré, C. (2021). Efficiently modeling long sequences with structured state spaces. ²⁹⁶
Merrill, W. & Sabharwal, A. (2022). The parallelism tradeoff: Limitations of log-precision transformers. ²⁹⁷
Bulatov, A., Kuratov, Y., & Burtsev, M.S. (2022). Recurrent memory transformer. ²⁹⁸
Raffel, C. (2023). A new alchemy: Language model development as a subfield?.
Bulatov, A., Kuratov, Y., & Burtsev, M.S. (2023). Scaling transformer to 1M tokens and beyond with RMT. ²⁹⁹
Bertsch, A., Alon, U., Neubig, G., & Gormley, M.R. (2023). Unlimiformer: Long-range transformers with unlimited length input. ³⁰⁰
Mialon, G. et al. (2023). Augmented Language Models: a Survey. ³⁰¹
Peng, B. et al. (2023). RWKV: Reinventing RNNs for the Transformer Era. ³⁰²
Sun, Y. et al. (2023). Retentive network: A successor to transformer for large language models. ³⁰³
Gu, A. & Dao, T. (2023). Mamba: Linear-time sequence modeling with selective state spaces. ³⁰⁴
Wang, H. et al. (2023). BitNet: Scaling 1-bit transformers for large language models. ³⁰⁵
Ma, S. et al. (2024). The era of 1-bit LLMs: All large language models are in 1.58 bits. ³⁰⁶
Ma, X. et al. (2024). Megalodon: Efficient LLM pretraining and inference with unlimited context length. ³⁰⁷
Bhargava, A., Witkowski, C., Shah, M., & Thomson, M. (2023). What’s the magic word? A control theory of LLM prompting. ³⁰⁸
Sun, Y. et al. (2024). Learning to (learn at test time): RNNs with expressive hidden states.
Dao, T. & Gu, A. (2024). Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. ³⁰⁹
Banerjee, S., Agarwal, A., & Singla, S. (2024). LLMs will always hallucinate, and we need to live with this. ³¹⁰

²⁹⁶ Gu, Goel, & Ré (2021).

²⁹⁷ Merrill & Sabharwal (2022).

²⁹⁸ Bulatov, Kuratov, & Burtsev (2022).

²⁹⁹ Bulatov, Kuratov, & Burtsev (2023).

³⁰⁰ Bertsch, Alon, Neubig, & Gormley (2023).

³⁰¹ Mialon, G. et al. (2023).

³⁰² Peng, B. et al. (2023).

³⁰³ Sun, Y. et al. (2023).

³⁰⁴ Gu & Dao (2023).

³⁰⁵ Wang, H. et al. (2023).

³⁰⁶ Ma, S. et al. (2024).

³⁰⁷ Ma, X. et al. (2024).

³⁰⁸ Bhargava, Witkowski, Shah, & Thomson (2023).

³⁰⁹ Dao & Gu (2024).

³¹⁰ Banerjee, Agarwal, & Singla (2024).

3.13.9.5 Evaluation methods

Hendrycks, D. et al. (2020). Measuring Massive Multitask Language Understanding. (MMLU)
Yue, X. et al. (2023). MMMU: A Massive Multi-discipline Multimodal Understanding and reasoning benchmark for expert AGI.
Kim, J. et al. (2024). Evalverse: Unified and accessible library for large language model evaluation.
Biderman, S. (2024). Lessons from the trenches on reproducible evaluation of language models.
Stanford’s HELM Leaderboards
- Liang, P. et al. (2022). Holistic Evaluation of Language Models (HELM)
- Lee, T. et al. (2023). Holistic Evaluation of Text-To-Image Models (HEIM)
- github.com/stanford-crfm/helm
EleutherAI/lm-evaluation-harness
- Srivastava, A. (2022). Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models.
- Suzgun, M. (2022). Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them.
- Wang, Y. (2024). MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark.

3.13.9.6 Scaling laws in NLP

Hestness, J. et al. (2017). Deep learning scaling is predictable, empirically. ³¹¹
Church, K.W. & Hestness, J. (2019). Rationalism and empiricism in artificial intellegence: A survey of 25 years of evaluation [in NLP]. ³¹²
Kaplan, J. et al. (2020). Scaling laws for neural language models. ³¹³
Rae, J.W. et al. (2022). Scaling language models: Methods, analysis & insights from training Gopher. ³¹⁴
Hoffmann, J. et al. (2022). Training compute-optimal large language models (Chinchilla). ³¹⁵
Caballero, E., Gupta, K., Rish, I., & Krueger, D. (2022). Broken neural scaling laws. ³¹⁶
Constantin, S. (2023). “Scaling Laws” for AI and some implications.
Muennighoff, N. et al. (2023). Scaling data-constrained language models. ³¹⁷
Pandey, R. (2024). gzip predicts data-dependent scaling laws. ³¹⁸
Bach, F. (2024). Scaling laws of optimization. ³¹⁹
Finzi, M. et al. (2025). Compute-optimal LLMs provably generalize better with scale. ³²⁰

³¹¹ Hestness, J. et al. (2017).

³¹² Church & Hestness (2019).

³¹³ Kaplan, J. et al. (2020).

³¹⁴ Rae, J.W. et al. (2022).

³¹⁵ Hoffmann, J. et al. (2022).

³¹⁶ Caballero, Gupta, Rish, & Krueger (2022).

³¹⁷ Muennighoff, N. et al. (2023).

³¹⁸ Pandey (2024).

³¹⁹ Bach (2024).

³²⁰ Finzi, M. et al. (2025).

3.13.9.7 Language understanding

NLU
Mahowald, K. et al. (2023). Dissociating language and thought in large language models: a cognitive perspective. ³²¹
Kosinski, M. (2023). Theory of mind may have spontaneously emerged in large language models. ³²²
Chitra, T. & Prior, H. (2023). Do language models possess knowledge (soundness)?
Shani, C., Jurafsky, D., LeCun, Y., et Shwartz-Ziv, R. (2025). From tokens to thoughts: How LLMs and humans trade compression for meaning. [^Shani2025]

³²¹ Mahowald, K. et al. (2023).

³²² Kosinski (2023).

See also:

Word meanings

3.13.9.8 Interpretability

Grandmother cell
Watson, D. & Floridi, L. (2019). The explanation game: A formal framework for interpretable machine learning. ³²³
Anthropic. (2021). A mathematical framework for transformer circuits.
Anthropic. (2022). In-context learning and induction heads.
Gurnee, W. et al. (2023). Finding neurons in a haystack: Case studies with sparse probing. ³²⁴
Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2023). Locating and editing factual associations in GPT. ³²⁵
McDougall, C., Conmy, A., Rushing, C., McGrath, T., & Nanda, N. (2023). Copy suppression: Comprehensively understanding an attention head. ³²⁶
Anthropic. (2025). On the biology of a large language model.

³²³ Watson & Floridi (2019).

³²⁴ Gurnee, W. et al. (2023).

³²⁵ Meng, Bau, Andonian, & Belinkov (2023).

³²⁶ McDougall, C. et al. (2023).

Linear probes:

Alain, G. & Bengio, Y. (2016). Understanding intermediate layers using linear classifier probes. ³²⁷
Belinkov, Y. (2022). Probing classifiers: Promises, shortcomings, and advances. ³²⁸
Gurnee, W. & Tegmark, M. (2023). Language models represent space and time. ³²⁹

³²⁷ Alain & Bengio (2016).

³²⁸ Belinkov (2022).

³²⁹ Gurnee & Tegmark (2023).

3.13.10 Reinforcement learning

Reinforcement Learning (RL)
Dynamic programming
Bellman equation
Backward induction
- John von Neumann & Oskar Morgenstern. (1944). Theory of Games and Economic Behavior.

Pedagogy:

Sutton & Barto ³³⁰
Deep Reinforcement Learning: A Brief Survey ³³¹
Cesa-Bianchi, N. & Lugosi, G. (2006). Prediction, Learning, and Games. ³³²
OpenAI. (2018). Spinning Up.

³³⁰ Sutton & Barto (2018).

³³¹ Arulkumaran, Deisenroth, Brundage, & Bharath (2017).

³³² Cesa-Bianchi & Lugosi (2006).

Tutorials:

List by OpenAI of key RL papers
List of game AI codes by DATA Lab
Xu, Z., van Hasselt, H., & Silver, D. (2018). Meta-gradient reinforcement learning. ³³³
Chen, L. et al. (2021). Decision Transformer: Reinforcement learning via sequence modeling.
Silver, D., Singh, S., Precup, D., & Sutton, R.S. (2024). Reward is enough. ³³⁴
Javed, K. & Sutton, R.S. (2024). The big world hypothesis and its ramifications for artificial intelligence. ³³⁵

³³³ Xu, Hasselt, & Silver (2018).

³³⁴ Silver, Singh, Precup, & Sutton (2024).

³³⁵ Javed & Sutton (2024).

3.13.10.1 Q-learning

Q-learning and DQN
Uses the Markov Decision Process (MDP) framework
The Bellman equation ³³⁶
Q-learning is a values-based learning algorithm. Value based algorithms updates the value function based on an equation (particularly Bellman equation). Whereas the other type, policy-based estimates the value function with a greedy policy obtained from the last policy improvement (source: towardsdatascience.com).
DQN masters Atari ³³⁷

³³⁶ Bellman (1952).

³³⁷ Mnih, V. et al. (2013) and Mnih, V. et al. (2015).

3.13.10.2 AlphaZero

AlphaGo Lee ³³⁸ → AlphaGo Zero ³³⁹ → AlphaZero ³⁴⁰
OpenAI Five masters Dota2
AlphaStar masters StarCraftII
AlphaZero
- \(\pi(a|s)\) and \(V(s)\)
- Monte Carlo Tree Search (MCTS)

³³⁸ Silver, D. et al. (2016).

³³⁹ Silver, D. et al. (2017b).

³⁴⁰ Silver, D. et al. (2017a).

3.13.10.3 Regret minimization

Regret matching (RM)

Hart, S. & Mas‐Colell, A. (2000). A simple adaptive procedure leading to correlated equilibrium. ³⁴¹

³⁴¹ Hart & Mas‐Colell (2000).

Consider a game like rock-paper-scissors, where there is only one action per round. Let \(v^{t}(a)\) be the value observed when playing action \(a\) on iteration \(t\).

TODO: explain that the entire rewards vector, \(v^{t}(a)\), over \(a\) is observable after the chosen action is played.

Let a strategy, \(\sigma^t\), be a probability distribution over actions, \(a \in A\). Then the value of a strategy, \(v^{t}(\sigma^{t})\), is the expectation of its value over actions.

\[ v^{t}(\sigma^{t}) = \sum_{a \in A} \sigma^{t}(a) \: v^{t}(a) \]

Regret, \(R^{T}\), measures how much better some sequence of strategies, \(\sigma'\), would do compared to the chosen sequence of strategies, \(\sigma = \{\sigma^1, \sigma^2, \ldots \sigma^T\}\).

\[ R^{T} \equiv \sum_{t=1}^{T} \left( v^{t}({\sigma'}^{t}) - v^{t}(\sigma^{t}) \right) \]

External regret, \(R^{T}(a)\), measures the regret of the chosen sequence of strategies versus a hypothetical stategy where action \(a\) is always chosen.

\[ R^{T}(a) \equiv \sum_{t=1}^{T} \left( v^{t}(a) - v^{t}(\sigma^{t}) \right) \]

Regret Matching (RM) is a rule to determine the strategy for the next iteration:

\[ \sigma^{t+1}(a) \equiv \frac{ R^{t}_{+}(a) }{ \sum_{b \in A} R^{t}_{+}(b) } \]

where \(R_{+} \equiv \mathrm{max}(R, 0)\).

At the end of training, the resulting recommended strategy with convergence bounds is not the final strategy used in training, \(\sigma^{T}\), but the average strategy over all time steps:

\[ \bar{\sigma}^{T}(a) = \frac{1}{T} \sum_{t=1}^{T} \sigma^{t}(a) \]

TODO: explain the convergence of \(\bar{\sigma}^{t}\) to an \(\varepsilon\)-Nash equilibrium.

Roughgarden, T. (2013). Video: Twenty Lectures on Algorithmic Game Theory.
Roughgarden, T. (2016). Twenty Lectures on Algorithmic Game Theory. ³⁴²
- TODO: Coarse correlated equilibria

³⁴² Roughgarden (2016).

Counterfactual regret minimization (CFR)

CFR
- Zinkevich, M., Johanson, M., Bowling, M., & Piccione, C. (2007). Regret minimization in games with incomplete information. ³⁴³
- Counterfactual regret minimization (CFR) is an algorithm for extensive-form games that independently minimizes regret in each information set. ³⁴⁴
- “In other words, actions are selected in proportion to the amount of positive counterfactual regret for not playing that action.” ³⁴⁵
- CFR differs from traditional RL algorithms in that it does not try to maximize expected return. Instead, it minimizes exploitability. CFR does not use the MDP framework; instead, it uses extensive-form games (source: Quora).
- Johanson’s explanation on Quora.
CFR+
- Tammelin, O. (2014). Solving large imperfect information games using CFR+. ³⁴⁶
- Tammelin, O., Burch, N., Johanson, M., & Bowling, M. (2015). Solving heads-up limit texas hold’em ³⁴⁷
- Burch, N., Moravcik, M., & Schmid, M. (2019). Revisiting CFR+ and alternating updates. ³⁴⁸
- Brown, N. & Sandholm, T. (2019). Solving imperfect-information games via discounted regret minimization. ³⁴⁹
  - LCFR+ is worse than CFR+ or LCFR.
Examples

³⁴³ Zinkevich, Johanson, Bowling, & Piccione (2007).

³⁴⁴ N. Brown (2020), p. 12.

³⁴⁵ Zinkevich et al. (2007), p. 4.

³⁴⁶ Tammelin (2014).

³⁴⁷ Tammelin, Burch, Johanson, & Bowling (2015).

³⁴⁸ Burch, Moravcik, & Schmid (2019).

³⁴⁹ N. Brown & Sandholm (2019a).

TODO: explain extensive-form games.

A finite extensive game with imperfect information has the following components: ³⁵⁰

³⁵⁰ Zinkevich et al. (2007) and Lanctot, Waugh, Zinkevich, & Bowling (2009).

A finite set \(N\) of players. A finite set \(H\) of sequences, the possible histories of actions, such that the empty sequence is in \(H\) and every prefix of a sequence in \(H\) is also in \(H\). Define \(h \sqsubseteq h'\) to mean \(h\) is a prefix of \(h'\). \(Z \subseteq H\) are the terminal histories (those which are not a prefix of any other sequences). \(A(h) = \{a : ha \in H\}\) are the actions available after a non-terminal history, \(h \in H \backslash Z\).
A function \(P\) that assigns to each non-terminal history a member of \(N \cup \{c\}\). \(P\) is the player function. \(P(h)\) is the player who takes an action after the history \(h\). If \(P(h) = c\) then chance determines the action taken after history \(h\).
For each player \(i \in N \cup \{c\}\) a partition \(\mathcal{I}_i\) of \(\{h \in H : P (h) = i\}\) with the property that \(A(h) = A(h')\) whenever \(h\) and \(h'\) are in the same member of the partition. For \(I \in \mathcal{I}_i\) we denote by \(A(I_i)\) the set \(A(h)\) and by \(P(I_i)\) the player \(P(h)\) for any \(h \in I_i\) . \(\mathcal{I}_i\) is the information partition of player \(i\); a set \(I_i \in \mathcal{I}_i\) is an information set of player \(i\).
A function \(f_c\) that associates with every information set \(I\) where \(P(I) = c\), a probability measure \(f_c(a|I)\) on \(A(h)\); \(f_c(a|I)\) is the probability that \(a\) occurs given some \(h \in I\), where each such probability measure is independent of every other such measure.
For each player \(i \in N\) there is a utility function \(u_i\) from the terminal states \(Z\) to the reals \(\mathbb{R}\). If \(N = \{1, 2\}\) and \(u_1 = -u_2\), it is a zero-sum extensive game. Define \(\Delta_{u,i} \equiv \mathrm{max}_z \: u_i(z) - \mathrm{min}_z \: u_i(z)\) to be the range of utilities to player \(i\).

The player reach, \(\pi^{\sigma}_{i}(h)\), of a history \(h\) is the product of the probabilities for all agent \(i\) actions leading to \(h\). Formally, ³⁵¹

³⁵¹ N. Brown (2020), p. 6.

\[ \pi^{\sigma}_{i}(h) \equiv \prod_{h' \cdot a' \sqsubseteq h | P(h') = i} \sigma_{i}(h', a') \]

Due to perfect recall, any two histories in infoset \(I_i\) have the same player reach for player \(i\). Thus, we similarly define the player reach \(\pi^{\sigma}_{i}(I_i)\) of infoset \(I_i\) as

\[ \pi^{\sigma}_{i}(I_i) \equiv \prod_{ {I'}_{i} \cdot a' \sqsubseteq I_i | P(I_i) = i } \sigma_{i}({I'}_{i}, a') = \left.\pi^{\sigma}_{i}(h)\right|_{h \in I_i} \]

The external reach AKA opponent reach, \(\pi^{\sigma}_{-i}(h)\), of a history \(h\) is the contribution of chance and all other players than \(i\). Formally,

\[ \pi^{\sigma}_{-i}(h) \equiv \prod_{h' \cdot a' \sqsubseteq h | P(h') \neq i} \sigma_{i}(h', a') \]

We also define the external reach of an infoset as

\[ \pi^{\sigma}_{-i}(I_i) \equiv \sum_{h \in I_{i}} \pi^{\sigma}_{-i}(h) \]

The counterfactual value of an infoset \(I\) is the expected utility to player \(i\) given that \(I\) has been reached, weighted by the external reach of \(I\) for player \(i\). Formally, ³⁵²

³⁵² N. Brown (2020), p. 12.

\[ v(I) = \sum_{h \in I} \pi^{\sigma}_{-i}(h) \sum_{z \in Z} \pi^{\sigma}(h, z) \: u_{i}(z) \]

The counterfactual value of an action, \(a\), is

\[ v(I, a) = \sum_{h \in I} \pi^{\sigma}_{-i}(h) \sum_{z \in Z} \pi^{\sigma}(h \cdot a, z) \: u_{i}(z) \]

Let’s consider the case where, like in NLHE, our two private hole cards each make a single unique history \(h\), and we form infosets with a single hand, so \(I=h\). Then

\[ v(h) = \pi^{\sigma}_{-i}(h) \sum_{z \in Z} \pi^{\sigma}(h, z) \: u_{i}(z) \]

making explicit the player reach and the external reach,

\[ v(h) = \pi^{\sigma}_{-i}(h) \sum_{z \in Z} \pi_{i}^{\sigma}(h, z) \: \pi_{-i}^{\sigma}(h, z) \: u_{i}(z) \]

At a leaf node where we finally calculate the rewards,

\[ v(z) = \pi^{\sigma}_{-i}(z) \: u_{i}(z) \]

TODO: explain CFR.

The instantaneous regret is

\[ r^{t}(I, a) = v^{\sigma^t}(I, a) - v^{\sigma^t}(I) \]

The (cummulative) counterfactual regret

\[ R^{t}(I, a) = \sum_{t=1}^{T} r^{t}(I, a) \]

Similar to the single-node game discussed above, eq. \(\eqref{eq:regret_matching}\), applying regret matching during training means to update strategies according to the following rule.

\[ \sigma^{t+1}(I, a) \equiv \frac{ R^{t}_{+}(I, a) }{ \sum_{b \in A} R^{t}_{+}(I, b) } \]

The average strategy is

\[ \bar{\sigma}^{T}(I, a) = \sum_{t=1}^{T} \frac{\pi^{t}_{i}(I) \: \sigma^{t}(I, a) }{\pi^{t}_{i}(I)} \]

Monte Carlo Counterfactual Regret Minimization (MCCFR)

Lanctot, M. (2009). Monte Carlo sampling for regret minimization. ³⁵³
Neller, T.W. & Lanctot, M. (2013). An introduction to counterfactual regret minimization. ³⁵⁴
Vectorized and sampling variants
- Burch, N., Lanctot, M., Szafron, D., & Gibson, R. (2012). Efficient Monte Carlo counterfactual regret minimization in games with many player actions. ³⁵⁵
- Johanson, M., Bard, N., Lanctot, M., Gibson, R.G., & Bowling, M. (2012). Efficient Nash equilibrium approximation through Monte Carlo counterfactual regret minimization. ³⁵⁶
  - Variants of chance sampling with single or vector opponent actions.
- Schmid, M. et al. (2019). Variance reduction in Monte Carlo counterfactual regret minimization (VR-MCCFR) for extensive form games using baselines. ³⁵⁷
- Li, H. et al. (2020). Regret minimization via novel vectorized sampling policies and exploration. ³⁵⁸
- Habara, K., Fukuda, E.H., & Yamashita, N. (2023). Convergence analysis and acceleration of the smoothing methods for solving extensive-form games. ³⁵⁹
MCCFR Ph.D. theses
- Lanctot, M. (2013). Monte Carlo Sample and Regret Minimization for Equilibrium Computation and Decision-Making in Large Extensive Form Games. ³⁶⁰
- Gibson, R. (2014). Regret minimization in games and the development of champion multiplayer computer poker-playing agents. ³⁶¹
- Johanson, M. (2016). Robust Strategies and Counter-Strategies: From Superhuman to Optimal Play. ³⁶²
- Burch, N. (2018). Time and Space: Why imperfect information games are hard. ³⁶³
- Horacek, M. (2022). Risk-Aversion in Algorithms for Poker. ³⁶⁴

³⁵³ Lanctot et al. (2009).

³⁵⁴ Neller & Lanctot (2013).

³⁵⁵ Burch, Lanctot, Szafron, & Gibson (2012).

³⁵⁶ Johanson, M. et al. (2012).

³⁵⁷ Schmid, M. et al. (2019).

³⁵⁸ Li, H. et al. (2020).

³⁵⁹ Habara, Fukuda, & Yamashita (2023).

³⁶⁰ Lanctot (2013).

³⁶¹ Gibson (2014).

³⁶² Johanson (2016).

³⁶³ Burch (2018).

³⁶⁴ Horacek (2022).

TODO: explain MCCFR.

External sampling MCCFR:

\[ \tilde{v}^{\sigma}_{i}(I) = \sum_{z \in Q} u_{i}(z) \: \pi^{\sigma}_{i}(z[I] \rightarrow z) \]

Best response and exploitability

Best response:

\[ \mathrm{BR}(\sigma_{-i}) = \underset{\sigma_{i}^{\prime}}{\mathrm{argmax}} \: u_{i}(\sigma_{i}^{\prime}, \sigma_{-i}) \]

TODO: Local Best Response (LBR). ³⁶⁵

³⁶⁵ Lisy & Bowling (2016), p. 2.

Exploitability:

\[ \varepsilon_{i}(\sigma) = u_{i}(\mathrm{BR}(\sigma_{-i}), \sigma_{-i}) - u_{i}(\sigma_{i}, \mathrm{BR}(\sigma_{i})) \]

NashConv ³⁶⁶ exploitability uses the convention:

³⁶⁶ See NashConv exploitability defined in Lanctot, M. et al. (2017).

\[ \varepsilon_{i}(\sigma) = u_{i}(\mathrm{BR}(\sigma_{-i}), \sigma_{-i}) - u_{i}(\sigma_{i}, \sigma_{-i}) \]

The average exploitability per player is

\[ \varepsilon(\sigma) = \frac{1}{n} \sum_{i}^{n} \varepsilon_{i}(\sigma) \]

Note that in zero-sum games, when summing over players, the second terms in NashConv sum to zero. ³⁶⁷

³⁶⁷ Timbers (2020), p. 3.

\[ \varepsilon(\sigma) = \frac{1}{n} \sum_{i}^{n} u_{i}(\mathrm{BR}(\sigma_{-i}), \sigma_{-i}) \]

In two-player games:

\[ \varepsilon(\sigma) = \frac{1}{2} \Big( u_{1}(\mathrm{BR}(\sigma_{2}), \sigma_{2}) + u_{2}(\sigma_{1}, \mathrm{BR}(\sigma_{1})) \Big) \]

Johanson, M., Waugh, K., Bowling, M., & Zinkevich, M. (2011). Accelerating best response calculation in large extensive games. ³⁶⁸
- Evaluates range vs range rewards in \(O(n \log n)\) + \(O(n)\) instead of \(O(n^2)\).
Ponsen, M., De Jong, S., & Lanctot, M. (2011). Computing approximate Nash equilibria and robust best-responses using sampling. ³⁶⁹
Lisy, V. & Bowling, M. (2016). Equilibrium approximation quality of current no-limit poker bots. ³⁷⁰
Timbers, F. (2020). Approximate exploitability: Learning a best response in large games. ³⁷¹

³⁶⁸ Johanson, Waugh, Bowling, & Zinkevich (2011).

³⁶⁹ Ponsen, De Jong, & Lanctot (2011).

³⁷⁰ Lisy & Bowling (2016).

³⁷¹ Timbers (2020).

3.13.10.4 Solving poker

Simplified toy pokers
- Kuhn poker ³⁷²
- Leduc poker ³⁷³
Earlier poker work
- Billings, D., Davidson, A., Schaeffer, J., & Szafron, D. (2002). The challenge of poker. ³⁷⁴
- Billings, D. et al. (2003). Approximating game-theoretic optimal strategies for full-scale poker. ³⁷⁵
- Johanson, M. (2013). Measuring the size of large no-limit poker games. ³⁷⁶
- Claudico (2015) - Sandholm, T. et al. (CMU)
- Bowling, M., Burch, N., Johanson, M., & Tammelin, O. (2015). Heads-up limit hold’em poker is solved. ³⁷⁷
  - CFR+
- Heinrich & Silver. (2016). Deep reinforcement learning from self play in imperfect-information games. ³⁷⁸
  - Q-learning
- Moravcik, M. et al. (2017). DeepStack: Expert-level artificial intelligence in heads-up no-limit poker. ³⁷⁹
Libratus
- Brown, N. & Sandholm, T. (2017). Safe and nested subgame solving for imperfect-information games. ³⁸⁰
  - Adds test-time search/planning to just using precomputed policies.
- Brown, N. & Sandholm, T. (2018). Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. ³⁸¹
  - bet and card abstraction
  - MCCFR used to find a solution of the abstracted game: blueprint
- Brown, N. & Sandholm, T. (2019). Solving imperfect-information games via discounted regret minimization. ³⁸²
- Brown, N., Lerer, A., Gross, S., & Sandholm, T. (2019). Deep counterfactual regret minimization. ³⁸³
Pluribus
- Brown, N. & Sandholm, T. (2019). Superhuman AI for multiplayer poker. ³⁸⁴
- Brown, N. (2019). Facebook, Carnegie Mellon build first AI that beats pros in 6-player poker.
- No limit: AI poker bot is first to beat professionals at multiplayer game
ReBeL
- Brown, N. et al. (2020). Combining deep reinforcement learning and search. ³⁸⁵
- ReBeL: A general game-playing AI bot that excels at poker and more
- YouTube by Brown: Combining deep reinforcement learning and search for imperfect-information games.
- Brown, N. (2020). Equilibrium finding for large adversarial imperfect-information games. ³⁸⁶
Player of Games
- Schmid, M. et al. (2021). Player of games. ³⁸⁷
More
- Kovarik, V. et al. (2022). Rethinking formal models of partially observable multiagent decision making. ³⁸⁸
- Brown, N. (2024). Talk: Parables on the power of planning in AI: From poker to diplomacy.

³⁷² Kuhn (1950).

³⁷³ Southey, F. et al. (2012).

³⁷⁴ Billings, Davidson, Schaeffer, & Szafron (2002).

³⁷⁵ Billings, D. et al. (2003).

³⁷⁶ Johanson (2013).

³⁷⁷ Bowling, Burch, Johanson, & Tammelin (2015).

³⁷⁸ Heinrich & Silver (2016).

³⁷⁹ Moravcik, M. et al. (2017).

³⁸⁰ N. Brown & Sandholm (2017).

³⁸¹ N. Brown & Sandholm (2018).

³⁸² N. Brown & Sandholm (2019a).

³⁸³ N. Brown, Lerer, Gross, & Sandholm (2019).

³⁸⁴ N. Brown & Sandholm (2019b).

³⁸⁵ N. Brown, Bakhtin, Lerer, & Gong (2020).

³⁸⁶ N. Brown (2020).

³⁸⁷ Schmid, M. et al. (2021).

³⁸⁸ Kovarik, V. et al. (2022).

3.13.11 Applications in physics

Denby, B. (1988). Neural networks and cellular automata in experimental high energy physics. ³⁸⁹
Denby, B. (1993). The use of neural networks in high-energy physics. ³⁹⁰
HEPML-LivingReview: A Living Review of Machine Learning for Particle Physics
Spears, B.K. et al. (2018). Deep learning: A guide for practitioners in the physical sciences. ³⁹¹
Cranmer, K., Seljak, U., & Terao, K. (2021). Machine learning (Review in the PDG). ³⁹²
Liu, Z. et al. (2024). KAN: Kolmogorov-Arnold Networks. ³⁹³
Jiao, L. et al. (2024). AI meets physics: A comprehensive survey. ³⁹⁴

³⁸⁹ Denby (1988).

³⁹⁰ Denby (1993).

³⁹¹ Spears, B.K. et al. (2018).

³⁹² Cranmer, Seljak, & Terao (2021).

³⁹³ Liu, Z. et al. (2024).

³⁹⁴ Jiao, L. et al. (2024).

See also:

Surrogate models

3.14 Theoretical machine learning

3.14.1 Algorithmic information theory

Ray Solomonoff (1926-2009)
Solomonoff induction
- Naturally formalizes Occam’s razor
- Incomputable
Cilibrasi, R. & Vitanyi, P.M.B. (2005). Clustering by compression. ³⁹⁵
Hutter, M. (2007). Universal Algorithmic Intelligence: A mathematical top-down approach. ³⁹⁶
Rathmanner, S. & Hutter, M. (2011). A philosophical treatise of universal induction. ³⁹⁷
Hutter, M. (2022). Talk: Introduction to Algorithmic Information Theory and University Learning.

³⁹⁵ Cilibrasi & Vitanyi (2005).

³⁹⁶ Hutter (2007).

³⁹⁷ Rathmanner & Hutter (2011).

3.14.2 No free lunch theorems

David Wolpert and William G. Macready
- No free lunch theorems for search (1995) ³⁹⁸
- The lack of a priori distinctions between learning algorithms (1996) ³⁹⁹
- No free lunch theorems for optimization (1997) ⁴⁰⁰
- Shalev-Shwarz, S. & Ben-David, S. (2014). ⁴⁰¹
- McDermott, J. (2019). When and why metaheuristics researchers can ignore “no free lunch” theorems. ⁴⁰²
- Wolpert, D.H. (2007). Physical limits of inference. ⁴⁰³
- Wolpert, D.H. & Kinney, D. (2020). Noisy deductive reasoning: How humans construct math, and how math constructs universes. ⁴⁰⁴
- Wolpert, D.H. (2023). The implications of the no-free-lunch theorems for meta-induction. ⁴⁰⁵
Blogs:
- Fedden, L. (2017). The no free lunch theorem.
- Lokesh, M. (2020). The intuition behind the no free lunch theorem.
- Mueller, A. (2019). Don’t cite the no free lunch theorem.
- Quora answer by Luis Argerich
Inductive bias
- Yudkowsky, E. (2007). Inductive bias. LessWrong.
- Ugly duckling theorem
- Hamilton, L.D. (2014). The inductive biases of various machine learning algorithms.
- Mitchell, T.M. (1980). The need for biases in learning generalizations. ⁴⁰⁶
Gerhard Schurz
- See also: Meta-induction as a solution to the problem of induction
Dan A. Roberts. (2021). Why is AI hard and physics simple? ⁴⁰⁷
- See also: Unreasonable effectiveness
More
- Goldreich, O. & Ron, D. (1997). On universal learning algorithms. ⁴⁰⁸
- Joyce, T. & Herrmann, J.M. (2017). A review of no free lunch theorems, and their implications for metaheuristic optimisation. ⁴⁰⁹
- Lin, H.W., Tegmark, M., & Rolnick, D. (2017). Why does deep and cheap learning work so well?. ⁴¹⁰
- Lauc, D. (2020). Machine learning and the philosophical problems of induction. ⁴¹¹
- Nakkiran, P. (2021). Turing-universal learners with optimal scaling laws. ⁴¹²
- Bousquet, O., Hanneke, S., Moran, S., Van Handel, R., & Yehudayoff, A. (2021). A theory of universal learning. ⁴¹³
- Andrews, M. (2023). The devil in the data: Machine learning & the theory-free ideal. ⁴¹⁴

³⁹⁸ Wolpert & Macready (1995).

³⁹⁹ Wolpert (1996).

⁴⁰⁰ Wolpert & Macready (1997).

⁴⁰¹ Shalev-Shwarz & Ben-David (2014), p. 60–66.

⁴⁰² McDermott (2019).

⁴⁰³ Wolpert (2007).

⁴⁰⁴ Wolpert & Kinney (2020).

⁴⁰⁵ Wolpert (2023).

⁴⁰⁶ Mitchell (1980).

⁴⁰⁷ Roberts (2021).

⁴⁰⁸ Goldreich & Ron (1997).

⁴⁰⁹ Joyce & Herrmann (2017).

⁴¹⁰ H. W. Lin, Tegmark, & Rolnick (2017).

⁴¹¹ Lauc (2020).

⁴¹² Nakkiran (2021).

⁴¹³ Bousquet, O. et al. (2021).

⁴¹⁴ Andrews (2023).

Raissi et al.:

encoding such structured information into a learning algorithm results in amplifying the information content of the data that the algorithm sees, enabling it to quickly steer itself towards the right solution and generalize well even when only a few training examples are available. ⁴¹⁵

⁴¹⁵ Raissi, Perdikaris, & Karniadakis (2017a), p. 2.

Roberts:

From an algorithmic complexity standpoint it is somewhat miraculous that we can compress our huge look-up table of experiment/outcome into such an efficient description. In many senses, this type of compression is precisely what we mean when we say that physics enables us to understand a given phenomenon. ⁴¹⁶

⁴¹⁶ Roberts (2021), p. 7.

TODO: Note Dennett’s discussion of compression. ⁴¹⁷

⁴¹⁷ Dennett (1991), p. TODO.

3.14.3 Connectivists vs symbolicists

Chollet, F. (2024). Talk: It’s not about scale, it’s about abstraction.

3.14.4 Graphical tensor notation

Penrose graphical notation
Predrag Cvitanovic
Matrices as Tensor Network Diagrams
Multi-layer perceptions

3.14.5 Universal approximation theorem

Minsky, M. & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. ⁴¹⁸
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. ⁴¹⁹
Lu, Z., Pu, H., Wang, F., Hu, Z., & Wang, L. (2017). The expressive power of neural networks: A view from the width. ⁴²⁰
Lin, H. & Jegelka, S. (2018). ResNet with one-neuron hidden layers is a universal approximator. ⁴²¹
Ismailov, V. (2020). A three layer neural network can represent any multivariate function. ⁴²²
Multi-layer perceptions with two or more layers are universal approximators. ⁴²³
Seemed to slow the interest in deeper networks?

⁴¹⁸ Minsky & Papert (1969), p. TODO.

⁴¹⁹ Hornik, Stinchcombe, & White (1989).

⁴²⁰ Lu, Z. et al. (2017).

⁴²¹ H. Lin & Jegelka (2018).

⁴²² Ismailov (2020).

⁴²³ Bishop (2006), p. 230.

3.14.6 Relationship to statistical mechanics

Logistic/softmax and Boltzman factors
Opper, M. & Kinzel, W. (1996). Statistical mechanics of generalization. ⁴²⁴
Opper, M. (2001). Learning to generalize. ⁴²⁵
Wang, L. (2018). Generative models for physicists.
Bahri ⁴²⁶
Halverson ⁴²⁷
Canatar, A., Bordelon, B., & Pehlevan, C. (2020). Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. ⁴²⁸
Roberts, Yaida, & Hanin. (2021). The Principles of Deep Learning Theory: An Effective Theory Approach to Understanding Neural Networks. ⁴²⁹
- Introduced in Facebook AI’s blog
Cantwell, G.T. (2022). Approximate sampling and estimation of partition functions using neural networks. ⁴³⁰
Wang, L. (2022). Generative AI for science.
Dinan, E., Yaida, S., & Zhang, S. (2023). Effective theory of transformers at initialization. ⁴³¹
Sohl-Dickstein, J. (2020). Two equalities expressing the determinant of a matrix in terms of expectations over matrix-vector products. ⁴³²
Aifer, M. et al. (2023). Thermodynamic linear algebra. ⁴³³
Geshkovski, B., Letrouit, C., Polyanskiy, Y., & Rigollet, P. (2023). A mathematical perspective on Transformers ⁴³⁴
Sompolinsky, H. (2023). Lecture video: Statistical mechanics of deep learning.

⁴²⁴ Opper & Kinzel (1996).

⁴²⁵ Opper (2001).

⁴²⁶ Bahri, Y. et al. (2020).

⁴²⁷ Halverson, Maiti, & Stoner (2020).

⁴²⁸ Canatar, Bordelon, & Pehlevan (2020).

⁴²⁹ Roberts, Yaida, & Hanin (2021).

⁴³⁰ Cantwell (2022).

⁴³¹ Dinan, Yaida, & Zhang (2023).

⁴³² Sohl-Dickstein (2020).

⁴³³ Aifer, M. et al. (2023).

⁴³⁴ Geshkovski, Letrouit, Polyanskiy, & Rigollet (2023).

3.14.7 Relationship to gauge theory

Invariant:

\[ f(g x) = f(x) \]

Equivariant:

\[ f(g x) = g' f(x) \]

Same-equivariant is the case that \(g' = g\).

Dieleman, S., Fauw, J.D., & Kavukcuoglu, K. (2016). Exploiting cyclic symmetry in convolutional neural networks. ⁴³⁵
Cohen, T.S. & Welling, M. (2016). Group equivariant convolutional networks. ⁴³⁶
Cohen & Welling. (2016). Group equivariant convolutional networks.
Cohen, T.S., Weiler, M., Kicanaoglu, B., & Welling, M. (2019). Gauge equivariant convolutional networks and the icosahedral CNN. ⁴³⁷
Pavlus, J. (2020). An idea from physics helps AI see in higher dimensions.
SE(3)-Transformers ⁴³⁸ and blog post.
Bogatskiy, A., Hoffman, T., Miller, D.W., Offermann, J.T., & Liu, X. (2023). Explainable equivariant neural networks for particle physics: PELICAN. ⁴³⁹
e3nn: a modular PyTorch framework for Euclidean neural networks
List of papers: Chen-Cai-OSU/awesome-equivariant-network
Marchetti, G.L., Hillar, C., Kragic, D., & Sanborn. S. (2023). Harmonics of learning: Universal fourier features emerge in invariant networks. ⁴⁴⁰
Battiloro, C. et al. (2024). E(n) equivariant topological neural networks. ⁴⁴¹

⁴³⁵ Dieleman, Fauw, & Kavukcuoglu (2016).

⁴³⁶ T. S. Cohen & Welling (2016).

⁴³⁷ T. S. Cohen, Weiler, Kicanaoglu, & Welling (2019).

⁴³⁸ Fuchs, Worrall, Fischer, & Welling (2020).

⁴³⁹ Bogatskiy, A. et al. (2023).

⁴⁴⁰ Marchetti, Hillar, Kragic, & Sanborn (2023).

⁴⁴¹ Battiloro, C. et al. (2024).

3.14.8 Thermodynamics of computation

Bérut, A., Arakelyan, A., Petrosyan, A., Ciliberto, S., Dillenschneider, R., & Lutz, E. (2012). Experimental verification of Landauer’s principle linking information and thermodynamics. ⁴⁴²
Bérut, A., Petrosyan, A., & Ciliberto, S. (2015). Information and thermodynamics: Experimental verification of Landauer’s erasure principle. ⁴⁴³
Wolpert, D. (2018). Why do computers use so much energy?
Sante Fe Institute: Thermodynamics of Computation

⁴⁴² Bérut, A. et al. (2012).

⁴⁴³ Bérut, Petrosyan, & Ciliberto (2015).

3.15 Information geometry

3.15.1 Introduction

Smith, L. (2019). A gentle introduction to information geometry. ⁴⁴⁴
Nielsen, F. (2020). An elementary introduction to information geometry. ⁴⁴⁵
Amari, S. (1998). Natural gradient works efficiently in learning. ⁴⁴⁶
Amari, S. (2016). Information Geometry and Its Applications. ⁴⁴⁷
Geomstats tutorial: Information geometry

⁴⁴⁴ Smith (2019).

⁴⁴⁵ Nielsen (2020).

⁴⁴⁶ Amari (1998).

⁴⁴⁷ Amari (2016).

3.15.2 Geometric understanding of classical statistics

Balasubramanian, V. (1996). A geometric formulation of Occam’s razor for inference of parametric distributions. ⁴⁴⁸
Balasubramanian, V. (1996). Statistical inference, Occam’s razor and statistical mechanics on the space of probability distributions. ⁴⁴⁹
Calin, O. & Udriste, C. (2014). Geometric Modeling in Probability and Statistics. ⁴⁵⁰
de Carvalho, M., Page, G.L., & Barney, B.J. (2019). On the geometry of Bayesian inference. ⁴⁵¹
Cranmer: Information geometry (coming soon?)

⁴⁴⁸ Balasubramanian (1996a).

⁴⁴⁹ Balasubramanian (1996b).

⁴⁵⁰ Calin & Udriste (2014).

⁴⁵¹ de Carvalho, Page, & Barney (2019).

3.15.3 Geometric understanding of deep learning

Lei, N. et al. (2018). Geometric understanding of deep learning. ⁴⁵²
Gao, Y. & Chaudhari, P. (2020). An information-geometric distance on the space of tasks. ⁴⁵³
Bronstein, M.M. et al. (2021). Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. ⁴⁵⁴

⁴⁵² Lei, Luo, Yau, & Gu (2018).

⁴⁵³ Gao & Chaudhari (2020).

⁴⁵⁴ Bronstein, Bruna, Cohen, & Velickovic (2021).

3.16 Automation

3.16.1 AutoML

Neural Architecture Search (NAS)
AutoML frameworks
RL-driven NAS
learned sparsity

3.16.2 Surrogate models

Autoencoders, latent variables
- The manifold hypothesis
- Olah, C. (2014). Neural networks, manifolds, and topology.
- Fefferman, C., Mitter, S., & Narayanan, H. (2016). Testing the manifold hypothesis. ⁴⁵⁵
Physical constraints in loss functions
- Raissi, M., Perdikaris, P., & Karniadakis, G.E. (2017). Physics informed deep learning (Part I) and (Part II). ⁴⁵⁶
- Karniadakis, G.E. et al. (2021). Physics-informed machine learning. ⁴⁵⁷
- Howard, J.N. et al. (2021). Foundations of a fast, data-driven, machine-learned simulator. ⁴⁵⁸
- Thuerey, N. et al. (2021). Physics-based deep learning. ⁴⁵⁹
- physicsbaseddeeplearning.org
Simulation-based inference
- Cranmer, K., Pavez, J., & Louppe, G. (2015). Approximating likelihood ratios with calibrated discriminative classifiers. ⁴⁶⁰
- Cranmer, K., Brehmer, J., & Louppe, G. (2019). The frontier of simulation-based inference. ⁴⁶¹
- Baydin, A.G. et al. (2019). Etalumis: Bringing probabilistic programming to scientific simulators at scale. ⁴⁶²

⁴⁵⁵ Fefferman, Mitter, & Narayanan (2016).

⁴⁵⁶ Raissi et al. (2017a) and Raissi, Perdikaris, & Karniadakis (2017b).

⁴⁵⁷ Karniadakis, G.E. et al. (2021).

⁴⁵⁸ Howard, Mandt, Whiteson, & Yang (2021).

⁴⁵⁹ Thuerey, N. et al. (2021).

⁴⁶⁰ Cranmer, Pavez, & Louppe (2015).

⁴⁶¹ Cranmer, Brehmer, & Louppe (2019).

⁴⁶² Baydin, A.G. et al. (2019).

Lectures:

Paul Hand. (2020). Invertible neural networks and inverse problems.

3.16.3 AutoScience

Automated discovery
- Anderson, C. (2008). The End of Theory: The data deluge makes the scientific method obsolete. ⁴⁶³
- Cranmer, K. (2017). Active sciencing.
- Asch, M. et al. (2018). Big data and extreme-scale computing: Pathways to Convergence-Toward a shaping strategy for a future software and data ecosystem for scientific inquiry. ⁴⁶⁴
- D’Agnolo, R.T. & Wulzer, A. (2019). Learning New Physics from a Machine. ⁴⁶⁵
  - Note that this description of abduction is missing that it is normative (i.e. “best-fit”).
- Krenn, M. et al. (2022). On scientific understanding with artificial intelligence. ⁴⁶⁶
Symbolic regression
- Udrescu, S. & Tegmark, M. (2020). Symbolic pregression: Discovering physical laws from raw distorted video. ⁴⁶⁷
- Cranmer, M. et al. (2020). Discovering symbolic models from deep learning with inductive biases. ⁴⁶⁸
  - Video: Discussion by Yannic Kilcher
- Liu, Z., Madhavan, V., & Tegmark, M. (2022). AI Poincare 2.0: Machine learning conservation laws from differential equations. ⁴⁶⁹
- Cranmer, M. (2024). Video of seminar: The next great scientific theory is hiding inside a neural network.
- Lu, C. et al. (2024). The AI Scientist: Towards fully automated open-ended scientific discovery. ⁴⁷⁰

⁴⁶³ C. Anderson (2008).

⁴⁶⁴ Asch, M. et al. (2018).

⁴⁶⁵ D’Agnolo & Wulzer (2019).

⁴⁶⁶ Krenn, M. et al. (2022).

⁴⁶⁷ Udrescu & Tegmark (2020).

⁴⁶⁸ Cranmer, M. et al. (2020).

⁴⁶⁹ Z. Liu, Madhavan, & Tegmark (2022).

⁴⁷⁰ Lu, C. et al. (2024).

Figure 3.14: The inference cycle for the process of scientific inquiry. The three distinct forms of inference (abduction, deduction, and induction) facilitate an all-encompassing vision, enabling HPC and HDA to converge in a rational and structured manner. HPC: high- performance computing; HDA: high-end data analysis. ⁴⁷¹

See also:

Artificial intelligence in the Outline of futures studies.

3.17 Implications for the realism debate

3.17.1 Introduction

Korb ⁴⁷²
Williamson ⁴⁷³
Bensusan ⁴⁷⁴

⁴⁷² Korb (2001).

⁴⁷³ Williamson (2009).

⁴⁷⁴ Bensusan (2000).

See also:

Outline on scientific realism

3.17.2 Real clusters

Nope: Hennig

See also:

3.17.3 Word meanings

Note that NLP has implications to the philosophy of language and realism
Olah, C. (2014). Deep learning, NLP, and representations.
Perone, C.S. (2018). NLP word representations and the Wittgenstein philosophy of language. ⁴⁷⁵
Belloni, M. (2019). Neural networks and philosophy of language: Why Wittgenstein’s theories are the basis of all modern NLP.
Goldhill, O. (2019). Google Translate is a manifestation of Wittgenstein’s theory of language.
Tenney, I. et al. (2019). What do you learn from context? Probing for sentence structure in contextualized word representations. ⁴⁷⁶
Nissim, M., van Noord, R., & van der Goot, R. (2019). Fair is better than sensational: Man is to doctor as woman is to doctor. ⁴⁷⁷
Skelac, I. & Jandric, A. (2020). Meaning as use: From Wittgenstein to Google’s Word2vec. ⁴⁷⁸
Bender, E.M. & Koller, A. (2020). Climbing towards NLU: On meaning, form, and understanding in the age of data. ⁴⁷⁹
Boccelli, D. (2022). Word embeddings align with Kandinsky’s theory of color.
Elhage, N. et al. (2022). Toy models of superposition. ⁴⁸⁰
Patel, R. & Pavlick, E. (2022). Mapping language models to grounded conceptual spaces. ⁴⁸¹
Lovering, C. & Pavlick, E. (2022). Unit testing for concepts in neural networks. ⁴⁸²
Tweet by Joscha Bach, Mar 25, 2023
Debate: Do language models need sensory grounding for meaning and understanding? (NYU).
Piantadosi, S. (2023). Talk: Meaning in the age of large language models.
Kornai, A. (2023). Vector Semantics. ⁴⁸³
Huh, M., Cheung, B., Wang, T., & Isola, P. (2024). The platonic representation hypothesis. ⁴⁸⁴
Musser, G. (2024). Can an emerging field called neural systems understanding explain the brain
Jamali, M. et al. (2024). Semantic encoding during language comprehension at single-cell resolution. ⁴⁸⁵
Inferential role semantics

⁴⁷⁵ Perone (2018).

⁴⁷⁶ Tenney, I. et al. (2019).

⁴⁷⁷ Nissim, Noord, & Goot (2019).

⁴⁷⁸ Skelac & Jandric (2020).

⁴⁷⁹ Bender & Koller (2020).

⁴⁸⁰ Elhage, N. et al. (2022).

⁴⁸¹ Patel & Pavlick (2022).

⁴⁸² Lovering & Pavlick (2022).

⁴⁸³ Kornai (2023).

⁴⁸⁴ Huh, Cheung, Wang, & Isola (2024).

⁴⁸⁵ Jamali, M. et al. (2024).

Wittgenstein in PI:

The meaning of a word is its use in the language. ⁴⁸⁶

⁴⁸⁶ Wittgenstein (2009), §43.

and

One cannot guess how a word functions. One has to look at its use, and learn from that. ⁴⁸⁷

⁴⁸⁷ Wittgenstein (2009), §340.

Piantadosi:

Modern large language models integrate syntax and semantics in the underlying representations: encoding words as vectors in a high-dimensional space, without an effort to separate out e.g. part of speech categories from semantic representations, or even predict at any level of analysis other than the literal word. Part of making these models work well was in determining how to encode semantic properties into vectors, and in fact initializing word vectors via encodings of distribution semantics from e.g. Mikolov et al. 2013 (Radford et al. 2019). Thus, an assumption of the autonomy of syntax is not required to make models that predict syntactic material and may well hinder it. ⁴⁸⁸

⁴⁸⁸ Piantadosi (2023), p. 15.

See also:

3.18 My thoughts

My docs:

Derivation of the Cramér-Rao Bound

My talks:

3.19 Annotated bibliography

3.19.1 Mayo, D.G. (1996). Error and the Growth of Experimental Knowledge.

Mayo (1996)

3.19.1.1 My thoughts

TODO

3.19.2 Cowan, G. (1998). Statistical Data Analysis.

Cowan (1998) and Cowan (2016)

3.19.2.1 My thoughts

TODO

3.19.3 James, F. (2006). Statistical Methods in Experimental Physics.

F. James (2006)

3.19.3.1 My thoughts

TODO

3.19.4 Cowan, G. et al. (2011). Asymptotic formulae for likelihood-based tests of new physics.

Cowan et al. (2011)
Glen Cowan, Kyle Cranmer, Eilam Gross, Ofer Vitells

3.19.4.1 My thoughts

TODO

3.19.5 ATLAS Collaboration. (2012). Combined search for the Standard Model Higgs boson.

ATLAS Collaboration (2012)
arxiv:1207.0319

3.19.5.1 My thoughts

TODO

3.19.6 Cranmer, K. (2015). Practical statistics for the LHC.

Cranmer (2015)

3.19.6.1 My thoughts

TODO

3.19.7 More articles to do

All of Statistics ⁴⁸⁹
The Foundations of Statistics ⁴⁹⁰

⁴⁸⁹ Wasserman (2003).

⁴⁹⁰ Savage (1954).

3.20 Links and encyclopedia articles

3.20.2 IEP

3.20.3 Scholarpedia

3.20.4 Wikipedia

3.20.5 Others

Deep Learning: Our Miraculous Year 1990-1991 - Schmidhuber
errorstatistics.com - Deborah Mayo’s blog
Graunt, John (1620-1674) - statprob.com
Peng, R. (2016). A Simple Explanation for the Replication Crisis in Science. - simplystatistics.org
Why is binary classification not a hypothesis test? - stackexchange.com
If the likelihood principle clashes with frequentist probability then do we discard one of them? - stackexchange.com
Wilks’s theorem - fiveMinuteStats
Dallal, G.E. (2012). The Little Handbook of Statistical Practice.
Soch, J. et al. (2024). The Book of Statistical Proofs.

Agresti, A. & Coull, B. A. (1998). Approximate is better than "exact" for interval estimation of binomial proportions. The American Statistician, 52, 119–126.

Aifer, M. et al. (2023). Thermodynamic linear algebra. https://arxiv.org/abs/2308.05660

Alain, G. & Bengio, Y. (2016). Understanding intermediate layers using linear classifier probes. https://arxiv.org/abs/1610.01644

Aldrich, J. (1997). R. A. Fisher and the making of maximum likelihood 1912-1922. Statistical Science, 12, 162–176.

Amari, S. (1993). Backpropagation and stochastic gradient descent method. Neurocomputing, 5, 185–196.

———. (1998). Natural gradient works efficiently in learning. Neural Computation, 10, 251–276.

———. (2016). Information Geometry and Its Applications. Springer Japan.

Anderson, C. (2008). The End of Theory: The data deluge makes the scientific method obsolete. Wired. June 23, 2008. https://www.wired.com/2008/06/pb-theory/

Anderson, J. A. & Rosenfeld, E. (1998). Talking Nets: An oral history of neural networks. MIT Press.

Andrews, M. (2023). The devil in the data: Machine learning & the theory-free ideal. https://philsci-archive.pitt.edu/22690/1/ML_Atheoreticity.pdf

Arras, K. O. (1998). An introduction to error propagation: Derivation, meaning and examples of \(C_y= F_x C_x F_{x}^{\top}\). EPFL-ASL-TR-98-01 R3. http://srl.informatik.uni-freiburg.de/papers/arrasTR98.pdf

Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). Deep Reinforcement Learning: A Brief Survey. IEEE Signal Processing Magazine, 34, 26–38.

Asch, M. et al. (2018). Big data and extreme-scale computing: Pathways to Convergence-Toward a shaping strategy for a future software and data ecosystem for scientific inquiry. The International Journal of High Performance Computing Applications, 32, 435–479.

ATLAS and CMS Collaborations. (2011). Procedure for the LHC Higgs boson search combination in Summer 2011. CMS-NOTE-2011-005, ATL-PHYS-PUB-2011-11. http://cds.cern.ch/record/1379837

ATLAS Collaboration. (2012). Combined search for the Standard Model Higgs boson in \(pp\) collisions at \(\sqrt{s}\) = 7 TeV with the ATLAS detector. Physical Review D, 86, 032003. https://arxiv.org/abs/1207.0319

ATLAS Statistics Forum. (2011). The CLs method: Information for conference speakers. http://www.pp.rhul.ac.uk/~cowan/stat/cls/CLsInfo.pdf

Aytekin, C. (2022). Neural networks are decision trees. https://arxiv.org/abs/2210.05189

Bach, F. (2022). Learning Theory from First Principles. (Draft). https://www.di.ens.fr/~fbach/ltfp_book.pdf

———. (2024). Scaling laws of optimization. https://francisbach.com/scaling-laws-of-optimization/

Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations, 3rd, 2015. https://arxiv.org/abs/1409.0473

Bahri, Y. et al. (2020). Statistical mechanics of deep learning. Annual Review of Condensed Matter Physics, 11, 501–528.

Balasubramanian, V. (1996a). A geometric formulation of Occam’s razor for inference of parametric distributions. https://arxiv.org/abs/adap-org/9601001

———. (1996b). Statistical inference, Occam’s razor and statistical mechanics on the space of probability distributions. https://arxiv.org/abs/cond-mat/9601030

Balestriero, R., Pesenti, J., & LeCun, Y. (2021). Learning in high dimension always amounts to extrapolation. https://arxiv.org/abs/2110.09485

Banerjee, S., Agarwal, A., & Singla, S. (2024). LLMs will always hallucinate, and we need to live with this. https://arxiv.org/abs/2409.05746

Batson, J., Haaf, C. G., Kahn, Y., & Roberts, D. A. (2021). Topological obstructions to autoencoding. https://arxiv.org/abs/2102.08380

Battiloro, C. et al. (2024). E(n) equivariant topological neural networks. https://arxiv.org/abs/2405.15429

Baydin, A.G. et al. (2019). Etalumis: Bringing probabilistic programming to scientific simulators at scale. https://arxiv.org/abs/1907.03382

Behnke, O., Kröninger, K., Schott, G., & Schörner-Sadenius, T. (2013). Data Analysis in High Energy Physics: A Practical Guide to Statistical Methods. Wiley.

Belinkov, Y. (2022). Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48, 207–219.

Belkin, M. (2021). Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. https://arxiv.org/abs/2105.14368

Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proceedings of the National Academy of Sciences, 116, 15849–15854. https://arxiv.org/abs/1812.11118

Bellman, R. (1952). On the theory of dynamic programming. Proceedings of the National Academy of Sciences, 38, 716–719.

Bender, E. M. & Koller, A. (2020). Climbing towards NLU: On meaning, form, and understanding in the age of data. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 58, 5185–5198. https://aclanthology.org/2020.acl-main.463.pdf

Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2, 1–127. https://www.iro.umontreal.ca/~lisa/pointeurs/TR1312.pdf

Benjamin, D.J. et al. (2017). Redefine statistical significance. PsyArXiv. July 22, 2017. https://psyarxiv.com/mky9j/

Benjamini, Y. et al. (2021). The ASA president’s task force statement on statistical significance and replicability. Annals of Applied Statistics, 16, 1–2. https://magazine.amstat.org/blog/2021/08/01/task-force-statement-p-value/

Bensusan, H. (2000). Is machine learning experimental philosophy of science? In ECAI2000 Workshop notes on scientific Reasoning in Artificial Intelligence and the Philosophy of Science (pp. 9–14).

Berger, J. O. (2003). Could Fisher, Jeffreys and Neyman have agreed on testing? Statistical Science, 18, 1–32.

Berger, J. O. & Wolpert, R. L. (1988). The Likelihood Principle (2nd ed.). Haywood, CA: The Institute of Mathematical Statistics.

Bertsch, A., Alon, U., Neubig, G., & Gormley, M. R. (2023). Unlimiformer: Long-range transformers with unlimited length input. https://arxiv.org/abs/2305.01625

Bérut, A. et al. (2012). Experimental verification of Landauer’s principle linking information and thermodynamics. Nature, 483, 187–189. doi:10.1038/nature10872.

Bérut, A., Petrosyan, A., & Ciliberto, S. (2015). Information and thermodynamics: Experimental verification of Landauer’s erasure principle. https://arxiv.org/abs/1503.06537

Bhargava, A., Witkowski, C., Shah, M., & Thomson, M. (2023). What’s the magic word? A control theory of LLM prompting. https://arxiv.org/abs/2310.04444

Bhattiprolu, P. N., Martin, S. P., & Wells, J. D. (2020). Criteria for projected discovery and exclusion sensitivities of counting experiments. https://arxiv.org/abs/2009.07249

Billings, D. et al. (2003). Approximating game-theoretic optimal strategies for full-scale poker. IJCAI, 3, 661. http://webdocs.cs.ualberta.ca/~duane/publications/pdf/2003ijcai.pdf

Billings, D., Davidson, A., Schaeffer, J., & Szafron, D. (2002). The challenge of poker. Artificial Intelligence, 134, 201–240. https://doi.org/10.1016/S0004-3702(01)00130-8

Birnbaum, A. (1962). On the foundations of statistical inference. Journal of the American Statistical Association, 57, 269–326.

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.

Blondel, M., Martins, A. F., & Niculae, V. (2020). Learning with Fenchel-Young losses. Journal of Machine Learning Research, 21, 1–69.

Bogatskiy, A. et al. (2023). Explainable equivariant neural networks for particle physics: PELICAN. https://arxiv.org/abs/2307.16506

Bottou, L. (1998). Stochastic gradient descent tricks. In G. B. Orr & K. R. Muller (Eds.), Neural Networks: Tricks of the trade. Springer. https://www.microsoft.com/en-us/research/publication/stochastic-gradient-tricks/

Bousquet, O. et al. (2021). A theory of universal learning. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing (pp. 532–541). https://dl.acm.org/doi/pdf/10.1145/3406325.3451087

Bowling, M., Burch, N., Johanson, M., & Tammelin, O. (2015). Heads-up limit hold’em poker is solved. Science, 347, 145–149. http://science.sciencemag.org/content/347/6218/145

Bronstein, M. M., Bruna, J., Cohen, T., & Velickovic, P. (2021). Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. https://arxiv.org/abs/2104.13478

Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval estimation for a binomial proportion. Statistical Science, 16, 101–133. https://projecteuclid.org/euclid.ss/1009213286

Brown, N. (2020). Equilibrium finding for large adversarial imperfect-information games. (Ph.D. thesis). http://www.cs.cmu.edu/~noamb/thesis.pdf

Brown, N., Bakhtin, A., Lerer, A., & Gong, Q. (2020). Combining deep reinforcement learning and search for imperfect-information games. https://arxiv.org/abs/2007.13544

Brown, N., Lerer, A., Gross, S., & Sandholm, T. (2019). Deep counterfactual regret minimization. https://arxiv.org/abs/1811.00164

Brown, N. & Sandholm, T. (2017). Safe and nested subgame solving for imperfect-information games. https://arxiv.org/abs/1705.02955

———. (2018). Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 359, 418–424. https://science.sciencemag.org/content/359/6374/418

———. (2019a). Solving imperfect-information games via discounted regret minimization. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 1829–1836. https://arxiv.org/abs/1809.04040

———. (2019b). Superhuman AI for multiplayer poker. Science, 365, 885–890. https://science.sciencemag.org/content/365/6456/885

Brown, T.B. et al. (2020). Language models are few-shot learners. (Paper on the GPT-3 model by OpenAI). https://arxiv.org/abs/2005.14165

Bubeck, S. & Sellke, M. (2023). A universal law of robustness via isoperimetry. Journal of the ACM, 70, 1–18. https://dl.acm.org/doi/full/10.1145/3578580

Bulatov, A., Kuratov, Y., & Burtsev, M. S. (2022). Recurrent memory transformer. https://arxiv.org/abs/2207.06881

———. (2023). Scaling transformer to 1M tokens and beyond with RMT. https://arxiv.org/abs/2304.11062

Burch, N. (2018). Time and Space: Why imperfect information games are hard. University of Alberta. (Ph.D. thesis). https://era.library.ualberta.ca/items/db44409f-b373-427d-be83-cace67d33c41/view/bcb00dca-39e6-4c43-9ec2-65026a50135e/Burch_Neil_E_201712_PhD.pdf

Burch, N., Lanctot, M., Szafron, D., & Gibson, R. (2012). Efficient Monte Carlo counterfactual regret minimization in games with many player actions. Advances in Neural Information Processing Systems, 25. https://proceedings.neurips.cc/paper/2012/file/3df1d4b96d8976ff5986393e8767f5b2-Paper.pdf

Burch, N., Moravcik, M., & Schmid, M. (2019). Revisiting CFR+ and alternating updates. Journal of Artificial Intelligence Research, 64, 429–443. https://www.jair.org/index.php/jair/article/view/11370

Caballero, E., Gupta, K., Rish, I., & Krueger, D. (2022). Broken neural scaling laws. https://arxiv.org/abs/2210.14891

Caldeira, J. & Nord, B. (2020). Deeply uncertain: comparing methods of uncertainty quantification in deep learning algorithms. Machine Learning: Science and Technology, 2, 015002. https://iopscience.iop.org/article/10.1088/2632-2153/aba6f3

Calin, O. & Udriste, C. (2014). Geometric Modeling in Probability and Statistics. Springer Switzerland.

Canatar, A., Bordelon, B., & Pehlevan, C. (2020). Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. https://arxiv.org/abs/2006.13198

Cantwell, G. T. (2022). Approximate sampling and estimation of partition functions using neural networks. https://arxiv.org/abs/2209.10423

Carnap, R. (1945). The two concepts of probability. Philosophy and Phenomenological Research, 5, 513–32.

———. (1947). Probability as a guide in life. Journal of Philosophy, 44, 141–48.

———. (1952). The Continuum of Inductive Methods. University of Chicago Press. https://www.phil.cmu.edu/projects/carnap/editorial/latex_pdf/1952-1.pdf

———. (1953). What is probability? Scientific American, 189, 128–139. https://www.jstor.org/stable/24944342

———. (1960). Logical Foundations of Probability. University of Chicago Press.

Cartwright, N. (2001). What is wrong with Bayes nets? Monist, 84, 242–264. https://doi.org/10.5840/monist20018429

Casadei, D. (2012). Estimating the selection efficiency. Journal of Instrumentation, 7, 08021. https://arxiv.org/abs/0908.0130

Cesa-Bianchi, N. & Lugosi, G. (2006). Prediction, Learning, and Games. Cambridge University Press. https://ii.uni.wroc.pl/~lukstafi/pmwiki/uploads/AGT/Prediction_Learning_and_Games.pdf

Chang, X., Li, Y., Oymak, S., & Thrampoulidis, C. (2020). Provable benefits of overparameterization in model compression: From double descent to pruning neural networks. https://arxiv.org/abs/2012.08749

Chen, R. T. Q., Rubanova, Y., Bettencourt, J., & Duvenaud, D. (2018). Neural ordinary differential equations. https://arxiv.org/abs/1806.07366

Chen, S., Dobriban, E., & Lee, J. H. (2020). A group-theoretic framework for data augmentation. https://arxiv.org/abs/1907.10905

Chen, T. & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. https://arxiv.org/abs/1603.02754

Chen, X. et al. (2018). Open is not enough. Nature Physics, 15, 113–119. https://www.nature.com/articles/s41567-018-0342-2

Chevalley, M., Schwab, P., & Mehrjou, A. (2024). Deriving causal order from single-variable interventions: Guarantees & algorithm. https://arxiv.org/abs/2405.18314

Chiley, V. et al. (2019). Online normalization for training neural networks. NeurIPS 2019. https://arxiv.org/abs/1905.05894

Chowdhery, A. et al. (2022). PaLM: Scaling language modeling with pathways. https://arxiv.org/abs/2204.02311

Church, K. W. & Hestness, J. (2019). A survey of 25 years of evaluation. Natural Language Engineering, 25, 753–767. https://www.cambridge.org/core/journals/natural-language-engineering/article/survey-of-25-years-of-evaluation/E4330FAEB9202EC490218E3220DDA291

Cilibrasi, R. & Vitanyi, P. M. B. (2005). Clustering by compression. IEEE Transactions on Information Theory, 51, 1523–1545.

Ciresan, D., Meier, U., Masci, J., & Schmidhuber, J. (2012). Multi-column deep neural network for traffic sign classification. Neural Networks, 32, 333–338. https://arxiv.org/abs/1202.2745

Clayton, A. (2021). Bernoulli’s Fallacy: Statistical Illogic and the Crisis of Modern Science. Columbia University Press.

Clopper, C. J. & Pearson, E. S. (1934). The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika, 26, 404–413.

Coadou, Y. (2022). Boosted decision trees. https://arxiv.org/abs/2206.09645

Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155–9. https://www2.psych.ubc.ca/~schaller/528Readings/Cohen1992.pdf

Cohen, T. S., Weiler, M., Kicanaoglu, B., & Welling, M. (2019). Gauge equivariant convolutional networks and the icosahedral CNN. https://arxiv.org/abs/1902.04615

Cohen, T. S. & Welling, M. (2016). Group equivariant convolutional networks. Proceedings of International Conference on Machine Learning, 2016, 2990–9. http://proceedings.mlr.press/v48/cohenc16.pdf

Collobert, R., Hannun, A., & Synnaeve, G. (2019). A fully differentiable beam search decoder. International Conference on Machine Learning, 2019, 1341–1350. http://proceedings.mlr.press/v97/collobert19a/collobert19a.pdf

Cousins, R. D. (2018). Lectures on statistics in theory: Prelude to statistics in practice. https://arxiv.org/abs/1807.05996

Cousins, R. D. & Highland, V. L. (1992). Incorporating systematic uncertainties into an upper limit. Nuclear Instruments and Methods in Physics Research Section A, 320, 331–335. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.193.1581&rep=rep1&type=pdf

Cowan, G. (1998). Statistical Data Analysis. Clarendon Press.

———. (2012). Discovery sensitivity for a counting experiment with background uncertainty. https://www.pp.rhul.ac.uk/~cowan/stat/notes/medsigNote.pdf

———. (2016). Statistics. In C. Patrignani et al. (Particle Data Group),. Chinese Physics C, 40, 100001. http://pdg.lbl.gov/2016/reviews/rpp2016-rev-statistics.pdf

Cowan, G., Cranmer, K., Gross, E., & Vitells, O. (2011). Asymptotic formulae for likelihood-based tests of new physics. European Physical Journal C, 71, 1544. https://arxiv.org/abs/1007.1727

———. (2012). Asymptotic distribution for two-sided tests with lower and upper boundaries on the parameter of interest. https://arxiv.org/abs/1210.6948

Cox, D. R. (2006). Principles of Statistical Inference. Cambridge University Press.

Cramér, H. (1946). A contribution to the theory of statistical estimation. Skandinavisk Aktuarietidskrift, 29, 85–94.

Cranmer, K. (2015). Practical statistics for the LHC. https://arxiv.org/abs/1503.07622

Cranmer, K. et al. (2012). HistFactory: A tool for creating statistical models for use with RooFit and RooStats. Technical Report: CERN-OPEN-2012-016. http://inspirehep.net/record/1236448/

Cranmer, K., Brehmer, J., & Louppe, G. (2019). The frontier of simulation-based inference. https://arxiv.org/abs/1911.01429

Cranmer, K., Pavez, J., & Louppe, G. (2015). Approximating likelihood ratios with calibrated discriminative classifiers. https://arxiv.org/abs/1506.02169

Cranmer, K., Seljak, U., & Terao, K. (2021). Machine learning. In P. A. Z. et al. (Ed.), Progress of Theoretical and Experimental Physics. 2020, 083C01. (and 2021 update). https://pdg.lbl.gov/2021-rev/2021/reviews/contents_sports.html

Cranmer, M. et al. (2020). Discovering symbolic models from deep learning with inductive biases. https://arxiv.org/abs/2006.11287

D’Agnolo, R. T. & Wulzer, A. (2019). Learning New Physics from a Machine. Physical Review D, 99, 015014. https://arxiv.org/abs/1806.02350

Dao, T. et al. (2022). FlashAttention: Fast and memory-efficient exact attention with IO-awareness. https://arxiv.org/abs/2205.14135

Dao, T. & Gu, A. (2024). Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. https://arxiv.org/abs/2405.21060

Dar, Y., Muthukumar, V., & Baraniuk, R. G. (2021). A farewell to the bias-variance tradeoff? An overview of the theory of overparameterized machine learning. https://arxiv.org/abs/2109.02355

Dawid, A. P. (2014). Discussion of "On the Birnbaum Argument for the Strong Likelihood Principle". Statistical Science, 29, 240–241. https://projecteuclid.org/journals/statistical-science/volume-29/issue-2/Discussion-of-On-the-Birnbaum-Argument-for-the-Strong-Likelihood/10.1214/14-STS470.full

de Carvalho, M., Page, G. L., & Barney, B. J. (2019). On the geometry of Bayesian inference. Bayesian Analysis, 14, 1013–1036. https://projecteuclid.org/journals/bayesian-analysis/volume-14/issue-4/On-the-Geometry-of-Bayesian-Inference/10.1214/18-BA1112.full

Denby, B. (1988). Neural networks and cellular automata in experimental high energy physics. Computer Physics Communications, 49, 429–448. https://inis.iaea.org/collection/NCLCollectionStore/_Public/20/013/20013339.pdf

———. (1993). The use of neural networks in high-energy physics. Neural Computation, 5, 505–549. https://lss.fnal.gov/archive/1992/pub/Pub-92-215-E.pdf

Dennett, D. C. (1991). Real patterns. The Journal of Philosophy, 88, 27–51. https://web.ics.purdue.edu/~drkelly/DCDRealPatterns1991.pdf

Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient finetuning of quantized LLMs. https://arxiv.org/abs/2305.14314

Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. https://arxiv.org/abs/1810.04805

Dhariwal, P. & Nichol, A. (2021). Diffusion models beat GANs on image synthesis. https://arxiv.org/abs/2105.05233

Dieleman, S., Fauw, J. D., & Kavukcuoglu, K. (2016). Exploiting cyclic symmetry in convolutional neural networks. https://arxiv.org/abs/1602.02660

Dinan, E., Yaida, S., & Zhang, S. (2023). Effective theory of transformers at initialization. https://arxiv.org/abs/2304.02034

Doob, J. L. (1935). The limiting distributions of certain statistics. Annals of Mathematical Statistics, 6, 160–169. https://www.jstor.org/stable/2957546

Dosovitskiy, A. et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. https://arxiv.org/abs/2010.11929

Edelman, B. L., Goel, S., Kakade, S., & Zhang, C. (2021). Inductive biases and variable creation in self-attention mechanisms. https://arxiv.org/abs/2110.10090

Edwards, A. W. F. (1974). The history of likelihood. International Statistical Review, 42, 9–15.

Efron, B. & Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, evidence, and data science. Cambridge University Press.

Elhage, N. et al. (2022). Toy models of superposition. https://transformer-circuits.pub/2022/toy_model/index.html

Evans, M. (2013). What does the proof of Birnbaum’s theorem prove? https://arxiv.org/abs/1302.5468

Fang, Z. et al. (2022). Is out-of-distribution detection learnable? NeurIPS 2022. https://arxiv.org/abs/2210.14707

Fefferman, C., Mitter, S., & Narayanan, H. (2016). Testing the manifold hypothesis. Journal of the American Mathematical Society, 29, 983–1049. https://www.ams.org/journals/jams/2016-29-04/S0894-0347-2016-00852-4/S0894-0347-2016-00852-4.pdf

Feldman, G. J. & Cousins, R. D. (1998). A unified approach to the classical statistical analysis of small signals. Physical Review D, 57, 3873. https://arxiv.org/abs/physics/9711021

Fienberg, S. E. (2006). When did Bayesian inference become "Bayesian"? Bayesian Analysis, 1, 1–40. https://projecteuclid.org/journals/bayesian-analysis/volume-1/issue-1/When-did-Bayesian-inference-become-Bayesian/10.1214/06-BA101.full

Finzi, M. et al. (2025). Compute-optimal LLMs provably generalize better with scale. https://arxiv.org/abs/2504.15208

Firth, J. R. (1957). A synopsis of linguistic theory, 1930-1955. In Studies in Linguistic Analysis (pp. 1–31). Oxford: Blackwell.

Fisher, R. A. (1912). On an absolute criterion for fitting frequency curves. Statistical Science, 12, 39–41.

———. (1915). Frequency distribution of the values of the correlation coefficient in samples of indefinitely large population. Biometrika, 10, 507–521.

———. (1921). On the "probable error" of a coefficient of correlation deduced from a small sample. Metron, 1, 1–32.

———. (1935). The Design of Experiments. Hafner.

———. (1955). Statistical methods and scientific induction. Journal of the Royal Statistical Society, Series B, 17, 69–78.

Frankle, J. & Carbin, M. (2018). The lottery ticket hypothesis: Finding sparse, trainable neural networks. https://arxiv.org/abs/1803.03635

Fréchet, M. (1943). Sur l’extension de certaines évaluations statistiques au cas de petits échantillons. Revue de l’Institut International de Statistique, 11, 182–205.

Freund, Y. & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55, 119–139. https://doi.org/10.1006/jcss.1997.1504

Fuchs, F. B., Worrall, D. E., Fischer, V., & Welling, M. (2020). SE(3)-Transformers: 3D roto-translation equivariant attention networks. https://arxiv.org/abs/2006.10503

Fukushima, K. & Miyake, S. (1982). Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern Recognition, 15, 455–469.

Gamba, M., Englesson, E., Björkman, M., & Azizpour, H. (2022). Deep double descent via smooth interpolation. https://arxiv.org/abs/2209.10080

Gammerman, A., Vovk, V., & Vapnik, V. (1998). Learning by transduction. Uncertainty in Artificial Intelligence, 14, 148–155. https://arxiv.org/abs/1301.7375

Gandenberger, G. (2015). A new proof of the likelihood principle. British Journal for the Philosophy of Science, 66, 475–503. https://www.journals.uchicago.edu/doi/abs/10.1093/bjps/axt039

———. (2016). Why I am not a likelihoodist. Philosopher’s Imprint, 16, 1–22. https://quod.lib.umich.edu/p/phimp/3521354.0016.007/--why-i-am-not-a-likelihoodist

Gao, Y. & Chaudhari, P. (2020). An information-geometric distance on the space of tasks. https://arxiv.org/abs/2011.00613

Gardner, M. W. & Dorling, S. R. (1998). Artificial neural networks (the multilayer perceptron)–a review of applications in the atmospheric sciences. Atmospheric Environment, 32, 2627–2636.

Gelman, A. & Hennig, C. (2017). Beyond subjective and objective in statistics. Journal of the Royal Statistical Society: Series A (Statistics in Society), 180, 967–1033.

Gelman, A. & Vehtari, A. (2021). What are the most important statistical ideas of the past 50 years? Journal of the American Statistical Association, 116, 2087–2097. https://www.tandfonline.com/doi/full/10.1080/01621459.2021.1938081

Geshkovski, B., Letrouit, C., Polyanskiy, Y., & Rigollet, P. (2023). A mathematical perspective on Transformers. https://arxiv.org/abs/2312.10794

Ghosh, N. & Belkin, M. (2022). A universal trade-off between the model size, test loss, and training loss of linear predictors. https://arxiv.org/abs/2207.11621

Gibson, R. (2014). Regret minimization in games and the development of champion multiplayer computer poker-playing agents. University of Alberta. (Ph.D. thesis). https://era.library.ualberta.ca/items/15d28cbf-49d4-42e5-a9c9-fc55b1d816af/view/5ee708c7-6b8b-4b96-b1f5-23cdd95b6a46/Gibson_Richard_Spring-202014.pdf

Goldreich, O. & Ron, D. (1997). On universal learning algorithms. Information Processing Letters, 63, 131–136. https://www.wisdom.weizmann.ac.il/~oded/p_ul.html

Golovneva, O., Wang, T., Weston, J., & Sukhbaatar, S. (2024). Contextual position encoding: Learning to count what’s important. https://arxiv.org/abs/2405.18719

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. http://www.deeplearningbook.org

Goodman, S. N. (1999a). Toward evidence-based medical statistics 1: The P value fallacy. Annals of Internal Medicine, 130, 995–1004. https://courses.botany.wisc.edu/botany_940/06EvidEvol/papers/goodman1.pdf

———. (1999b). Toward evidence-based medical statistics 2: The Bayes factor. Annals of Internal Medicine, 130, 1005–1013. https://courses.botany.wisc.edu/botany_940/06EvidEvol/papers/goodman2.pdf

Gorard, S. & Gorard, J. (2016). What to do instead of significance testing? Calculating the ’number of counterfactual cases needed to disturb a finding’. International Journal of Social Research Methodology, 19, 481–490.

Graves, A. (2013). Generating sequences with recurrent neural networks. https://arxiv.org/abs/1308.0850

Grinsztajn, L., Oyallon, E., & Varoquaux, G. (2022). Why do tree-based models still outperform deep learning on tabular data? https://arxiv.org/abs/2207.08815

Gu, A. & Dao, T. (2023). Mamba: Linear-time sequence modeling with selective state spaces. https://arxiv.org/abs/2312.00752

Gu, A., Goel, K., & Ré, C. (2021). Efficiently modeling long sequences with structured state spaces. https://arxiv.org/abs/2111.00396

Gurnee, W. et al. (2023). Finding neurons in a haystack: Case studies with sparse probing. https://arxiv.org/abs/2305.01610

Gurnee, W. & Tegmark, M. (2023). Language models represent space and time. https://arxiv.org/abs/2310.02207

Habara, K., Fukuda, E. H., & Yamashita, N. (2023). Convergence analysis and acceleration of the smoothing methods for solving extensive-form games. https://arxiv.org/abs/2303.11046

Haber, E. et al. (2018). Learning across scales: Multiscale methods for convolution neural networks. AAAI Proceedings, 32, 3142–3148. https://cdn.aaai.org/ojs/11680/11680-13-15208-1-2-20201228.pdf

Haber, E. & Ruthotto, L. (2017). Stable architectures for deep neural networks. https://arxiv.org/abs/1705.03341

Hacking, I. (1965). Logic of Statistical Inference. Cambridge University Press.

———. (1971). Jacques Bernoulli’s Art of conjecturing. The British Journal for the Philosophy of Science, 22, 209–229.

Halverson, J., Maiti, A., & Stoner, K. (2020). Neural networks and quantum field theory. https://arxiv.org/abs/2008.08601

Hanley, J. A. & Lippman-Hand, A. (1983). If nothing goes wrong, is everything all right?: Interpreting zero numerators. JAMA, 249, 1743–1745.

Hart, S. & Mas‐Colell, A. (2000). A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68, 1127–1150. https://www.ma.imperial.ac.uk/~dturaev/Hart0.pdf

Hartigan, J. A. (1985). Statistical theory in clustering. Journal of Classification, 2, 63–76. https://link.springer.com/article/10.1007/BF01908064

Hastie, T., Montanari, A., Rosset, S., & Tibshirani, R. J. (2022). Surprises in high-dimensional ridgeless least squares interpolation. Annals of Statistics, 50, 949. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9481183/

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer.

He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. https://arxiv.org/abs/1512.03385

Heinrich, J. & Lyons, L. (2007). Systematic errors. Annual Reviews of Nuclear and Particle Science, 57, 145–169. https://www.annualreviews.org/doi/abs/10.1146/annurev.nucl.57.090506.123052

Heinrich, J. & Silver, D. (2016). Deep reinforcement learning from self-play in imperfect-information games. https://arxiv.org/abs/1603.01121

Henighan, T. et al. (2023). Superposition, memorization, and double descent. https://transformer-circuits.pub/2023/toy-double-descent/index.html

Hennig, C. (2015). What are the true clusters? Pattern Recognition Letters, 64, 53–62. https://arxiv.org/abs/1502.02555

Hestness, J. et al. (2017). Deep learning scaling is predictable, empirically. https://arxiv.org/abs/1712.00409

Hochreiter, S. & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9, 1735–1780.

Hoffmann, J. et al. (2022). Training compute-optimal large language models. https://arxiv.org/abs/2203.15556

Holzmüller, D. (2020). On the universality of the double descent peak in ridgeless regression. https://arxiv.org/abs/2010.01851

Horacek, M. (2022). Risk-Aversion in Algorithms for Poker. https://is.muni.cz/th/ydbvx/thesis.pdf

Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359–366. https://cognitivemedium.com/magic_paper/assets/Hornik.pdf

Howard, A.G. et al. (2017). MobileNets: Efficient convolutional neural networks for mobile vision applications. https://arxiv.org/abs/1704.04861

Howard, J. N., Mandt, S., Whiteson, D., & Yang, Y. (2021). Foundations of a fast, data-driven, machine-learned simulator. https://arxiv.org/abs/2101.08944

Hu, E.J. et al. (2021). LoRA: Low-rank adaptation of large language models. https://arxiv.org/abs/2106.09685

Huang, L. et al. (2020). Normalization techniques in training DNNs: Methodology, analysis and application. https://arxiv.org/abs/2009.12836

Huh, M. et al. (2024). Training neural networks from scratch with parallel low-rank adapters. https://arxiv.org/abs/2402.16828

Huh, M., Cheung, B., Wang, T., & Isola, P. (2024). The platonic representation hypothesis. https://arxiv.org/abs/2405.07987

Hutchins, J. (2000). Yehoshua Bar-Hillel: A philosophers’ contribution to machine translation.

Hutter, M. (2007). Universal Algorithmic Intelligence: A mathematical top-down approach. In Artificial General Intelligence (pp. 227–290). Springer. http://www.hutter1.net/ai/aixigentle.htm

Ingrosso, A. & Goldt, S. (2022). Data-driven emergence of convolutional structure in neural networks. https://arxiv.org/abs/2202.00565

Ioannidis, J. P. (2005). Why most published research findings are false. PLOS Medicine, 2, 696–701.

Ismael, J. (2023). Reflections on the asymmetry of causation. Interface Focus, 13, 20220081. https://royalsocietypublishing.org/doi/pdf/10.1098/rsfs.2022.0081

Ismailov, V. (2020). A three layer neural network can represent any multivariate function. https://arxiv.org/abs/2012.03016

Jamali, M. et al. (2024). Semantic encoding during language comprehension at single-cell resolution. Nature, 631, 610–616. https://www.nature.com/articles/s41586-024-07643-2

James, F. (2006). Statistical Methods in Experimental Particle Physics (2nd ed.). World Scientific.

James, F. & Roos, M. (1975). MINUIT: A system for function minimization and analysis of the parameter errors and corrections. Computational Physics Communications, 10, 343–367. https://cds.cern.ch/record/310399

James, W. & Stein, C. (1961). Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1 (pp. 361–379). University of California Press. https://projecteuclid.org/accountAjax/Download?urlId=bsmsp%2F1200512173&downloadType=presschapter&isResultClick=True

Javed, K. & Sutton, R. S. (2024). The big world hypothesis and its ramifications for artificial intelligence. http://incompleteideas.net/papers/The_Big_World_Hypothesis.pdf

Jaynes, E. T. (2003). Probability Theory: The Logic of Science. https://bayes.wustl.edu/etj/prob/book.pdf

Jevons, W. S. (1873a). The philosophy of inductive inference. Fortnightly Review, 14, 457–476. London: Chapman and Hall.

———. (1873b). The use of hypothesis. Fortnightly Review, 14, 778–788. London: Chapman and Hall.

Jiao, L. et al. (2024). AI meets physics: A comprehensive survey. Artificial Intelligence Review, 57, 256. https://doi.org/10.1007/s10462-024-10874-4

Johanson, M. (2013). Measuring the size of large no-limit poker games. https://arxiv.org/abs/1302.7008

———. (2016). Robust Strategies and Counter-Strategies: From Superhuman to Optimal Play. University of Alberta. (Ph.D. thesis). http://johanson.ca/publications/theses/2016-johanson-phd-thesis/2016-johanson-phd-thesis.pdf

Johanson, M. et al. (2012). Efficient Nash equilibrium approximation through Monte Carlo counterfactual regret minimization. Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2012), 2, 837–846. https://www.idi.ntnu.no/emner/it3105/materials/poker/monte-carlo-cfm-2012.pdf

Johanson, M., Waugh, K., Bowling, M., & Zinkevich, M. (2011). Accelerating best response calculation in large extensive games. IJCAI 2011, Proceedings of the 22nd International Joint Conference on Artificial Intelligence, 11, 258–265. http://www.cs.cmu.edu/~kwaugh/publications/johanson11.pdf

Joyce, T. & Herrmann, J. M. (2017). A review of no free lunch theorems, and their implications for metaheuristic optimisation. In X. S. Yang (Ed.), Nature-Inspired Algorithms and Applied Optimization (pp. 27–52).

Junk, T. (1999). Confidence level computation for combining searches with small statistics. Nuclear Instruments and Methods in Physics Research Section A, 434, 435–443. https://arxiv.org/abs/hep-ex/9902006

Jurafsky, D. & Martin, J. H. (2022). Speech and Language Processing: An introduction to natural language processing, computational linguistics, and speech recognition (3rd ed.). https://web.stanford.edu/~jurafsky/slp3/ed3book_jan122022.pdf

Kaplan, J. et al. (2019). Notes on contemporary machine learning for physicists. https://sites.krieger.jhu.edu/jared-kaplan/files/2019/04/ContemporaryMLforPhysicists.pdf

———. (2020). Scaling laws for neural language models. https://arxiv.org/abs/2001.08361

Kardum, M. (2020). Rudolf Carnap–The grandfather of artificial neural networks: The influence of Carnap’s philosophy on Walter Pitts. In S. Skansi (Ed.), Guide To Deep Learning Basics: Logical, Historical And Philosophical Perspectives (pp. 55–66). Springer.

Karniadakis, G.E. et al. (2021). Physics-informed machine learning. Nature Reviews Physics, 3, 422–440. https://doi.org/10.1038/s42254-021-00314-5

Keynes, J. M. (1921). A Treatise on Probability. London: Macmillan and Co.

Kiani, B., Balestriero, R., Lecun, Y., & Lloyd, S. (2022). projUNN: efficient method for training deep networks with unitary matrices. https://arxiv.org/abs/2203.05483

Korb, K. B. (2001). Machine learning as philosophy of science. In Proceedings of the ECML-PKDD-01 Workshop on Machine Learning as Experimental Philosophy of Science. Freiburg.

Kornai, A. (2023). Vector Semantics. Springer.

Kosinski, M. (2023). Theory of mind may have spontaneously emerged in large language models. https://arxiv.org/abs/2302.02083

Kovarik, V. et al. (2022). Rethinking formal models of partially observable multiagent decision making. Artificial Intelligence, 303, 103645. https://arxiv.org/abs/1906.11110

Krenn, M. et al. (2022). On scientific understanding with artificial intelligence. Nature Reviews Physics. https://www.nature.com/articles/s42254-022-00518-3

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 2012, 1097–1105. https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

Kruschke, J. K. & Liddell, T. M. (2018). The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychonomic Bulletin & Review, 25, 178–206. https://link.springer.com/article/10.3758/s13423-016-1221-4

Kuhn, H. W. (1950). A simplified two-person poker. Contributions to the Theory of Games, 1, 97–103.

Kun, J. (2018). A Programmer’s Introduction to Mathematics. CreateSpace Independent Publishing Platform.

Lan, Z. et al. (2019). ALBERT: A lite BERT for self-supervised learning of language representations. https://arxiv.org/abs/1909.11942

Lanctot, M. (2013). Monte Carlo Sample and Regret Minimization for Equilibrium Computation and Decision-Making in Large Extensive Form Games. University of Alberta. (PhD thesis). http://mlanctot.info/files/papers/PhD_Thesis_MarcLanctot.pdf

Lanctot, M. et al. (2017). A unified game-theoretic approach to multiagent reinforcement learning. Advances in Neural Information Processing Systems, 30. https://arxiv.org/abs/1711.00832

Lanctot, M., Waugh, K., Zinkevich, M., & Bowling, M. (2009). Monte Carlo sampling for regret minimization in extensive games. Advances in Neural Information Processing Systems, 22, 1078–1086. https://proceedings.neurips.cc/paper/2009/file/00411460f7c92d2124a67ea0f4cb5f85-Paper.pdf

Lauc, D. (2020). Machine learning and the philosophical problems of induction. In S. Skansi (Ed.), Guide To Deep Learning Basics: Logical, Historical And Philosophical Perspectives (pp. 93–106). Springer.

LeCun, Y. et al. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1, 541–551. https://web.archive.org/web/20150611222615/http://yann.lecun.com/exdb/publis/pdf/lecun-89e.pdf

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436–44.

LeCun, Y. & Bottou, L. (1998). Efficient BackProp. In G. B. Orr & K. R. Muller (Eds.), Neural Networks: Tricks of the trade. Springer. http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf

LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86, 2278–2324. http://vision.stanford.edu/cs598_spring07/papers/Lecun98.pdf

Leemis, L. M. & McQueston, J. T. (2008). Univariate distribution relationships. The American Statistician, 62, 45–53. http://www.stat.rice.edu/~dobelman/courses/texts/leemis.distributions.2008amstat.pdf

Lei, N., Luo, Z., Yau, S., & Gu, D. X. (2018). Geometric understanding of deep learning. https://arxiv.org/abs/1805.10451

Lewis, D. (1981). Causal decision theory. Australasian Journal of Philosophy, 59, 5–30. https://www.andrewmbailey.com/dkl/Causal_Decision_Theory.pdf

Lewis, M. et al. (2019). BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. https://arxiv.org/abs/1910.13461

Li, H. et al. (2020). Regret minimization via novel vectorized sampling policies and exploration. http://aaai-rlg.mlanctot.info/2020/papers/AAAI20-RLG_paper_14.pdf

Lin, H. W., Tegmark, M., & Rolnick, D. (2017). Why does deep and cheap learning work so well? Journal of Statistical Physics, 168, 1223–1247. https://link.springer.com/article/10.1007/s10955-017-1836-5

Lin, H. & Jegelka, S. (2018). ResNet with one-neuron hidden layers is a universal approximator. https://arxiv.org/abs/1806.10909

Lista, L. (2016a). Practical statistics for particle physicists. https://arxiv.org/abs/1609.04150

———. (2016b). Statistical Methods for Data Analysis in Particle Physics. Springer. http://foswiki.oris.mephi.ru/pub/Main/Literature/st_methods_for_data_analysis_in_particle_ph.pdf

Lisy, V. & Bowling, M. (2016). Equilibrium approximation quality of current no-limit poker bots. https://arxiv.org/abs/1612.07547

Liu, H., Dai, Z., So, D. R., & Le, Q. V. (2021). Pay attention to MLPs. https://arxiv.org/abs/2105.08050

Liu, Y. et al. (2019). RoBERTa: A robustly optimized BERT pretraining approach. https://arxiv.org/abs/1907.11692

———. (2021). A survey of visual transformers. https://arxiv.org/abs/2111.06091

Liu, Z. et al. (2024). KAN: Kolmogorov-Arnold Networks. https://arxiv.org/abs/2404.19756

Liu, Z., Lin, Y., & Sun, M. (2023). Representation Learning for Natural Language Processing. Springer. https://link.springer.com/book/10.1007/978-981-99-1600-9

Liu, Z., Madhavan, V., & Tegmark, M. (2022). AI Poincare 2: Machine learning conservation laws from differential equations. https://arxiv.org/abs/2203.12610

Loh, W. Y. (1987). Calibrating confidence coefficients. Journal of the American Statistical Association, 82, 155–162. https://www.tandfonline.com/doi/abs/10.1080/01621459.1987.10478408

Lovering, C. & Pavlick, E. (2022). Unit testing for concepts in neural networks. Transactions of the Association for Computational Linguistics, 10, 1193–1208. https://aclanthology.org/2022.tacl-1.69/

Lu, C. et al. (2024). The AI Scientist: Towards fully automated open-ended scientific discovery. https://arxiv.org/abs/2408.06292

Lu, Z. et al. (2017). The expressive power of neural networks: A view from the width. Advances in Neural Information Processing Systems, 30. https://proceedings.neurips.cc/paper/2017/file/32cbf687880eb1674a07bf717761dd3a-Paper.pdf

Lundberg, I., Johnson, R., & Stewart, B. M. (2021). What is your estimand? Defining the target quantity connects statistical evidence to theory. American Sociological Review, 86, 532–565. https://journals.sagepub.com/doi/abs/10.1177/00031224211004187

Lyons, L. (2008). Open statistical issues in particle physics. The Annals of Applied Statistics, 2, 887–915. https://projecteuclid.org/journals/annals-of-applied-statistics/volume-2/issue-3/Open-statistical-issues-in-Particle-Physics/10.1214/08-AOAS163.full

Ma, S. et al. (2024). The era of 1-bit LLMs: All large language models are in 1.

Ma, X. et al. (2024). Megalodon: Efficient LLM pretraining and inference with unlimited context length. https://arxiv.org/abs/2404.08801

MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge University Press.

Maddox, W. J., Benton, G., & Wilson, A. G. (2023). Rethinking parameter counting in deep models: Effective dimensionality revisited. https://arxiv.org/abs/2003.02139

Mahowald, K. et al. (2023). Dissociating language and thought in large language models: a cognitive perspective. https://arxiv.org/abs/2301.06627

Marchetti, G. L., Hillar, C., Kragic, D., & Sanborn, S. (2023). Harmonics of learning: Universal fourier features emerge in invariant networks. https://arxiv.org/abs/2312.08550

Mayo, D. G. (1981). In defense of the Neyman-Pearson theory of confidence intervals. Philosophy of Science, 48, 269–280.

———. (1996). Error and the Growth of Experimental Knowledge. Chicago University Press.

———. (2014). On the Birnbaum Argument for the Strong Likelihood Principle,. Statistical Science, 29, 227–266.

———. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. Cambridge University Press.

———. (2019). The law of likelihood and error statistics. https://errorstatistics.com/2019/04/04/excursion-1-tour-ii-error-probing-tools-versus-logics-of-evidence-excerpt/

———. (2021). Significance tests: Vitiated or vindicated by the replication crisis in psychology? Review of Philosophy and Psychology, 12, 101–121. https://link.springer.com/article/10.1007/s13164-020-00501-w

Mayo, D. G. & Spanos, A. (2006). Severe testing as a basic concept in a Neyman-Pearson philosophy of induction. British Journal for the Philosophy of Science, 57, 323–357.

———. (2011). Error statistics. In Philosophy of Statistics (pp. 153–198). North-Holland.

McCarthy, J., Minsky, M. L., Rochester, N., & Shannon, C. E. (1955). A proposal for the Dartmouth Summer Research Project on Artificial Intelligence. http://www-formal.stanford.edu/jmc/history/dartmouth.pdf

McCulloch, W. & Pitts, W. (1943). A logical calculus of ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5, 115–133. https://link.springer.com/content/pdf/10.1007/BF02478259.pdf

McDermott, J. (2019). When and why metaheuristics researchers can ignore "no free lunch" theorems. https://arxiv.org/abs/1906.03280

McDougall, C. et al. (2023). Copy suppression: Comprehensively understanding an attention head. https://arxiv.org/abs/2310.04625

McFadden, D. & Zarembka, P. (1973). Conditional logit analysis of qualitative choice behavior. In Frontiers in Econometrics (pp. 105–142). New York: Academic Press.

Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806–834.

Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2023). Locating and editing factual associations in GPT https://arxiv.

Merrill, W. & Sabharwal, A. (2022). The parallelism tradeoff: Limitations of log-precision transformers. https://arxiv.org/abs/2207.00729

Mialon, G. et al. (2023). Augmented Language Models: a Survey. https://arxiv.org/abs/2302.07842

Mikolov, T. et al. (2013). Distributed representations of words and phrases and their compositionality. https://arxiv.org/abs/1310.4546

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. https://arxiv.org/abs/1301.3781

Mikolov, T., Yih, W. T., & Zweig, G. (2013). Linguistic regularities in continuous space word representations. NAACL HLT 2013. https://www.aclweb.org/anthology/N13-1090.pdf

Minsky, M. & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press.

Mitchell, T. M. (1980). The need for biases in learning generalizations. In Readings in Machine Learning (pp. 184–192). San Mateo, CA, USA. http://www.cs.cmu.edu/afs/cs/usr/mitchell/ftp/pubs/NeedForBias_1980.pdf

Mnih, V. et al. (2013). Playing Atari with deep reinforcement learning. https://arxiv.org/abs/1312.5602

———. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529–533. http://files.davidqiu.com//research/nature14236.pdf

Mohamadi, S. et al. (2023). ChatGPT in the age of generative AI and large language models: A concise survey. https://arxiv.org/abs/2307.04251v1

Moravcik, M. et al. (2017). DeepStack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356, 508–513. https://arxiv.org/abs/1701.01724

Muennighoff, N. et al. (2023). Scaling data-constrained language models. https://arxiv.org/abs/2305.16264

Murphy, K. P. (2012). Machine Learning: A probabilistic perspective. MIT Press.

———. (2022). Probabilistic Machine Learning: An introduction. MIT Press.

Muthukumar, V., Vodrahalli, K., Subramanian, V., & Sahai, A. (2019). Harmless interpolation of noisy data in regression. https://arxiv.org/abs/1903.09139

Nagarajan, V. (2021). Explaining generalization in deep learning: progress and fundamental limits. (Ph.D. thesis). https://arxiv.org/abs/2110.08922

Nakkiran, P. (2021). Turing-universal learners with optimal scaling laws. https://arxiv.org/abs/2111.05321

Nakkiran, P. et al. (2019). Deep double descent: Where bigger models and more data hurt. https://arxiv.org/abs/1912.02292

Nakkiran, P., Bradley, A., Zhou, H., & Advani, M. (2024). Step-by-step diffusion: An elementary tutorial. https://arxiv.org/abs/2406.08929

Neller, T. W. & Lanctot, M. (2013). An introduction to counterfactual regret minimization. Proceedings of Model AI Assignments, 11. http://cs.gettysburg.edu/~tneller/modelai/2013/cfr/cfr.pdf

Neyman, J. (1955). The problem of inductive inference. Communications on Pure and Applied Mathematics, 8, 13–45. https://errorstatistics.files.wordpress.com/2017/04/neyman-1955-the-problem-of-inductive-inference-searchable.pdf

———. (1977). Frequentist probability and frequentist statistics. Synthese, 36, 97–131.

Neyman, J. & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society A, 231, 289–337.

Nielsen, F. (2013). Cramer-Rao lower bound and information geometry. https://arxiv.org/abs/1301.3578

———. (2020). An elementary introduction to information geometry. Entropy, 22, 1100. https://www.mdpi.com/1099-4300/22/10/1100

Nirenburg, S. (1996). Bar Hillel and Machine Translation: Then and Now.

Nissim, M., Noord, R. van, & Goot, R. van der. (2019). Fair is better than sensational: Man is to doctor as woman is to doctor. Computational Linguistics, 46, 487–497.

Norvig, P. (2011). On Chomsky and the Two Cultures of Statistical Learning. https://norvig.com/chomsky.html

O’Hagan, A. (2010). Kendall’s Advanced Theory of Statistics, Vol 2B: Bayesian Inference. Wiley.

OpenAI. (2023). GPT-4 Technical Report. https://cdn.openai.com/papers/gpt-4.pdf

Opper, M. (2001). Learning to generalize. Frontiers of Life, 3, 763–775.

Opper, M. & Kinzel, W. (1996). Statistical mechanics of generalization. In Models of Neural Networks III: Association, Generalization, and Representation (pp. 151–209). Springer New York. https://gwern.net/doc/ai/nn/1996-opper.pdf

Otsuka, J. (2023). Thinking About Statistics: The Philosophical Foundations. Routledge.

Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback. https://arxiv.org/abs/2203.02155

Pandey, R. (2024). gzip predicts data-dependent scaling laws. https://arxiv.org/abs/2405.16684

Park, N. & Kim, S. (2022). How do vision transformers work? https://arxiv.org/abs/2202.06709

Patel, R. & Pavlick, E. (2022). Mapping language models to grounded conceptual spaces. International Conference on Learning Representations, 2022. https://openreview.net/pdf?id=gJcEM8sxHK

Pearl, J. (2009). Causal inference in statistics: An overview. Statistics Surveys, 3, 96–146. https://projecteuclid.org/journals/statistics-surveys/volume-3/issue-none/Causal-inference-in-statistics-An-overview/10.1214/09-SS057.pdf

———. (2018). The Book of Why: The new science of cause and effect. Basic Books.

Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50, 157–175.

Peirce, C. S. (1883). Studies in Logic. Boston: Little, Brown, and Co.

Peng, B. et al. (2023). RWKV: Reinventing RNNs for the Transformer Era. https://arxiv.org/abs/2305.13048

Perone, C. S. (2018). NLP word representations and the Wittgenstein philosophy of language. http://blog.christianperone.com/2018/05/nlp-word-representations-and-the-wittgenstein-philosophy-of-language/

Peters, J., Janzing, D., & Scholkopf, B. (2017). Elements of Causal Inference. MIT Press.

Phuong, M. & Hutter, M. (2022). Formal algorithms for transformers. https://arxiv.org/abs/2207.09238

Piantadosi, S. T. (2023). Modern language models refute Chomsky’s approach to language. https://lingbuzz.net/lingbuzz/007180

Ponsen, M., De Jong, S., & Lanctot, M. (2011). Computing approximate Nash equilibria and robust best-responses using sampling. Journal of Artificial Intelligence Research, 42, 575–605. https://arxiv.org/abs/1401.4591

Radford, A. et al. (2019). Language models are unsupervised multitask learners. (Paper on the GPT-2 model by OpenAI). https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf

Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. (Paper on the GPT model by OpenAI). https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

Rae, J.W. et al. (2022). Scaling language models: Methods, analysis & insights from training Gopher. https://arxiv.org/abs/2112.11446

Raffel, C. et al. (2019). Exploring the limits of transfer learning with a unified text-to-text transformer. https://arxiv.org/abs/1910.10683

Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2017a). Physics informed deep learning (Part I): Data-driven solutions of nonlinear partial differential equations. https://arxiv.org/abs/1711.10561

———. (2017b). Physics informed deep learning (Part II): Data-driven discovery of nonlinear partial differential equations. https://arxiv.org/abs/1711.10566

Rao, C. R. (1945). Information and the accuracy attainable in the estimation of statistical parameters. Bulletin of the Calcutta Mathematical Society, 37, 81–91.

———. (1947). Minimum variance and the estimation of several parameters. In Mathematical Proceedings of the Cambridge Philosophical Society. 43, 280–283. Cambridge University Press.

———. (1997). Statisitcs and Truth: Putting Chance to Work (2nd ed.). World Scientific.

Rao, C. R. & Lovric, M. M. (2016). Testing point null hypothesis of a normal mean and the truth: 21st century perspective. Journal of Modern Applied Statistical Methods, 15, 2–21. http://digitalcommons.wayne.edu/jmasm/vol15/iss2/3

Rathmanner, S. & Hutter, M. (2011). A philosophical treatise of universal induction. Entropy, 13, 1076–1136. https://www.mdpi.com/1099-4300/13/6/1076

Read, A. L. (2002). Presentation of search results: the CLs technique. Journal of Physics G: Nuclear and Particle Physics, 28, 2693. https://indico.cern.ch/event/398949/attachments/799330/1095613/The_CLs_Technique.pdf

Reid, C. (1998). Neyman. Springer-Verlag.

Rice, J. A. (2007). Mathematical Statistics and Data Analysis (3rd ed.). Thomson.

Roberts, D. A. (2021). Why is AI hard and physics simple? https://arxiv.org/abs/2104.00008

Roberts, D. A., Yaida, S., & Hanin, B. (2021). The Principles of Deep Learning Theory: An Effective Theory Approach to Understanding Neural Networks. Cambridge University Press. https://deeplearningtheory.com/PDLT.pdf

Robins, J. M. & Wasserman, L. (1999). On the impossibility of inferring causation from association without background knowledge. In C. Glymour & G. Cooper (Eds.), Computation, Causation, and Discovery (pp. 305–321). AAAI & MIT Press.

Ronen, M., Finder, S. E., & Freifeld, O. (2022). DeepDPM: Deep clustering with an unknown number of clusters. https://arxiv.org/abs/2203.14309

Rosenblatt, F. (1961). Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan.

Roughgarden, T. (2016). Twenty Lectures on Algorithmic Game Theory. Cambridge University Press.

Royall, R. (1997). Statistical Evidence: A likelihood paradigm. CRC Press.

Rozeboom, W. W. (1960). The fallacy of the null-hypothesis significance test. Psychological Bulletin, 57, 416.

Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66, 688. https://psycnet.apa.org/fulltext/1975-06502-001.pdf

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536. https://www.nature.com/articles/323533a0.pdf

Salsburg, D. (2001). The Lady Tasting Tea. Holt.

Savage, L. J. (1954). The Foundations of Statistics. John Wiley & Sons.

Scardapane, S. (2024). Alice’s Adventures in a Differentiable Wonderland, Vol. I: A Tour of the Land. https://arxiv.org/abs/2404.17625

Schaeffer, R. et al. (2023). Double descent demystified: Identifying, interpreting & ablating the sources of a deep learning puzzle. https://arxiv.org/abs/2303.14151

Schmid, M. et al. (2019). Variance reduction in Monte Carlo counterfactual regret minimization (VR-MCCFR) for extensive form games using baselines. https://ojs.aaai.org/index.php/AAAI/article/view/4048/3926

———. (2021). Player of games. https://arxiv.org/abs/2112.03178

Shalev-Shwarz, S. & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press. https://www.cs.huji.ac.il/w~shais/UnderstandingMachineLearning/understanding-machine-learning-theory-algorithms.pdf

Silver, D. et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529, 484–489.

———. (2017a). Mastering chess and shogi by self-play with a general reinforcement learning algorithm. https://arxiv.org/abs/1712.01815

———. (2017b). Mastering the game of Go without human knowledge. Nature, 550, 354–359.

Silver, D., Singh, S., Precup, D., & Sutton, R. S. (2024). Reward is enough. Artificial Intelligence, 299, 103535. https://www.sciencedirect.com/science/article/pii/S0004370221000862

Simonyan, K. & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. https://arxiv.org/abs/1409.1556

Sinervo, P. (2002). Signal significance in particle physics. In M. R. Whalley & L. Lyons (Eds.), Proceedings of the Conference on Advanced Statistical Techniques in Particle Physics. Durham, UK: Institute of Particle Physics Phenomenology. https://arxiv.org/abs/hep-ex/0208005v1

———. (2003). Definition and treatment of systematic uncertainties in high energy physics and astrophysics. In Lyons L., Mount R., & R. Reitmeyer (Eds.), Proceedings of the Conference on Statistical Problems in Particle Physics, Astrophysics, and Cosmology (PhyStat2003) (pp. 122–129). Stanford Linear Accelerator Center. https://www.slac.stanford.edu/econf/C030908/papers/TUAT004.pdf

Singh, S. P., Lucchi, A., Hofmann, T., & Schölkopf, B. (2022). Phenomenology of double descent in finite-width neural networks. https://arxiv.org/abs/2203.07337

Skelac, I. & Jandric, A. (2020). Meaning as use: From Wittgenstein to Google’s Word2vec. In S. Skansi (Ed.), Guide To Deep Learning Basics: Logical, Historical And Philosophical Perspectives (pp. 41–53). Springer.

Slonim, N., Atwal, G. S., Tkacik, G., & Bialek, W. (2005). Information-based clustering. Proceedings of the National Academy of Sciences, 102, 18297–18302. https://arxiv.org/abs/q-bio/0511043

Smith, L. (2019). A gentle introduction to information geometry. September 27, 2019. http://www.robots.ox.ac.uk/~lsgs/posts/2019-09-27-info-geom.html

Sohl-Dickstein, J. (2020). Two equalities expressing the determinant of a matrix in terms of expectations over matrix-vector products. https://arxiv.org/abs/2005.06553

Solomonoff, G. (2016). Ray Solomonoff and the Dartmouth Summer Research Project in Artificial Intelligence, 1956. http://raysolomonoff.com/dartmouth/dartray.pdf

Southey, F. et al. (2012). Bayes’ bluff: Opponent modelling in poker. https://arxiv.org/abs/1207.1411

Spears, B.K. et al. (2018). Deep learning: A guide for practitioners in the physical sciences. Physics of Plasmas, 25, 080901.

Stahlberg, F. (2019). Neural machine translation: A review. https://arxiv.org/abs/1912.02047

Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 1, 197–206.

Steinhardt, J. (2012). Beyond Bayesians and frequentists. https://jsteinhardt.stat.berkeley.edu/files/stats-essay.pdf

———. (2022). More is different for AI. https://bounded-regret.ghost.io/more-is-different-for-ai/

Stuart, A., Ord, K., & Arnold, S. (2010). Kendall’s Advanced Theory of Statistics, Vol 2A: Classical Inference and the Linear Model. Wiley.

Sun, Y. et al. (2023). Retentive network: A successor to transformer for large language models. https://arxiv.org/abs/2307.08621

Sutskever, I. (2015). A brief overview of deep learning. https://web.archive.org/web/20220728224752/http://yyue.blogspot.com/2015/01/a-brief-overview-of-deep-learning.html

Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems, 2014, 3104–3112. https://arxiv.org/abs/1409.3215

Sutton, R. S. (2019). The bitter lesson. http://www.incompleteideas.net/IncIdeas/BitterLesson.html

Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning (2nd ed.). MIT Press.

Sznajder, M. (2018). Inductive logic as explication: The evolution of Carnap’s notion of logical probability. The Monist, 101, 417–440.

Tammelin, O. (2014). Solving large imperfect information games using CFR+. https://arxiv.org/abs/1407.5042

Tammelin, O., Burch, N., Johanson, M., & Bowling, M. (2015). Solving heads-up limit texas hold’em. International Joint Conference on Artificial Intelligence, 24. http://johanson.ca/publications/poker/2015-ijcai-cfrplus/2015-ijcai-cfrplus.pdf

Tan, M. & Le, Q. V. (2019). EfficientNet: Rethinking model scaling for convolutional neural networks. https://arxiv.org/abs/1905.11946

———. (2021). EfficientNetV2: Smaller models and faster training. https://arxiv.org/abs/2104.00298

Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2022). Efficient transformers: A survey. https://arxiv.org/abs/2009.06732

Tegmark, M., Taylor, A. N., & Heavens, A. F. (1997). Karhunen-Loeve eigenvalue problems in cosmology: How should we tackle large data sets? The Astrophysical Journal, 480, 22–35. https://arxiv.org/abs/astro-ph/9603021

Tenney, I. et al. (2019). What do you learn from context? Probing for sentence structure in contextualized word representations. https://arxiv.org/abs/1905.06316

Thuerey, N. et al. (2021). Physics-based deep learning. https://arxiv.org/abs/2109.05237

Timbers, F. (2020). Approximate exploitability: Learning a best response in large games. https://arxiv.org/abs/2004.09677

Tukey, J. W. (1977). Exploratory Data Analysis. Pearson.

Udrescu, S. & Tegmark, M. (2020). Symbolic pregression: Discovering physical laws from raw distorted video. https://arxiv.org/abs/2005.11212

van Handel, R. (2016). Probability in high dimensions. Lecture notes at Princeton. https://web.math.princeton.edu/~rvan/APC550.pdf

Vapnik, V., Levin, E., & LeCun, Y. (1994). Measuring the VC-dimension of a learning machine. Neural Computation, 6, 851–876.

Vaswani, A. et al. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 2017, 5998–6008. https://arxiv.org/abs/1706.03762

Venn, J. (1888). The Logic of Chance. London: MacMillan and Co. (Originally published in 1866).

Ver Hoef, J. M. (2012). Who invented the delta method? American Statistician, 66, 124–127. https://www.jstor.org/stable/23339471

Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press. https://www.math.uci.edu/~rvershyn/papers/HDP-book/HDP-book.pdf

Wagenmakers, E. J. (2021). Review: Bernoulli’s Fallacy. Chance, 34, 37–38. https://www.tandfonline.com/doi/full/10.1080/09332480.2021.2003642

Wainer, H. (2007). The most dangerous equation. American Scientist, 95, 249–256. https://sites.stat.washington.edu/people/peter/498.Sp16/Equation.pdf

Wakefield, J. (2013). Bayesian and Frequentist Regression Methods. Springer.

Wald, A. (1943). Tests of statistical hypotheses concerning several parameters when the number of observations is large. Transactions of the American Mathematical Society, 54, 426–482. https://www.ams.org/journals/tran/1943-054-03/S0002-9947-1943-0012401-3/S0002-9947-1943-0012401-3.pdf

Wang, H. et al. (2023). BitNet: Scaling 1-bit transformers for large language models. https://arxiv.org/abs/2310.11453

Wasserman, L. (2003). All of Statistics: A Concise Course in Statistical Inference. Springer.

Wasserstein, R. L., Allen, L. S., & Lazar, N. A. (2019). Moving to a World Beyond "p<0.05". American Statistician, 73, 1–19.

Wasserstein, R. L. & Lazar, N. A. (2016). The ASA’s statement on p-values: Context, process, and purpose. American Statistician, 70, 129–133.

Watson, D. & Floridi, L. (2019). The explanation game: A formal framework for interpretable machine learning. SSRN, 3509737. https://ssrn.com/abstract=3509737

Weisberg, J. (2019). Odds & Ends: Introducing Probability & Decision with a Visual Emphasis. https://jonathanweisberg.org/vip/

Werbos, P. J. (1990). Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78, 1550–1560. http://www.werbos.com/Neural/BTT.pdf

Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. The Annals of Mathematical Statistics, 9, 60–62. https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-9/issue-1/The-Large-Sample-Distribution-of-the-Likelihood-Ratio-for-Testing/10.1214/aoms/1177732360.full

Williamson, J. (2009). The philosophy of science and its relation to machine learning. In Scientific Data Mining and Knowledge Discovery (pp. 77–89). Springer, Berlin, Heidelberg.

Wittgenstein, L. (2009). Philosophical Investigations. (E. Anscombe & P. Hacker, Trans., P. Hacker & J. Schulte, Eds.) (4th ed.). Wiley-Blackwell. (Originally published in 1953).

Wolfram, S. (2023). What is ChatGPT doing—and why does it work? https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/

Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms. Neural Computation, 8, 1341–1390.

———. (2007). Physical limits of inference. https://arxiv.org/abs/0708.1362

———. (2023). The implications of the no-free-lunch theorems for meta-induction. Journal for General Philosophy of Science, 54, 421–432. https://link.springer.com/article/10.1007/s10838-022-09609-2

Wolpert, D. H. & Kinney, D. (2020). Noisy deductive reasoning: How humans construct math, and how math constructs universes. https://arxiv.org/abs/2012.08298

Wolpert, D. H. & Macready, W. G. (1995). No free lunch theorems for search. Technical Report SFI-TR-95-02-010, Santa Fe Institute.

———. (1997). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1, 67–82.

Wu, Y. et al. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. https://arxiv.org/abs/1409.0473

Xu, Z., Hasselt, H. van, & Silver, D. (2018). Meta-gradient reinforcement learning. https://arxiv.org/abs/1805.09801

Yang, T. & Suzuki, J. (2023). Dropout drops double descent. https://arxiv.org/abs/2305.16179

Yang, Z. et al. (2019). XLNet: Generalized autoregressive pretraining for language understanding. https://arxiv.org/abs/1906.08237

Zaheer, M. et al. (2020). Big Bird: Transformers for longer sequences. https://arxiv.org/abs/2007.14062

Zech, G. (1995). Comparing statistical data to Monte Carlo simulation: Parameter fitting and unfolding. (DESY-95-113). Deutsches Elektronen-Synchrotron (DESY). https://cds.cern.ch/record/284321

Zhang, W. (1998). Complete anytime beam search. AAAI Proceedings, 98, 425–430. https://cdn.aaai.org/AAAI/1998/AAAI98-060.pdf

Zhao, J. et al. (2024). GaLore: Memory-efficient LLM training by gradient low-rank projection. https://arxiv.org/abs/2403.03507

Zhao, W.X. et al. (2023). A survey of large language models. https://arxiv.org/abs/2303.18223

Zhao, Y. et al. (2023). DETRs beat YOLOs on real-time object detection. https://arxiv.org/abs/2304.08069

Zhou, R. & Hansen, E. A. (2005). Beam-stack search: Integrating backtracking with beam search. ICAPS, 15, 90–98. https://cdn.aaai.org/ICAPS/2005/ICAPS05-010.pdf

Zinkevich, M., Johanson, M., Bowling, M., & Piccione, C. (2007). Regret minimization in games with incomplete information. Advances in Neural Information Processing Systems, 20, 1729–1736. https://proceedings.neurips.cc/paper/2007/file/08d98638c6fcd194a4b1e6992063e944-Paper.pdf