3  Philosophy of statistics

Published

November 28, 2025

Statistical analysis is very important in addressing the problem of induction. Can inductive inference be formalized? What are the caveats? Can inductive inference be automated? How does machine learning work?

All knowledge is, in final analysis, history. All sciences are, in the abstract, mathematics. All judgements are, in their rationale, statistics.

– C. R. Rao 1

1 Rao (1997), p. x.

3.1 Introduction to the foundations of statistics

3.1.1 Problem of induction

A key issue for the scientific method, as discussed in the previous outline, is the problem of induction. Inductive inferences are used in the scientific method to make generalizations from finite data. This introduces unique avenues of error not found in purely deductive inferences, like in logic and mathematics. Compared to deductive inferences, which are sound and necessarily follow if an argument is valid and all of its premises obtain, inductive inferences can be valid and probably (not certainly) sound, and therefore can still result in error in some cases because the support of the argument is ultimately probabilistic.

A skeptic may further probe if we are even justified in using the probabilities we use in inductive arguments. What is the probability the Sun will rise tomorrow? What kind of probabilities are reasonable?

In this outline, we sketch and explore how the mathematical theory of statistics has arisen to wrestle with the problem of induction, and how it equips us with careful ways of framing inductive arguments and notions of confidence in them.

See also:

3.1.2 Early investigators

  • Ibn al-Haytham (c. 965-1040)
    • “Ibn al-Haytham was an early proponent of the concept that a hypothesis must be supported by experiments based on confirmable procedures or mathematical evidence—an early pioneer in the scientific method five centuries before Renaissance scientists.” - Wikipedia
  • Gerolamo Cardano (1501-1576)
    • Book on Games of Chance (1564)
  • John Graunt (1620-1674)
  • Jacob Bernoulli (1655-1705)
    • Ars Conjectandi (1713, posthumous)
    • First modern phrasing of the problem of parameter estimation 2
    • See Hacking 3
    • Early vision of decision theory:

2 Edwards (1974), p. 9.

3 Hacking (1971).

The art of measuring, as precisely as possible, probabilities of things, with the goal that we would be able always to choose or follow in our judgments and actions that course, which will have been determined to be better, more satisfactory, safer or more advantageous. 4

4 Bernoulli, J. (1713). Ars Conjectandi, Chapter II, Part IV, defining the art of conjecture [wikiquote].

5 Venn (1888).

6 Jevons (1873a).

7 Jevons (1873b).

3.1.3 Foundations of modern statistics

  • Central limit theorem
  • Charles Sanders Peirce (1839-1914)
    • Formulated modern statistics in “Illustrations of the Logic of Science”, a series published in Popular Science Monthly (1877-1878), and also “A Theory of Probable Inference” in Studies in Logic (1883). 8
    • With a repeated measures design, introduced blinded, controlled randomized experiments (before Fisher).
  • Karl Pearson (1857-1936)
    • The Grammar of Science (1892)
    • “On the criterion that a given system of deviations…” (1900) 9
      • Proposed testing the validity of hypothesized values by evaluating the chi distance between the hypothesized and the empirically observed values via the \(p\)-value.
    • With Frank Raphael Weldon, he established the journal Biometrika in 1902.
    • Founded the world’s first university statistics department at University College, London in 1911.
  • John Maynard Keynes (1883-1946)
    • Keynes, J. M. (1921). A Treatise on Probability. 10
  • Ronald Fisher (1890-1972)
    • Fisher significance of the null hypothesis (\(p\)-values)
      • “On an absolute criterion for fitting frequency curves” 11
      • “Frequency distribution of the values of the correlation coefficient in samples of indefinitely large population” 12
    • “On the ‘probable error’ of a coefficient of correlation deduced from a small sample” 13
      • Definition of likelihood
      • ANOVA
    • Statistical Methods for Research Workers (1925)
    • The Design of Experiments (1935)
    • “Statistical methods and scientific induction” 14
    • The Lady Tasting Tea 15
  • Jerzy Neyman (1894-1981)
  • Egon Pearson (1895-1980)
    • Neyman-Pearson confidence intervals with fixed error probabilities (also \(p\)-values but considering two hypotheses involves two types of errors)
  • Harold Jeffreys (1891-1989)
    • objective (non-informative) Jeffreys priors
  • Andrey Kolmogorov (1903-1987)
  • C.R. Rao (1920-2023)
  • Ray Solomonoff (1926-2009)
  • Shun’ichi Amari (b. 1936)
  • Judea Pearl (b. 1936)

8 Peirce (1883), p. 126–181.

9 Pearson (1900).

10 Keynes (1921).

11 Fisher (1912).

12 Fisher (1915).

13 Fisher (1921).

14 Fisher (1955).

15 Salsburg (2001).

16 Reid (1998).

17 Neyman (1955).

18 Carnap (1960).

3.1.4 Pedagogy

19 Stuart, Ord, & Arnold (2010).

20 F. James (2006).

21 Cowan (1998) and Cowan (2016).

22 Cranmer (2015).

23 Jaynes (2003).

24 Lista (2016b).

25 Lista (2016a).

26 Cox (2006).

27 Behnke, Kröninger, Schott, & Schörner-Sadenius (2013).

28 Cousins (2018).

29 Weisberg (2019).

30 Gelman & Vehtari (2021).

31 Otsuka (2023).

3.3 Statistical models

3.3.1 Parametric models

  • Data: \(x_i\)
  • Parameters: \(\theta_j\)
  • Model: \(f(\vec{x} ; \vec{\theta})\)

3.3.2 Canonical distributions

3.3.2.1 Bernoulli distribution

\[ \mathrm{Ber}(k; p) = \begin{cases} p & \mathrm{if}\ k = 1 \\ 1-p & \mathrm{if}\ k = 0 \end{cases} \]

which can also be written as

\[ \mathrm{Ber}(k; p) = p^k \: (1-p)^{(1-k)} \quad \mathrm{for}\ k \in \{0, 1\} \]

or

\[ \mathrm{Ber}(k; p) = p k + (1-p)(1-k) \quad \mathrm{for}\ k \in \{0, 1\} \]

  • Binomial distribution
  • Poisson distribution

TODO: explain, another important relationship is

Figure 3.1: Relationships among Bernoulli, binomial, categorical, and multinomial distributions.

3.3.2.2 Normal/Gaussian distribution

\[ N(x \,|\, \mu, \sigma^2) = \frac{1}{\sqrt{2\,\pi\:\sigma^2}} \: \exp\left(\frac{-(x-\mu)^2}{2\,\sigma^2}\right) \]

and in \(k\) dimensions:

\[ N(\vec{x} \,|\, \vec{\mu}, \boldsymbol{\Sigma}) = (2 \pi)^{-k/2}\:\left|\boldsymbol{\Sigma}\right|^{-1/2} \: \exp\left(\frac{-1}{2}\:(\vec{x}-\vec{\mu})^\intercal \:\boldsymbol{\Sigma}^{-1}\:(\vec{x}-\vec{\mu})\right) \]

where \(\boldsymbol{\Sigma}\) is the covariance matrix of the distribution (defined in eq. 3.2).

3.3.3 Central limit theorem

Let \(X_{1}\), \(X_{2}\), … , \(X_{n}\) be a random sample drawn from any distribution with a finite mean \(\mu\) and variance \(\sigma^{2}\). As \(n \rightarrow \infty\), the distribution of

\[ \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim N(0, 1) \]

  • \(\chi^2\) distribution
  • Univariate distribution relationships
Figure 3.2: Detail of a figure showing relationships among univariate distributions. See the full figure here 46.

46 Leemis & McQueston (2008).

  • The exponential family of distributions are maximum entropy distributions.

3.3.4 Mixture models

47 Cranmer, K. et al. (2012).

3.4 Point estimation and confidence intervals

3.4.1 Inverse problems

Recall that in the context of parametric models of data, \(x_i\) the pdf of which is modeled by a function, \(f(x_i ; \theta_j)\) with parameters, \(\theta_j\). In a statistical inverse problem, the goal is to infer values of the model parameters, \(\theta_j\) given some finite set of data, \(\{x_i\}\) sampled from a probability density, \(f(x_i; \theta_j)\) that models the data reasonably well 48.

48 This assumption that the model models the data “reasonably” well reflects that to the degree required by your analysis, the important features of the data match well within the systematic uncertainties parametrized within the model. If the model is incomplete because it is missing an important feature of the data, then this is the “ugly” (class-3) error in the Sinervo classification of systematic uncertainties.

49 Cowan (1998) and Cowan (2016), p. TODO.

3.4.2 Bias and variance

The bias of an estimator, \(\hat\theta\), is defined as

\[ \mathrm{Bias}(\hat{\theta}) \equiv \mathbb{E}(\hat{\theta} - \theta) = \int dx \: P(x|\theta) \: (\hat{\theta} - \theta) \]

The mean squared error (MSE) of an estimator has a similar formula to variance (eq. 3.1) except that instead of quantifying the square of the difference of the estimator and its expected value, the MSE uses the square of the difference of the estimator and the true parameter:

\[ \mathrm{MSE}(\hat{\theta}) \equiv \mathbb{E}((\hat{\theta} - \theta)^2) \]

The MSE of an estimator can be related to its bias and its variance by the following proof:

\[ \begin{align} \mathrm{MSE}(\hat{\theta}) &= \mathbb{E}(\hat{\theta}^2 - 2 \: \hat{\theta} \: \theta + \theta^2) \\ &= \mathbb{E}(\hat{\theta}^2) - 2 \: \mathbb{E}(\hat{\theta}) \: \theta + \theta^2 \end{align} \]

noting that

\[ \mathrm{Var}(\hat{\theta}) = \mathbb{E}(\hat{\theta}^2) - \mathbb{E}(\hat{\theta})^2 \]

and

\[ \begin{align} \mathrm{Bias}(\hat{\theta})^2 &= \mathbb{E}(\hat{\theta} - \theta)^2 \\ &= \mathbb{E}(\hat{\theta})^2 - 2 \: \mathbb{E}(\hat{\theta}) \: \theta + \theta^2 \end{align} \]

we see that MSE is equivalent to

\[ \mathrm{MSE}(\hat{\theta}) = \mathrm{Var}(\hat{\theta}) + \mathrm{Bias}(\hat{\theta})^2 \]

For an unbiased estimator, the MSE is the variance of the estimator.

TODO:

  • Note the discussion of the bias-variance tradeoff by Cranmer.
  • Note the new deep learning view. See Deep learning.

See also:

3.4.3 Maximum likelihood estimation

A maximum likelihood estimator (MLE) was first used by Fisher. 50

50 Aldrich (1997).

\[\hat{\theta} \equiv \underset{\theta}{\mathrm{argmax}} \: \mathrm{log} \: L(\theta) \]

Maximizing \(\mathrm{log} \: L(\theta)\) is equivalent to maximizing \(L(\theta)\), and the former is more convenient because for data that are independent and identically distributed (i.i.d.) the joint likelihood can be factored into a product of individual measurements:

\[ L(\theta) = \prod_i L(\theta|x_i) = \prod_i P(x_i|\theta) \]

and taking the log of the product makes it a sum:

\[ \mathrm{log} \: L(\theta) = \sum_i \mathrm{log} \: L(\theta|x_i) = \sum_i \mathrm{log} \: P(x_i|\theta) \]

Maximizing \(\mathrm{log} \: L(\theta)\) is also equivalent to minimizing \(-\mathrm{log} \: L(\theta)\), the negative log-likelihood (NLL). For distributions that are i.i.d.,

\[ \mathrm{NLL} \equiv - \log L = - \log \prod_i L_i = - \sum_i \log L_i = \sum_i \mathrm{NLL}_i \]

3.4.3.1 Invariance of likelihoods under reparametrization

  • Likelihoods are invariant under reparametrization. 51
  • Bayesian posteriors are not invariant in general.

51 F. James (2006), p. 234.

See also:

3.4.3.2 Ordinary least squares

  • Least squares from MLE of gaussian models: \(\chi^2\)
  • Ordinary Least Squares (OLS)
  • Geometric interpretation

52 Cox (2006), p. 11.

53 Murphy (2012), p. 222.

3.4.4 Variance of MLEs

54 Fréchet (1943), Cramér (1946), Rao (1945), and Rao (1947).

55 Rice (2007), p. 300–2.

56 Nielsen (2013).

57 Cowan (1998), p. 130-5.

58 F. James (2006), p. 234.

59 F. James & Roos (1975).

Figure 3.3: Transformation of non-parabolic log-likelihood to parabolic (source: my slides, recreation of F. James (2006), p. 235).

60 Loh (1987).

61 Wainer (2007).

62 Tegmark, Taylor, & Heavens (1997).

3.4.5 Bayesian credibility intervals

3.4.6 Uncertainty on measuring an efficiency

63 Clopper & Pearson (1934).

64 Agresti & Coull (1998).

65 Hanley & Lippman-Hand (1983).

66 L. D. Brown, Cai, & DasGupta (2001).

67 Casadei (2012).

3.4.7 Examples

3.5 Statistical hypothesis testing

3.5.1 Null hypothesis significance testing

  • Karl Pearson observing how rare sequences of roulette spins are
  • Null hypothesis significance testing (NHST)
  • goodness of fit
  • Fisher

Fisher:

[T]he null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. 68

68 Fisher (1935), p. 16.

3.5.2 Neyman-Pearson theory

3.5.2.1 Introduction

69 Goodman (1999a). p. 998.

70 J. Cohen (1992).

71 ATLAS and CMS Collaborations (2011).

72 Cowan, Cranmer, Gross, & Vitells (2011).

Figure 3.4: TODO: ROC explainer. (Wikimedia, 2015).

See also:

3.5.2.2 Neyman-Pearson lemma

Neyman-Pearson lemma: 73

73 Neyman & Pearson (1933).

For a fixed signal efficiency, \(1-\alpha\), the selection that corresponds to the lowest possible misidentification probability, \(\beta\), is given by

\[ \frac{L(H_1)}{L(H_0)} > k_{\alpha} \]

where \(k_{\alpha}\) is the cut value required to achieve a type-1 error rate of \(\alpha\).

Neyman-Pearson test statistic:

\[ q_\mathrm{NP} = - 2 \ln \frac{L(H_1)}{L(H_0)} \]

Profile likelihood ratio:

\[ \lambda(\mu) = \frac{ L(\mu, \hat{\theta}_\mu) }{ L(\hat{\mu}, \hat{\theta}) } \]

where \(\hat{\theta}\) is the (unconditional) maximum-likelihood estimator that maximizes \(L\), while \(\hat{\theta}_\mu\) is the conditional maximum-likelihood estimator that maximizes \(L\) for a specified signal strength, \(\mu\), and \(\theta\) as a vector includes all other parameters of interest and nuisance parameters.

3.5.2.3 Neyman construction

Cranmer: Neyman construction.

Figure 3.5: Neyman construction for a confidence belt for \(\theta\) (source: K. Cranmer, 2020).

TODO: fix

\[ q = - 2 \ln \frac{L(\mu\,s + b)}{L(b)} \]

3.5.2.4 Flip-flopping

  • Flip-flopping and Feldman-Cousins confidence intervals 74

74 Feldman & Cousins (1998).

3.5.3 p-values and significance

  • \(p\)-values and significance 75
  • Coverage
  • Fisherian vs Neyman-Pearson \(p\)-values

75 Sinervo (2002) and Cowan (2012).

Cowan et al. define a \(p\)-value as

a probability, under assumption of \(H\), of finding data of equal or greater incompatibility with the predictions of \(H\). 76

76 Cowan et al. (2011), p. 2–3.

Also:

It should be emphasized that in an actual scientific context, rejecting the background-only hypothesis in a statistical sense is only part of discovering a new phenomenon. One’s degree of belief that a new process is present will depend in general on other factors as well, such as the plausibility of the new signal hypothesis and the degree to which it can describe the data. Here, however, we only consider the task of determining the \(p\)-value of the background-only hypothesis; if it is found below a specified threshold, we regard this as “discovery”. 77

77 Cowan et al. (2011), p. 3.

3.5.3.1 Uppper limits

78 Cousins & Highland (1992).

3.5.3.2 CLs method

  • Conservative coverage; used in particle physics
  • Junk 79
  • Read 80
  • ATLAS 81

79 Junk (1999).

80 Read (2002).

81 ATLAS Statistics Forum (2011).

3.5.4 Asymptotics

82 Wilks (1938).

83 Wald (1943).

84 Cowan et al. (2011).

85 Cowan, Cranmer, Gross, & Vitells (2012).

86 Bhattiprolu, Martin, & Wells (2020).

3.5.5 Student’s t-test

  • Student’s t-test
  • ANOVA
  • A/B-testing

3.5.6 Decision theory

87 Murphy (2012), p. 197.

88 Goodman (1999b).

Support for using Bayes factors:

which properly separates issues of long-run behavior from evidential strength and allows the integration of background knowledge with statistical findings. 89

89 Goodman (1999a). p. 995.

See also:

3.5.7 Examples

3.6 Uncertainty quantification

3.6.1 Sinervo classification of systematic uncertainties

  • Class-1, class-2, and class-3 systematic uncertanties (good, bad, ugly), Classification by Pekka Sinervo (PhyStat2003) 90
  • Not to be confused with type-1 and type-2 errors in Neyman-Pearson theory
  • Heinrich, J. & Lyons, L. (2007). Systematic errors. 91
  • Caldeira & Nord 92

90 Sinervo (2003).

91 Heinrich & Lyons (2007).

92 Caldeira & Nord (2020).

Lyons:

In analyses involving enough data to achieve reasonable statistical accuracy, considerably more effort is devoted to assessing the systematic error than to determining the parameter of interest and its statistical error. 93

93 Lyons (2008), p. 890.

Figure 3.6: Classification of measurement uncertainties (philosophy-in-figures.tumblr.com, 2016).
  • Poincaré’s three levels of ignorance

3.6.2 Profile likelihoods

  • Profiling and the profile likelihood
    • Importance of Wald and Cowan et al.
    • hybrid Bayesian-frequentist method

3.6.3 Examples of poor estimates of systematic uncertanties

Figure 3.7: Demonstration of sensitivity to the jet energy scale for an alleged excess in \(Wjj\) by Tommaso Dorigo (2011) (see also: GIF).
  • OPERA. (2011). Faster-than-light neutrinos.
  • BICEP2 claimed evidence of B-modes in the CMB as evidence of cosmic inflation without accounting for cosmic dust.

3.6.4 Conformal prediction

94 Gammerman, Vovk, & Vapnik (1998).

3.7 Statistical classification

3.7.1 Introduction

  • Precision vs recall
  • Recall is sensitivity
  • Sensitivity vs specificity
  • Accuracy

3.7.2 Examples

  • TODO

See also:

3.8 Causal inference

3.8.1 Introduction

95 Rubin (1974).

96 Lewis (1981).

97 Pearl (2018).

See also:

3.8.2 Causal models

98 Pearl (2009).

99 Robins & Wasserman (1999).

100 Peters, Janzing, & Scholkopf (2017).

101 Lundberg, Johnson, & Stewart (2021).

3.8.3 Counterfactuals

102 Ismael (2023).

103 Chevalley, Schwab, & Mehrjou (2024).

3.9 Exploratory data analysis

3.9.1 Introduction

104 Tukey (1977).

3.9.2 Look-elsewhere effect

  • Look-elsewhere effect (LEE)
    • AKA File-drawer effect
  • Stopping rules
    • validation dataset
    • statistical issues, violates the likelihood principle

3.9.3 Archiving and data science

105 Chen, X. et al. (2018).

3.10 “Statistics Wars”

3.10.1 Introduction

  • Kruschke
  • Carnap
    • “The two concepts of probability” 106
  • Royall
    • “What do these data say?” 107

106 Carnap (1945).

107 Royall (1997), p. 171–2.

Cranmer:

Bayes’s theorem is a theorem, so there’s no debating it. It is not the case that Frequentists dispute whether Bayes’s theorem is true. The debate is whether the necessary probabilities exist in the first place. If one can define the joint probability \(P (A, B)\) in a frequentist way, then a Frequentist is perfectly happy using Bayes theorem. Thus, the debate starts at the very definition of probability. 108

108 Cranmer (2015), p. 6.

Neyman:

Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong. 109

109 Neyman & Pearson (1933).

Figure 3.8: From Kruschke. 110

110 Kruschke & Liddell (2018).

3.10.2 Likelihood principle

  • Likelihood principle
  • The likelihood principle is the proposition that, given a statistical model and a data sample, all the evidence relevant to model parameters is contained in the likelihood function.
  • The history of likelihood 111
  • Berger & Wolpert. (1988). The Likelihood Principle. 114

111 Edwards (1974).

112 Birnbaum (1962).

113 Hacking (1965).

114 Berger & Wolpert (1988).

O’Hagan:

The first key argument in favour of the Bayesian approach can be called the axiomatic argument. We can formulate systems of axioms of good inference, and under some persuasive axiom systems it can be proved that Bayesian inference is a consequence of adopting any of these systems… If one adopts two principles known as ancillarity and sufficiency principles, then under some statement of these principles it follows that one must adopt another known as the likelihood principle. Bayesian inference conforms to the likelihood principle whereas classical inference does not. Classical procedures regularly violate the likelihood principle or one or more of the other axioms of good inference. There are no such arguments in favour of classical inference. 115

115 O’Hagan (2010), p. 17–18.

116 Gandenberger (2015).

117 Evans (2013).

118 Mayo (2014).

119 Mayo (2019).

120 Dawid (2014).

Mayo:

Likelihoods are vital to all statistical accounts, but they are often misunderstood because the data are fixed and the hypothesis varies. Likelihoods of hypotheses should not be confused with their probabilities. … [T]he same phenomenon may be perfectly predicted or explained by two rival theories; so both theories are equally likely on the data, even though they cannot both be true. 121

121 Mayo (2019).

3.10.3 Discussion

Lyons:

Particle Physicists tend to favor a frequentist method. This is because we really do consider that our data are representative as samples drawn according to the model we are using (decay time distributions often are exponential; the counts in repeated time intervals do follow a Poisson distribution, etc.), and hence we want to use a statistical approach that allows the data “to speak for themselves,” rather than our analysis being dominated by our assumptions and beliefs, as embodied in Bayesian priors. 122

122 Lyons (2008), p. 891.

123 Carnap (1952).

124 Carnap (1960).

125 Sznajder (2018).

126 Hacking (1965).

127 Neyman (1977).

128 Rozeboom (1960).

129 Meehl (1978).

130 Zech (1995).

131 Royall (1997).

132 Berger (2003).

133 Stuart et al. (2010), p. 460.

134 Mayo (1981).

135 Mayo (1996).

136 Mayo & Spanos (2006).

137 Mayo & Spanos (2011).

138 Mayo (2018).

139 Gelman & Hennig (2017).

140 Murphy (2012), ch. 6.6.

141 Murphy (2022), p. 195–198.

142 Gandenberger (2016).

143 Wakefield (2013), ch. 4.

144 Efron & Hastie (2016), p. 30–36.

145 Kruschke & Liddell (2018).

146 Steinhardt (2012).

147 Clayton (2021).

148 Wagenmakers (2021).

Figure 3.9: The major virtues and vices of Bayesian, frequentist, and likelihoodist approaches to statistical inference (gandenberger.org/research/, 2015).

Goodman:

The idea that the \(P\) value can play both of these roles is based on a fallacy: that an event can be viewed simultaneously both from a long-run and a short-run perspective. In the long-run perspective, which is error-based and deductive, we group the observed result together with other outcomes that might have occurred in hypothetical repetitions of the experiment. In the “short run” perspective, which is evidential and inductive, we try to evaluate the meaning of the observed result from a single experiment. If we could combine these perspectives, it would mean that inductive ends (drawing scientific conclusions) could be served with purely deductive methods (objective probability calculations). 149

149 Goodman (1999a). p. 999.

3.11 Replication crisis

3.11.1 Introduction

  • Ioannidis, J.P. (2005). Why most published research findings are false. 150

150 Ioannidis (2005).

3.11.2 p-value controversy

151 Wasserstein & Lazar (2016).

152 Wasserstein, Allen, & Lazar (2019).

153 Benjamin, D.J. et al. (2017).

[N]o isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon; for the “one chance in a million” will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be that it should occur to us. In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. 154

154 Fisher (1935), p. 13–14.

155 Rao & Lovric (2016).

156 Mayo (2021).

157 Gorard & Gorard (2016).

From “The ASA president’s task force statement on statistical significance and replicability”:

P-values are valid statistical measures that provide convenient conventions for communicating the uncertainty inherent in quantitative results. Indeed, P-values and significance tests are among the most studied and best understood statistical procedures in the statistics literature. They are important tools that have advanced science through their proper application. 158

158 Benjamini, Y. et al. (2021), p. 1.

3.12 Classical machine learning

3.12.1 Introduction

  • Classification vs regression
  • Supervised and unsupervised learning
    • Classification = supervised; clustering = unsupervised
  • Hastie, Tibshirani, & Friedman 159
  • Information Theory, Inference, and Learning 160
  • Murphy, K.P. (2012). Machine Learning: A probabilistic perspective. MIT Press. 161
  • Murphy, K.P. (2022). Probabilistic Machine Learning: An introduction. MIT Press. 162
  • Shalev-Shwarz, S. & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. 163
  • VC-dimension
    • Vapnik (1994) 164
    • Shalev-Shwarz, S. & Ben-David, S. (2014). 165

159 Hastie, Tibshirani, & Friedman (2009).

160 MacKay (2003).

161 Murphy (2012).

162 Murphy (2022), p. 195–198.

163 Shalev-Shwarz & Ben-David (2014).

164 Vapnik, Levin, & LeCun (1994).

165 Shalev-Shwarz & Ben-David (2014), p. 67–82.

3.12.2 History

166 McCarthy, Minsky, Rochester, & Shannon (1955).

167 Solomonoff (2016).

168 Kardum (2020).

169 McCulloch & Pitts (1943).

170 J. A. Anderson & Rosenfeld (1998).

171 Rosenblatt (1961).

172 Minsky & Papert (1969).

173 Cartwright (2001).

See also:

3.12.3 Logistic regression

From a probabilistic point of view, 174 logistic regression can be derived from doing maximum likelihood estimation of a vector of model parameters, \(\vec{w}\), in a dot product with the input features, \(\vec{x}\), and squashed with a logistic function that yields the probability, \(\mu\), of a Bernoulli random variable, \(y \in \{0, 1\}\).

174 Murphy (2012), p. 21.

\[ p(y | \vec{x}, \vec{w}) = \mathrm{Ber}(y | \mu(\vec{x}, \vec{w})) = \mu(\vec{x}, \vec{w})^y \: (1-\mu(\vec{x}, \vec{w}))^{(1-y)} \]

The negative log-likelihood of multiple trials is

\[ \begin{align} \mathrm{NLL} &= - \sum_i \log p(y_i | \vec{x}_i, \vec{w}) \\ &= - \sum_i \log\left( \mu(\vec{x}_i, \vec{w})^{y_i} \: (1-\mu(\vec{x}_i, \vec{w}))^{(1-y_i)} \right) \\ &= - \sum_i \log\left( \mu_i^{y_i} \: (1-\mu_i)^{(1-y_i)} \right) \\ &= - \sum_i \big( y_i \, \log \mu_i + (1-y_i) \log(1-\mu_i) \big) \end{align} \]

which is the cross entropy loss. Note that the first term is non-zero only when the true target is \(y_i=1\), and similarly the second term is non-zero only when \(y_i=0\). 175 Therefore, we can reparametrize the target \(y_i\) in favor of \(t_{ki}\) that is one-hot in an index \(k\) over classes.

175 Note: Label smoothing is a regularization technique that smears the activation over other labels, but we don’t do that here.

\[ \mathrm{CEL} = \mathrm{NLL} = - \sum_i \sum_k \big( t_{ki} \, \log \mu_{ki} \big) \]

where

\[ t_{ki} = \begin{cases} 1 & \mathrm{if}\ (k = y_i = 0)\ \mathrm{or}\ (k = y_i = 1) \\ 0 & \mathrm{otherwise} \end{cases} \]

and

\[ \mu_{ki} = \begin{cases} 1-\mu_i & \mathrm{if}\ k = 0 \\ \mu_i & \mathrm{if}\ k =1 \end{cases} \]

This readily generalizes from binary classification to classification over many classes as we will discuss more below. Note that in the sum over classes, \(k\), only one term for the true class contributes.

\[ \mathrm{CEL} = - \left. \sum_i \log \mu_{ki} \right|_{k\ \mathrm{is\ such\ that}\ y_k=1} \tag{3.3}\]

Logistic regression uses the logit function 176, which is the logarithm of the odds—the ratio of the chance of success to failure. Let \(\mu\) be the probability of success in a Bernoulli trial, then the logit function is defined as

176 “Logit” was coined by Joseph Berkson (1899-1982).

\[ \mathrm{logit}(\mu) \equiv \log\left(\frac{\mu}{1-\mu}\right) \]

Logistic regression assumes that the logit function is a linear function of the explanatory variable, \(x\).

\[ \log\left(\frac{\mu}{1-\mu}\right) = \beta_0 + \beta_1 x \]

where \(\beta_0\) and \(\beta_1\) are trainable parameters. (TODO: Why would we assume this?) This can be generalized to a vector of multiple input variables, \(\vec{x}\), where the input vector has a 1 prepended to be its zeroth component in order to conveniently include the bias, \(\beta_0\), in a dot product.

\[ \vec{x} = (1, x_1, x_2, \ldots, x_n)^\intercal \]

\[ \vec{w} = (\beta_0, \beta_1, \beta_2, \ldots, \beta_n)^\intercal \]

\[ \log\left(\frac{\mu}{1-\mu}\right) = \vec{w}^\intercal \vec{x} \]

For the moment, let \(z \equiv \vec{w}^\intercal \vec{x}\). Exponentiating and solving for \(\mu\) gives

\[ \mu = \frac{ e^z }{ 1 + e^z } = \frac{ 1 }{ 1 + e^{-z} } \]

This function is called the logistic or sigmoid function.

\[ \mathrm{logistic}(z) \equiv \mathrm{sigm}(z) \equiv \frac{ 1 }{ 1 + e^{-z} } \]

Since we inverted the logit function by solving for \(\mu\), the inverse of the logit function is the logistic or sigmoid.

\[ \mathrm{logit}^{-1}(z) = \mathrm{logistic}(z) = \mathrm{sigm}(z) \]

And therefore,

\[ \mu = \mathrm{sigm}(z) = \mathrm{sigm}(\vec{w}^\intercal \vec{x}) \]

Figure 3.10: Logistic regression.

See also:

3.12.4 Softmax regression

Again, from a probabilistic point of view, we can derive the use of multi-class cross entropy loss by starting with the Bernoulli distribution, generalizing it to multiple classes (indexed by \(k\)) as

\[ p(y_k | \mu) = \mathrm{Cat}(y_k | \mu_k) = \prod_k {\mu_k}^{y_k} \]

which is the categorical or multinoulli distribution. The negative-log likelihood of multiple independent trials is

\[ \mathrm{NLL} = - \sum_i \log \left(\prod_k {\mu_{ki}}^{y_{ki}}\right) = - \sum_i \sum_k y_{ki} \: \log \mu_{ki} \]

Noting again that \(y_{ki} = 1\) only when \(k\) is the true class, and is 0 otherwise, this simplifies to eq. 3.3.

See also:

177 McFadden & Zarembka (1973).

178 Blondel, Martins, & Niculae (2020).

179 Goodfellow et al. (2016), p. 129.

3.12.5 Decision trees

180 Freund & Schapire (1997).

181 T. Chen & Guestrin (2016).

182 Aytekin (2022).

183 Grinsztajn, Oyallon, & Varoquaux (2022).

184 Coadou (2022).

3.12.6 Clustering

185 Hartigan (1985).

186 Slonim, Atwal, Tkacik, & Bialek (2005).

187 Batson, Haaf, Kahn, & Roberts (2021).

188 Hennig (2015).

189 Lauc (2020), p. 103–4.

190 Ronen, Finder, & Freifeld (2022).

191 Fang, Z. et al. (2022).

See also:

3.13 Deep learning

3.13.1 Introduction

192 Gardner & Dorling (1998).

193 Schmidhuber (2024).

194 Bengio (2009).

195 LeCun, Bengio, & Hinton (2015).

196 Sutskever (2015).

197 Goodfellow et al. (2016).

198 Kaplan, J. et al. (2019).

199 Rumelhart, Hinton, & Williams (1986).

200 Amari (1993).

201 LeCun & Bottou (1998).

202 Scardapane (2024).

203 Bottou (1998).

204 Norvig (2011).

205 Sutton (2019).

206 Frankle & Carbin (2018).

Figure 3.11: Raw input image is transformed into gradually higher levels of representation. 207

207 Bengio (2009).

3.13.2 Gradient descent

\[ \hat{f} = \underset{f \in \mathcal{F}}{\mathrm{argmin}} \underset{x \sim \mathcal{X}}{\mathbb{E}} L(f, x) \]

The workhorse algorithm for optimizing (training) model parameters is gradient descent:

\[ \vec{w}[t+1] = \vec{w}[t] - \eta \frac{\partial L}{\partial \vec{w}}[t] \]

In Stochastic Gradient Descent (SGD), you chunk the training data into minibatches (AKA batches), \(\vec{x}_{bt}\), and take a gradient descent step with each minibatch:

\[ \vec{w}[t+1] = \vec{w}[t] - \frac{\eta}{m} \sum_{i=1}^m \frac{\partial L}{\partial \vec{w}}[\vec{x}_{bt}] \]

where

  • \(t \in \mathbf{N}\) is the learning step number
  • \(\eta\) is the learning rate
  • \(m\) is the number of samples in a minibatch, called the batch size
  • \(L\) is the loss function
  • \(\frac{\partial L}{\partial \vec{w}}\) is the gradient

3.13.3 Deep double descent

Papers:

208 Belkin, Hsu, Ma, & Mandal (2019).

209 Muthukumar, Vodrahalli, Subramanian, & Sahai (2019).

210 Nakkiran, P. et al. (2019).

211 Chang, Li, Oymak, & Thrampoulidis (2020).

212 Holzmüller (2020).

213 Dar, Muthukumar, & Baraniuk (2021).

214 Balestriero, Pesenti, & LeCun (2021).

215 Belkin (2021).

216 Nagarajan (2021).

217 Bach (2022), p. 225–230.

218 Ghosh & Belkin (2022).

219 Singh, Lucchi, Hofmann, & Schölkopf (2022).

220 Hastie, Montanari, Rosset, & Tibshirani (2022).

221 Bubeck & Sellke (2023).

222 Gamba, Englesson, Björkman, & Azizpour (2022).

223 Schaeffer, R. et al. (2023).

224 Yang & Suzuki (2023).

225 Maddox, Benton, & Wilson (2023).

Blogs:

226 Steinhardt (2022).

227 Henighan, T. et al. (2023).

Twitter threads:

3.13.4 Regularization

Regularization = any change we make to the training algorithm in order to reduce the generalization error but not the training error. 228

228 Mishra, D. (2020). Weight Decay == L2 Regularization?

Most common regularizations:

  • L2 Regularization
  • L1 Regularization
  • Data Augmentation
  • Dropout
  • Early Stopping

Papers:

229 S. Chen, Dobriban, & Lee (2020).

3.13.5 Batch size vs learning rate

Papers:

  1. Keskar, N.S. et al. (2016). On large-batch training for deep learning: Generalization gap and sharp minima.

[L]arge-batch methods tend to converge to sharp minimizers of the training and testing functions—and as is well known—sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation.

  1. Hoffer, E. et al. (2017). Train longer, generalize better: closing the generalization gap in large batch training of neural networks.

    • \(\eta \propto \sqrt{m}\)
  2. Goyal, P. et al. (2017). Accurate large minibatch SGD: Training ImageNet in 1 hour.

    • \(\eta \propto m\)
  3. You, Y. et al. (2017). Large batch training of convolutional networks.

    • Layer-wise Adaptive Rate Scaling (LARS)
  4. You, Y. et al. (2017). ImageNet training in minutes.

    • Layer-wise Adaptive Rate Scaling (LARS)
  5. Jastrzebski, S. (2018). Three factors influencing minima in SGD.

    • \(\eta \propto m\)
  6. Smith, S.L. & Le, Q.V. (2018). A Bayesian Perspective on Generalization and Stochastic Gradient Descent.

  7. Smith, S.L. et al. (2018). Don’t decay the learning rate, increase the batch size.

    • \(m \propto \eta\)
  8. Masters, D. & Luschi, C. (2018). Revisiting small batch training for deep neural networks.

This linear scaling rule has been widely adopted, e.g., in Krizhevsky (2014), Chen et al. (2016), Bottou et al. (2016), Smith et al. (2017) and Jastrzebski et al. (2017).

On the other hand, as shown in Hoffer et al. (2017), when \(m \ll M\), the covariance matrix of the weight update \(\mathrm{Cov(\eta \Delta\theta)}\) scales linearly with the quantity \(\eta^2/m\).

This implies that, adopting the linear scaling rule, an increase in the batch size would also result in a linear increase in the covariance matrix of the weight update \(\eta \Delta\theta\). Conversely, to keep the scaling of the covariance of the weight update vector \(\eta \Delta\theta\) constant would require scaling \(\eta\) with the square root of the batch size \(m\) (Krizhevsky, 2014; Hoffer et al., 2017).

  1. Lin, T. et al. (2020). Don’t use large mini-batches, use local SGD.
    - Post-local SGD.

  2. Golmant, N. et al. (2018). On the computational inefficiency of large batch sizes for stochastic gradient descent.

Scaling the learning rate as \(\eta \propto \sqrt{m}\) attempts to keep the weight increment length statistics constant, but the distance between SGD iterates is governed more by properties of the objective function than the ratio of learning rate to batch size. This rule has also been found to be empirically sub-optimal in various problem domains. … There does not seem to be a simple training heuristic to improve large batch performance in general.

  1. McCandlish, S. et al. (2018). An empirical model of large-batch training.
    • Critical batch size
  2. Shallue, C.J. et al. (2018). Measuring the effects of data parallelism on neural network training.

In all cases, as the batch size grows, there is an initial period of perfect scaling (\(b\)-fold benefit, indicated with a dashed line on the plots) where the steps needed to achieve the error goal halves for each doubling of the batch size. However, for all problems, this is followed by a region of diminishing returns that eventually leads to a regime of maximal data parallelism where additional parallelism provides no benefit whatsoever.

  1. Jastrzebski, S. et al. (2018). Width of minima reached by stochastic gradient descent is influenced by learning rate to batch size ratio.
    • \(\eta \propto m\)

We show this experimentally in Fig. 5, where similar learning dynamics and final performance can be observed when simultaneously multiplying the learning rate and batch size by a factor up to a certain limit.

  1. You, Y. et al. (2019). Large-batch training for LSTM and beyond.
    • Warmup and use \(\eta \propto m\)

[W]e propose linear-epoch gradual-warmup approach in this paper. We call this approach Leg-Warmup (LEGW). LEGW enables a Sqrt Scaling scheme in practice and as a result we achieve much better performance than the previous Linear Scaling learning rate scheme. For the GNMT application (Seq2Seq) with LSTM, we are able to scale the batch size by a factor of 16 without losing accuracy and without tuning the hyper-parameters mentioned above.

  1. You, Y. et al. (2019). Large batch optimization for deep learning: Training BERT in 76 minutes.
    • LARS and LAMB
  2. Zhang, G. et al. (2019). Which algorithmic choices matter at which batch sizes? Insights from a Noisy Quadratic Model.

Consistent with the empirical results of Shallue et al. (2018), each optimizer shows two distinct regimes: a small-batch (stochastic) regime with perfect linear scaling, and a large-batch (deterministic) regime insensitive to batch size. We call the phase transition between these regimes the critical batch size.

  1. Li, Y., Wei, C., & Ma, T. (2019). Towards explaining the regularization effect of initial large learning rate in training neural networks.

Our analysis reveals that more SGD noise, or larger learning rate, biases the model towards learning “generalizing” kernels rather than “memorizing” kernels.

  1. Kaplan, J. et al. (2020). Scaling laws for neural language models.

  2. Jastrzebski, S. et al. (2020). The break-even point on optimization trajectories of deep neural networks.

Blogs:

3.13.6 Normalization

230 Chiley, V. et al. (2019).

231 Kiani, Balestriero, Lecun, & Lloyd (2022).

232 Huang, L. et al. (2020).

3.13.7 Finetuning

233 Hu, E.J. et al. (2021).

234 Dettmers, Pagnoni, Holtzman, & Zettlemoyer (2023).

235 Zhao, J. et al. (2024).

236 Huh, M. et al. (2024).

3.13.8 Computer vision

237 Fukushima & Miyake (1982).

238 LeCun, Y. et al. (1989).

239 LeCun, Bottou, Bengio, & Haffner (1998).

240 Ciresan, Meier, Masci, & Schmidhuber (2012).

241 Krizhevsky, Sutskever, & Hinton (2012).

242 Simonyan & Zisserman (2014).

243 He, Zhang, Ren, & Sun (2015).

244 Haber & Ruthotto (2017) and Haber, E. et al. (2018).

245 Howard, A.G. et al. (2017).

246 R. T. Q. Chen, Rubanova, Bettencourt, & Duvenaud (2018).

247 Tan & Le (2019).

248 Dosovitskiy, A. et al. (2020).

249 Tan & Le (2021).

250 H. Liu, Dai, So, & Le (2021).

251 Dhariwal & Nichol (2021).

252 Liu, Y. et al. (2021).

253 Ingrosso & Goldt (2022).

254 Park & Kim (2022).

255 Zhao, Y. et al. (2023).

256 Nakkiran, Bradley, Zhou, & Advani (2024).

Resources:

3.13.9 Natural language processing

3.13.9.1 Introduction

257 Firth (1957).

258 Nirenburg (1996).

259 Hutchins (2000).

260 Jurafsky & Martin (2022).

261 Z. Liu, Lin, & Sun (2023).

3.13.9.2 word2vec

262 Mikolov, Chen, Corrado, & Dean (2013), Mikolov, Yih, & Zweig (2013), and Mikolov, T. et al. (2013).

263 Kun (2018), p. 176–8.

3.13.9.3 RNNs

264 Hochreiter & Schmidhuber (1997).

265 Graves (2013).

Chain rule of language modeling (chain rule of probability):

\[ P(x_1, \ldots, x_T) = P(x_1, \ldots, x_{n-1}) \prod_{t=n}^{T} P(x_t | x_1 \ldots x_{t-1}) \]

or for the whole sequence:

\[ P(x_1, \ldots, x_T) = \prod_{t=1}^{T} P(x_t | x_1 \ldots x_{t-1}) \]

\[ = P(x_1) \: P(x_2 | x_1) \: P(x_3 | x_1 x_2) \: P(x_4 | x_1 x_2 x_3) \ldots \]

A language model (LM), predicts the next token given previous context. The output of the model is a vector of logits, which is given to a softmax to convert to probabilities for the next token.

\[ P(x_t | x_1 \ldots x_{t-1}) = \mathrm{softmax}\left( \mathrm{model}(x_1 \ldots x_{t-1}) \right) \]

Auto-regressive inference follows this chain rule. If done with greedy search:

\[ \hat{x}_t = \underset{x_t \in V}{\mathrm{argmax}} \: P(x_t | x_1 \ldots x_{t-1}) \]

Beam search:

266 Sutskever, Vinyals, & Le (2014), p. 4.

267 Zhang (1998).

268 Zhou & Hansen (2005).

269 Collobert, Hannun, & Synnaeve (2019).

Backpropagation through time (BPTT):

270 Werbos (1990).

Neural Machine Translation (NMT):

  • Sutskever seq2seq 271
  • Bahdanau attention 272 and GNMT 273
  • Review by Stahlberg 274

271 Sutskever et al. (2014).

272 Bahdanau, Cho, & Bengio (2015).

273 Wu, Y. et al. (2016).

274 Stahlberg (2019).

3.13.9.4 Transformers

Figure 3.12: Diagram of the Transformer model (source: d2l.ai).

\[ \mathrm{attention}(Q, K, V) = \mathrm{softmax}\left(\frac{Q\, K^\intercal}{\sqrt{d_k}}\right) V \]

3.13.9.4.1 Attention and Transformers

275 Vaswani, A. et al. (2017).

276 Devlin, Chang, Lee, & Toutanova (2018).

277 Liu, Y. et al. (2019).

278 Raffel, C. et al. (2019).

279 Lan, Z. et al. (2019).

280 Lewis, M. et al. (2019).

281 Radford, Narasimhan, Salimans, & Sutskever (2018).

282 Radford, A. et al. (2019).

283 Brown, T.B. et al. (2020).

284 Yang, Z. et al. (2019).

285 Zaheer, M. et al. (2020).

286 Edelman, Goel, Kakade, & Zhang (2021).

287 Tay, Dehghani, Bahri, & Metzler (2022).

288 Phuong & Hutter (2022).

289 Chowdhery, A. et al. (2022).

290 Ouyang, L. et al. (2022).

291 Wolfram (2023).

292 OpenAI (2023).

293 Mohamadi, S. et al. (2023).

294 Zhao, W.X. et al. (2023).

295 Golovneva, Wang, Weston, & Sukhbaatar (2024).

Figure 3.13: Diagram of the BERT model (source: peltarion.com).
3.13.9.4.2 Computational complexity of transformers
3.13.9.4.3 Efficient transformers

296 Dao, T. et al. (2022).

3.13.9.4.4 What comes after Transformers?

297 Gu, Goel, & Ré (2021).

298 Merrill & Sabharwal (2022).

299 Bulatov, Kuratov, & Burtsev (2022).

300 Bulatov, Kuratov, & Burtsev (2023).

301 Bertsch, Alon, Neubig, & Gormley (2023).

302 Mialon, G. et al. (2023).

303 Peng, B. et al. (2023).

304 Sun, Y. et al. (2023).

305 Gu & Dao (2023).

306 Wang, H. et al. (2023).

307 Ma, S. et al. (2024).

308 Ma, X. et al. (2024).

309 Bhargava, Witkowski, Shah, & Thomson (2023).

310 Dao & Gu (2024).

311 Banerjee, Agarwal, & Singla (2024).

312 Jingyu Liu et al (2025).

3.13.9.5 Evaluation methods

3.13.9.6 Scaling laws in NLP

313 Hestness, J. et al. (2017).

314 Church & Hestness (2019).

315 Kaplan, J. et al. (2020).

316 Rae, J.W. et al. (2022).

317 Hoffmann, J. et al. (2022).

318 Caballero, Gupta, Rish, & Krueger (2022).

319 Muennighoff, N. et al. (2023).

320 Pandey (2024).

321 Bach (2024).

322 Finzi, M. et al. (2025).

3.13.9.7 Language understanding

323 Mahowald, K. et al. (2023).

324 Kosinski (2023).

See also:

3.13.9.8 Interpretability

325 Watson & Floridi (2019).

326 Gurnee, W. et al. (2023).

327 Meng, Bau, Andonian, & Belinkov (2023).

328 McDougall, C. et al. (2023).

Linear probes:

329 Alain & Bengio (2016).

330 Belinkov (2022).

331 Gurnee & Tegmark (2023).

3.13.10 Reinforcement learning

Pedagogy:

332 Sutton & Barto (2018).

333 Arulkumaran, Deisenroth, Brundage, & Bharath (2017).

334 Cesa-Bianchi & Lugosi (2006).

Tutorials:

More:

335 Xu, Hasselt, & Silver (2018).

336 Silver, Singh, Precup, & Sutton (2024).

337 Javed & Sutton (2024).

3.13.10.1 Q-learning

  • Q-learning and DQN
  • Uses the Markov Decision Process (MDP) framework
  • The Bellman equation 338
  • Q-learning is a values-based learning algorithm. Value based algorithms updates the value function based on an equation (particularly Bellman equation). Whereas the other type, policy-based estimates the value function with a greedy policy obtained from the last policy improvement (source: towardsdatascience.com).
  • DQN masters Atari 339

338 Bellman (1952).

339 Mnih, V. et al. (2013) and Mnih, V. et al. (2015).

3.13.10.2 AlphaZero

340 Silver, D. et al. (2016).

341 Silver, D. et al. (2017b).

342 Silver, D. et al. (2017a).

3.13.10.3 Regret minimization

Regret matching (RM)

343 Hart & Mas‐Colell (2000).

Consider a game like rock-paper-scissors, where there is only one action per round. Let \(v^{t}(a)\) be the value observed when playing action \(a\) on iteration \(t\).

TODO: explain that the entire rewards vector, \(v^{t}(a)\), over \(a\) is observable after the chosen action is played.

Let a strategy, \(\sigma^t\), be a probability distribution over actions, \(a \in A\). Then the value of a strategy, \(v^{t}(\sigma^{t})\), is the expectation of its value over actions.

\[ v^{t}(\sigma^{t}) = \sum_{a \in A} \sigma^{t}(a) \: v^{t}(a) \]

Regret, \(R^{T}\), measures how much better some sequence of strategies, \(\sigma'\), would do compared to the chosen sequence of strategies, \(\sigma = \{\sigma^1, \sigma^2, \ldots \sigma^T\}\).

\[ R^{T} \equiv \sum_{t=1}^{T} \left( v^{t}({\sigma'}^{t}) - v^{t}(\sigma^{t}) \right) \]

External regret, \(R^{T}(a)\), measures the regret of the chosen sequence of strategies versus a hypothetical stategy where action \(a\) is always chosen.

\[ R^{T}(a) \equiv \sum_{t=1}^{T} \left( v^{t}(a) - v^{t}(\sigma^{t}) \right) \]

Regret Matching (RM) is a rule to determine the strategy for the next iteration:

\[ \sigma^{t+1}(a) \equiv \frac{ R^{t}_{+}(a) }{ \sum_{b \in A} R^{t}_{+}(b) } \]

where \(R_{+} \equiv \mathrm{max}(R, 0)\).

At the end of training, the resulting recommended strategy with convergence bounds is not the final strategy used in training, \(\sigma^{T}\), but the average strategy over all time steps:

\[ \bar{\sigma}^{T}(a) = \frac{1}{T} \sum_{t=1}^{T} \sigma^{t}(a) \]

TODO: explain the convergence of \(\bar{\sigma}^{t}\) to an \(\varepsilon\)-Nash equilibrium.

344 Roughgarden (2016).

Counterfactual regret minimization (CFR)

345 Zinkevich, Johanson, Bowling, & Piccione (2007).

346 N. Brown (2020), p. 12.

347 Zinkevich et al. (2007), p. 4.

348 Tammelin (2014).

349 Tammelin, Burch, Johanson, & Bowling (2015).

350 Burch, Moravcik, & Schmid (2019).

351 N. Brown & Sandholm (2019a).

TODO: explain extensive-form games.

A finite extensive game with imperfect information has the following components: 352

352 Zinkevich et al. (2007) and Lanctot, Waugh, Zinkevich, & Bowling (2009).

  • A finite set \(N\) of players. A finite set \(H\) of sequences, the possible histories of actions, such that the empty sequence is in \(H\) and every prefix of a sequence in \(H\) is also in \(H\). Define \(h \sqsubseteq h'\) to mean \(h\) is a prefix of \(h'\). \(Z \subseteq H\) are the terminal histories (those which are not a prefix of any other sequences). \(A(h) = \{a : ha \in H\}\) are the actions available after a non-terminal history, \(h \in H \backslash Z\).
  • A function \(P\) that assigns to each non-terminal history a member of \(N \cup \{c\}\). \(P\) is the player function. \(P(h)\) is the player who takes an action after the history \(h\). If \(P(h) = c\) then chance determines the action taken after history \(h\).
  • For each player \(i \in N \cup \{c\}\) a partition \(\mathcal{I}_i\) of \(\{h \in H : P (h) = i\}\) with the property that \(A(h) = A(h')\) whenever \(h\) and \(h'\) are in the same member of the partition. For \(I \in \mathcal{I}_i\) we denote by \(A(I_i)\) the set \(A(h)\) and by \(P(I_i)\) the player \(P(h)\) for any \(h \in I_i\) . \(\mathcal{I}_i\) is the information partition of player \(i\); a set \(I_i \in \mathcal{I}_i\) is an information set of player \(i\).
  • A function \(f_c\) that associates with every information set \(I\) where \(P(I) = c\), a probability measure \(f_c(a|I)\) on \(A(h)\); \(f_c(a|I)\) is the probability that \(a\) occurs given some \(h \in I\), where each such probability measure is independent of every other such measure.
  • For each player \(i \in N\) there is a utility function \(u_i\) from the terminal states \(Z\) to the reals \(\mathbb{R}\). If \(N = \{1, 2\}\) and \(u_1 = -u_2\), it is a zero-sum extensive game. Define \(\Delta_{u,i} \equiv \mathrm{max}_z \: u_i(z) - \mathrm{min}_z \: u_i(z)\) to be the range of utilities to player \(i\).

The player reach, \(\pi^{\sigma}_{i}(h)\), of a history \(h\) is the product of the probabilities for all agent \(i\) actions leading to \(h\). Formally, 353

353 N. Brown (2020), p. 6.

\[ \pi^{\sigma}_{i}(h) \equiv \prod_{h' \cdot a' \sqsubseteq h | P(h') = i} \sigma_{i}(h', a') \]

Due to perfect recall, any two histories in infoset \(I_i\) have the same player reach for player \(i\). Thus, we similarly define the player reach \(\pi^{\sigma}_{i}(I_i)\) of infoset \(I_i\) as

\[ \pi^{\sigma}_{i}(I_i) \equiv \prod_{ {I'}_{i} \cdot a' \sqsubseteq I_i | P(I_i) = i } \sigma_{i}({I'}_{i}, a') = \left.\pi^{\sigma}_{i}(h)\right|_{h \in I_i} \]

The external reach AKA opponent reach, \(\pi^{\sigma}_{-i}(h)\), of a history \(h\) is the contribution of chance and all other players than \(i\). Formally,

\[ \pi^{\sigma}_{-i}(h) \equiv \prod_{h' \cdot a' \sqsubseteq h | P(h') \neq i} \sigma_{i}(h', a') \]

We also define the external reach of an infoset as

\[ \pi^{\sigma}_{-i}(I_i) \equiv \sum_{h \in I_{i}} \pi^{\sigma}_{-i}(h) \]

The counterfactual value of an infoset \(I\) is the expected utility to player \(i\) given that \(I\) has been reached, weighted by the external reach of \(I\) for player \(i\). Formally, 354

354 N. Brown (2020), p. 12.

\[ v(I) = \sum_{h \in I} \pi^{\sigma}_{-i}(h) \sum_{z \in Z} \pi^{\sigma}(h, z) \: u_{i}(z) \]

The counterfactual value of an action, \(a\), is

\[ v(I, a) = \sum_{h \in I} \pi^{\sigma}_{-i}(h) \sum_{z \in Z} \pi^{\sigma}(h \cdot a, z) \: u_{i}(z) \]

Let’s consider the case where, like in NLHE, our two private hole cards each make a single unique history \(h\), and we form infosets with a single hand, so \(I=h\). Then

\[ v(h) = \pi^{\sigma}_{-i}(h) \sum_{z \in Z} \pi^{\sigma}(h, z) \: u_{i}(z) \]

making explicit the player reach and the external reach,

\[ v(h) = \pi^{\sigma}_{-i}(h) \sum_{z \in Z} \pi_{i}^{\sigma}(h, z) \: \pi_{-i}^{\sigma}(h, z) \: u_{i}(z) \]

At a leaf node where we finally calculate the rewards,

\[ v(z) = \pi^{\sigma}_{-i}(z) \: u_{i}(z) \]

TODO: explain CFR.

The instantaneous regret is

\[ r^{t}(I, a) = v^{\sigma^t}(I, a) - v^{\sigma^t}(I) \]

The (cummulative) counterfactual regret

\[ R^{t}(I, a) = \sum_{t=1}^{T} r^{t}(I, a) \]

Similar to the single-node game discussed above, eq. \(\eqref{eq:regret_matching}\), applying regret matching during training means to update strategies according to the following rule.

\[ \sigma^{t+1}(I, a) \equiv \frac{ R^{t}_{+}(I, a) }{ \sum_{b \in A} R^{t}_{+}(I, b) } \]

The average strategy is

\[ \bar{\sigma}^{T}(I, a) = \sum_{t=1}^{T} \frac{\pi^{t}_{i}(I) \: \sigma^{t}(I, a) }{\pi^{t}_{i}(I)} \]

Monte Carlo Counterfactual Regret Minimization (MCCFR)

355 Lanctot et al. (2009).

356 Neller & Lanctot (2013).

357 Burch, Lanctot, Szafron, & Gibson (2012).

358 Johanson, M. et al. (2012).

359 Schmid, M. et al. (2019).

360 Li, H. et al. (2020).

361 Habara, Fukuda, & Yamashita (2023).

362 Lanctot (2013).

363 Gibson (2014).

364 Johanson (2016).

365 Burch (2018).

366 Horacek (2022).

TODO: explain MCCFR.

External sampling MCCFR:

\[ \tilde{v}^{\sigma}_{i}(I) = \sum_{z \in Q} u_{i}(z) \: \pi^{\sigma}_{i}(z[I] \rightarrow z) \]

Best response and exploitability

Best response:

\[ \mathrm{BR}(\sigma_{-i}) = \underset{\sigma_{i}^{\prime}}{\mathrm{argmax}} \: u_{i}(\sigma_{i}^{\prime}, \sigma_{-i}) \]

TODO: Local Best Response (LBR). 367

367 Lisy & Bowling (2016), p. 2.

Exploitability:

\[ \varepsilon_{i}(\sigma) = u_{i}(\mathrm{BR}(\sigma_{-i}), \sigma_{-i}) - u_{i}(\sigma_{i}, \mathrm{BR}(\sigma_{i})) \]

NashConv 368 exploitability uses the convention:

368 See NashConv exploitability defined in Lanctot, M. et al. (2017).

\[ \varepsilon_{i}(\sigma) = u_{i}(\mathrm{BR}(\sigma_{-i}), \sigma_{-i}) - u_{i}(\sigma_{i}, \sigma_{-i}) \]

The average exploitability per player is

\[ \varepsilon(\sigma) = \frac{1}{n} \sum_{i}^{n} \varepsilon_{i}(\sigma) \]

Note that in zero-sum games, when summing over players, the second terms in NashConv sum to zero. 369

369 Timbers (2020), p. 3.

\[ \varepsilon(\sigma) = \frac{1}{n} \sum_{i}^{n} u_{i}(\mathrm{BR}(\sigma_{-i}), \sigma_{-i}) \]

In two-player games:

\[ \varepsilon(\sigma) = \frac{1}{2} \Big( u_{1}(\mathrm{BR}(\sigma_{2}), \sigma_{2}) + u_{2}(\sigma_{1}, \mathrm{BR}(\sigma_{1})) \Big) \]

370 Johanson, Waugh, Bowling, & Zinkevich (2011).

371 Ponsen, De Jong, & Lanctot (2011).

372 Lisy & Bowling (2016).

373 Timbers (2020).

3.13.10.4 Solving poker

374 Kuhn (1950).

375 Southey, F. et al. (2012).

376 Billings, Davidson, Schaeffer, & Szafron (2002).

377 Billings, D. et al. (2003).

378 Johanson (2013).

379 Bowling, Burch, Johanson, & Tammelin (2015).

380 Heinrich & Silver (2016).

381 Moravcik, M. et al. (2017).

382 N. Brown & Sandholm (2017).

383 N. Brown & Sandholm (2018).

384 N. Brown & Sandholm (2019a).

385 N. Brown, Lerer, Gross, & Sandholm (2019).

386 N. Brown & Sandholm (2019b).

387 N. Brown, Bakhtin, Lerer, & Gong (2020).

388 N. Brown (2020).

389 Schmid, M. et al. (2021).

390 Kovarik, V. et al. (2022).

3.13.11 Applications in physics

391 Denby (1988).

392 Denby (1993).

393 Spears, B.K. et al. (2018).

394 Cranmer, Seljak, & Terao (2021).

395 Liu, Z. et al. (2024).

396 Jiao, L. et al. (2024).

See also:

3.14 Theoretical machine learning

3.14.1 Algorithmic information theory

397 Cilibrasi & Vitanyi (2005).

398 Hutter (2007).

399 Rathmanner & Hutter (2011).

3.14.2 No free lunch theorems

400 Wolpert & Macready (1995).

401 Wolpert (1996).

402 Wolpert & Macready (1997).

403 Shalev-Shwarz & Ben-David (2014), p. 60–66.

404 McDermott (2019).

405 Wolpert (2007).

406 Wolpert & Kinney (2020).

407 Wolpert (2023).

408 Mitchell (1980).

409 Roberts (2021).

410 Goldreich & Ron (1997).

411 Joyce & Herrmann (2017).

412 H. W. Lin, Tegmark, & Rolnick (2017).

413 Lauc (2020).

414 Nakkiran (2021).

415 Bousquet, O. et al. (2021).

416 Andrews (2023).

Raissi et al.:

encoding such structured information into a learning algorithm results in amplifying the information content of the data that the algorithm sees, enabling it to quickly steer itself towards the right solution and generalize well even when only a few training examples are available. 417

417 Raissi, Perdikaris, & Karniadakis (2017a), p. 2.

Roberts:

From an algorithmic complexity standpoint it is somewhat miraculous that we can compress our huge look-up table of experiment/outcome into such an efficient description. In many senses, this type of compression is precisely what we mean when we say that physics enables us to understand a given phenomenon. 418

418 Roberts (2021), p. 7.

  • TODO: Note Dennett’s discussion of compression. 419

419 Dennett (1991), p. TODO.

3.14.3 Connectivists vs symbolicists

3.14.4 Graphical tensor notation

3.14.5 Universal approximation theorem

420 Minsky & Papert (1969), p. TODO.

421 Hornik, Stinchcombe, & White (1989).

422 Lu, Z. et al. (2017).

423 H. Lin & Jegelka (2018).

424 Ismailov (2020).

425 Bishop (2006), p. 230.

3.14.6 Relationship to statistical mechanics

426 Opper & Kinzel (1996).

427 Opper (2001).

428 Bahri, Y. et al. (2020).

429 Halverson, Maiti, & Stoner (2020).

430 Canatar, Bordelon, & Pehlevan (2020).

431 Roberts, Yaida, & Hanin (2021).

432 Cantwell (2022).

433 Dinan, Yaida, & Zhang (2023).

434 Sohl-Dickstein (2020).

435 Aifer, M. et al. (2023).

436 Geshkovski, Letrouit, Polyanskiy, & Rigollet (2023).

3.14.7 Relationship to gauge theory

Invariant:

\[ f(g x) = f(x) \]

Equivariant:

\[ f(g x) = g' f(x) \]

Same-equivariant is the case that \(g' = g\).

437 Dieleman, Fauw, & Kavukcuoglu (2016).

438 T. S. Cohen & Welling (2016).

439 T. S. Cohen, Weiler, Kicanaoglu, & Welling (2019).

440 Fuchs, Worrall, Fischer, & Welling (2020).

441 Bogatskiy, A. et al. (2023).

442 Marchetti, Hillar, Kragic, & Sanborn (2023).

443 Battiloro, C. et al. (2024).

444 Erdogan & Lucic (2025).

3.14.8 Thermodynamics of computation

445 Bérut, A. et al. (2012).

446 Bérut, Petrosyan, & Ciliberto (2015).

3.15 Information geometry

3.15.1 Introduction

447 Smith (2019).

448 Nielsen (2020).

449 Amari (1998).

450 Amari (2016).

3.15.2 Geometric understanding of classical statistics

451 Balasubramanian (1996a).

452 Balasubramanian (1996b).

453 Calin & Udriste (2014).

454 de Carvalho, Page, & Barney (2019).

3.15.3 Geometric understanding of deep learning

455 Lei, Luo, Yau, & Gu (2018).

456 Gao & Chaudhari (2020).

457 Bronstein, Bruna, Cohen, & Velickovic (2021).

3.16 Automation

3.16.1 AutoML

  • Neural Architecture Search (NAS)
  • AutoML frameworks
  • RL-driven NAS
  • learned sparsity

3.16.2 Surrogate models

458 Fefferman, Mitter, & Narayanan (2016).

459 Raissi et al. (2017a) and Raissi, Perdikaris, & Karniadakis (2017b).

460 Karniadakis, G.E. et al. (2021).

461 Howard, Mandt, Whiteson, & Yang (2021).

462 Thuerey, N. et al. (2021).

463 Cranmer, Pavez, & Louppe (2015).

464 Cranmer, Brehmer, & Louppe (2019).

465 Baydin, A.G. et al. (2019).

Lectures:

3.16.3 AutoScience

466 C. Anderson (2008).

467 Asch, M. et al. (2018).

468 D’Agnolo & Wulzer (2019).

469 Krenn, M. et al. (2022).

470 Udrescu & Tegmark (2020).

471 Cranmer, M. et al. (2020).

472 Z. Liu, Madhavan, & Tegmark (2022).

473 Lu, C. et al. (2024).

Figure 3.14: The inference cycle for the process of scientific inquiry. The three distinct forms of inference (abduction, deduction, and induction) facilitate an all-encompassing vision, enabling HPC and HDA to converge in a rational and structured manner. HPC: high- performance computing; HDA: high-end data analysis. 474

474 Asch, M. et al. (2018).

See also:

3.17 Implications for the realism debate

3.17.1 Introduction

475 Korb (2001).

476 Williamson (2009).

477 Bensusan (2000).

See also:

3.17.2 Real clusters

  • Nope: Hennig

See also:

3.17.3 Word meanings

478 Perone (2018).

479 Tenney, I. et al. (2019).

480 Nissim, Noord, & Goot (2019).

481 Skelac & Jandric (2020).

482 Bender & Koller (2020).

483 Elhage, N. et al. (2022).

484 Patel & Pavlick (2022).

485 Lovering & Pavlick (2022).

486 Kornai (2023).

487 Huh, Cheung, Wang, & Isola (2024).

488 Jamali, M. et al. (2024).

Wittgenstein in PI:

The meaning of a word is its use in the language. 489

489 Wittgenstein (2009), §43.

and

One cannot guess how a word functions. One has to look at its use, and learn from that. 490

490 Wittgenstein (2009), §340.

Piantadosi:

Modern large language models integrate syntax and semantics in the underlying representations: encoding words as vectors in a high-dimensional space, without an effort to separate out e.g. part of speech categories from semantic representations, or even predict at any level of analysis other than the literal word. Part of making these models work well was in determining how to encode semantic properties into vectors, and in fact initializing word vectors via encodings of distribution semantics from e.g. Mikolov et al. 2013 (Radford et al. 2019). Thus, an assumption of the autonomy of syntax is not required to make models that predict syntactic material and may well hinder it. 491

491 Piantadosi (2023), p. 15.

See also:

3.18 My thoughts

My docs:

My talks:

3.19 Annotated bibliography

3.19.1 Mayo, D.G. (1996). Error and the Growth of Experimental Knowledge.

3.19.1.1 My thoughts

  • TODO

3.19.2 Cowan, G. (1998). Statistical Data Analysis.

3.19.2.1 My thoughts

  • TODO

3.19.3 James, F. (2006). Statistical Methods in Experimental Physics.

3.19.3.1 My thoughts

  • TODO

3.19.4 Cowan, G. et al. (2011). Asymptotic formulae for likelihood-based tests of new physics.

  • Cowan et al. (2011)
  • Glen Cowan, Kyle Cranmer, Eilam Gross, Ofer Vitells

3.19.4.1 My thoughts

  • TODO

3.19.5 ATLAS Collaboration. (2012). Combined search for the Standard Model Higgs boson.

3.19.5.1 My thoughts

  • TODO

3.19.6 Cranmer, K. (2015). Practical statistics for the LHC.

3.19.6.1 My thoughts

  • TODO