3  Philosophy of statistics

Published

January 27, 2026

Statistical analysis is very important in addressing the problem of induction. Can inductive inference be formalized? What are the caveats? Can inductive inference be automated? How does machine learning work?

All knowledge is, in final analysis, history. All sciences are, in the abstract, mathematics. All judgements are, in their rationale, statistics.

– C. R. Rao 1

1 Rao (1997), p. x.

3.1 Introduction to the foundations of statistics

3.1.1 Problem of induction

A key issue for the scientific method, as discussed in the previous outline, is the problem of induction. Inductive inferences are used in the scientific method to make generalizations from finite data. This introduces unique avenues of error not found in purely deductive inferences, like in logic and mathematics. Compared to deductive inferences, which are sound and necessarily follow if an argument is valid and all of its premises obtain, inductive inferences can be valid and probably (not certainly) sound, and therefore can still result in error in some cases because the support of the argument is ultimately probabilistic.

A skeptic may further probe if we are even justified in using the probabilities we use in inductive arguments. What is the probability the Sun will rise tomorrow? What kind of probabilities are reasonable?

In this outline, we sketch and explore how the mathematical theory of statistics has arisen to wrestle with the problem of induction, and how it equips us with careful ways of framing inductive arguments and notions of confidence in them.

See also:

3.1.2 Early investigators

  • Ibn al-Haytham (c. 965-1040)
    • “Ibn al-Haytham was an early proponent of the concept that a hypothesis must be supported by experiments based on confirmable procedures or mathematical evidence—an early pioneer in the scientific method five centuries before Renaissance scientists.” - Wikipedia
  • Gerolamo Cardano (1501-1576)
    • Book on Games of Chance (1564)
  • John Graunt (1620-1674)
  • Jacob Bernoulli (1655-1705)
    • Ars Conjectandi (1713, posthumous)
    • First modern phrasing of the problem of parameter estimation 2
    • See Hacking 3
    • Early vision of decision theory:

2 Edwards (1974), p. 9.

3 Hacking (1971).

The art of measuring, as precisely as possible, probabilities of things, with the goal that we would be able always to choose or follow in our judgments and actions that course, which will have been determined to be better, more satisfactory, safer or more advantageous. 4

4 Bernoulli, J. (1713). Ars Conjectandi, Chapter II, Part IV, defining the art of conjecture [wikiquote].

5 Venn (1888).

6 Jevons (1873a).

7 Jevons (1873b).

3.1.3 Foundations of modern statistics

  • Central limit theorem
  • Charles Sanders Peirce (1839-1914)
    • Formulated modern statistics in “Illustrations of the Logic of Science”, a series published in Popular Science Monthly (1877-1878), and also “A Theory of Probable Inference” in Studies in Logic (1883). 8
    • With a repeated measures design, introduced blinded, controlled randomized experiments (before Fisher).
  • Karl Pearson (1857-1936)
    • The Grammar of Science (1892)
    • “On the criterion that a given system of deviations…” (1900) 9
      • Proposed testing the validity of hypothesized values by evaluating the chi distance between the hypothesized and the empirically observed values via the \(p\)-value.
    • With Frank Raphael Weldon, he established the journal Biometrika in 1902.
    • Founded the world’s first university statistics department at University College, London in 1911.
  • John Maynard Keynes (1883-1946)
    • Keynes, J. M. (1921). A Treatise on Probability. 10
  • Ronald Fisher (1890-1972)
    • Fisher significance of the null hypothesis (\(p\)-values)
      • “On an absolute criterion for fitting frequency curves” 11
      • “Frequency distribution of the values of the correlation coefficient in samples of indefinitely large population” 12
    • “On the ‘probable error’ of a coefficient of correlation deduced from a small sample” 13
      • Definition of likelihood
      • ANOVA
    • Statistical Methods for Research Workers (1925)
    • The Design of Experiments (1935)
    • “Statistical methods and scientific induction” 14
    • The Lady Tasting Tea 15
  • Jerzy Neyman (1894-1981)
  • Egon Pearson (1895-1980)
    • Neyman-Pearson confidence intervals with fixed error probabilities (also \(p\)-values but considering two hypotheses involves two types of errors)
  • Harold Jeffreys (1891-1989)
    • objective (non-informative) Jeffreys priors
  • Andrey Kolmogorov (1903-1987)
  • C.R. Rao (1920-2023)
  • Ray Solomonoff (1926-2009)
  • Shun’ichi Amari (b. 1936)
  • Judea Pearl (b. 1936)

8 Peirce (1883), p. 126–181.

9 Pearson (1900).

10 Keynes (1921).

11 Fisher (1912).

12 Fisher (1915).

13 Fisher (1921).

14 Fisher (1955).

15 Salsburg (2001).

16 Reid (1998).

17 Neyman (1955).

18 Carnap (1960).

3.1.4 Pedagogy

19 Stuart, Ord, & Arnold (2010).

20 F. James (2006).

21 Cowan (1998) and Cowan (2016).

22 Cranmer (2015).

23 Jaynes (2003).

24 Lista (2016b).

25 Lista (2016a).

26 Cox (2006).

27 Behnke, Kröninger, Schott, & Schörner-Sadenius (2013).

28 Cousins (2018).

29 Weisberg (2019).

30 Gelman & Vehtari (2021).

31 Otsuka (2023).

3.3 Statistical models

3.3.1 Parametric models

46 McCullagh (2002).

3.3.2 Canonical distributions

3.3.2.1 Bernoulli distribution

\[ \mathrm{Ber}(k; p) = \begin{cases} p & \mathrm{if}\ k = 1 \\ 1-p & \mathrm{if}\ k = 0 \end{cases} \]

which can also be written as

\[ \mathrm{Ber}(k; p) = p^k \: (1-p)^{(1-k)} \quad \mathrm{for}\ k \in \{0, 1\} \]

or

\[ \mathrm{Ber}(k; p) = p k + (1-p)(1-k) \quad \mathrm{for}\ k \in \{0, 1\} \]

  • Binomial distribution
  • Poisson distribution

TODO: explain, another important relationship is

Figure 3.1: Relationships among Bernoulli, binomial, categorical, and multinomial distributions.

3.3.2.2 Normal/Gaussian distribution

\[ N(x \,|\, \mu, \sigma^2) = \frac{1}{\sqrt{2\,\pi\:\sigma^2}} \: \exp\left(\frac{-(x-\mu)^2}{2\,\sigma^2}\right) \]

and in \(k\) dimensions:

\[ N(\vec{x} \,|\, \vec{\mu}, \boldsymbol{\Sigma}) = (2 \pi)^{-k/2}\:\left|\boldsymbol{\Sigma}\right|^{-1/2} \: \exp\left(\frac{-1}{2}\:(\vec{x}-\vec{\mu})^\intercal \:\boldsymbol{\Sigma}^{-1}\:(\vec{x}-\vec{\mu})\right) \]

where \(\boldsymbol{\Sigma}\) is the covariance matrix of the distribution (defined in eq. 3.2).

3.3.3 Central limit theorem

Let \(X_{1}\), \(X_{2}\), … , \(X_{n}\) be a random sample drawn from any distribution with a finite mean \(\mu\) and variance \(\sigma^{2}\). As \(n \rightarrow \infty\), the distribution of

\[ \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim N(0, 1) \]

  • \(\chi^2\) distribution
  • Univariate distribution relationships
Figure 3.2: Detail of a figure showing relationships among univariate distributions. See the full figure here 47.

47 Leemis & McQueston (2008).

  • The exponential family of distributions are maximum entropy distributions.

3.3.4 Mixture models

48 Cranmer, K. et al. (2012).

3.4 Point estimation and confidence intervals

3.4.1 Inverse problems

Recall that in the context of parametric models of data, \(x_i\) the pdf of which is modeled by a function, \(f(x_i ; \theta_j)\) with parameters, \(\theta_j\). In a statistical inverse problem, the goal is to infer values of the model parameters, \(\theta_j\) given some finite set of data, \(\{x_i\}\) sampled from a probability density, \(f(x_i; \theta_j)\) that models the data reasonably well 49.

49 This assumption that the model models the data “reasonably” well reflects that to the degree required by your analysis, the important features of the data match well within the systematic uncertainties parametrized within the model. If the model is incomplete because it is missing an important feature of the data, then this is the “ugly” (class-3) error in the Sinervo classification of systematic uncertainties.

50 Cowan (1998) and Cowan (2016), p. TODO.

3.4.2 Bias and variance

The bias of an estimator, \(\hat\theta\), is defined as

\[ \mathrm{Bias}(\hat{\theta}) \equiv \mathbb{E}(\hat{\theta} - \theta) = \int dx \: P(x|\theta) \: (\hat{\theta} - \theta) \]

The mean squared error (MSE) of an estimator has a similar formula to variance (eq. 3.1) except that instead of quantifying the square of the difference of the estimator and its expected value, the MSE uses the square of the difference of the estimator and the true parameter:

\[ \mathrm{MSE}(\hat{\theta}) \equiv \mathbb{E}((\hat{\theta} - \theta)^2) \]

The MSE of an estimator can be related to its bias and its variance by the following proof:

\[ \begin{align} \mathrm{MSE}(\hat{\theta}) &= \mathbb{E}(\hat{\theta}^2 - 2 \: \hat{\theta} \: \theta + \theta^2) \\ &= \mathbb{E}(\hat{\theta}^2) - 2 \: \mathbb{E}(\hat{\theta}) \: \theta + \theta^2 \end{align} \]

noting that

\[ \mathrm{Var}(\hat{\theta}) = \mathbb{E}(\hat{\theta}^2) - \mathbb{E}(\hat{\theta})^2 \]

and

\[ \begin{align} \mathrm{Bias}(\hat{\theta})^2 &= \mathbb{E}(\hat{\theta} - \theta)^2 \\ &= \mathbb{E}(\hat{\theta})^2 - 2 \: \mathbb{E}(\hat{\theta}) \: \theta + \theta^2 \end{align} \]

we see that MSE is equivalent to

\[ \mathrm{MSE}(\hat{\theta}) = \mathrm{Var}(\hat{\theta}) + \mathrm{Bias}(\hat{\theta})^2 \]

For an unbiased estimator, the MSE is the variance of the estimator.

TODO:

  • Note the discussion of the bias-variance tradeoff by Cranmer.
  • Note the new deep learning view. See Deep learning.

See also:

3.4.3 Maximum likelihood estimation

A maximum likelihood estimator (MLE) was first used by Fisher. 51

51 Aldrich (1997).

\[\hat{\theta} \equiv \underset{\theta}{\mathrm{argmax}} \: \mathrm{log} \: L(\theta) \]

Maximizing \(\mathrm{log} \: L(\theta)\) is equivalent to maximizing \(L(\theta)\), and the former is more convenient because for data that are independent and identically distributed (i.i.d.) the joint likelihood can be factored into a product of individual measurements:

\[ L(\theta) = \prod_i L(\theta|x_i) = \prod_i P(x_i|\theta) \]

and taking the log of the product makes it a sum:

\[ \mathrm{log} \: L(\theta) = \sum_i \mathrm{log} \: L(\theta|x_i) = \sum_i \mathrm{log} \: P(x_i|\theta) \]

Maximizing \(\mathrm{log} \: L(\theta)\) is also equivalent to minimizing \(-\mathrm{log} \: L(\theta)\), the negative log-likelihood (NLL). For distributions that are i.i.d.,

\[ \mathrm{NLL} \equiv - \log L = - \log \prod_i L_i = - \sum_i \log L_i = \sum_i \mathrm{NLL}_i \]

3.4.3.1 Invariance of likelihoods under reparametrization

  • Likelihoods are invariant under reparametrization. 52
  • Bayesian posteriors are not invariant in general.

52 F. James (2006), p. 234.

See also:

3.4.3.2 Ordinary least squares

  • Least squares from MLE of gaussian models: \(\chi^2\)
  • Ordinary Least Squares (OLS)
  • Geometric interpretation

53 Cox (2006), p. 11.

54 Murphy (2012), p. 222.

3.4.4 Variance of MLEs

55 Fréchet (1943), Cramér (1946), Rao (1945), and Rao (1947).

56 Rice (2007), p. 300–2.

57 Nielsen (2013).

58 Cowan (1998), p. 130-5.

59 F. James (2006), p. 234.

60 F. James & Roos (1975).

Figure 3.3: Transformation of non-parabolic log-likelihood to parabolic (source: my slides, recreation of F. James (2006), p. 235).

61 Loh (1987).

62 Wainer (2007).

63 Tegmark, Taylor, & Heavens (1997).

3.4.5 Bayesian credibility intervals

3.4.6 Uncertainty on measuring an efficiency

64 Clopper & Pearson (1934).

65 Agresti & Coull (1998).

66 Hanley & Lippman-Hand (1983).

67 L. D. Brown, Cai, & DasGupta (2001).

68 Casadei (2012).

3.4.7 Examples

3.5 Statistical hypothesis testing

3.5.1 Null hypothesis significance testing

  • Karl Pearson observing how rare sequences of roulette spins are
  • Null hypothesis significance testing (NHST)
  • goodness of fit
  • Fisher

Fisher:

[T]he null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. 69

69 Fisher (1935), p. 16.

3.5.2 Neyman-Pearson theory

3.5.2.1 Introduction

70 Goodman (1999a). p. 998.

71 J. Cohen (1992).

72 ATLAS and CMS Collaborations (2011).

73 Cowan, Cranmer, Gross, & Vitells (2011).

Figure 3.4: TODO: ROC explainer. (Wikimedia, 2015).

See also:

3.5.2.2 Neyman-Pearson lemma

Neyman-Pearson lemma: 74

74 Neyman & Pearson (1933).

For a fixed signal efficiency, \(1-\alpha\), the selection that corresponds to the lowest possible misidentification probability, \(\beta\), is given by

\[ \frac{L(H_1)}{L(H_0)} > k_{\alpha} \]

where \(k_{\alpha}\) is the cut value required to achieve a type-1 error rate of \(\alpha\).

Neyman-Pearson test statistic:

\[ q_\mathrm{NP} = - 2 \ln \frac{L(H_1)}{L(H_0)} \]

Profile likelihood ratio:

\[ \lambda(\mu) = \frac{ L(\mu, \hat{\theta}_\mu) }{ L(\hat{\mu}, \hat{\theta}) } \]

where \(\hat{\theta}\) is the (unconditional) maximum-likelihood estimator that maximizes \(L\), while \(\hat{\theta}_\mu\) is the conditional maximum-likelihood estimator that maximizes \(L\) for a specified signal strength, \(\mu\), and \(\theta\) as a vector includes all other parameters of interest and nuisance parameters.

3.5.2.3 Neyman construction

Cranmer: Neyman construction.

Figure 3.5: Neyman construction for a confidence belt for \(\theta\) (source: K. Cranmer, 2020).

TODO: fix

\[ q = - 2 \ln \frac{L(\mu\,s + b)}{L(b)} \]

3.5.2.4 Flip-flopping

  • Flip-flopping and Feldman-Cousins confidence intervals 75

75 Feldman & Cousins (1998).

3.5.3 p-values and significance

  • \(p\)-values and significance 76
  • Coverage
  • Fisherian vs Neyman-Pearson \(p\)-values

76 Sinervo (2002) and Cowan (2012).

Cowan et al. define a \(p\)-value as

a probability, under assumption of \(H\), of finding data of equal or greater incompatibility with the predictions of \(H\). 77

77 Cowan et al. (2011), p. 2–3.

Also:

It should be emphasized that in an actual scientific context, rejecting the background-only hypothesis in a statistical sense is only part of discovering a new phenomenon. One’s degree of belief that a new process is present will depend in general on other factors as well, such as the plausibility of the new signal hypothesis and the degree to which it can describe the data. Here, however, we only consider the task of determining the \(p\)-value of the background-only hypothesis; if it is found below a specified threshold, we regard this as “discovery”. 78

78 Cowan et al. (2011), p. 3.

3.5.3.1 Uppper limits

79 Cousins & Highland (1992).

3.5.3.2 CLs method

  • Conservative coverage; used in particle physics
  • Junk 80
  • Read 81
  • ATLAS 82

80 Junk (1999).

81 Read (2002).

82 ATLAS Statistics Forum (2011).

3.5.4 Asymptotics

83 Wilks (1938).

84 Wald (1943).

85 Cowan et al. (2011).

86 Cowan, Cranmer, Gross, & Vitells (2012).

87 Bhattiprolu, Martin, & Wells (2020).

3.5.5 Student’s t-test

  • Student’s t-test
  • ANOVA
  • A/B-testing

3.5.6 Decision theory

88 Murphy (2012), p. 197.

89 Goodman (1999b).

Support for using Bayes factors:

which properly separates issues of long-run behavior from evidential strength and allows the integration of background knowledge with statistical findings. 90

90 Goodman (1999a). p. 995.

See also:

3.5.7 Examples

3.6 Uncertainty quantification

3.6.1 Sinervo classification of systematic uncertainties

  • Class-1, class-2, and class-3 systematic uncertanties (good, bad, ugly), Classification by Pekka Sinervo (PhyStat2003) 91
  • Not to be confused with type-1 and type-2 errors in Neyman-Pearson theory
  • Heinrich, J. & Lyons, L. (2007). Systematic errors. 92
  • Caldeira & Nord 93

91 Sinervo (2003).

92 Heinrich & Lyons (2007).

93 Caldeira & Nord (2020).

Lyons:

In analyses involving enough data to achieve reasonable statistical accuracy, considerably more effort is devoted to assessing the systematic error than to determining the parameter of interest and its statistical error. 94

94 Lyons (2008), p. 890.

Figure 3.6: Classification of measurement uncertainties (philosophy-in-figures.tumblr.com, 2016).
  • Poincaré’s three levels of ignorance

3.6.2 Profile likelihoods

  • Profiling and the profile likelihood
    • Importance of Wald and Cowan et al.
    • hybrid Bayesian-frequentist method

3.6.3 Examples of poor estimates of systematic uncertanties

Figure 3.7: Demonstration of sensitivity to the jet energy scale for an alleged excess in \(Wjj\) by Tommaso Dorigo (2011) (see also: GIF).
  • OPERA. (2011). Faster-than-light neutrinos.
  • BICEP2 claimed evidence of B-modes in the CMB as evidence of cosmic inflation without accounting for cosmic dust.

3.6.4 Conformal prediction

95 Gammerman, Vovk, & Vapnik (1998).

3.7 Statistical classification

3.7.1 Introduction

  • Precision vs recall
  • Recall is sensitivity
  • Sensitivity vs specificity
  • Accuracy

3.7.2 Examples

  • TODO

See also:

3.8 Causal inference

3.8.1 Introduction

96 Rubin (1974).

97 Lewis (1981).

98 Pearl (2018).

99 Bareinboim, Correa, Ibeling, & Icard (2021).

See also:

3.8.2 Causal models

100 Pearl (2009).

101 Robins & Wasserman (1999).

102 Peters, Janzing, & Scholkopf (2017).

103 Lundberg, Johnson, & Stewart (2021).

3.8.3 Counterfactuals

104 Ismael (2023).

105 Chevalley, Schwab, & Mehrjou (2024).

3.9 Exploratory data analysis

3.9.1 Introduction

106 Tukey (1977).

3.9.2 Look-elsewhere effect

  • Look-elsewhere effect (LEE)
    • AKA File-drawer effect
  • Stopping rules
    • validation dataset
    • statistical issues, violates the likelihood principle

3.9.3 Archiving and data science

107 Chen, X. et al. (2018).

3.10 “Statistics Wars”

3.10.1 Introduction

  • Kruschke
  • Carnap
    • “The two concepts of probability” 108
  • Royall
    • “What do these data say?” 109

108 Carnap (1945).

109 Royall (1997), p. 171–2.

Cranmer:

Bayes’s theorem is a theorem, so there’s no debating it. It is not the case that Frequentists dispute whether Bayes’s theorem is true. The debate is whether the necessary probabilities exist in the first place. If one can define the joint probability \(P (A, B)\) in a frequentist way, then a Frequentist is perfectly happy using Bayes theorem. Thus, the debate starts at the very definition of probability. 110

110 Cranmer (2015), p. 6.

Neyman:

Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong. 111

111 Neyman & Pearson (1933).

Figure 3.8: From Kruschke. 112

112 Kruschke & Liddell (2018).

3.10.2 Likelihood principle

  • Likelihood principle
  • The likelihood principle is the proposition that, given a statistical model and a data sample, all the evidence relevant to model parameters is contained in the likelihood function.
  • The history of likelihood 113
  • Berger & Wolpert. (1988). The Likelihood Principle. 116

113 Edwards (1974).

114 Birnbaum (1962).

115 Hacking (1965).

116 Berger & Wolpert (1988).

O’Hagan:

The first key argument in favour of the Bayesian approach can be called the axiomatic argument. We can formulate systems of axioms of good inference, and under some persuasive axiom systems it can be proved that Bayesian inference is a consequence of adopting any of these systems… If one adopts two principles known as ancillarity and sufficiency principles, then under some statement of these principles it follows that one must adopt another known as the likelihood principle. Bayesian inference conforms to the likelihood principle whereas classical inference does not. Classical procedures regularly violate the likelihood principle or one or more of the other axioms of good inference. There are no such arguments in favour of classical inference. 117

117 O’Hagan (2010), p. 17–18.

118 Gandenberger (2015).

119 Evans (2013).

120 Mayo (2014).

121 Mayo (2019).

122 Dawid (2014).

Mayo:

Likelihoods are vital to all statistical accounts, but they are often misunderstood because the data are fixed and the hypothesis varies. Likelihoods of hypotheses should not be confused with their probabilities. … [T]he same phenomenon may be perfectly predicted or explained by two rival theories; so both theories are equally likely on the data, even though they cannot both be true. 123

123 Mayo (2019).

3.10.3 Discussion

Lyons:

Particle Physicists tend to favor a frequentist method. This is because we really do consider that our data are representative as samples drawn according to the model we are using (decay time distributions often are exponential; the counts in repeated time intervals do follow a Poisson distribution, etc.), and hence we want to use a statistical approach that allows the data “to speak for themselves,” rather than our analysis being dominated by our assumptions and beliefs, as embodied in Bayesian priors. 124

124 Lyons (2008), p. 891.

125 Carnap (1952).

126 Carnap (1960).

127 Sznajder (2018).

128 Hacking (1965).

129 Neyman (1977).

130 Rozeboom (1960).

131 Meehl (1978).

132 Zech (1995).

133 Royall (1997).

134 Berger (2003).

135 Stuart et al. (2010), p. 460.

136 Mayo (1981).

137 Mayo (1996).

138 Mayo & Spanos (2006).

139 Mayo & Spanos (2011).

140 Mayo (2018).

141 Gelman & Hennig (2017).

142 Murphy (2012), ch. 6.6.

143 Murphy (2022), p. 195–198.

144 Gandenberger (2016).

145 Wakefield (2013), ch. 4.

146 Efron & Hastie (2016), p. 30–36.

147 Kruschke & Liddell (2018).

148 Steinhardt (2012).

149 Clayton (2021).

150 Wagenmakers (2021).

Figure 3.9: The major virtues and vices of Bayesian, frequentist, and likelihoodist approaches to statistical inference (gandenberger.org/research/, 2015).

Goodman:

The idea that the \(P\) value can play both of these roles is based on a fallacy: that an event can be viewed simultaneously both from a long-run and a short-run perspective. In the long-run perspective, which is error-based and deductive, we group the observed result together with other outcomes that might have occurred in hypothetical repetitions of the experiment. In the “short run” perspective, which is evidential and inductive, we try to evaluate the meaning of the observed result from a single experiment. If we could combine these perspectives, it would mean that inductive ends (drawing scientific conclusions) could be served with purely deductive methods (objective probability calculations). 151

151 Goodman (1999a). p. 999.

3.11 Replication crisis

3.11.1 Introduction

  • Ioannidis, J.P. (2005). Why most published research findings are false. 152

152 Ioannidis (2005).

3.11.2 p-value controversy

153 Wasserstein & Lazar (2016).

154 Wasserstein, Allen, & Lazar (2019).

155 Benjamin, D.J. et al. (2017).

[N]o isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon; for the “one chance in a million” will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be that it should occur to us. In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. 156

156 Fisher (1935), p. 13–14.

157 Rao & Lovric (2016).

158 Mayo (2021).

159 Gorard & Gorard (2016).

From “The ASA president’s task force statement on statistical significance and replicability”:

P-values are valid statistical measures that provide convenient conventions for communicating the uncertainty inherent in quantitative results. Indeed, P-values and significance tests are among the most studied and best understood statistical procedures in the statistics literature. They are important tools that have advanced science through their proper application. 160

160 Benjamini, Y. et al. (2021), p. 1.

3.12 Classical machine learning

3.12.1 Introduction

  • Classification vs regression
  • Supervised and unsupervised learning
    • Classification = supervised; clustering = unsupervised
  • Hastie, Tibshirani, & Friedman 161
  • Information Theory, Inference, and Learning 162
  • Murphy, K.P. (2012). Machine Learning: A probabilistic perspective. MIT Press. 163
  • Murphy, K.P. (2022). Probabilistic Machine Learning: An introduction. MIT Press. 164
  • Shalev-Shwarz, S. & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. 165
  • VC-dimension
    • Vapnik (1994) 166
    • Shalev-Shwarz, S. & Ben-David, S. (2014). 167

161 Hastie, Tibshirani, & Friedman (2009).

162 MacKay (2003).

163 Murphy (2012).

164 Murphy (2022), p. 195–198.

165 Shalev-Shwarz & Ben-David (2014).

166 Vapnik, Levin, & LeCun (1994).

167 Shalev-Shwarz & Ben-David (2014), p. 67–82.

3.12.2 History

168 McCarthy, Minsky, Rochester, & Shannon (1955).

169 Solomonoff (2016).

170 Kardum (2020).

171 McCulloch & Pitts (1943).

172 J. A. Anderson & Rosenfeld (1998).

173 Rosenblatt (1961).

174 Minsky & Papert (1969).

175 Cartwright (2001).

See also:

3.12.3 Logistic regression

From a probabilistic point of view, 176 logistic regression can be derived from doing maximum likelihood estimation of a vector of model parameters, \(\vec{w}\), in a dot product with the input features, \(\vec{x}\), and squashed with a logistic function that yields the probability, \(\mu\), of a Bernoulli random variable, \(y \in \{0, 1\}\).

176 Murphy (2012), p. 21.

\[ p(y | \vec{x}, \vec{w}) = \mathrm{Ber}(y | \mu(\vec{x}, \vec{w})) = \mu(\vec{x}, \vec{w})^y \: (1-\mu(\vec{x}, \vec{w}))^{(1-y)} \]

The negative log-likelihood of multiple trials is

\[ \begin{align} \mathrm{NLL} &= - \sum_i \log p(y_i | \vec{x}_i, \vec{w}) \\ &= - \sum_i \log\left( \mu(\vec{x}_i, \vec{w})^{y_i} \: (1-\mu(\vec{x}_i, \vec{w}))^{(1-y_i)} \right) \\ &= - \sum_i \log\left( \mu_i^{y_i} \: (1-\mu_i)^{(1-y_i)} \right) \\ &= - \sum_i \big( y_i \, \log \mu_i + (1-y_i) \log(1-\mu_i) \big) \end{align} \]

which is the cross entropy loss. Note that the first term is non-zero only when the true target is \(y_i=1\), and similarly the second term is non-zero only when \(y_i=0\). 177 Therefore, we can reparametrize the target \(y_i\) in favor of \(t_{ki}\) that is one-hot in an index \(k\) over classes.

177 Note: Label smoothing is a regularization technique that smears the activation over other labels, but we don’t do that here.

\[ \mathrm{CEL} = \mathrm{NLL} = - \sum_i \sum_k \big( t_{ki} \, \log \mu_{ki} \big) \]

where

\[ t_{ki} = \begin{cases} 1 & \mathrm{if}\ (k = y_i = 0)\ \mathrm{or}\ (k = y_i = 1) \\ 0 & \mathrm{otherwise} \end{cases} \]

and

\[ \mu_{ki} = \begin{cases} 1-\mu_i & \mathrm{if}\ k = 0 \\ \mu_i & \mathrm{if}\ k =1 \end{cases} \]

This readily generalizes from binary classification to classification over many classes as we will discuss more below. Note that in the sum over classes, \(k\), only one term for the true class contributes.

\[ \mathrm{CEL} = - \left. \sum_i \log \mu_{ki} \right|_{k\ \mathrm{is\ such\ that}\ y_k=1} \tag{3.3}\]

Logistic regression uses the logit function 178, which is the logarithm of the odds—the ratio of the chance of success to failure. Let \(\mu\) be the probability of success in a Bernoulli trial, then the logit function is defined as

178 “Logit” was coined by Joseph Berkson (1899-1982).

\[ \mathrm{logit}(\mu) \equiv \log\left(\frac{\mu}{1-\mu}\right) \]

Logistic regression assumes that the logit function is a linear function of the explanatory variable, \(x\).

\[ \log\left(\frac{\mu}{1-\mu}\right) = \beta_0 + \beta_1 x \]

where \(\beta_0\) and \(\beta_1\) are trainable parameters. (TODO: Why would we assume this?) This can be generalized to a vector of multiple input variables, \(\vec{x}\), where the input vector has a 1 prepended to be its zeroth component in order to conveniently include the bias, \(\beta_0\), in a dot product.

\[ \vec{x} = (1, x_1, x_2, \ldots, x_n)^\intercal \]

\[ \vec{w} = (\beta_0, \beta_1, \beta_2, \ldots, \beta_n)^\intercal \]

\[ \log\left(\frac{\mu}{1-\mu}\right) = \vec{w}^\intercal \vec{x} \]

For the moment, let \(z \equiv \vec{w}^\intercal \vec{x}\). Exponentiating and solving for \(\mu\) gives

\[ \mu = \frac{ e^z }{ 1 + e^z } = \frac{ 1 }{ 1 + e^{-z} } \]

This function is called the logistic or sigmoid function.

\[ \mathrm{logistic}(z) \equiv \mathrm{sigm}(z) \equiv \frac{ 1 }{ 1 + e^{-z} } \]

Since we inverted the logit function by solving for \(\mu\), the inverse of the logit function is the logistic or sigmoid.

\[ \mathrm{logit}^{-1}(z) = \mathrm{logistic}(z) = \mathrm{sigm}(z) \]

And therefore,

\[ \mu = \mathrm{sigm}(z) = \mathrm{sigm}(\vec{w}^\intercal \vec{x}) \]

Figure 3.10: Logistic regression.

See also:

3.12.4 Softmax regression

Again, from a probabilistic point of view, we can derive the use of multi-class cross entropy loss by starting with the Bernoulli distribution, generalizing it to multiple classes (indexed by \(k\)) as

\[ p(y_k | \mu) = \mathrm{Cat}(y_k | \mu_k) = \prod_k {\mu_k}^{y_k} \]

which is the categorical or multinoulli distribution. The negative-log likelihood of multiple independent trials is

\[ \mathrm{NLL} = - \sum_i \log \left(\prod_k {\mu_{ki}}^{y_{ki}}\right) = - \sum_i \sum_k y_{ki} \: \log \mu_{ki} \]

Noting again that \(y_{ki} = 1\) only when \(k\) is the true class, and is 0 otherwise, this simplifies to eq. 3.3.

See also:

179 McFadden & Zarembka (1973).

180 Blondel, Martins, & Niculae (2020).

181 Goodfellow et al. (2016), p. 129.

3.12.5 Decision trees

182 Freund & Schapire (1997).

183 T. Chen & Guestrin (2016).

184 Aytekin (2022).

185 Grinsztajn, Oyallon, & Varoquaux (2022).

186 Coadou (2022).

3.12.6 Clustering

187 Hartigan (1985).

188 Slonim, Atwal, Tkacik, & Bialek (2005).

189 Batson, Haaf, Kahn, & Roberts (2021).

190 Hennig (2015).

191 Lauc (2020), p. 103–4.

192 Ronen, Finder, & Freifeld (2022).

193 Fang, Z. et al. (2022).

See also:

3.13 Deep learning

3.13.1 Introduction

194 Gardner & Dorling (1998).

195 Schmidhuber (2024).

196 Bengio (2009).

197 LeCun, Bengio, & Hinton (2015).

198 Sutskever (2015).

199 Goodfellow et al. (2016).

200 Kaplan, J. et al. (2019).

201 Rumelhart, Hinton, & Williams (1986).

202 Amari (1993).

203 LeCun & Bottou (1998).

204 Scardapane (2024).

205 Bottou (1998).

206 Norvig (2011).

207 Sutton (2019).

208 Frankle & Carbin (2018).

Figure 3.11: Raw input image is transformed into gradually higher levels of representation. 209

209 Bengio (2009).

3.13.2 Gradient descent

\[ \hat{f} = \underset{f \in \mathcal{F}}{\mathrm{argmin}} \underset{x \sim \mathcal{X}}{\mathbb{E}} L(f, x) \]

The workhorse algorithm for optimizing (training) model parameters is gradient descent:

\[ \vec{w}[t+1] = \vec{w}[t] - \eta \frac{\partial L}{\partial \vec{w}}[t] \]

In Stochastic Gradient Descent (SGD), you chunk the training data into minibatches (AKA batches), \(\vec{x}_{bt}\), and take a gradient descent step with each minibatch:

\[ \vec{w}[t+1] = \vec{w}[t] - \frac{\eta}{m} \sum_{i=1}^m \frac{\partial L}{\partial \vec{w}}[\vec{x}_{bt}] \]

where

  • \(t \in \mathbf{N}\) is the learning step number
  • \(\eta\) is the learning rate
  • \(m\) is the number of samples in a minibatch, called the batch size
  • \(L\) is the loss function
  • \(\frac{\partial L}{\partial \vec{w}}\) is the gradient

3.13.3 Deep double descent

Papers:

210 Belkin, Hsu, Ma, & Mandal (2019).

211 Muthukumar, Vodrahalli, Subramanian, & Sahai (2019).

212 Nakkiran, P. et al. (2019).

213 Chang, Li, Oymak, & Thrampoulidis (2020).

214 Holzmüller (2020).

215 Dar, Muthukumar, & Baraniuk (2021).

216 Balestriero, Pesenti, & LeCun (2021).

217 Belkin (2021).

218 Nagarajan (2021).

219 Bach (2022), p. 225–230.

220 Ghosh & Belkin (2022).

221 Singh, Lucchi, Hofmann, & Schölkopf (2022).

222 Hastie, Montanari, Rosset, & Tibshirani (2022).

223 Bubeck & Sellke (2023).

224 Gamba, Englesson, Björkman, & Azizpour (2022).

225 Schaeffer, R. et al. (2023).

226 Yang & Suzuki (2023).

227 Maddox, Benton, & Wilson (2023).

Blogs:

228 Steinhardt (2022).

229 Henighan, T. et al. (2023).

Twitter threads:

3.13.4 Regularization

Regularization = any change we make to the training algorithm in order to reduce the generalization error but not the training error. 230

230 Mishra, D. (2020). Weight Decay == L2 Regularization?

Most common regularizations:

  • L2 Regularization
  • L1 Regularization
  • Data Augmentation
  • Dropout
  • Early Stopping

Papers:

231 S. Chen, Dobriban, & Lee (2020).

3.13.5 Batch size vs learning rate

Papers:

  1. Keskar, N.S. et al. (2016). On large-batch training for deep learning: Generalization gap and sharp minima.

[L]arge-batch methods tend to converge to sharp minimizers of the training and testing functions—and as is well known—sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation.

  1. Hoffer, E. et al. (2017). Train longer, generalize better: closing the generalization gap in large batch training of neural networks.

    • \(\eta \propto \sqrt{m}\)
  2. Goyal, P. et al. (2017). Accurate large minibatch SGD: Training ImageNet in 1 hour.

    • \(\eta \propto m\)
  3. You, Y. et al. (2017). Large batch training of convolutional networks.

    • Layer-wise Adaptive Rate Scaling (LARS)
  4. You, Y. et al. (2017). ImageNet training in minutes.

    • Layer-wise Adaptive Rate Scaling (LARS)
  5. Jastrzebski, S. (2018). Three factors influencing minima in SGD.

    • \(\eta \propto m\)
  6. Smith, S.L. & Le, Q.V. (2018). A Bayesian Perspective on Generalization and Stochastic Gradient Descent.

  7. Smith, S.L. et al. (2018). Don’t decay the learning rate, increase the batch size.

    • \(m \propto \eta\)
  8. Masters, D. & Luschi, C. (2018). Revisiting small batch training for deep neural networks.

This linear scaling rule has been widely adopted, e.g., in Krizhevsky (2014), Chen et al. (2016), Bottou et al. (2016), Smith et al. (2017) and Jastrzebski et al. (2017).

On the other hand, as shown in Hoffer et al. (2017), when \(m \ll M\), the covariance matrix of the weight update \(\mathrm{Cov(\eta \Delta\theta)}\) scales linearly with the quantity \(\eta^2/m\).

This implies that, adopting the linear scaling rule, an increase in the batch size would also result in a linear increase in the covariance matrix of the weight update \(\eta \Delta\theta\). Conversely, to keep the scaling of the covariance of the weight update vector \(\eta \Delta\theta\) constant would require scaling \(\eta\) with the square root of the batch size \(m\) (Krizhevsky, 2014; Hoffer et al., 2017).

  1. Lin, T. et al. (2020). Don’t use large mini-batches, use local SGD.
    - Post-local SGD.

  2. Golmant, N. et al. (2018). On the computational inefficiency of large batch sizes for stochastic gradient descent.

Scaling the learning rate as \(\eta \propto \sqrt{m}\) attempts to keep the weight increment length statistics constant, but the distance between SGD iterates is governed more by properties of the objective function than the ratio of learning rate to batch size. This rule has also been found to be empirically sub-optimal in various problem domains. … There does not seem to be a simple training heuristic to improve large batch performance in general.

  1. McCandlish, S. et al. (2018). An empirical model of large-batch training.
    • Critical batch size
  2. Shallue, C.J. et al. (2018). Measuring the effects of data parallelism on neural network training.

In all cases, as the batch size grows, there is an initial period of perfect scaling (\(b\)-fold benefit, indicated with a dashed line on the plots) where the steps needed to achieve the error goal halves for each doubling of the batch size. However, for all problems, this is followed by a region of diminishing returns that eventually leads to a regime of maximal data parallelism where additional parallelism provides no benefit whatsoever.

  1. Jastrzebski, S. et al. (2018). Width of minima reached by stochastic gradient descent is influenced by learning rate to batch size ratio.
    • \(\eta \propto m\)

We show this experimentally in Fig. 5, where similar learning dynamics and final performance can be observed when simultaneously multiplying the learning rate and batch size by a factor up to a certain limit.

  1. You, Y. et al. (2019). Large-batch training for LSTM and beyond.
    • Warmup and use \(\eta \propto m\)

[W]e propose linear-epoch gradual-warmup approach in this paper. We call this approach Leg-Warmup (LEGW). LEGW enables a Sqrt Scaling scheme in practice and as a result we achieve much better performance than the previous Linear Scaling learning rate scheme. For the GNMT application (Seq2Seq) with LSTM, we are able to scale the batch size by a factor of 16 without losing accuracy and without tuning the hyper-parameters mentioned above.

  1. You, Y. et al. (2019). Large batch optimization for deep learning: Training BERT in 76 minutes.
    • LARS and LAMB
  2. Zhang, G. et al. (2019). Which algorithmic choices matter at which batch sizes? Insights from a Noisy Quadratic Model.

Consistent with the empirical results of Shallue et al. (2018), each optimizer shows two distinct regimes: a small-batch (stochastic) regime with perfect linear scaling, and a large-batch (deterministic) regime insensitive to batch size. We call the phase transition between these regimes the critical batch size.

  1. Li, Y., Wei, C., & Ma, T. (2019). Towards explaining the regularization effect of initial large learning rate in training neural networks.

Our analysis reveals that more SGD noise, or larger learning rate, biases the model towards learning “generalizing” kernels rather than “memorizing” kernels.

  1. Kaplan, J. et al. (2020). Scaling laws for neural language models.

  2. Jastrzebski, S. et al. (2020). The break-even point on optimization trajectories of deep neural networks.

Blogs:

3.13.6 Normalization

232 Chiley, V. et al. (2019).

233 Kiani, Balestriero, Lecun, & Lloyd (2022).

234 Huang, L. et al. (2020).

3.13.7 Finetuning

235 Hu, E.J. et al. (2021).

236 Dettmers, Pagnoni, Holtzman, & Zettlemoyer (2023).

237 Zhao, J. et al. (2024).

238 Huh, M. et al. (2024).

3.13.8 Computer vision

239 Fukushima & Miyake (1982).

240 LeCun, Y. et al. (1989).

241 LeCun, Bottou, Bengio, & Haffner (1998).

242 Ciresan, Meier, Masci, & Schmidhuber (2012).

243 Krizhevsky, Sutskever, & Hinton (2012).

244 Simonyan & Zisserman (2014).

245 He, Zhang, Ren, & Sun (2015).

246 Haber & Ruthotto (2017) and Haber, E. et al. (2018).

247 Howard, A.G. et al. (2017).

248 R. T. Q. Chen, Rubanova, Bettencourt, & Duvenaud (2018).

249 Tan & Le (2019).

250 Dosovitskiy, A. et al. (2020).

251 Tan & Le (2021).

252 H. Liu, Dai, So, & Le (2021).

253 Dhariwal & Nichol (2021).

254 Liu, Y. et al. (2021).

255 Ingrosso & Goldt (2022).

256 Park & Kim (2022).

257 Zhao, Y. et al. (2023).

258 Nakkiran, Bradley, Zhou, & Advani (2024).

Resources:

3.13.9 Natural language processing

3.13.9.1 Introduction

259 Firth (1957).

260 Nirenburg (1996).

261 Hutchins (2000).

262 Jurafsky & Martin (2022).

263 Z. Liu, Lin, & Sun (2023).

3.13.9.2 word2vec

264 Mikolov, Chen, Corrado, & Dean (2013), Mikolov, Yih, & Zweig (2013), and Mikolov, T. et al. (2013).

265 Kun (2018), p. 176–8.

3.13.9.3 RNNs

266 Hochreiter & Schmidhuber (1997).

267 Graves (2013).

Chain rule of language modeling (chain rule of probability):

\[ P(x_1, \ldots, x_T) = P(x_1, \ldots, x_{n-1}) \prod_{t=n}^{T} P(x_t | x_1 \ldots x_{t-1}) \]

or for the whole sequence:

\[ P(x_1, \ldots, x_T) = \prod_{t=1}^{T} P(x_t | x_1 \ldots x_{t-1}) \]

\[ = P(x_1) \: P(x_2 | x_1) \: P(x_3 | x_1 x_2) \: P(x_4 | x_1 x_2 x_3) \ldots \]

A language model (LM), predicts the next token given previous context. The output of the model is a vector of logits, which is given to a softmax to convert to probabilities for the next token.

\[ P(x_t | x_1 \ldots x_{t-1}) = \mathrm{softmax}\left( \mathrm{model}(x_1 \ldots x_{t-1}) \right) \]

Auto-regressive inference follows this chain rule. If done with greedy search:

\[ \hat{x}_t = \underset{x_t \in V}{\mathrm{argmax}} \: P(x_t | x_1 \ldots x_{t-1}) \]

Beam search:

268 Sutskever, Vinyals, & Le (2014), p. 4.

269 Zhang (1998).

270 Zhou & Hansen (2005).

271 Collobert, Hannun, & Synnaeve (2019).

Backpropagation through time (BPTT):

272 Werbos (1990).

Neural Machine Translation (NMT):

  • Sutskever seq2seq 273
  • Bahdanau attention 274 and GNMT 275
  • Review by Stahlberg 276

273 Sutskever et al. (2014).

274 Bahdanau, Cho, & Bengio (2015).

275 Wu, Y. et al. (2016).

276 Stahlberg (2019).

3.13.9.4 Transformers

Figure 3.12: Diagram of the Transformer model (source: d2l.ai).

\[ \mathrm{attention}(Q, K, V) = \mathrm{softmax}\left(\frac{Q\, K^\intercal}{\sqrt{d_k}}\right) V \]

3.13.9.4.1 Attention and Transformers

277 Vaswani, A. et al. (2017).

278 Devlin, Chang, Lee, & Toutanova (2018).

279 Liu, Y. et al. (2019).

280 Raffel, C. et al. (2019).

281 Lan, Z. et al. (2019).

282 Lewis, M. et al. (2019).

283 Radford, Narasimhan, Salimans, & Sutskever (2018).

284 Radford, A. et al. (2019).

285 Brown, T.B. et al. (2020).

286 Yang, Z. et al. (2019).

287 Zaheer, M. et al. (2020).

288 Edelman, Goel, Kakade, & Zhang (2021).

289 Tay, Dehghani, Bahri, & Metzler (2022).

290 Phuong & Hutter (2022).

291 Chowdhery, A. et al. (2022).

292 Ouyang, L. et al. (2022).

293 Wolfram (2023).

294 OpenAI (2023).

295 Mohamadi, S. et al. (2023).

296 Zhao, W.X. et al. (2023).

297 Golovneva, Wang, Weston, & Sukhbaatar (2024).

Figure 3.13: Diagram of the BERT model (source: peltarion.com).
3.13.9.4.2 Computational complexity of transformers
3.13.9.4.3 Efficient transformers

298 Dao, T. et al. (2022).

3.13.9.4.4 What comes after Transformers?

299 Gu, Goel, & Ré (2021).

300 Merrill & Sabharwal (2022).

301 Bulatov, Kuratov, & Burtsev (2022).

302 Bulatov, Kuratov, & Burtsev (2023).

303 Bertsch, Alon, Neubig, & Gormley (2023).

304 Mialon, G. et al. (2023).

305 Peng, B. et al. (2023).

306 Sun, Y. et al. (2023).

307 Gu & Dao (2023).

308 Wang, H. et al. (2023).

309 Ma, S. et al. (2024).

310 Ma, X. et al. (2024).

311 Bhargava, Witkowski, Shah, & Thomson (2023).

312 Dao & Gu (2024).

313 Banerjee, Agarwal, & Singla (2024).

314 Jingyu Liu et al (2025).

3.13.9.5 Evaluation methods

3.13.9.6 Scaling laws in NLP

315 Hestness, J. et al. (2017).

316 Church & Hestness (2019).

317 Kaplan, J. et al. (2020).

318 Rae, J.W. et al. (2022).

319 Hoffmann, J. et al. (2022).

320 Caballero, Gupta, Rish, & Krueger (2022).

321 Muennighoff, N. et al. (2023).

322 Pandey (2024).

323 Bach (2024).

324 Finzi, M. et al. (2025).

3.13.9.7 Language understanding

325 Mahowald, K. et al. (2023).

326 Kosinski (2023).

See also:

3.13.9.8 Interpretability

327 Watson & Floridi (2019).

328 Gurnee, W. et al. (2023).

329 Meng, Bau, Andonian, & Belinkov (2023).

330 McDougall, C. et al. (2023).

Linear probes:

331 Alain & Bengio (2016).

332 Belinkov (2022).

333 Gurnee & Tegmark (2023).

3.13.10 Reinforcement learning

Pedagogy:

334 Sutton & Barto (2018).

335 Arulkumaran, Deisenroth, Brundage, & Bharath (2017).

336 Cesa-Bianchi & Lugosi (2006).

Tutorials:

More:

337 Xu, Hasselt, & Silver (2018).

338 Silver, Singh, Precup, & Sutton (2024).

339 Javed & Sutton (2024).

See also:

3.13.10.1 Q-learning

  • Q-learning and DQN
  • Uses the Markov Decision Process (MDP) framework
  • The Bellman equation 340
  • Q-learning is a values-based learning algorithm. Value based algorithms updates the value function based on an equation (particularly Bellman equation). Whereas the other type, policy-based estimates the value function with a greedy policy obtained from the last policy improvement (source: towardsdatascience.com).
  • DQN masters Atari 341

340 Bellman (1952).

341 Mnih, V. et al. (2013) and Mnih, V. et al. (2015).

3.13.10.2 AlphaZero

342 Silver, D. et al. (2016).

343 Silver, D. et al. (2017b).

344 Silver, D. et al. (2017a).

3.13.10.3 Regret minimization

Regret matching (RM)

345 Hart & Mas‐Colell (2000).

Consider a game like rock-paper-scissors, where there is only one action per round. Let \(v^{t}(a)\) be the value observed when playing action \(a\) on iteration \(t\).

TODO: explain that the entire rewards vector, \(v^{t}(a)\), over \(a\) is observable after the chosen action is played.

Let a strategy, \(\sigma^t\), be a probability distribution over actions, \(a \in A\). Then the value of a strategy, \(v^{t}(\sigma^{t})\), is the expectation of its value over actions.

\[ v^{t}(\sigma^{t}) = \sum_{a \in A} \sigma^{t}(a) \: v^{t}(a) \]

Regret, \(R^{T}\), measures how much better some sequence of strategies, \(\sigma'\), would do compared to the chosen sequence of strategies, \(\sigma = \{\sigma^1, \sigma^2, \ldots \sigma^T\}\).

\[ R^{T} \equiv \sum_{t=1}^{T} \left( v^{t}({\sigma'}^{t}) - v^{t}(\sigma^{t}) \right) \]

External regret, \(R^{T}(a)\), measures the regret of the chosen sequence of strategies versus a hypothetical stategy where action \(a\) is always chosen.

\[ R^{T}(a) \equiv \sum_{t=1}^{T} \left( v^{t}(a) - v^{t}(\sigma^{t}) \right) \]

Regret Matching (RM) is a rule to determine the strategy for the next iteration:

\[ \sigma^{t+1}(a) \equiv \frac{ R^{t}_{+}(a) }{ \sum_{b \in A} R^{t}_{+}(b) } \]

where \(R_{+} \equiv \mathrm{max}(R, 0)\).

At the end of training, the resulting recommended strategy with convergence bounds is not the final strategy used in training, \(\sigma^{T}\), but the average strategy over all time steps:

\[ \bar{\sigma}^{T}(a) = \frac{1}{T} \sum_{t=1}^{T} \sigma^{t}(a) \]

  • TODO: explain the convergence of \(\bar{\sigma}^{t}\) to an \(\varepsilon\)-Nash equilibrium.
  • Roughgarden, T. (2016). Twenty Lectures on Algorithmic Game Theory. 346

346 Roughgarden (2016).

Counterfactual regret minimization (CFR)

347 Zinkevich, Johanson, Bowling, & Piccione (2007).

348 N. Brown (2020), p. 12.

349 Zinkevich et al. (2007), p. 4.

350 Tammelin (2014).

351 Tammelin, Burch, Johanson, & Bowling (2015).

352 Burch, Moravcik, & Schmid (2019).

353 N. Brown & Sandholm (2019a).

TODO: explain extensive-form games.

A finite extensive game with imperfect information has the following components: 354

354 Zinkevich et al. (2007) and Lanctot, Waugh, Zinkevich, & Bowling (2009).

  • A finite set \(N\) of players. A finite set \(H\) of sequences, the possible histories of actions, such that the empty sequence is in \(H\) and every prefix of a sequence in \(H\) is also in \(H\). Define \(h \sqsubseteq h'\) to mean \(h\) is a prefix of \(h'\). \(Z \subseteq H\) are the terminal histories (those which are not a prefix of any other sequences). \(A(h) = \{a : ha \in H\}\) are the actions available after a non-terminal history, \(h \in H \backslash Z\).
  • A function \(P\) that assigns to each non-terminal history a member of \(N \cup \{c\}\). \(P\) is the player function. \(P(h)\) is the player who takes an action after the history \(h\). If \(P(h) = c\) then chance determines the action taken after history \(h\).
  • For each player \(i \in N \cup \{c\}\) a partition \(\mathcal{I}_i\) of \(\{h \in H : P (h) = i\}\) with the property that \(A(h) = A(h')\) whenever \(h\) and \(h'\) are in the same member of the partition. For \(I \in \mathcal{I}_i\) we denote by \(A(I_i)\) the set \(A(h)\) and by \(P(I_i)\) the player \(P(h)\) for any \(h \in I_i\) . \(\mathcal{I}_i\) is the information partition of player \(i\); a set \(I_i \in \mathcal{I}_i\) is an information set of player \(i\).
  • A function \(f_c\) that associates with every information set \(I\) where \(P(I) = c\), a probability measure \(f_c(a|I)\) on \(A(h)\); \(f_c(a|I)\) is the probability that \(a\) occurs given some \(h \in I\), where each such probability measure is independent of every other such measure.
  • For each player \(i \in N\) there is a utility function \(u_i\) from the terminal states \(Z\) to the reals \(\mathbb{R}\). If \(N = \{1, 2\}\) and \(u_1 = -u_2\), it is a zero-sum extensive game. Define \(\Delta_{u,i} \equiv \mathrm{max}_z \: u_i(z) - \mathrm{min}_z \: u_i(z)\) to be the range of utilities to player \(i\).

The player reach, \(\pi^{\sigma}_{i}(h)\), of a history \(h\) is the product of the probabilities for all agent \(i\) actions leading to \(h\). Formally, 355

355 N. Brown (2020), p. 6.

\[ \pi^{\sigma}_{i}(h) \equiv \prod_{h' \cdot a' \sqsubseteq h | P(h') = i} \sigma_{i}(h', a') \]

Due to perfect recall, any two histories in infoset \(I_i\) have the same player reach for player \(i\). Thus, we similarly define the player reach \(\pi^{\sigma}_{i}(I_i)\) of infoset \(I_i\) as

\[ \pi^{\sigma}_{i}(I_i) \equiv \prod_{ {I'}_{i} \cdot a' \sqsubseteq I_i | P(I_i) = i } \sigma_{i}({I'}_{i}, a') = \left.\pi^{\sigma}_{i}(h)\right|_{h \in I_i} \]

The external reach AKA opponent reach, \(\pi^{\sigma}_{-i}(h)\), of a history \(h\) is the contribution of chance and all other players than \(i\). Formally,

\[ \pi^{\sigma}_{-i}(h) \equiv \prod_{h' \cdot a' \sqsubseteq h | P(h') \neq i} \sigma_{i}(h', a') \]

We also define the external reach of an infoset as

\[ \pi^{\sigma}_{-i}(I_i) \equiv \sum_{h \in I_{i}} \pi^{\sigma}_{-i}(h) \]

The counterfactual value of an infoset \(I\) is the expected utility to player \(i\) given that \(I\) has been reached, weighted by the external reach of \(I\) for player \(i\). Formally, 356

356 N. Brown (2020), p. 12.

\[ v(I) = \sum_{h \in I} \pi^{\sigma}_{-i}(h) \sum_{z \in Z} \pi^{\sigma}(h, z) \: u_{i}(z) \]

The counterfactual value of an action, \(a\), is

\[ v(I, a) = \sum_{h \in I} \pi^{\sigma}_{-i}(h) \sum_{z \in Z} \pi^{\sigma}(h \cdot a, z) \: u_{i}(z) \]

Let’s consider the case where, like in NLHE, our two private hole cards each make a single unique history \(h\), and we form infosets with a single hand, so \(I=h\). Then

\[ v(h) = \pi^{\sigma}_{-i}(h) \sum_{z \in Z} \pi^{\sigma}(h, z) \: u_{i}(z) \]

making explicit the player reach and the external reach,

\[ v(h) = \pi^{\sigma}_{-i}(h) \sum_{z \in Z} \pi_{i}^{\sigma}(h, z) \: \pi_{-i}^{\sigma}(h, z) \: u_{i}(z) \]

At a leaf node where we finally calculate the rewards,

\[ v(z) = \pi^{\sigma}_{-i}(z) \: u_{i}(z) \]

TODO: explain CFR.

The instantaneous regret is

\[ r^{t}(I, a) = v^{\sigma^t}(I, a) - v^{\sigma^t}(I) \]

The (cummulative) counterfactual regret

\[ R^{t}(I, a) = \sum_{t=1}^{T} r^{t}(I, a) \]

Similar to the single-node game discussed above, eq. \(\eqref{eq:regret_matching}\), applying regret matching during training means to update strategies according to the following rule.

\[ \sigma^{t+1}(I, a) \equiv \frac{ R^{t}_{+}(I, a) }{ \sum_{b \in A} R^{t}_{+}(I, b) } \]

The average strategy is

\[ \bar{\sigma}^{T}(I, a) = \sum_{t=1}^{T} \frac{\pi^{t}_{i}(I) \: \sigma^{t}(I, a) }{\pi^{t}_{i}(I)} \]

Monte Carlo Counterfactual Regret Minimization (MCCFR)

357 Lanctot et al. (2009).

358 Neller & Lanctot (2013).

359 Burch, Lanctot, Szafron, & Gibson (2012).

360 Johanson, M. et al. (2012).

361 Schmid, M. et al. (2019).

362 Li, H. et al. (2020).

363 Habara, Fukuda, & Yamashita (2023).

364 Lanctot (2013).

365 Gibson (2014).

366 Johanson (2016).

367 Burch (2018).

368 Horacek (2022).

TODO: explain MCCFR.

External sampling MCCFR:

\[ \tilde{v}^{\sigma}_{i}(I) = \sum_{z \in Q} u_{i}(z) \: \pi^{\sigma}_{i}(z[I] \rightarrow z) \]

Best response and exploitability

Best response:

\[ \mathrm{BR}(\sigma_{-i}) = \underset{\sigma_{i}^{\prime}}{\mathrm{argmax}} \: u_{i}(\sigma_{i}^{\prime}, \sigma_{-i}) \]

TODO: Local Best Response (LBR). 369

369 Lisy & Bowling (2016), p. 2.

Exploitability:

\[ \varepsilon_{i}(\sigma) = u_{i}(\mathrm{BR}(\sigma_{-i}), \sigma_{-i}) - u_{i}(\sigma_{i}, \mathrm{BR}(\sigma_{i})) \]

NashConv 370 exploitability uses the convention:

370 See NashConv exploitability defined in Lanctot, M. et al. (2017).

\[ \varepsilon_{i}(\sigma) = u_{i}(\mathrm{BR}(\sigma_{-i}), \sigma_{-i}) - u_{i}(\sigma_{i}, \sigma_{-i}) \]

The average exploitability per player is

\[ \varepsilon(\sigma) = \frac{1}{n} \sum_{i}^{n} \varepsilon_{i}(\sigma) \]

Note that in zero-sum games, when summing over players, the second terms in NashConv sum to zero. 371

371 Timbers (2020), p. 3.

\[ \varepsilon(\sigma) = \frac{1}{n} \sum_{i}^{n} u_{i}(\mathrm{BR}(\sigma_{-i}), \sigma_{-i}) \]

In two-player games:

\[ \varepsilon(\sigma) = \frac{1}{2} \Big( u_{1}(\mathrm{BR}(\sigma_{2}), \sigma_{2}) + u_{2}(\sigma_{1}, \mathrm{BR}(\sigma_{1})) \Big) \]

372 Johanson, Waugh, Bowling, & Zinkevich (2011).

373 Ponsen, De Jong, & Lanctot (2011).

374 Lisy & Bowling (2016).

375 Timbers (2020).

3.13.10.4 Solving poker

376 Kuhn (1950).

377 Southey, F. et al. (2012).

378 Billings, Davidson, Schaeffer, & Szafron (2002).

379 Billings, D. et al. (2003).

380 Johanson (2013).

381 Bowling, Burch, Johanson, & Tammelin (2015).

382 Heinrich & Silver (2016).

383 Moravcik, M. et al. (2017).

384 N. Brown & Sandholm (2017).

385 N. Brown & Sandholm (2018).

386 N. Brown & Sandholm (2019a).

387 N. Brown, Lerer, Gross, & Sandholm (2019).

388 N. Brown & Sandholm (2019b).

389 N. Brown, Bakhtin, Lerer, & Gong (2020).

390 N. Brown (2020).

391 Schmid, M. et al. (2021).

392 Kovarik, V. et al. (2022).

3.13.11 Applications in physics

393 Denby (1988).

394 Denby (1993).

395 Spears, B.K. et al. (2018).

396 Cranmer, Seljak, & Terao (2021).

397 Liu, Z. et al. (2024).

398 Jiao, L. et al. (2024).

See also:

3.14 Theoretical machine learning

3.14.1 Algorithmic information theory

399 Cilibrasi & Vitanyi (2005).

400 Hutter (2007).

401 Rathmanner & Hutter (2011).

3.14.2 No free lunch theorems

402 Wolpert & Macready (1995).

403 Wolpert (1996).

404 Wolpert & Macready (1997).

405 Shalev-Shwarz & Ben-David (2014), p. 60–66.

406 McDermott (2019).

407 Wolpert (2007).

408 Wolpert & Kinney (2020).

409 Wolpert (2023).

410 Mitchell (1980).

411 Roberts (2021).

412 Goldreich & Ron (1997).

413 Joyce & Herrmann (2017).

414 H. W. Lin, Tegmark, & Rolnick (2017).

415 Lauc (2020).

416 Nakkiran (2021).

417 Bousquet, O. et al. (2021).

418 Andrews (2023).

Raissi et al.:

encoding such structured information into a learning algorithm results in amplifying the information content of the data that the algorithm sees, enabling it to quickly steer itself towards the right solution and generalize well even when only a few training examples are available. 419

419 Raissi, Perdikaris, & Karniadakis (2017a), p. 2.

Roberts:

From an algorithmic complexity standpoint it is somewhat miraculous that we can compress our huge look-up table of experiment/outcome into such an efficient description. In many senses, this type of compression is precisely what we mean when we say that physics enables us to understand a given phenomenon. 420

420 Roberts (2021), p. 7.

  • TODO: Note Dennett’s discussion of compression. 421

421 Dennett (1991), p. TODO.

3.14.3 Connectivists vs symbolicists

3.14.4 Graphical tensor notation

3.14.5 Universal approximation theorem

422 Minsky & Papert (1969), p. TODO.

423 Hornik, Stinchcombe, & White (1989).

424 Lu, Z. et al. (2017).

425 H. Lin & Jegelka (2018).

426 Ismailov (2020).

427 Bishop (2006), p. 230.

3.14.6 Relationship to statistical mechanics

428 Opper & Kinzel (1996).

429 Opper (2001).

430 Bahri, Y. et al. (2020).

431 Halverson, Maiti, & Stoner (2020).

432 Canatar, Bordelon, & Pehlevan (2020).

433 Roberts, Yaida, & Hanin (2021).

434 Cantwell (2022).

435 Dinan, Yaida, & Zhang (2023).

436 Sohl-Dickstein (2020).

437 Aifer, M. et al. (2023).

438 Geshkovski, Letrouit, Polyanskiy, & Rigollet (2023).

3.14.7 Relationship to gauge theory

Invariant:

\[ f(g x) = f(x) \]

Equivariant:

\[ f(g x) = g' f(x) \]

Same-equivariant is the case that \(g' = g\).

439 Dieleman, Fauw, & Kavukcuoglu (2016).

440 T. S. Cohen & Welling (2016).

441 T. S. Cohen, Weiler, Kicanaoglu, & Welling (2019).

442 Fuchs, Worrall, Fischer, & Welling (2020).

443 Bogatskiy, A. et al. (2023).

444 Marchetti, Hillar, Kragic, & Sanborn (2023).

445 Battiloro, C. et al. (2024).

446 Erdogan & Lucic (2025).

3.14.8 Thermodynamics of computation

447 Bérut, A. et al. (2012).

448 Bérut, Petrosyan, & Ciliberto (2015).

3.15 Information geometry

3.15.1 Introduction

449 Smith (2019).

450 Nielsen (2020).

451 Amari (1998).

452 Amari (2016).

3.15.2 Geometric understanding of classical statistics

453 Balasubramanian (1996a).

454 Balasubramanian (1996b).

455 Calin & Udriste (2014).

456 de Carvalho, Page, & Barney (2019).

3.15.3 Geometric understanding of deep learning

457 Lei, Luo, Yau, & Gu (2018).

458 Gao & Chaudhari (2020).

459 Bronstein, Bruna, Cohen, & Velickovic (2021).

3.16 Automation

3.16.1 Cybernetics

3.16.2 AutoML

  • Neural Architecture Search (NAS)
  • AutoML frameworks
  • RL-driven NAS
  • learned sparsity

3.16.3 Surrogate models

460 Fefferman, Mitter, & Narayanan (2016).

461 Raissi et al. (2017a) and Raissi, Perdikaris, & Karniadakis (2017b).

462 Karniadakis, G.E. et al. (2021).

463 Howard, Mandt, Whiteson, & Yang (2021).

464 Thuerey, N. et al. (2021).

465 Cranmer, Pavez, & Louppe (2015).

466 Cranmer, Brehmer, & Louppe (2019).

467 Baydin, A.G. et al. (2019).

Lectures:

3.16.4 AutoScience

468 C. Anderson (2008).

469 Asch, M. et al. (2018).

470 D’Agnolo & Wulzer (2019).

471 Krenn, M. et al. (2022).

472 Udrescu & Tegmark (2020).

473 Cranmer, M. et al. (2020).

474 Z. Liu, Madhavan, & Tegmark (2022).

475 Lu, C. et al. (2024).

Figure 3.14: The inference cycle for the process of scientific inquiry. The three distinct forms of inference (abduction, deduction, and induction) facilitate an all-encompassing vision, enabling HPC and HDA to converge in a rational and structured manner. HPC: high- performance computing; HDA: high-end data analysis. 476

476 Asch, M. et al. (2018).

See also:

3.17 Implications for the realism debate

3.17.1 Introduction

477 Korb (2001).

478 Williamson (2009).

479 Bensusan (2000).

See also:

3.17.2 Real clusters

  • Nope: Hennig

See also:

3.17.3 Word meanings

480 Kornai (2023).

481 Perone (2018).

482 Skelac & Jandric (2020).

483 Grietzer (2017b).

484 Grietzer (2017a).

485 Tenney, I. et al. (2019).

486 Nissim, Noord, & Goot (2019).

487 Bender & Koller (2020).

488 Elhage, N. et al. (2022).

489 Patel & Pavlick (2022).

490 Lovering & Pavlick (2022).

491 Huh, Cheung, Wang, & Isola (2024).

492 Jamali, M. et al. (2024).

493 Coecke, Sadrzadeh, & Clark (2010).

494 Grefenstette & Sadrzadeh (2011).

495 Balkir, Kartsaklis, & Sadrzadeh (2015).

496 Tyrrell (2018).

497 Kartsaklis & Sadrzadeh (2013).

498 Grefenstette, E. et al. (2013).

499 de Felice, Meichanetzidis, & Toumi (2019).

Wittgenstein in PI:

The meaning of a word is its use in the language. 500

500 Wittgenstein (2009), §43.

and

One cannot guess how a word functions. One has to look at its use, and learn from that. 501

501 Wittgenstein (2009), §340.

Piantadosi:

Modern large language models integrate syntax and semantics in the underlying representations: encoding words as vectors in a high-dimensional space, without an effort to separate out e.g. part of speech categories from semantic representations, or even predict at any level of analysis other than the literal word. Part of making these models work well was in determining how to encode semantic properties into vectors, and in fact initializing word vectors via encodings of distribution semantics from e.g. Mikolov et al. 2013 (Radford et al. 2019). Thus, an assumption of the autonomy of syntax is not required to make models that predict syntactic material and may well hinder it. 502

502 Piantadosi (2023), p. 15.

See also:

3.18 My thoughts

My docs:

My talks:

3.19 Annotated bibliography

3.19.1 Mayo, D.G. (1996). Error and the Growth of Experimental Knowledge.

3.19.1.1 My thoughts

  • TODO

3.19.2 Cowan, G. (1998). Statistical Data Analysis.

3.19.2.1 My thoughts

  • TODO

3.19.3 James, F. (2006). Statistical Methods in Experimental Physics.

3.19.3.1 My thoughts

  • TODO

3.19.4 Cowan, G. et al. (2011). Asymptotic formulae for likelihood-based tests of new physics.

  • Cowan et al. (2011)
  • Glen Cowan, Kyle Cranmer, Eilam Gross, Ofer Vitells

3.19.4.1 My thoughts

  • TODO

3.19.5 ATLAS Collaboration. (2012). Combined search for the Standard Model Higgs boson.

3.19.5.1 My thoughts

  • TODO

3.19.6 Cranmer, K. (2015). Practical statistics for the LHC.

3.19.6.1 My thoughts

  • TODO