3  Philosophy of statistics

Published

February 15, 2026

Statistical analysis is very important in addressing the problem of induction. Can inductive inference be formalized? What are the caveats? Can inductive inference be automated? How does machine learning work?

All knowledge is, in final analysis, history. All sciences are, in the abstract, mathematics. All judgements are, in their rationale, statistics.

– C. R. Rao 1

1 Rao (1997), p. x.

3.1 Introduction to the foundations of statistics

3.1.1 Problem of induction

A key issue for the scientific method, as discussed in the previous outline, is the problem of induction. Inductive inferences are used in the scientific method to make generalizations from finite data. This introduces unique avenues of error not found in purely deductive inferences, like in logic and mathematics. Compared to deductive inferences, which are sound and necessarily follow if an argument is valid and all of its premises obtain, inductive inferences can be valid and probably (not certainly) sound, and therefore can still result in error in some cases because the support of the argument is ultimately probabilistic.

A skeptic may further probe if we are even justified in using the probabilities we use in inductive arguments. What is the probability the Sun will rise tomorrow? What kind of probabilities are reasonable?

In this outline, we sketch and explore how the mathematical theory of statistics has arisen to wrestle with the problem of induction, and how it equips us with careful ways of framing inductive arguments and notions of confidence in them.

See also:

3.1.2 Early investigators

  • Ibn al-Haytham (c. 965-1040)
    • “Ibn al-Haytham was an early proponent of the concept that a hypothesis must be supported by experiments based on confirmable procedures or mathematical evidence—an early pioneer in the scientific method five centuries before Renaissance scientists.” - Wikipedia
  • Gerolamo Cardano (1501-1576)
    • Book on Games of Chance (1564)
  • John Graunt (1620-1674)
  • Jacob Bernoulli (1655-1705)
    • Ars Conjectandi (1713, posthumous)
    • First modern phrasing of the problem of parameter estimation 2
    • See Hacking 3
    • Early vision of decision theory:

2 Edwards (1974), p. 9.

3 Hacking (1971).

The art of measuring, as precisely as possible, probabilities of things, with the goal that we would be able always to choose or follow in our judgments and actions that course, which will have been determined to be better, more satisfactory, safer or more advantageous. 4

4 Bernoulli, J. (1713). Ars Conjectandi, Chapter II, Part IV, defining the art of conjecture [wikiquote].

5 Venn (1888).

6 Jevons (1873a).

7 Jevons (1873b).

3.1.3 Foundations of modern statistics

  • Central limit theorem
  • Charles Sanders Peirce (1839-1914)
    • Formulated modern statistics in “Illustrations of the Logic of Science”, a series published in Popular Science Monthly (1877-1878), and also “A Theory of Probable Inference” in Studies in Logic (1883). 8
    • With a repeated measures design, introduced blinded, controlled randomized experiments (before Fisher).
  • Karl Pearson (1857-1936)
    • The Grammar of Science (1892)
    • “On the criterion that a given system of deviations…” (1900) 9
      • Proposed testing the validity of hypothesized values by evaluating the chi distance between the hypothesized and the empirically observed values via the \(p\)-value.
    • With Frank Raphael Weldon, he established the journal Biometrika in 1902.
    • Founded the world’s first university statistics department at University College, London in 1911.
  • John Maynard Keynes (1883-1946)
    • Keynes, J. M. (1921). A Treatise on Probability. 10
  • Ronald Fisher (1890-1972)
    • Fisher significance of the null hypothesis (\(p\)-values)
      • “On an absolute criterion for fitting frequency curves” 11
      • “Frequency distribution of the values of the correlation coefficient in samples of indefinitely large population” 12
    • “On the ‘probable error’ of a coefficient of correlation deduced from a small sample” 13
      • Definition of likelihood
      • ANOVA
    • Statistical Methods for Research Workers (1925)
    • The Design of Experiments (1935)
    • “Statistical methods and scientific induction” 14
    • The Lady Tasting Tea 15
  • Jerzy Neyman (1894-1981)
  • Egon Pearson (1895-1980)
    • Neyman-Pearson confidence intervals with fixed error probabilities (also \(p\)-values but considering two hypotheses involves two types of errors)
  • Harold Jeffreys (1891-1989)
    • objective (non-informative) Jeffreys priors
  • Andrey Kolmogorov (1903-1987)
  • C.R. Rao (1920-2023)
  • Ray Solomonoff (1926-2009)
  • Shun’ichi Amari (b. 1936)
  • Judea Pearl (b. 1936)

8 Peirce (1883), p. 126–181.

9 Pearson (1900).

10 Keynes (1921).

11 Fisher (1912).

12 Fisher (1915).

13 Fisher (1921).

14 Fisher (1955).

15 Salsburg (2001).

16 Reid (1998).

17 Neyman (1955).

18 Carnap (1960).

3.1.4 Pedagogy

19 Stuart, Ord, & Arnold (2010).

20 F. James (2006).

21 Cowan (1998) and Cowan (2016).

22 Cranmer (2015).

23 Jaynes (2003).

24 Lista (2016b).

25 Lista (2016a).

26 Cox (2006).

27 Behnke, Kröninger, Schott, & Schörner-Sadenius (2013).

28 Cousins (2018).

29 Weisberg (2019).

30 Gelman & Vehtari (2021).

31 Otsuka (2023).

3.3 Statistical models

3.3.1 Parametric models

46 McCullagh (2002).

3.3.2 Canonical distributions

3.3.2.1 Bernoulli distribution

\[ \mathrm{Ber}(k; p) = \begin{cases} p & \mathrm{if}\ k = 1 \\ 1-p & \mathrm{if}\ k = 0 \end{cases} \]

which can also be written as

\[ \mathrm{Ber}(k; p) = p^k \: (1-p)^{(1-k)} \quad \mathrm{for}\ k \in \{0, 1\} \]

or

\[ \mathrm{Ber}(k; p) = p k + (1-p)(1-k) \quad \mathrm{for}\ k \in \{0, 1\} \]

  • Binomial distribution
  • Poisson distribution

TODO: explain, another important relationship is

Figure 3.1: Relationships among Bernoulli, binomial, categorical, and multinomial distributions.

3.3.2.2 Normal/Gaussian distribution

\[ N(x \,|\, \mu, \sigma^2) = \frac{1}{\sqrt{2\,\pi\:\sigma^2}} \: \exp\left(\frac{-(x-\mu)^2}{2\,\sigma^2}\right) \]

and in \(k\) dimensions:

\[ N(\vec{x} \,|\, \vec{\mu}, \boldsymbol{\Sigma}) = (2 \pi)^{-k/2}\:\left|\boldsymbol{\Sigma}\right|^{-1/2} \: \exp\left(\frac{-1}{2}\:(\vec{x}-\vec{\mu})^\intercal \:\boldsymbol{\Sigma}^{-1}\:(\vec{x}-\vec{\mu})\right) \]

where \(\boldsymbol{\Sigma}\) is the covariance matrix of the distribution (defined in eq. 3.2).

3.3.3 Central limit theorem

Let \(X_{1}\), \(X_{2}\), … , \(X_{n}\) be a random sample drawn from any distribution with a finite mean \(\mu\) and variance \(\sigma^{2}\). As \(n \rightarrow \infty\), the distribution of

\[ \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim N(0, 1) \]

  • \(\chi^2\) distribution
  • Univariate distribution relationships
Figure 3.2: Detail of a figure showing relationships among univariate distributions. See the full figure here 47.

47 Leemis & McQueston (2008).

  • The exponential family of distributions are maximum entropy distributions.

3.3.4 Mixture models

48 Cranmer, K. et al. (2012).

3.4 Point estimation and confidence intervals

3.4.1 Inverse problems

Recall that in the context of parametric models of data, \(x_i\) the pdf of which is modeled by a function, \(f(x_i ; \theta_j)\) with parameters, \(\theta_j\). In a statistical inverse problem, the goal is to infer values of the model parameters, \(\theta_j\) given some finite set of data, \(\{x_i\}\) sampled from a probability density, \(f(x_i; \theta_j)\) that models the data reasonably well 49.

49 This assumption that the model models the data “reasonably” well reflects that to the degree required by your analysis, the important features of the data match well within the systematic uncertainties parametrized within the model. If the model is incomplete because it is missing an important feature of the data, then this is the “ugly” (class-3) error in the Sinervo classification of systematic uncertainties.

50 Cowan (1998) and Cowan (2016), p. TODO.

3.4.2 Bias and variance

The bias of an estimator, \(\hat\theta\), is defined as

\[ \mathrm{Bias}(\hat{\theta}) \equiv \mathbb{E}(\hat{\theta} - \theta) = \int dx \: P(x|\theta) \: (\hat{\theta} - \theta) \]

The mean squared error (MSE) of an estimator has a similar formula to variance (eq. 3.1) except that instead of quantifying the square of the difference of the estimator and its expected value, the MSE uses the square of the difference of the estimator and the true parameter:

\[ \mathrm{MSE}(\hat{\theta}) \equiv \mathbb{E}((\hat{\theta} - \theta)^2) \]

The MSE of an estimator can be related to its bias and its variance by the following proof:

\[ \begin{align} \mathrm{MSE}(\hat{\theta}) &= \mathbb{E}(\hat{\theta}^2 - 2 \: \hat{\theta} \: \theta + \theta^2) \\ &= \mathbb{E}(\hat{\theta}^2) - 2 \: \mathbb{E}(\hat{\theta}) \: \theta + \theta^2 \end{align} \]

noting that

\[ \mathrm{Var}(\hat{\theta}) = \mathbb{E}(\hat{\theta}^2) - \mathbb{E}(\hat{\theta})^2 \]

and

\[ \begin{align} \mathrm{Bias}(\hat{\theta})^2 &= \mathbb{E}(\hat{\theta} - \theta)^2 \\ &= \mathbb{E}(\hat{\theta})^2 - 2 \: \mathbb{E}(\hat{\theta}) \: \theta + \theta^2 \end{align} \]

we see that MSE is equivalent to

\[ \mathrm{MSE}(\hat{\theta}) = \mathrm{Var}(\hat{\theta}) + \mathrm{Bias}(\hat{\theta})^2 \]

For an unbiased estimator, the MSE is the variance of the estimator.

TODO:

  • Note the discussion of the bias-variance tradeoff by Cranmer.
  • Note the new deep learning view. See Deep learning.

See also:

3.4.3 Maximum likelihood estimation

A maximum likelihood estimator (MLE) was first used by Fisher. 51

51 Aldrich (1997).

\[\hat{\theta} \equiv \underset{\theta}{\mathrm{argmax}} \: \mathrm{log} \: L(\theta) \]

Maximizing \(\mathrm{log} \: L(\theta)\) is equivalent to maximizing \(L(\theta)\), and the former is more convenient because for data that are independent and identically distributed (i.i.d.) the joint likelihood can be factored into a product of individual measurements:

\[ L(\theta) = \prod_i L(\theta|x_i) = \prod_i P(x_i|\theta) \]

and taking the log of the product makes it a sum:

\[ \mathrm{log} \: L(\theta) = \sum_i \mathrm{log} \: L(\theta|x_i) = \sum_i \mathrm{log} \: P(x_i|\theta) \]

Maximizing \(\mathrm{log} \: L(\theta)\) is also equivalent to minimizing \(-\mathrm{log} \: L(\theta)\), the negative log-likelihood (NLL). For distributions that are i.i.d.,

\[ \mathrm{NLL} \equiv - \log L = - \log \prod_i L_i = - \sum_i \log L_i = \sum_i \mathrm{NLL}_i \]

3.4.3.1 Invariance of likelihoods under reparametrization

  • Likelihoods are invariant under reparametrization. 52
  • Bayesian posteriors are not invariant in general.

52 F. James (2006), p. 234.

See also:

3.4.3.2 Ordinary least squares

  • Least squares from MLE of gaussian models: \(\chi^2\)
  • Ordinary Least Squares (OLS)
  • Geometric interpretation

53 Cox (2006), p. 11.

54 Murphy (2012), p. 222.

3.4.4 Variance of MLEs

55 Fréchet (1943), Cramér (1946), Rao (1945), and Rao (1947).

56 Rice (2007), p. 300–2.

57 Nielsen (2013).

58 Cowan (1998), p. 130-5.

59 F. James (2006), p. 234.

60 F. James & Roos (1975).

Figure 3.3: Transformation of non-parabolic log-likelihood to parabolic (source: my slides, recreation of F. James (2006), p. 235).

61 Loh (1987).

62 Wainer (2007).

63 Tegmark, Taylor, & Heavens (1997).

3.4.5 Bayesian credibility intervals

3.4.6 Uncertainty on measuring an efficiency

64 Clopper & Pearson (1934).

65 Agresti & Coull (1998).

66 Hanley & Lippman-Hand (1983).

67 L. D. Brown, Cai, & DasGupta (2001).

68 Casadei (2012).

3.4.7 Examples

3.5 Statistical hypothesis testing

3.5.1 Null hypothesis significance testing

  • Karl Pearson observing how rare sequences of roulette spins are
  • Null hypothesis significance testing (NHST)
  • goodness of fit
  • Fisher

Fisher:

[T]he null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. 69

69 Fisher (1935), p. 16.

3.5.2 Neyman-Pearson theory

3.5.2.1 Introduction

70 Goodman (1999a). p. 998.

71 J. Cohen (1992).

72 ATLAS and CMS Collaborations (2011).

73 Cowan, Cranmer, Gross, & Vitells (2011).

Figure 3.4: TODO: ROC explainer. (Wikimedia, 2015).

See also:

3.5.2.2 Neyman-Pearson lemma

Neyman-Pearson lemma: 74

74 Neyman & Pearson (1933).

For a fixed signal efficiency, \(1-\alpha\), the selection that corresponds to the lowest possible misidentification probability, \(\beta\), is given by

\[ \frac{L(H_1)}{L(H_0)} > k_{\alpha} \]

where \(k_{\alpha}\) is the cut value required to achieve a type-1 error rate of \(\alpha\).

Neyman-Pearson test statistic:

\[ q_\mathrm{NP} = - 2 \ln \frac{L(H_1)}{L(H_0)} \]

Profile likelihood ratio:

\[ \lambda(\mu) = \frac{ L(\mu, \hat{\theta}_\mu) }{ L(\hat{\mu}, \hat{\theta}) } \]

where \(\hat{\theta}\) is the (unconditional) maximum-likelihood estimator that maximizes \(L\), while \(\hat{\theta}_\mu\) is the conditional maximum-likelihood estimator that maximizes \(L\) for a specified signal strength, \(\mu\), and \(\theta\) as a vector includes all other parameters of interest and nuisance parameters.

3.5.2.3 Neyman construction

Cranmer: Neyman construction.

Figure 3.5: Neyman construction for a confidence belt for \(\theta\) (source: K. Cranmer, 2020).

TODO: fix

\[ q = - 2 \ln \frac{L(\mu\,s + b)}{L(b)} \]

3.5.2.4 Flip-flopping

  • Flip-flopping and Feldman-Cousins confidence intervals 75

75 Feldman & Cousins (1998).

3.5.3 p-values and significance

  • \(p\)-values and significance 76
  • Coverage
  • Fisherian vs Neyman-Pearson \(p\)-values

76 Sinervo (2002) and Cowan (2012).

Cowan et al. define a \(p\)-value as

a probability, under assumption of \(H\), of finding data of equal or greater incompatibility with the predictions of \(H\). 77

77 Cowan et al. (2011), p. 2–3.

Also:

It should be emphasized that in an actual scientific context, rejecting the background-only hypothesis in a statistical sense is only part of discovering a new phenomenon. One’s degree of belief that a new process is present will depend in general on other factors as well, such as the plausibility of the new signal hypothesis and the degree to which it can describe the data. Here, however, we only consider the task of determining the \(p\)-value of the background-only hypothesis; if it is found below a specified threshold, we regard this as “discovery”. 78

78 Cowan et al. (2011), p. 3.

3.5.3.1 Uppper limits

79 Cousins & Highland (1992).

3.5.3.2 CLs method

  • Conservative coverage; used in particle physics
  • Junk 80
  • Read 81
  • ATLAS 82

80 Junk (1999).

81 Read (2002).

82 ATLAS Statistics Forum (2011).

3.5.4 Asymptotics

83 Wilks (1938).

84 Wald (1943).

85 Cowan et al. (2011).

86 Cowan, Cranmer, Gross, & Vitells (2012).

87 Bhattiprolu, Martin, & Wells (2020).

3.5.5 Student’s t-test

  • Student’s t-test
  • ANOVA
  • A/B-testing

3.5.6 Decision theory

88 Murphy (2012), p. 197.

89 Goodman (1999b).

Support for using Bayes factors:

which properly separates issues of long-run behavior from evidential strength and allows the integration of background knowledge with statistical findings. 90

90 Goodman (1999a). p. 995.

See also:

3.5.7 Examples

3.6 Uncertainty quantification

3.6.1 Sinervo classification of systematic uncertainties

  • Class-1, class-2, and class-3 systematic uncertanties (good, bad, ugly), Classification by Pekka Sinervo (PhyStat2003) 91
  • Not to be confused with type-1 and type-2 errors in Neyman-Pearson theory
  • Heinrich, J. & Lyons, L. (2007). Systematic errors. 92
  • Caldeira & Nord 93

91 Sinervo (2003).

92 Heinrich & Lyons (2007).

93 Caldeira & Nord (2020).

Lyons:

In analyses involving enough data to achieve reasonable statistical accuracy, considerably more effort is devoted to assessing the systematic error than to determining the parameter of interest and its statistical error. 94

94 Lyons (2008), p. 890.

Figure 3.6: Classification of measurement uncertainties (philosophy-in-figures.tumblr.com, 2016).
  • Poincaré’s three levels of ignorance

3.6.2 Profile likelihoods

  • Profiling and the profile likelihood
    • Importance of Wald and Cowan et al.
    • hybrid Bayesian-frequentist method

3.6.3 Examples of poor estimates of systematic uncertanties

Figure 3.7: Demonstration of sensitivity to the jet energy scale for an alleged excess in \(Wjj\) by Tommaso Dorigo (2011) (see also: GIF).
  • OPERA. (2011). Faster-than-light neutrinos.
  • BICEP2 claimed evidence of B-modes in the CMB as evidence of cosmic inflation without accounting for cosmic dust.

3.6.4 Conformal prediction

95 Gammerman, Vovk, & Vapnik (1998).

3.7 Statistical classification

3.7.1 Introduction

  • Precision vs recall
  • Recall is sensitivity
  • Sensitivity vs specificity
  • Accuracy

3.7.2 Examples

  • TODO

See also:

3.8 Causal inference

3.8.1 Introduction

96 Rubin (1974).

97 Lewis (1981).

98 Pearl (2018).

99 Bareinboim, Correa, Ibeling, & Icard (2021).

See also:

3.8.2 Causal models

100 Pearl (2009).

101 Robins & Wasserman (1999).

102 Peters, Janzing, & Scholkopf (2017).

103 Lundberg, Johnson, & Stewart (2021).

3.8.3 Counterfactuals

104 Ismael (2023).

105 Chevalley, Schwab, & Mehrjou (2024).

3.9 Exploratory data analysis

3.9.1 Introduction

106 Tukey (1977).

3.9.2 Look-elsewhere effect

  • Look-elsewhere effect (LEE)
    • AKA File-drawer effect
  • Stopping rules
    • validation dataset
    • statistical issues, violates the likelihood principle

3.9.3 Archiving and data science

107 Chen, X. et al. (2018).

3.10 “Statistics Wars”

3.10.1 Introduction

  • Kruschke
  • Carnap
    • “The two concepts of probability” 108
  • Royall
    • “What do these data say?” 109

108 Carnap (1945).

109 Royall (1997), p. 171–2.

Cranmer:

Bayes’s theorem is a theorem, so there’s no debating it. It is not the case that Frequentists dispute whether Bayes’s theorem is true. The debate is whether the necessary probabilities exist in the first place. If one can define the joint probability \(P (A, B)\) in a frequentist way, then a Frequentist is perfectly happy using Bayes theorem. Thus, the debate starts at the very definition of probability. 110

110 Cranmer (2015), p. 6.

Neyman:

Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong. 111

111 Neyman & Pearson (1933).

Figure 3.8: From Kruschke. 112

112 Kruschke & Liddell (2018).

3.10.2 Likelihood principle

  • Likelihood principle
  • The likelihood principle is the proposition that, given a statistical model and a data sample, all the evidence relevant to model parameters is contained in the likelihood function.
  • The history of likelihood 113
  • Berger & Wolpert. (1988). The Likelihood Principle. 116

113 Edwards (1974).

114 Birnbaum (1962).

115 Hacking (1965).

116 Berger & Wolpert (1988).

O’Hagan:

The first key argument in favour of the Bayesian approach can be called the axiomatic argument. We can formulate systems of axioms of good inference, and under some persuasive axiom systems it can be proved that Bayesian inference is a consequence of adopting any of these systems… If one adopts two principles known as ancillarity and sufficiency principles, then under some statement of these principles it follows that one must adopt another known as the likelihood principle. Bayesian inference conforms to the likelihood principle whereas classical inference does not. Classical procedures regularly violate the likelihood principle or one or more of the other axioms of good inference. There are no such arguments in favour of classical inference. 117

117 O’Hagan (2010), p. 17–18.

118 Gandenberger (2015).

119 Evans (2013).

120 Mayo (2014).

121 Mayo (2019).

122 Dawid (2014).

Mayo:

Likelihoods are vital to all statistical accounts, but they are often misunderstood because the data are fixed and the hypothesis varies. Likelihoods of hypotheses should not be confused with their probabilities. … [T]he same phenomenon may be perfectly predicted or explained by two rival theories; so both theories are equally likely on the data, even though they cannot both be true. 123

123 Mayo (2019).

3.10.3 Discussion

Lyons:

Particle Physicists tend to favor a frequentist method. This is because we really do consider that our data are representative as samples drawn according to the model we are using (decay time distributions often are exponential; the counts in repeated time intervals do follow a Poisson distribution, etc.), and hence we want to use a statistical approach that allows the data “to speak for themselves,” rather than our analysis being dominated by our assumptions and beliefs, as embodied in Bayesian priors. 124

124 Lyons (2008), p. 891.

125 Carnap (1952).

126 Carnap (1960).

127 Sznajder (2018).

128 Hacking (1965).

129 Neyman (1977).

130 Rozeboom (1960).

131 Meehl (1978).

132 Zech (1995).

133 Royall (1997).

134 Berger (2003).

135 Stuart et al. (2010), p. 460.

136 Mayo (1981).

137 Mayo (1996).

138 Mayo & Spanos (2006).

139 Mayo & Spanos (2011).

140 Mayo (2018).

141 Gelman & Hennig (2017).

142 Murphy (2012), ch. 6.6.

143 Murphy (2022), p. 195–198.

144 Gandenberger (2016).

145 Wakefield (2013), ch. 4.

146 Efron & Hastie (2016), p. 30–36.

147 Kruschke & Liddell (2018).

148 Steinhardt (2012).

149 Clayton (2021).

150 Wagenmakers (2021).

Figure 3.9: The major virtues and vices of Bayesian, frequentist, and likelihoodist approaches to statistical inference (gandenberger.org/research/, 2015).

Goodman:

The idea that the \(P\) value can play both of these roles is based on a fallacy: that an event can be viewed simultaneously both from a long-run and a short-run perspective. In the long-run perspective, which is error-based and deductive, we group the observed result together with other outcomes that might have occurred in hypothetical repetitions of the experiment. In the “short run” perspective, which is evidential and inductive, we try to evaluate the meaning of the observed result from a single experiment. If we could combine these perspectives, it would mean that inductive ends (drawing scientific conclusions) could be served with purely deductive methods (objective probability calculations). 151

151 Goodman (1999a). p. 999.

3.11 Replication crisis

3.11.1 Introduction

  • Ioannidis, J.P. (2005). Why most published research findings are false. 152

152 Ioannidis (2005).

3.11.2 p-value controversy

153 Wasserstein & Lazar (2016).

154 Wasserstein, Allen, & Lazar (2019).

155 Benjamin, D.J. et al. (2017).

[N]o isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon; for the “one chance in a million” will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be that it should occur to us. In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. 156

156 Fisher (1935), p. 13–14.

157 Rao & Lovric (2016).

158 Mayo (2021).

159 Gorard & Gorard (2016).

From “The ASA president’s task force statement on statistical significance and replicability”:

P-values are valid statistical measures that provide convenient conventions for communicating the uncertainty inherent in quantitative results. Indeed, P-values and significance tests are among the most studied and best understood statistical procedures in the statistics literature. They are important tools that have advanced science through their proper application. 160

160 Benjamini, Y. et al. (2021), p. 1.

More:

161 Bassily, R. et al. (2015).

3.12 Classical machine learning

3.12.1 Introduction

  • Classification vs regression
  • Supervised and unsupervised learning
    • Classification = supervised; clustering = unsupervised
  • Hastie, Tibshirani, & Friedman 162
  • Information Theory, Inference, and Learning 163
  • Murphy, K.P. (2012). Machine Learning: A probabilistic perspective. MIT Press. 164
  • Murphy, K.P. (2022). Probabilistic Machine Learning: An introduction. MIT Press. 165
  • Shalev-Shwarz, S. & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. 166
  • VC-dimension
    • Vapnik (1994) 167
    • Shalev-Shwarz, S. & Ben-David, S. (2014). 168

162 Hastie, Tibshirani, & Friedman (2009).

163 MacKay (2003).

164 Murphy (2012).

165 Murphy (2022), p. 195–198.

166 Shalev-Shwarz & Ben-David (2014).

167 Vapnik, Levin, & LeCun (1994).

168 Shalev-Shwarz & Ben-David (2014), p. 67–82.

3.12.2 History

169 McCarthy, Minsky, Rochester, & Shannon (1955).

170 Solomonoff (2016).

171 Kardum (2020).

172 McCulloch & Pitts (1943).

173 J. A. Anderson & Rosenfeld (1998).

174 Rosenblatt (1961).

175 Minsky & Papert (1969).

176 Cartwright (2001).

See also:

3.12.3 Logistic regression

From a probabilistic point of view, 177 logistic regression can be derived from doing maximum likelihood estimation of a vector of model parameters, \(\vec{w}\), in a dot product with the input features, \(\vec{x}\), and squashed with a logistic function that yields the probability, \(\mu\), of a Bernoulli random variable, \(y \in \{0, 1\}\).

177 Murphy (2012), p. 21.

\[ p(y | \vec{x}, \vec{w}) = \mathrm{Ber}(y | \mu(\vec{x}, \vec{w})) = \mu(\vec{x}, \vec{w})^y \: (1-\mu(\vec{x}, \vec{w}))^{(1-y)} \]

The negative log-likelihood of multiple trials is

\[ \begin{align} \mathrm{NLL} &= - \sum_i \log p(y_i | \vec{x}_i, \vec{w}) \\ &= - \sum_i \log\left( \mu(\vec{x}_i, \vec{w})^{y_i} \: (1-\mu(\vec{x}_i, \vec{w}))^{(1-y_i)} \right) \\ &= - \sum_i \log\left( \mu_i^{y_i} \: (1-\mu_i)^{(1-y_i)} \right) \\ &= - \sum_i \big( y_i \, \log \mu_i + (1-y_i) \log(1-\mu_i) \big) \end{align} \]

which is the cross entropy loss. Note that the first term is non-zero only when the true target is \(y_i=1\), and similarly the second term is non-zero only when \(y_i=0\). 178 Therefore, we can reparametrize the target \(y_i\) in favor of \(t_{ki}\) that is one-hot in an index \(k\) over classes.

178 Note: Label smoothing is a regularization technique that smears the activation over other labels, but we don’t do that here.

\[ \mathrm{CEL} = \mathrm{NLL} = - \sum_i \sum_k \big( t_{ki} \, \log \mu_{ki} \big) \]

where

\[ t_{ki} = \begin{cases} 1 & \mathrm{if}\ (k = y_i = 0)\ \mathrm{or}\ (k = y_i = 1) \\ 0 & \mathrm{otherwise} \end{cases} \]

and

\[ \mu_{ki} = \begin{cases} 1-\mu_i & \mathrm{if}\ k = 0 \\ \mu_i & \mathrm{if}\ k =1 \end{cases} \]

This readily generalizes from binary classification to classification over many classes as we will discuss more below. Note that in the sum over classes, \(k\), only one term for the true class contributes.

\[ \mathrm{CEL} = - \left. \sum_i \log \mu_{ki} \right|_{k\ \mathrm{is\ such\ that}\ y_k=1} \tag{3.3}\]

Logistic regression uses the logit function 179, which is the logarithm of the odds—the ratio of the chance of success to failure. Let \(\mu\) be the probability of success in a Bernoulli trial, then the logit function is defined as

179 “Logit” was coined by Joseph Berkson (1899-1982).

\[ \mathrm{logit}(\mu) \equiv \log\left(\frac{\mu}{1-\mu}\right) \]

Logistic regression assumes that the logit function is a linear function of the explanatory variable, \(x\).

\[ \log\left(\frac{\mu}{1-\mu}\right) = \beta_0 + \beta_1 x \]

where \(\beta_0\) and \(\beta_1\) are trainable parameters. (TODO: Why would we assume this?) This can be generalized to a vector of multiple input variables, \(\vec{x}\), where the input vector has a 1 prepended to be its zeroth component in order to conveniently include the bias, \(\beta_0\), in a dot product.

\[ \vec{x} = (1, x_1, x_2, \ldots, x_n)^\intercal \]

\[ \vec{w} = (\beta_0, \beta_1, \beta_2, \ldots, \beta_n)^\intercal \]

\[ \log\left(\frac{\mu}{1-\mu}\right) = \vec{w}^\intercal \vec{x} \]

For the moment, let \(z \equiv \vec{w}^\intercal \vec{x}\). Exponentiating and solving for \(\mu\) gives

\[ \mu = \frac{ e^z }{ 1 + e^z } = \frac{ 1 }{ 1 + e^{-z} } \]

This function is called the logistic or sigmoid function.

\[ \mathrm{logistic}(z) \equiv \mathrm{sigm}(z) \equiv \frac{ 1 }{ 1 + e^{-z} } \]

Since we inverted the logit function by solving for \(\mu\), the inverse of the logit function is the logistic or sigmoid.

\[ \mathrm{logit}^{-1}(z) = \mathrm{logistic}(z) = \mathrm{sigm}(z) \]

And therefore,

\[ \mu = \mathrm{sigm}(z) = \mathrm{sigm}(\vec{w}^\intercal \vec{x}) \]

Figure 3.10: Logistic regression.

See also:

3.12.4 Softmax regression

Again, from a probabilistic point of view, we can derive the use of multi-class cross entropy loss by starting with the Bernoulli distribution, generalizing it to multiple classes (indexed by \(k\)) as

\[ p(y_k | \mu) = \mathrm{Cat}(y_k | \mu_k) = \prod_k {\mu_k}^{y_k} \]

which is the categorical or multinoulli distribution. The negative-log likelihood of multiple independent trials is

\[ \mathrm{NLL} = - \sum_i \log \left(\prod_k {\mu_{ki}}^{y_{ki}}\right) = - \sum_i \sum_k y_{ki} \: \log \mu_{ki} \]

Noting again that \(y_{ki} = 1\) only when \(k\) is the true class, and is 0 otherwise, this simplifies to eq. 3.3.

See also:

180 McFadden & Zarembka (1973).

181 Blondel, Martins, & Niculae (2020).

182 Goodfellow et al. (2016), p. 129.

3.12.5 Decision trees

183 Freund & Schapire (1997).

184 T. Chen & Guestrin (2016).

185 Aytekin (2022).

186 Grinsztajn, Oyallon, & Varoquaux (2022).

187 Coadou (2022).

3.12.6 Clustering

188 Hartigan (1985).

189 Slonim, Atwal, Tkacik, & Bialek (2005).

190 Batson, Haaf, Kahn, & Roberts (2021).

191 Hennig (2015).

192 Lauc (2020), p. 103–4.

193 Ronen, Finder, & Freifeld (2022).

194 Fang, Z. et al. (2022).

See also:

3.13 Deep learning

3.13.1 Introduction

195 Gardner & Dorling (1998).

196 Schmidhuber (2024).

197 Bengio (2009).

198 LeCun, Bengio, & Hinton (2015).

199 Sutskever (2015).

200 Goodfellow et al. (2016).

201 Kaplan, J. et al. (2019).

202 Rumelhart, Hinton, & Williams (1986).

203 Amari (1993).

204 LeCun & Bottou (1998).

205 Scardapane (2024).

206 Bottou (1998).

207 Norvig (2011).

208 Sutton (2019).

209 Frankle & Carbin (2018).

Figure 3.11: Raw input image is transformed into gradually higher levels of representation. 210

210 Bengio (2009).

3.13.2 Gradient descent

\[ \hat{f} = \underset{f \in \mathcal{F}}{\mathrm{argmin}} \underset{x \sim \mathcal{X}}{\mathbb{E}} L(f, x) \]

The workhorse algorithm for optimizing (training) model parameters is gradient descent:

\[ \vec{w}[t+1] = \vec{w}[t] - \eta \frac{\partial L}{\partial \vec{w}}[t] \]

In Stochastic Gradient Descent (SGD), you chunk the training data into minibatches (AKA batches), \(\vec{x}_{bt}\), and take a gradient descent step with each minibatch:

\[ \vec{w}[t+1] = \vec{w}[t] - \frac{\eta}{m} \sum_{i=1}^m \frac{\partial L}{\partial \vec{w}}[\vec{x}_{bt}] \]

where

  • \(t \in \mathbf{N}\) is the learning step number
  • \(\eta\) is the learning rate
  • \(m\) is the number of samples in a minibatch, called the batch size
  • \(L\) is the loss function
  • \(\frac{\partial L}{\partial \vec{w}}\) is the gradient

3.13.3 Deep double descent

Papers:

211 Belkin, Hsu, Ma, & Mandal (2019).

212 Muthukumar, Vodrahalli, Subramanian, & Sahai (2019).

213 Nakkiran, P. et al. (2019).

214 Chang, Li, Oymak, & Thrampoulidis (2020).

215 Holzmüller (2020).

216 Dar, Muthukumar, & Baraniuk (2021).

217 Balestriero, Pesenti, & LeCun (2021).

218 Belkin (2021).

219 Nagarajan (2021).

220 Bach (2022), p. 225–230.

221 Ghosh & Belkin (2022).

222 Singh, Lucchi, Hofmann, & Schölkopf (2022).

223 Hastie, Montanari, Rosset, & Tibshirani (2022).

224 Bubeck & Sellke (2023).

225 Gamba, Englesson, Björkman, & Azizpour (2022).

226 Schaeffer, R. et al. (2023).

227 Yang & Suzuki (2023).

228 Maddox, Benton, & Wilson (2023).

Blogs:

229 Steinhardt (2022).

230 Henighan, T. et al. (2023).

Twitter threads:

3.13.4 Regularization

Regularization = any change we make to the training algorithm in order to reduce the generalization error but not the training error. 231

231 Mishra, D. (2020). Weight Decay == L2 Regularization?

Most common regularizations:

  • L2 Regularization
  • L1 Regularization
  • Data Augmentation
  • Dropout
  • Early Stopping

Papers:

232 S. Chen, Dobriban, & Lee (2020).

3.13.5 Batch size vs learning rate

Papers:

  1. Keskar, N.S. et al. (2016). On large-batch training for deep learning: Generalization gap and sharp minima.

[L]arge-batch methods tend to converge to sharp minimizers of the training and testing functions—and as is well known—sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation.

  1. Hoffer, E. et al. (2017). Train longer, generalize better: closing the generalization gap in large batch training of neural networks.

    • \(\eta \propto \sqrt{m}\)
  2. Goyal, P. et al. (2017). Accurate large minibatch SGD: Training ImageNet in 1 hour.

    • \(\eta \propto m\)
  3. You, Y. et al. (2017). Large batch training of convolutional networks.

    • Layer-wise Adaptive Rate Scaling (LARS)
  4. You, Y. et al. (2017). ImageNet training in minutes.

    • Layer-wise Adaptive Rate Scaling (LARS)
  5. Jastrzebski, S. (2018). Three factors influencing minima in SGD.

    • \(\eta \propto m\)
  6. Smith, S.L. & Le, Q.V. (2018). A Bayesian Perspective on Generalization and Stochastic Gradient Descent.

  7. Smith, S.L. et al. (2018). Don’t decay the learning rate, increase the batch size.

    • \(m \propto \eta\)
  8. Masters, D. & Luschi, C. (2018). Revisiting small batch training for deep neural networks.

This linear scaling rule has been widely adopted, e.g., in Krizhevsky (2014), Chen et al. (2016), Bottou et al. (2016), Smith et al. (2017) and Jastrzebski et al. (2017).

On the other hand, as shown in Hoffer et al. (2017), when \(m \ll M\), the covariance matrix of the weight update \(\mathrm{Cov(\eta \Delta\theta)}\) scales linearly with the quantity \(\eta^2/m\).

This implies that, adopting the linear scaling rule, an increase in the batch size would also result in a linear increase in the covariance matrix of the weight update \(\eta \Delta\theta\). Conversely, to keep the scaling of the covariance of the weight update vector \(\eta \Delta\theta\) constant would require scaling \(\eta\) with the square root of the batch size \(m\) (Krizhevsky, 2014; Hoffer et al., 2017).

  1. Lin, T. et al. (2020). Don’t use large mini-batches, use local SGD.
    - Post-local SGD.

  2. Golmant, N. et al. (2018). On the computational inefficiency of large batch sizes for stochastic gradient descent.

Scaling the learning rate as \(\eta \propto \sqrt{m}\) attempts to keep the weight increment length statistics constant, but the distance between SGD iterates is governed more by properties of the objective function than the ratio of learning rate to batch size. This rule has also been found to be empirically sub-optimal in various problem domains. … There does not seem to be a simple training heuristic to improve large batch performance in general.

  1. McCandlish, S. et al. (2018). An empirical model of large-batch training.
    • Critical batch size
  2. Shallue, C.J. et al. (2018). Measuring the effects of data parallelism on neural network training.

In all cases, as the batch size grows, there is an initial period of perfect scaling (\(b\)-fold benefit, indicated with a dashed line on the plots) where the steps needed to achieve the error goal halves for each doubling of the batch size. However, for all problems, this is followed by a region of diminishing returns that eventually leads to a regime of maximal data parallelism where additional parallelism provides no benefit whatsoever.

  1. Jastrzebski, S. et al. (2018). Width of minima reached by stochastic gradient descent is influenced by learning rate to batch size ratio.
    • \(\eta \propto m\)

We show this experimentally in Fig. 5, where similar learning dynamics and final performance can be observed when simultaneously multiplying the learning rate and batch size by a factor up to a certain limit.

  1. You, Y. et al. (2019). Large-batch training for LSTM and beyond.
    • Warmup and use \(\eta \propto m\)

[W]e propose linear-epoch gradual-warmup approach in this paper. We call this approach Leg-Warmup (LEGW). LEGW enables a Sqrt Scaling scheme in practice and as a result we achieve much better performance than the previous Linear Scaling learning rate scheme. For the GNMT application (Seq2Seq) with LSTM, we are able to scale the batch size by a factor of 16 without losing accuracy and without tuning the hyper-parameters mentioned above.

  1. You, Y. et al. (2019). Large batch optimization for deep learning: Training BERT in 76 minutes.
    • LARS and LAMB
  2. Zhang, G. et al. (2019). Which algorithmic choices matter at which batch sizes? Insights from a Noisy Quadratic Model.

Consistent with the empirical results of Shallue et al. (2018), each optimizer shows two distinct regimes: a small-batch (stochastic) regime with perfect linear scaling, and a large-batch (deterministic) regime insensitive to batch size. We call the phase transition between these regimes the critical batch size.

  1. Li, Y., Wei, C., & Ma, T. (2019). Towards explaining the regularization effect of initial large learning rate in training neural networks.

Our analysis reveals that more SGD noise, or larger learning rate, biases the model towards learning “generalizing” kernels rather than “memorizing” kernels.

  1. Kaplan, J. et al. (2020). Scaling laws for neural language models.

  2. Jastrzebski, S. et al. (2020). The break-even point on optimization trajectories of deep neural networks.

Blogs:

3.13.6 Normalization

233 Chiley, V. et al. (2019).

234 Kiani, Balestriero, Lecun, & Lloyd (2022).

235 Huang, L. et al. (2020).

3.13.7 Finetuning

236 Hu, E.J. et al. (2021).

237 Dettmers, Pagnoni, Holtzman, & Zettlemoyer (2023).

238 Zhao, J. et al. (2024).

239 Huh, M. et al. (2024).

3.13.8 Computer vision

240 Fukushima & Miyake (1982).

241 LeCun, Y. et al. (1989).

242 LeCun, Bottou, Bengio, & Haffner (1998).

243 Ciresan, Meier, Masci, & Schmidhuber (2012).

244 Krizhevsky, Sutskever, & Hinton (2012).

245 Simonyan & Zisserman (2014).

246 He, Zhang, Ren, & Sun (2015).

247 Haber & Ruthotto (2017) and Haber, E. et al. (2018).

248 Howard, A.G. et al. (2017).

249 R. T. Q. Chen, Rubanova, Bettencourt, & Duvenaud (2018).

250 Tan & Le (2019).

251 Dosovitskiy, A. et al. (2020).

252 Tan & Le (2021).

253 H. Liu, Dai, So, & Le (2021).

254 Dhariwal & Nichol (2021).

255 Liu, Y. et al. (2021).

256 Ingrosso & Goldt (2022).

257 Park & Kim (2022).

258 Zhao, Y. et al. (2023).

259 Nakkiran, Bradley, Zhou, & Advani (2024).

Resources:

3.13.9 Natural language processing

3.13.9.1 Introduction

260 Firth (1957).

261 Nirenburg (1996).

262 Hutchins (2000).

263 Jurafsky & Martin (2022).

264 Z. Liu, Lin, & Sun (2023).

3.13.9.2 word2vec

265 Mikolov, Chen, Corrado, & Dean (2013), Mikolov, Yih, & Zweig (2013), and Mikolov, T. et al. (2013).

266 Kun (2018), p. 176–8.

3.13.9.3 RNNs

267 Hochreiter & Schmidhuber (1997).

268 Graves (2013).

Chain rule of language modeling (chain rule of probability):

\[ P(x_1, \ldots, x_T) = P(x_1, \ldots, x_{n-1}) \prod_{t=n}^{T} P(x_t | x_1 \ldots x_{t-1}) \]

or for the whole sequence:

\[ P(x_1, \ldots, x_T) = \prod_{t=1}^{T} P(x_t | x_1 \ldots x_{t-1}) \]

\[ = P(x_1) \: P(x_2 | x_1) \: P(x_3 | x_1 x_2) \: P(x_4 | x_1 x_2 x_3) \ldots \]

A language model (LM), predicts the next token given previous context. The output of the model is a vector of logits, which is given to a softmax to convert to probabilities for the next token.

\[ P(x_t | x_1 \ldots x_{t-1}) = \mathrm{softmax}\left( \mathrm{model}(x_1 \ldots x_{t-1}) \right) \]

Auto-regressive inference follows this chain rule. If done with greedy search:

\[ \hat{x}_t = \underset{x_t \in V}{\mathrm{argmax}} \: P(x_t | x_1 \ldots x_{t-1}) \]

Beam search:

269 Sutskever, Vinyals, & Le (2014), p. 4.

270 Zhang (1998).

271 Zhou & Hansen (2005).

272 Collobert, Hannun, & Synnaeve (2019).

Backpropagation through time (BPTT):

273 Werbos (1990).

Neural Machine Translation (NMT):

  • Sutskever seq2seq 274
  • Bahdanau attention 275 and GNMT 276
  • Review by Stahlberg 277

274 Sutskever et al. (2014).

275 Bahdanau, Cho, & Bengio (2015).

276 Wu, Y. et al. (2016).

277 Stahlberg (2019).

3.13.9.4 Transformers

Figure 3.12: Diagram of the Transformer model (source: d2l.ai).

\[ \mathrm{attention}(Q, K, V) = \mathrm{softmax}\left(\frac{Q\, K^\intercal}{\sqrt{d_k}}\right) V \]

3.13.9.4.1 Attention and Transformers

278 Vaswani, A. et al. (2017).

279 Devlin, Chang, Lee, & Toutanova (2018).

280 Liu, Y. et al. (2019).

281 Raffel, C. et al. (2019).

282 Lan, Z. et al. (2019).

283 Lewis, M. et al. (2019).

284 Radford, Narasimhan, Salimans, & Sutskever (2018).

285 Radford, A. et al. (2019).

286 Brown, T.B. et al. (2020).

287 Yang, Z. et al. (2019).

288 Zaheer, M. et al. (2020).

289 Edelman, Goel, Kakade, & Zhang (2021).

290 Tay, Dehghani, Bahri, & Metzler (2022).

291 Phuong & Hutter (2022).

292 Chowdhery, A. et al. (2022).

293 Ouyang, L. et al. (2022).

294 Wolfram (2023).

295 OpenAI (2023).

296 Mohamadi, S. et al. (2023).

297 Zhao, W.X. et al. (2023).

298 Golovneva, Wang, Weston, & Sukhbaatar (2024).

Figure 3.13: Diagram of the BERT model (source: peltarion.com).
3.13.9.4.2 Computational complexity of transformers
3.13.9.4.3 Efficient transformers

299 Dao, T. et al. (2022).

3.13.9.4.4 What comes after Transformers?

300 Gu, Goel, & Ré (2021).

301 Merrill & Sabharwal (2022).

302 Bulatov, Kuratov, & Burtsev (2022).

303 Bulatov, Kuratov, & Burtsev (2023).

304 Bertsch, Alon, Neubig, & Gormley (2023).

305 Mialon, G. et al. (2023).

306 Peng, B. et al. (2023).

307 Sun, Y. et al. (2023).

308 Gu & Dao (2023).

309 Wang, H. et al. (2023).

310 Ma, S. et al. (2024).

311 Ma, X. et al. (2024).

312 Bhargava, Witkowski, Shah, & Thomson (2023).

313 Dao & Gu (2024).

314 Banerjee, Agarwal, & Singla (2024).

315 Jingyu Liu et al (2025).

3.13.9.5 Evaluation methods

3.13.9.6 Scaling laws in NLP

316 Hestness, J. et al. (2017).

317 Church & Hestness (2019).

318 Kaplan, J. et al. (2020).

319 Rae, J.W. et al. (2022).

320 Hoffmann, J. et al. (2022).

321 Caballero, Gupta, Rish, & Krueger (2022).

322 Muennighoff, N. et al. (2023).

323 Pandey (2024).

324 Bach (2024).

325 Finzi, M. et al. (2025).

3.13.9.7 Language understanding

326 Mahowald, K. et al. (2023).

327 Kosinski (2023).

See also:

3.13.9.8 Interpretability

328 Watson & Floridi (2019).

329 Gurnee, W. et al. (2023).

330 Meng, Bau, Andonian, & Belinkov (2023).

331 McDougall, C. et al. (2023).

Linear probes:

332 Alain & Bengio (2016).

333 Belinkov (2022).

334 Gurnee & Tegmark (2023).

3.13.10 Reinforcement learning

Pedagogy:

335 Sutton & Barto (2018).

336 Arulkumaran, Deisenroth, Brundage, & Bharath (2017).

337 Cesa-Bianchi & Lugosi (2006).

Tutorials:

More:

338 Xu, Hasselt, & Silver (2018).

339 Silver, Singh, Precup, & Sutton (2024).

340 Javed & Sutton (2024).

See also:

3.13.10.1 Q-learning

  • Q-learning and DQN
  • Uses the Markov Decision Process (MDP) framework
  • The Bellman equation 341
  • Q-learning is a values-based learning algorithm. Value based algorithms updates the value function based on an equation (particularly Bellman equation). Whereas the other type, policy-based estimates the value function with a greedy policy obtained from the last policy improvement (source: towardsdatascience.com).
  • DQN masters Atari 342

341 Bellman (1952).

342 Mnih, V. et al. (2013) and Mnih, V. et al. (2015).

3.13.10.2 AlphaZero

343 Silver, D. et al. (2016).

344 Silver, D. et al. (2017b).

345 Silver, D. et al. (2017a).

3.13.10.3 Regret minimization

Regret matching (RM)

346 Hart & Mas‐Colell (2000).

Consider a game like rock-paper-scissors, where there is only one action per round. Let \(v^{t}(a)\) be the value observed when playing action \(a\) on iteration \(t\).

TODO: explain that the entire rewards vector, \(v^{t}(a)\), over \(a\) is observable after the chosen action is played.

Let a strategy, \(\sigma^t\), be a probability distribution over actions, \(a \in A\). Then the value of a strategy, \(v^{t}(\sigma^{t})\), is the expectation of its value over actions.

\[ v^{t}(\sigma^{t}) = \sum_{a \in A} \sigma^{t}(a) \: v^{t}(a) \]

Regret, \(R^{T}\), measures how much better some sequence of strategies, \(\sigma'\), would do compared to the chosen sequence of strategies, \(\sigma = \{\sigma^1, \sigma^2, \ldots \sigma^T\}\).

\[ R^{T} \equiv \sum_{t=1}^{T} \left( v^{t}({\sigma'}^{t}) - v^{t}(\sigma^{t}) \right) \]

External regret, \(R^{T}(a)\), measures the regret of the chosen sequence of strategies versus a hypothetical stategy where action \(a\) is always chosen.

\[ R^{T}(a) \equiv \sum_{t=1}^{T} \left( v^{t}(a) - v^{t}(\sigma^{t}) \right) \]

Regret Matching (RM) is a rule to determine the strategy for the next iteration:

\[ \sigma^{t+1}(a) \equiv \frac{ R^{t}_{+}(a) }{ \sum_{b \in A} R^{t}_{+}(b) } \]

where \(R_{+} \equiv \mathrm{max}(R, 0)\).

At the end of training, the resulting recommended strategy with convergence bounds is not the final strategy used in training, \(\sigma^{T}\), but the average strategy over all time steps:

\[ \bar{\sigma}^{T}(a) = \frac{1}{T} \sum_{t=1}^{T} \sigma^{t}(a) \]

  • TODO: explain the convergence of \(\bar{\sigma}^{t}\) to an \(\varepsilon\)-Nash equilibrium.
  • Roughgarden, T. (2016). Twenty Lectures on Algorithmic Game Theory. 347

347 Roughgarden (2016).

Counterfactual regret minimization (CFR)

348 Zinkevich, Johanson, Bowling, & Piccione (2007).

349 N. Brown (2020), p. 12.

350 Zinkevich et al. (2007), p. 4.

351 Tammelin (2014).

352 Tammelin, Burch, Johanson, & Bowling (2015).

353 Burch, Moravcik, & Schmid (2019).

354 N. Brown & Sandholm (2019a).

TODO: explain extensive-form games.

A finite extensive game with imperfect information has the following components: 355

355 Zinkevich et al. (2007) and Lanctot, Waugh, Zinkevich, & Bowling (2009).

  • A finite set \(N\) of players. A finite set \(H\) of sequences, the possible histories of actions, such that the empty sequence is in \(H\) and every prefix of a sequence in \(H\) is also in \(H\). Define \(h \sqsubseteq h'\) to mean \(h\) is a prefix of \(h'\). \(Z \subseteq H\) are the terminal histories (those which are not a prefix of any other sequences). \(A(h) = \{a : ha \in H\}\) are the actions available after a non-terminal history, \(h \in H \backslash Z\).
  • A function \(P\) that assigns to each non-terminal history a member of \(N \cup \{c\}\). \(P\) is the player function. \(P(h)\) is the player who takes an action after the history \(h\). If \(P(h) = c\) then chance determines the action taken after history \(h\).
  • For each player \(i \in N \cup \{c\}\) a partition \(\mathcal{I}_i\) of \(\{h \in H : P (h) = i\}\) with the property that \(A(h) = A(h')\) whenever \(h\) and \(h'\) are in the same member of the partition. For \(I \in \mathcal{I}_i\) we denote by \(A(I_i)\) the set \(A(h)\) and by \(P(I_i)\) the player \(P(h)\) for any \(h \in I_i\) . \(\mathcal{I}_i\) is the information partition of player \(i\); a set \(I_i \in \mathcal{I}_i\) is an information set of player \(i\).
  • A function \(f_c\) that associates with every information set \(I\) where \(P(I) = c\), a probability measure \(f_c(a|I)\) on \(A(h)\); \(f_c(a|I)\) is the probability that \(a\) occurs given some \(h \in I\), where each such probability measure is independent of every other such measure.
  • For each player \(i \in N\) there is a utility function \(u_i\) from the terminal states \(Z\) to the reals \(\mathbb{R}\). If \(N = \{1, 2\}\) and \(u_1 = -u_2\), it is a zero-sum extensive game. Define \(\Delta_{u,i} \equiv \mathrm{max}_z \: u_i(z) - \mathrm{min}_z \: u_i(z)\) to be the range of utilities to player \(i\).

The player reach, \(\pi^{\sigma}_{i}(h)\), of a history \(h\) is the product of the probabilities for all agent \(i\) actions leading to \(h\). Formally, 356

356 N. Brown (2020), p. 6.

\[ \pi^{\sigma}_{i}(h) \equiv \prod_{h' \cdot a' \sqsubseteq h | P(h') = i} \sigma_{i}(h', a') \]

Due to perfect recall, any two histories in infoset \(I_i\) have the same player reach for player \(i\). Thus, we similarly define the player reach \(\pi^{\sigma}_{i}(I_i)\) of infoset \(I_i\) as

\[ \pi^{\sigma}_{i}(I_i) \equiv \prod_{ {I'}_{i} \cdot a' \sqsubseteq I_i | P(I_i) = i } \sigma_{i}({I'}_{i}, a') = \left.\pi^{\sigma}_{i}(h)\right|_{h \in I_i} \]

The external reach AKA opponent reach, \(\pi^{\sigma}_{-i}(h)\), of a history \(h\) is the contribution of chance and all other players than \(i\). Formally,

\[ \pi^{\sigma}_{-i}(h) \equiv \prod_{h' \cdot a' \sqsubseteq h | P(h') \neq i} \sigma_{i}(h', a') \]

We also define the external reach of an infoset as

\[ \pi^{\sigma}_{-i}(I_i) \equiv \sum_{h \in I_{i}} \pi^{\sigma}_{-i}(h) \]

The counterfactual value of an infoset \(I\) is the expected utility to player \(i\) given that \(I\) has been reached, weighted by the external reach of \(I\) for player \(i\). Formally, 357

357 N. Brown (2020), p. 12.

\[ v(I) = \sum_{h \in I} \pi^{\sigma}_{-i}(h) \sum_{z \in Z} \pi^{\sigma}(h, z) \: u_{i}(z) \]

The counterfactual value of an action, \(a\), is

\[ v(I, a) = \sum_{h \in I} \pi^{\sigma}_{-i}(h) \sum_{z \in Z} \pi^{\sigma}(h \cdot a, z) \: u_{i}(z) \]

Let’s consider the case where, like in NLHE, our two private hole cards each make a single unique history \(h\), and we form infosets with a single hand, so \(I=h\). Then

\[ v(h) = \pi^{\sigma}_{-i}(h) \sum_{z \in Z} \pi^{\sigma}(h, z) \: u_{i}(z) \]

making explicit the player reach and the external reach,

\[ v(h) = \pi^{\sigma}_{-i}(h) \sum_{z \in Z} \pi_{i}^{\sigma}(h, z) \: \pi_{-i}^{\sigma}(h, z) \: u_{i}(z) \]

At a leaf node where we finally calculate the rewards,

\[ v(z) = \pi^{\sigma}_{-i}(z) \: u_{i}(z) \]

TODO: explain CFR.

The instantaneous regret is

\[ r^{t}(I, a) = v^{\sigma^t}(I, a) - v^{\sigma^t}(I) \]

The (cummulative) counterfactual regret

\[ R^{t}(I, a) = \sum_{t=1}^{T} r^{t}(I, a) \]

Similar to the single-node game discussed above, eq. \(\eqref{eq:regret_matching}\), applying regret matching during training means to update strategies according to the following rule.

\[ \sigma^{t+1}(I, a) \equiv \frac{ R^{t}_{+}(I, a) }{ \sum_{b \in A} R^{t}_{+}(I, b) } \]

The average strategy is

\[ \bar{\sigma}^{T}(I, a) = \sum_{t=1}^{T} \frac{\pi^{t}_{i}(I) \: \sigma^{t}(I, a) }{\pi^{t}_{i}(I)} \]

Monte Carlo Counterfactual Regret Minimization (MCCFR)

358 Lanctot et al. (2009).

359 Neller & Lanctot (2013).

360 Burch, Lanctot, Szafron, & Gibson (2012).

361 Johanson, M. et al. (2012).

362 Schmid, M. et al. (2019).

363 Li, H. et al. (2020).

364 Habara, Fukuda, & Yamashita (2023).

365 Lanctot (2013).

366 Gibson (2014).

367 Johanson (2016).

368 Burch (2018).

369 Horacek (2022).

TODO: explain MCCFR.

External sampling MCCFR:

\[ \tilde{v}^{\sigma}_{i}(I) = \sum_{z \in Q} u_{i}(z) \: \pi^{\sigma}_{i}(z[I] \rightarrow z) \]

Best response and exploitability

Best response:

\[ \mathrm{BR}(\sigma_{-i}) = \underset{\sigma_{i}^{\prime}}{\mathrm{argmax}} \: u_{i}(\sigma_{i}^{\prime}, \sigma_{-i}) \]

TODO: Local Best Response (LBR). 370

370 Lisy & Bowling (2016), p. 2.

Exploitability:

\[ \varepsilon_{i}(\sigma) = u_{i}(\mathrm{BR}(\sigma_{-i}), \sigma_{-i}) - u_{i}(\sigma_{i}, \mathrm{BR}(\sigma_{i})) \]

NashConv 371 exploitability uses the convention:

371 See NashConv exploitability defined in Lanctot, M. et al. (2017).

\[ \varepsilon_{i}(\sigma) = u_{i}(\mathrm{BR}(\sigma_{-i}), \sigma_{-i}) - u_{i}(\sigma_{i}, \sigma_{-i}) \]

The average exploitability per player is

\[ \varepsilon(\sigma) = \frac{1}{n} \sum_{i}^{n} \varepsilon_{i}(\sigma) \]

Note that in zero-sum games, when summing over players, the second terms in NashConv sum to zero. 372

372 Timbers (2020), p. 3.

\[ \varepsilon(\sigma) = \frac{1}{n} \sum_{i}^{n} u_{i}(\mathrm{BR}(\sigma_{-i}), \sigma_{-i}) \]

In two-player games:

\[ \varepsilon(\sigma) = \frac{1}{2} \Big( u_{1}(\mathrm{BR}(\sigma_{2}), \sigma_{2}) + u_{2}(\sigma_{1}, \mathrm{BR}(\sigma_{1})) \Big) \]

373 Johanson, Waugh, Bowling, & Zinkevich (2011).

374 Ponsen, De Jong, & Lanctot (2011).

375 Lisy & Bowling (2016).

376 Timbers (2020).

3.13.10.4 Solving poker

377 Kuhn (1950).

378 Southey, F. et al. (2012).

379 Billings, Davidson, Schaeffer, & Szafron (2002).

380 Billings, D. et al. (2003).

381 Johanson (2013).

382 Bowling, Burch, Johanson, & Tammelin (2015).

383 Heinrich & Silver (2016).

384 Moravcik, M. et al. (2017).

385 N. Brown & Sandholm (2017).

386 N. Brown & Sandholm (2018).

387 N. Brown & Sandholm (2019a).

388 N. Brown, Lerer, Gross, & Sandholm (2019).

389 N. Brown & Sandholm (2019b).

390 N. Brown, Bakhtin, Lerer, & Gong (2020).

391 N. Brown (2020).

392 Schmid, M. et al. (2021).

393 Kovarik, V. et al. (2022).

3.13.11 Applications in physics

394 Denby (1988).

395 Denby (1993).

396 Spears, B.K. et al. (2018).

397 Cranmer, Seljak, & Terao (2021).

398 Liu, Z. et al. (2024).

399 Jiao, L. et al. (2024).

See also:

3.14 Theoretical machine learning

3.14.1 Algorithmic information theory

400 Cilibrasi & Vitanyi (2005).

401 Hutter (2007).

402 Rathmanner & Hutter (2011).

3.14.2 No free lunch theorems

403 Wolpert & Macready (1995).

404 Wolpert (1996).

405 Wolpert & Macready (1997).

406 Shalev-Shwarz & Ben-David (2014), p. 60–66.

407 McDermott (2019).

408 Wolpert (2007).

409 Wolpert & Kinney (2020).

410 Wolpert (2023).

411 Mitchell (1980).

412 Roberts (2021).

413 Goldreich & Ron (1997).

414 Joyce & Herrmann (2017).

415 H. W. Lin, Tegmark, & Rolnick (2017).

416 Lauc (2020).

417 Nakkiran (2021).

418 Bousquet, O. et al. (2021).

419 Andrews (2023).

Raissi et al.:

encoding such structured information into a learning algorithm results in amplifying the information content of the data that the algorithm sees, enabling it to quickly steer itself towards the right solution and generalize well even when only a few training examples are available. 420

420 Raissi, Perdikaris, & Karniadakis (2017a), p. 2.

Roberts:

From an algorithmic complexity standpoint it is somewhat miraculous that we can compress our huge look-up table of experiment/outcome into such an efficient description. In many senses, this type of compression is precisely what we mean when we say that physics enables us to understand a given phenomenon. 421

421 Roberts (2021), p. 7.

  • TODO: Note Dennett’s discussion of compression. 422

422 Dennett (1991), p. TODO.

3.14.3 Connectivists vs symbolicists

3.14.4 Graphical tensor notation

3.14.5 Universal approximation theorem

423 Minsky & Papert (1969), p. TODO.

424 Hornik, Stinchcombe, & White (1989).

425 Lu, Z. et al. (2017).

426 H. Lin & Jegelka (2018).

427 Ismailov (2020).

428 Bishop (2006), p. 230.

3.14.6 Relationship to statistical mechanics

429 Opper & Kinzel (1996).

430 Opper (2001).

431 Bahri, Y. et al. (2020).

432 Halverson, Maiti, & Stoner (2020).

433 Canatar, Bordelon, & Pehlevan (2020).

434 Roberts, Yaida, & Hanin (2021).

435 Cantwell (2022).

436 Dinan, Yaida, & Zhang (2023).

437 Sohl-Dickstein (2020).

438 Aifer, M. et al. (2023).

439 Geshkovski, Letrouit, Polyanskiy, & Rigollet (2023).

3.14.7 Relationship to gauge theory

Invariant:

\[ f(g x) = f(x) \]

Equivariant:

\[ f(g x) = g' f(x) \]

Same-equivariant is the case that \(g' = g\).

440 Dieleman, Fauw, & Kavukcuoglu (2016).

441 T. S. Cohen & Welling (2016).

442 T. S. Cohen, Weiler, Kicanaoglu, & Welling (2019).

443 Fuchs, Worrall, Fischer, & Welling (2020).

444 Bogatskiy, A. et al. (2023).

445 Marchetti, Hillar, Kragic, & Sanborn (2023).

446 Battiloro, C. et al. (2024).

447 Erdogan & Lucic (2025).

3.14.8 Thermodynamics of computation

448 Bérut, A. et al. (2012).

449 Bérut, Petrosyan, & Ciliberto (2015).

3.15 Information geometry

3.15.1 Introduction

450 Smith (2019).

451 Nielsen (2020).

452 Amari (1998).

453 Amari (2016).

3.15.2 Geometric understanding of classical statistics

454 Balasubramanian (1996a).

455 Balasubramanian (1996b).

456 Calin & Udriste (2014).

457 de Carvalho, Page, & Barney (2019).

3.15.3 Geometric understanding of deep learning

458 Lei, Luo, Yau, & Gu (2018).

459 Gao & Chaudhari (2020).

460 Bronstein, Bruna, Cohen, & Velickovic (2021).

3.16 Automation

3.16.1 Cybernetics

3.16.2 AutoML

  • Neural Architecture Search (NAS)
  • AutoML frameworks
  • RL-driven NAS
  • learned sparsity

3.16.3 Surrogate models

461 Fefferman, Mitter, & Narayanan (2016).

462 Raissi et al. (2017a) and Raissi, Perdikaris, & Karniadakis (2017b).

463 Karniadakis, G.E. et al. (2021).

464 Howard, Mandt, Whiteson, & Yang (2021).

465 Thuerey, N. et al. (2021).

466 Cranmer, Pavez, & Louppe (2015).

467 Cranmer, Brehmer, & Louppe (2019).

468 Baydin, A.G. et al. (2019).

Lectures:

3.16.4 AutoScience

469 C. Anderson (2008).

470 Asch, M. et al. (2018).

471 D’Agnolo & Wulzer (2019).

472 Krenn, M. et al. (2022).

473 Udrescu & Tegmark (2020).

474 Cranmer, M. et al. (2020).

475 Z. Liu, Madhavan, & Tegmark (2022).

476 Lu, C. et al. (2024).

Figure 3.14: The inference cycle for the process of scientific inquiry. The three distinct forms of inference (abduction, deduction, and induction) facilitate an all-encompassing vision, enabling HPC and HDA to converge in a rational and structured manner. HPC: high- performance computing; HDA: high-end data analysis. 477

477 Asch, M. et al. (2018).

See also:

3.17 Implications for the realism debate

3.17.1 Introduction

478 Korb (2001).

479 Williamson (2009).

480 Bensusan (2000).

See also:

3.17.2 Real clusters

  • Nope: Hennig

See also:

3.17.3 Word meanings

481 Kornai (2023).

482 Perone (2018).

483 Skelac & Jandric (2020).

484 Grietzer (2017b).

485 Grietzer (2017a).

486 Tenney, I. et al. (2019).

487 Nissim, Noord, & Goot (2019).

488 Bender & Koller (2020).

489 Elhage, N. et al. (2022).

490 Patel & Pavlick (2022).

491 Lovering & Pavlick (2022).

492 Huh, Cheung, Wang, & Isola (2024).

493 Jamali, M. et al. (2024).

494 Coecke, Sadrzadeh, & Clark (2010).

495 Grefenstette & Sadrzadeh (2011).

496 Balkir, Kartsaklis, & Sadrzadeh (2015).

497 Tyrrell (2018).

498 Kartsaklis & Sadrzadeh (2013).

499 Grefenstette, E. et al. (2013).

500 de Felice, Meichanetzidis, & Toumi (2019).

Wittgenstein in PI:

The meaning of a word is its use in the language. 501

501 Wittgenstein (2009), §43.

and

One cannot guess how a word functions. One has to look at its use, and learn from that. 502

502 Wittgenstein (2009), §340.

Piantadosi:

Modern large language models integrate syntax and semantics in the underlying representations: encoding words as vectors in a high-dimensional space, without an effort to separate out e.g. part of speech categories from semantic representations, or even predict at any level of analysis other than the literal word. Part of making these models work well was in determining how to encode semantic properties into vectors, and in fact initializing word vectors via encodings of distribution semantics from e.g. Mikolov et al. 2013 (Radford et al. 2019). Thus, an assumption of the autonomy of syntax is not required to make models that predict syntactic material and may well hinder it. 503

503 Piantadosi (2023), p. 15.

See also:

3.18 My thoughts

My docs:

My talks:

3.19 Annotated bibliography

3.19.1 Mayo, D.G. (1996). Error and the Growth of Experimental Knowledge.

3.19.1.1 My thoughts

  • TODO

3.19.2 Cowan, G. (1998). Statistical Data Analysis.

3.19.2.1 My thoughts

  • TODO

3.19.3 James, F. (2006). Statistical Methods in Experimental Physics.

3.19.3.1 My thoughts

  • TODO

3.19.4 Cowan, G. et al. (2011). Asymptotic formulae for likelihood-based tests of new physics.

  • Cowan et al. (2011)
  • Glen Cowan, Kyle Cranmer, Eilam Gross, Ofer Vitells

3.19.4.1 My thoughts

  • TODO

3.19.5 ATLAS Collaboration. (2012). Combined search for the Standard Model Higgs boson.

3.19.5.1 My thoughts

  • TODO

3.19.6 Cranmer, K. (2015). Practical statistics for the LHC.

3.19.6.1 My thoughts

  • TODO