3  Philosophy of statistics

Published

October 10, 2025

Statistical analysis is very important in addressing the problem of induction. Can inductive inference be formalized? What are the caveats? Can inductive inference be automated? How does machine learning work?

All knowledge is, in final analysis, history. All sciences are, in the abstract, mathematics. All judgements are, in their rationale, statistics.

– C. R. Rao 1

1 Rao (1997), p. x.

3.1 Introduction to the foundations of statistics

3.1.1 Problem of induction

A key issue for the scientific method, as discussed in the previous outline, is the problem of induction. Inductive inferences are used in the scientific method to make generalizations from finite data. This introduces unique avenues of error not found in purely deductive inferences, like in logic and mathematics. Compared to deductive inferences, which are sound and necessarily follow if an argument is valid and all of its premises obtain, inductive inferences can be valid and probably (not certainly) sound, and therefore can still result in error in some cases because the support of the argument is ultimately probabilistic.

A skeptic may further probe if we are even justified in using the probabilities we use in inductive arguments. What is the probability the Sun will rise tomorrow? What kind of probabilities are reasonable?

In this outline, we sketch and explore how the mathematical theory of statistics has arisen to wrestle with the problem of induction, and how it equips us with careful ways of framing inductive arguments and notions of confidence in them.

See also:

3.1.2 Early investigators

  • Ibn al-Haytham (c. 965-1040)
    • “Ibn al-Haytham was an early proponent of the concept that a hypothesis must be supported by experiments based on confirmable procedures or mathematical evidence—an early pioneer in the scientific method five centuries before Renaissance scientists.” - Wikipedia
  • Gerolamo Cardano (1501-1576)
    • Book on Games of Chance (1564)
  • John Graunt (1620-1674)
  • Jacob Bernoulli (1655-1705)
    • Ars Conjectandi (1713, posthumous)
    • First modern phrasing of the problem of parameter estimation 2
    • See Hacking 3
    • Early vision of decision theory:

2 Edwards (1974), p. 9.

3 Hacking (1971).

The art of measuring, as precisely as possible, probabilities of things, with the goal that we would be able always to choose or follow in our judgments and actions that course, which will have been determined to be better, more satisfactory, safer or more advantageous. 4

4 Bernoulli, J. (1713). Ars Conjectandi, Chapter II, Part IV, defining the art of conjecture [wikiquote].

5 Venn (1888).

6 Jevons (1873a).

7 Jevons (1873b).

3.1.3 Foundations of modern statistics

  • Central limit theorem
  • Charles Sanders Peirce (1839-1914)
    • Formulated modern statistics in “Illustrations of the Logic of Science”, a series published in Popular Science Monthly (1877-1878), and also “A Theory of Probable Inference” in Studies in Logic (1883). 8
    • With a repeated measures design, introduced blinded, controlled randomized experiments (before Fisher).
  • Karl Pearson (1857-1936)
    • The Grammar of Science (1892)
    • “On the criterion that a given system of deviations…” (1900) 9
      • Proposed testing the validity of hypothesized values by evaluating the chi distance between the hypothesized and the empirically observed values via the \(p\)-value.
    • With Frank Raphael Weldon, he established the journal Biometrika in 1902.
    • Founded the world’s first university statistics department at University College, London in 1911.
  • John Maynard Keynes (1883-1946)
    • Keynes, J. M. (1921). A Treatise on Probability. 10
  • Ronald Fisher (1890-1972)
    • Fisher significance of the null hypothesis (\(p\)-values)
      • “On an absolute criterion for fitting frequency curves” 11
      • “Frequency distribution of the values of the correlation coefficient in samples of indefinitely large population” 12
    • “On the ‘probable error’ of a coefficient of correlation deduced from a small sample” 13
      • Definition of likelihood
      • ANOVA
    • Statistical Methods for Research Workers (1925)
    • The Design of Experiments (1935)
    • “Statistical methods and scientific induction” 14
    • The Lady Tasting Tea 15
  • Jerzy Neyman (1894-1981)
  • Egon Pearson (1895-1980)
    • Neyman-Pearson confidence intervals with fixed error probabilities (also \(p\)-values but considering two hypotheses involves two types of errors)
  • Harold Jeffreys (1891-1989)
    • objective (non-informative) Jeffreys priors
  • Andrey Kolmogorov (1903-1987)
  • C.R. Rao (1920-2023)
  • Ray Solomonoff (1926-2009)
  • Shun’ichi Amari (b. 1936)
  • Judea Pearl (b. 1936)

8 Peirce (1883), p. 126–181.

9 Pearson (1900).

10 Keynes (1921).

11 Fisher (1912).

12 Fisher (1915).

13 Fisher (1921).

14 Fisher (1955).

15 Salsburg (2001).

16 Reid (1998).

17 Neyman (1955).

18 Carnap (1960).

3.1.4 Pedagogy

19 Stuart, Ord, & Arnold (2010).

20 F. James (2006).

21 Cowan (1998) and Cowan (2016).

22 Cranmer (2015).

23 Jaynes (2003).

24 Lista (2016b).

25 Lista (2016a).

26 Cox (2006).

27 Behnke, Kröninger, Schott, & Schörner-Sadenius (2013).

28 Cousins (2018).

29 Weisberg (2019).

30 Gelman & Vehtari (2021).

31 Otsuka (2023).

3.3 Statistical models

3.3.1 Parametric models

  • Data: \(x_i\)
  • Parameters: \(\theta_j\)
  • Model: \(f(\vec{x} ; \vec{\theta})\)

3.3.2 Canonical distributions

3.3.2.1 Bernoulli distribution

\[ \mathrm{Ber}(k; p) = \begin{cases} p & \mathrm{if}\ k = 1 \\ 1-p & \mathrm{if}\ k = 0 \end{cases} \]

which can also be written as

\[ \mathrm{Ber}(k; p) = p^k \: (1-p)^{(1-k)} \quad \mathrm{for}\ k \in \{0, 1\} \]

or

\[ \mathrm{Ber}(k; p) = p k + (1-p)(1-k) \quad \mathrm{for}\ k \in \{0, 1\} \]

  • Binomial distribution
  • Poisson distribution

TODO: explain, another important relationship is

Figure 3.1: Relationships among Bernoulli, binomial, categorical, and multinomial distributions.

3.3.2.2 Normal/Gaussian distribution

\[ N(x \,|\, \mu, \sigma^2) = \frac{1}{\sqrt{2\,\pi\:\sigma^2}} \: \exp\left(\frac{-(x-\mu)^2}{2\,\sigma^2}\right) \]

and in \(k\) dimensions:

\[ N(\vec{x} \,|\, \vec{\mu}, \boldsymbol{\Sigma}) = (2 \pi)^{-k/2}\:\left|\boldsymbol{\Sigma}\right|^{-1/2} \: \exp\left(\frac{-1}{2}\:(\vec{x}-\vec{\mu})^\intercal \:\boldsymbol{\Sigma}^{-1}\:(\vec{x}-\vec{\mu})\right) \]

where \(\boldsymbol{\Sigma}\) is the covariance matrix of the distribution (defined in eq. 3.2).

3.3.3 Central limit theorem

Let \(X_{1}\), \(X_{2}\), … , \(X_{n}\) be a random sample drawn from any distribution with a finite mean \(\mu\) and variance \(\sigma^{2}\). As \(n \rightarrow \infty\), the distribution of

\[ \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim N(0, 1) \]

  • \(\chi^2\) distribution
  • Univariate distribution relationships
Figure 3.2: Detail of a figure showing relationships among univariate distributions. See the full figure here 46.

46 Leemis & McQueston (2008).

  • The exponential family of distributions are maximum entropy distributions.

3.3.4 Mixture models

47 Cranmer, K. et al. (2012).

3.4 Point estimation and confidence intervals

3.4.1 Inverse problems

Recall that in the context of parametric models of data, \(x_i\) the pdf of which is modeled by a function, \(f(x_i ; \theta_j)\) with parameters, \(\theta_j\). In a statistical inverse problem, the goal is to infer values of the model parameters, \(\theta_j\) given some finite set of data, \(\{x_i\}\) sampled from a probability density, \(f(x_i; \theta_j)\) that models the data reasonably well 48.

48 This assumption that the model models the data “reasonably” well reflects that to the degree required by your analysis, the important features of the data match well within the systematic uncertainties parametrized within the model. If the model is incomplete because it is missing an important feature of the data, then this is the “ugly” (class-3) error in the Sinervo classification of systematic uncertainties.

49 Cowan (1998) and Cowan (2016), p. TODO.

3.4.2 Bias and variance

The bias of an estimator, \(\hat\theta\), is defined as

\[ \mathrm{Bias}(\hat{\theta}) \equiv \mathbb{E}(\hat{\theta} - \theta) = \int dx \: P(x|\theta) \: (\hat{\theta} - \theta) \]

The mean squared error (MSE) of an estimator has a similar formula to variance (eq. 3.1) except that instead of quantifying the square of the difference of the estimator and its expected value, the MSE uses the square of the difference of the estimator and the true parameter:

\[ \mathrm{MSE}(\hat{\theta}) \equiv \mathbb{E}((\hat{\theta} - \theta)^2) \]

The MSE of an estimator can be related to its bias and its variance by the following proof:

\[ \begin{align} \mathrm{MSE}(\hat{\theta}) &= \mathbb{E}(\hat{\theta}^2 - 2 \: \hat{\theta} \: \theta + \theta^2) \\ &= \mathbb{E}(\hat{\theta}^2) - 2 \: \mathbb{E}(\hat{\theta}) \: \theta + \theta^2 \end{align} \]

noting that

\[ \mathrm{Var}(\hat{\theta}) = \mathbb{E}(\hat{\theta}^2) - \mathbb{E}(\hat{\theta})^2 \]

and

\[ \begin{align} \mathrm{Bias}(\hat{\theta})^2 &= \mathbb{E}(\hat{\theta} - \theta)^2 \\ &= \mathbb{E}(\hat{\theta})^2 - 2 \: \mathbb{E}(\hat{\theta}) \: \theta + \theta^2 \end{align} \]

we see that MSE is equivalent to

\[ \mathrm{MSE}(\hat{\theta}) = \mathrm{Var}(\hat{\theta}) + \mathrm{Bias}(\hat{\theta})^2 \]

For an unbiased estimator, the MSE is the variance of the estimator.

TODO:

  • Note the discussion of the bias-variance tradeoff by Cranmer.
  • Note the new deep learning view. See Deep learning.

See also:

3.4.3 Maximum likelihood estimation

A maximum likelihood estimator (MLE) was first used by Fisher. 50

50 Aldrich (1997).

\[\hat{\theta} \equiv \underset{\theta}{\mathrm{argmax}} \: \mathrm{log} \: L(\theta) \]

Maximizing \(\mathrm{log} \: L(\theta)\) is equivalent to maximizing \(L(\theta)\), and the former is more convenient because for data that are independent and identically distributed (i.i.d.) the joint likelihood can be factored into a product of individual measurements:

\[ L(\theta) = \prod_i L(\theta|x_i) = \prod_i P(x_i|\theta) \]

and taking the log of the product makes it a sum:

\[ \mathrm{log} \: L(\theta) = \sum_i \mathrm{log} \: L(\theta|x_i) = \sum_i \mathrm{log} \: P(x_i|\theta) \]

Maximizing \(\mathrm{log} \: L(\theta)\) is also equivalent to minimizing \(-\mathrm{log} \: L(\theta)\), the negative log-likelihood (NLL). For distributions that are i.i.d.,

\[ \mathrm{NLL} \equiv - \log L = - \log \prod_i L_i = - \sum_i \log L_i = \sum_i \mathrm{NLL}_i \]

3.4.3.1 Invariance of likelihoods under reparametrization

  • Likelihoods are invariant under reparametrization. 51
  • Bayesian posteriors are not invariant in general.

51 F. James (2006), p. 234.

See also:

3.4.3.2 Ordinary least squares

  • Least squares from MLE of gaussian models: \(\chi^2\)
  • Ordinary Least Squares (OLS)
  • Geometric interpretation

52 Cox (2006), p. 11.

53 Murphy (2012), p. 222.

3.4.4 Variance of MLEs

54 Fréchet (1943), Cramér (1946), Rao (1945), and Rao (1947).

55 Rice (2007), p. 300–2.

56 Nielsen (2013).

57 Cowan (1998), p. 130-5.

58 F. James (2006), p. 234.

59 F. James & Roos (1975).

Figure 3.3: Transformation of non-parabolic log-likelihood to parabolic (source: my slides, recreation of F. James (2006), p. 235).

60 Loh (1987).

61 Wainer (2007).

62 Tegmark, Taylor, & Heavens (1997).

3.4.5 Bayesian credibility intervals

3.4.6 Uncertainty on measuring an efficiency

63 Clopper & Pearson (1934).

64 Agresti & Coull (1998).

65 Hanley & Lippman-Hand (1983).

66 L. D. Brown, Cai, & DasGupta (2001).

67 Casadei (2012).

3.4.7 Examples

3.5 Statistical hypothesis testing

3.5.1 Null hypothesis significance testing

  • Karl Pearson observing how rare sequences of roulette spins are
  • Null hypothesis significance testing (NHST)
  • goodness of fit
  • Fisher

Fisher:

[T]he null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. 68

68 Fisher (1935), p. 16.

3.5.2 Neyman-Pearson theory

3.5.2.1 Introduction

69 Goodman (1999a). p. 998.

70 J. Cohen (1992).

71 ATLAS and CMS Collaborations (2011).

72 Cowan, Cranmer, Gross, & Vitells (2011).

Figure 3.4: TODO: ROC explainer. (Wikimedia, 2015).

See also:

3.5.2.2 Neyman-Pearson lemma

Neyman-Pearson lemma: 73

73 Neyman & Pearson (1933).

For a fixed signal efficiency, \(1-\alpha\), the selection that corresponds to the lowest possible misidentification probability, \(\beta\), is given by

\[ \frac{L(H_1)}{L(H_0)} > k_{\alpha} \]

where \(k_{\alpha}\) is the cut value required to achieve a type-1 error rate of \(\alpha\).

Neyman-Pearson test statistic:

\[ q_\mathrm{NP} = - 2 \ln \frac{L(H_1)}{L(H_0)} \]

Profile likelihood ratio:

\[ \lambda(\mu) = \frac{ L(\mu, \hat{\theta}_\mu) }{ L(\hat{\mu}, \hat{\theta}) } \]

where \(\hat{\theta}\) is the (unconditional) maximum-likelihood estimator that maximizes \(L\), while \(\hat{\theta}_\mu\) is the conditional maximum-likelihood estimator that maximizes \(L\) for a specified signal strength, \(\mu\), and \(\theta\) as a vector includes all other parameters of interest and nuisance parameters.

3.5.2.3 Neyman construction

Cranmer: Neyman construction.

Figure 3.5: Neyman construction for a confidence belt for \(\theta\) (source: K. Cranmer, 2020).

TODO: fix

\[ q = - 2 \ln \frac{L(\mu\,s + b)}{L(b)} \]

3.5.2.4 Flip-flopping

  • Flip-flopping and Feldman-Cousins confidence intervals 74

74 Feldman & Cousins (1998).

3.5.3 p-values and significance

  • \(p\)-values and significance 75
  • Coverage
  • Fisherian vs Neyman-Pearson \(p\)-values

75 Sinervo (2002) and Cowan (2012).

Cowan et al. define a \(p\)-value as

a probability, under assumption of \(H\), of finding data of equal or greater incompatibility with the predictions of \(H\). 76

76 Cowan et al. (2011), p. 2–3.

Also:

It should be emphasized that in an actual scientific context, rejecting the background-only hypothesis in a statistical sense is only part of discovering a new phenomenon. One’s degree of belief that a new process is present will depend in general on other factors as well, such as the plausibility of the new signal hypothesis and the degree to which it can describe the data. Here, however, we only consider the task of determining the \(p\)-value of the background-only hypothesis; if it is found below a specified threshold, we regard this as “discovery”. 77

77 Cowan et al. (2011), p. 3.

3.5.3.1 Uppper limits

78 Cousins & Highland (1992).

3.5.3.2 CLs method

  • Conservative coverage; used in particle physics
  • Junk 79
  • Read 80
  • ATLAS 81

79 Junk (1999).

80 Read (2002).

81 ATLAS Statistics Forum (2011).

3.5.4 Asymptotics

82 Wilks (1938).

83 Wald (1943).

84 Cowan et al. (2011).

85 Cowan, Cranmer, Gross, & Vitells (2012).

86 Bhattiprolu, Martin, & Wells (2020).

3.5.5 Student’s t-test

  • Student’s t-test
  • ANOVA
  • A/B-testing

3.5.6 Decision theory

87 Murphy (2012), p. 197.

88 Goodman (1999b).

Support for using Bayes factors:

which properly separates issues of long-run behavior from evidential strength and allows the integration of background knowledge with statistical findings. 89

89 Goodman (1999a). p. 995.

See also:

3.5.7 Examples

3.6 Uncertainty quantification

3.6.1 Sinervo classification of systematic uncertainties

  • Class-1, class-2, and class-3 systematic uncertanties (good, bad, ugly), Classification by Pekka Sinervo (PhyStat2003) 90
  • Not to be confused with type-1 and type-2 errors in Neyman-Pearson theory
  • Heinrich, J. & Lyons, L. (2007). Systematic errors. 91
  • Caldeira & Nord 92

90 Sinervo (2003).

91 Heinrich & Lyons (2007).

92 Caldeira & Nord (2020).

Lyons:

In analyses involving enough data to achieve reasonable statistical accuracy, considerably more effort is devoted to assessing the systematic error than to determining the parameter of interest and its statistical error. 93

93 Lyons (2008), p. 890.

Figure 3.6: Classification of measurement uncertainties (philosophy-in-figures.tumblr.com, 2016).
  • Poincaré’s three levels of ignorance

3.6.2 Profile likelihoods

  • Profiling and the profile likelihood
    • Importance of Wald and Cowan et al.
    • hybrid Bayesian-frequentist method

3.6.3 Examples of poor estimates of systematic uncertanties

Figure 3.7: Demonstration of sensitivity to the jet energy scale for an alleged excess in \(Wjj\) by Tommaso Dorigo (2011) (see also: GIF).
  • OPERA. (2011). Faster-than-light neutrinos.
  • BICEP2 claimed evidence of B-modes in the CMB as evidence of cosmic inflation without accounting for cosmic dust.

3.6.4 Conformal prediction

94 Gammerman, Vovk, & Vapnik (1998).

3.7 Statistical classification

3.7.1 Introduction

  • Precision vs recall
  • Recall is sensitivity
  • Sensitivity vs specificity
  • Accuracy

3.7.2 Examples

  • TODO

See also:

3.8 Causal inference

3.8.1 Introduction

95 Rubin (1974).

96 Lewis (1981).

97 Pearl (2018).

See also:

3.8.2 Causal models

98 Pearl (2009).

99 Robins & Wasserman (1999).

100 Peters, Janzing, & Scholkopf (2017).

101 Lundberg, Johnson, & Stewart (2021).

3.8.3 Counterfactuals

102 Ismael (2023).

103 Chevalley, Schwab, & Mehrjou (2024).

3.9 Exploratory data analysis

3.9.1 Introduction

104 Tukey (1977).

3.9.2 Look-elsewhere effect

  • Look-elsewhere effect (LEE)
    • AKA File-drawer effect
  • Stopping rules
    • validation dataset
    • statistical issues, violates the likelihood principle

3.9.3 Archiving and data science

105 Chen, X. et al. (2018).

3.10 “Statistics Wars”

3.10.1 Introduction

  • Kruschke
  • Carnap
    • “The two concepts of probability” 106
  • Royall
    • “What do these data say?” 107

106 Carnap (1945).

107 Royall (1997), p. 171–2.

Cranmer:

Bayes’s theorem is a theorem, so there’s no debating it. It is not the case that Frequentists dispute whether Bayes’s theorem is true. The debate is whether the necessary probabilities exist in the first place. If one can define the joint probability \(P (A, B)\) in a frequentist way, then a Frequentist is perfectly happy using Bayes theorem. Thus, the debate starts at the very definition of probability. 108

108 Cranmer (2015), p. 6.

Neyman:

Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong. 109

109 Neyman & Pearson (1933).

Figure 3.8: From Kruschke. 110

110 Kruschke & Liddell (2018).

3.10.2 Likelihood principle

  • Likelihood principle
  • The likelihood principle is the proposition that, given a statistical model and a data sample, all the evidence relevant to model parameters is contained in the likelihood function.
  • The history of likelihood 111
  • Berger & Wolpert. (1988). The Likelihood Principle. 114

111 Edwards (1974).

112 Birnbaum (1962).

113 Hacking (1965).

114 Berger & Wolpert (1988).

O’Hagan:

The first key argument in favour of the Bayesian approach can be called the axiomatic argument. We can formulate systems of axioms of good inference, and under some persuasive axiom systems it can be proved that Bayesian inference is a consequence of adopting any of these systems… If one adopts two principles known as ancillarity and sufficiency principles, then under some statement of these principles it follows that one must adopt another known as the likelihood principle. Bayesian inference conforms to the likelihood principle whereas classical inference does not. Classical procedures regularly violate the likelihood principle or one or more of the other axioms of good inference. There are no such arguments in favour of classical inference. 115

115 O’Hagan (2010), p. 17–18.

116 Gandenberger (2015).

117 Evans (2013).

118 Mayo (2014).

119 Mayo (2019).

120 Dawid (2014).

Mayo:

Likelihoods are vital to all statistical accounts, but they are often misunderstood because the data are fixed and the hypothesis varies. Likelihoods of hypotheses should not be confused with their probabilities. … [T]he same phenomenon may be perfectly predicted or explained by two rival theories; so both theories are equally likely on the data, even though they cannot both be true. 121

121 Mayo (2019).

3.10.3 Discussion

Lyons:

Particle Physicists tend to favor a frequentist method. This is because we really do consider that our data are representative as samples drawn according to the model we are using (decay time distributions often are exponential; the counts in repeated time intervals do follow a Poisson distribution, etc.), and hence we want to use a statistical approach that allows the data “to speak for themselves,” rather than our analysis being dominated by our assumptions and beliefs, as embodied in Bayesian priors. 122

122 Lyons (2008), p. 891.

123 Carnap (1952).

124 Carnap (1960).

125 Sznajder (2018).

126 Hacking (1965).

127 Neyman (1977).

128 Rozeboom (1960).

129 Meehl (1978).

130 Zech (1995).

131 Royall (1997).

132 Berger (2003).

133 Stuart et al. (2010), p. 460.

134 Mayo (1981).

135 Mayo (1996).

136 Mayo & Spanos (2006).

137 Mayo & Spanos (2011).

138 Mayo (2018).

139 Gelman & Hennig (2017).

140 Murphy (2012), ch. 6.6.

141 Murphy (2022), p. 195–198.

142 Gandenberger (2016).

143 Wakefield (2013), ch. 4.

144 Efron & Hastie (2016), p. 30–36.

145 Kruschke & Liddell (2018).

146 Steinhardt (2012).

147 Clayton (2021).

148 Wagenmakers (2021).

Figure 3.9: The major virtues and vices of Bayesian, frequentist, and likelihoodist approaches to statistical inference (gandenberger.org/research/, 2015).

Goodman:

The idea that the \(P\) value can play both of these roles is based on a fallacy: that an event can be viewed simultaneously both from a long-run and a short-run perspective. In the long-run perspective, which is error-based and deductive, we group the observed result together with other outcomes that might have occurred in hypothetical repetitions of the experiment. In the “short run” perspective, which is evidential and inductive, we try to evaluate the meaning of the observed result from a single experiment. If we could combine these perspectives, it would mean that inductive ends (drawing scientific conclusions) could be served with purely deductive methods (objective probability calculations). 149

149 Goodman (1999a). p. 999.

3.11 Replication crisis

3.11.1 Introduction

  • Ioannidis, J.P. (2005). Why most published research findings are false. 150

150 Ioannidis (2005).

3.11.2 p-value controversy

151 Wasserstein & Lazar (2016).

152 Wasserstein, Allen, & Lazar (2019).

153 Benjamin, D.J. et al. (2017).

[N]o isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon; for the “one chance in a million” will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be that it should occur to us. In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. 154

154 Fisher (1935), p. 13–14.

155 Rao & Lovric (2016).

156 Mayo (2021).

157 Gorard & Gorard (2016).

From “The ASA president’s task force statement on statistical significance and replicability”:

P-values are valid statistical measures that provide convenient conventions for communicating the uncertainty inherent in quantitative results. Indeed, P-values and significance tests are among the most studied and best understood statistical procedures in the statistics literature. They are important tools that have advanced science through their proper application. 158

158 Benjamini, Y. et al. (2021), p. 1.

3.12 Classical machine learning

3.12.1 Introduction

  • Classification vs regression
  • Supervised and unsupervised learning
    • Classification = supervised; clustering = unsupervised
  • Hastie, Tibshirani, & Friedman 159
  • Information Theory, Inference, and Learning 160
  • Murphy, K.P. (2012). Machine Learning: A probabilistic perspective. MIT Press. 161
  • Murphy, K.P. (2022). Probabilistic Machine Learning: An introduction. MIT Press. 162
  • Shalev-Shwarz, S. & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. 163
  • VC-dimension
    • Vapnik (1994) 164
    • Shalev-Shwarz, S. & Ben-David, S. (2014). 165

159 Hastie, Tibshirani, & Friedman (2009).

160 MacKay (2003).

161 Murphy (2012).

162 Murphy (2022), p. 195–198.

163 Shalev-Shwarz & Ben-David (2014).

164 Vapnik, Levin, & LeCun (1994).

165 Shalev-Shwarz & Ben-David (2014), p. 67–82.

3.12.2 History

166 McCarthy, Minsky, Rochester, & Shannon (1955).

167 Solomonoff (2016).

168 Kardum (2020).

169 McCulloch & Pitts (1943).

170 J. A. Anderson & Rosenfeld (1998).

171 Rosenblatt (1961).

172 Minsky & Papert (1969).

173 Cartwright (2001).

See also:

3.12.3 Logistic regression

From a probabilistic point of view, 174 logistic regression can be derived from doing maximum likelihood estimation of a vector of model parameters, \(\vec{w}\), in a dot product with the input features, \(\vec{x}\), and squashed with a logistic function that yields the probability, \(\mu\), of a Bernoulli random variable, \(y \in \{0, 1\}\).

174 Murphy (2012), p. 21.

\[ p(y | \vec{x}, \vec{w}) = \mathrm{Ber}(y | \mu(\vec{x}, \vec{w})) = \mu(\vec{x}, \vec{w})^y \: (1-\mu(\vec{x}, \vec{w}))^{(1-y)} \]

The negative log-likelihood of multiple trials is

\[ \begin{align} \mathrm{NLL} &= - \sum_i \log p(y_i | \vec{x}_i, \vec{w}) \\ &= - \sum_i \log\left( \mu(\vec{x}_i, \vec{w})^{y_i} \: (1-\mu(\vec{x}_i, \vec{w}))^{(1-y_i)} \right) \\ &= - \sum_i \log\left( \mu_i^{y_i} \: (1-\mu_i)^{(1-y_i)} \right) \\ &= - \sum_i \big( y_i \, \log \mu_i + (1-y_i) \log(1-\mu_i) \big) \end{align} \]

which is the cross entropy loss. Note that the first term is non-zero only when the true target is \(y_i=1\), and similarly the second term is non-zero only when \(y_i=0\). 175 Therefore, we can reparametrize the target \(y_i\) in favor of \(t_{ki}\) that is one-hot in an index \(k\) over classes.

175 Note: Label smoothing is a regularization technique that smears the activation over other labels, but we don’t do that here.

\[ \mathrm{CEL} = \mathrm{NLL} = - \sum_i \sum_k \big( t_{ki} \, \log \mu_{ki} \big) \]

where

\[ t_{ki} = \begin{cases} 1 & \mathrm{if}\ (k = y_i = 0)\ \mathrm{or}\ (k = y_i = 1) \\ 0 & \mathrm{otherwise} \end{cases} \]

and

\[ \mu_{ki} = \begin{cases} 1-\mu_i & \mathrm{if}\ k = 0 \\ \mu_i & \mathrm{if}\ k =1 \end{cases} \]

This readily generalizes from binary classification to classification over many classes as we will discuss more below. Note that in the sum over classes, \(k\), only one term for the true class contributes.

\[ \mathrm{CEL} = - \left. \sum_i \log \mu_{ki} \right|_{k\ \mathrm{is\ such\ that}\ y_k=1} \tag{3.3}\]

Logistic regression uses the logit function 176, which is the logarithm of the odds—the ratio of the chance of success to failure. Let \(\mu\) be the probability of success in a Bernoulli trial, then the logit function is defined as

176 “Logit” was coined by Joseph Berkson (1899-1982).

\[ \mathrm{logit}(\mu) \equiv \log\left(\frac{\mu}{1-\mu}\right) \]

Logistic regression assumes that the logit function is a linear function of the explanatory variable, \(x\).

\[ \log\left(\frac{\mu}{1-\mu}\right) = \beta_0 + \beta_1 x \]

where \(\beta_0\) and \(\beta_1\) are trainable parameters. (TODO: Why would we assume this?) This can be generalized to a vector of multiple input variables, \(\vec{x}\), where the input vector has a 1 prepended to be its zeroth component in order to conveniently include the bias, \(\beta_0\), in a dot product.

\[ \vec{x} = (1, x_1, x_2, \ldots, x_n)^\intercal \]

\[ \vec{w} = (\beta_0, \beta_1, \beta_2, \ldots, \beta_n)^\intercal \]

\[ \log\left(\frac{\mu}{1-\mu}\right) = \vec{w}^\intercal \vec{x} \]

For the moment, let \(z \equiv \vec{w}^\intercal \vec{x}\). Exponentiating and solving for \(\mu\) gives

\[ \mu = \frac{ e^z }{ 1 + e^z } = \frac{ 1 }{ 1 + e^{-z} } \]

This function is called the logistic or sigmoid function.

\[ \mathrm{logistic}(z) \equiv \mathrm{sigm}(z) \equiv \frac{ 1 }{ 1 + e^{-z} } \]

Since we inverted the logit function by solving for \(\mu\), the inverse of the logit function is the logistic or sigmoid.

\[ \mathrm{logit}^{-1}(z) = \mathrm{logistic}(z) = \mathrm{sigm}(z) \]

And therefore,

\[ \mu = \mathrm{sigm}(z) = \mathrm{sigm}(\vec{w}^\intercal \vec{x}) \]

Figure 3.10: Logistic regression.

See also:

3.12.4 Softmax regression

Again, from a probabilistic point of view, we can derive the use of multi-class cross entropy loss by starting with the Bernoulli distribution, generalizing it to multiple classes (indexed by \(k\)) as

\[ p(y_k | \mu) = \mathrm{Cat}(y_k | \mu_k) = \prod_k {\mu_k}^{y_k} \]

which is the categorical or multinoulli distribution. The negative-log likelihood of multiple independent trials is

\[ \mathrm{NLL} = - \sum_i \log \left(\prod_k {\mu_{ki}}^{y_{ki}}\right) = - \sum_i \sum_k y_{ki} \: \log \mu_{ki} \]

Noting again that \(y_{ki} = 1\) only when \(k\) is the true class, and is 0 otherwise, this simplifies to eq. 3.3.

See also:

177 McFadden & Zarembka (1973).

178 Blondel, Martins, & Niculae (2020).

179 Goodfellow et al. (2016), p. 129.

3.12.5 Decision trees

180 Freund & Schapire (1997).

181 T. Chen & Guestrin (2016).

182 Aytekin (2022).

183 Grinsztajn, Oyallon, & Varoquaux (2022).

184 Coadou (2022).

3.12.6 Clustering

185 Hartigan (1985).

186 Slonim, Atwal, Tkacik, & Bialek (2005).

187 Batson, Haaf, Kahn, & Roberts (2021).

188 Hennig (2015).

189 Lauc (2020), p. 103–4.

190 Ronen, Finder, & Freifeld (2022).

191 Fang, Z. et al. (2022).

See also:

3.13 Deep learning

3.13.1 Introduction

192 Gardner & Dorling (1998).

193 Bengio (2009).

194 LeCun, Bengio, & Hinton (2015).

195 Sutskever (2015).

196 Goodfellow et al. (2016).

197 Kaplan, J. et al. (2019).

198 Rumelhart, Hinton, & Williams (1986).

199 Amari (1993).

200 LeCun & Bottou (1998).

201 Scardapane (2024).

202 Bottou (1998).

203 Norvig (2011).

204 Sutton (2019).

205 Frankle & Carbin (2018).

Figure 3.11: Raw input image is transformed into gradually higher levels of representation. 206

206 Bengio (2009).

3.13.2 Gradient descent

\[ \hat{f} = \underset{f \in \mathcal{F}}{\mathrm{argmin}} \underset{x \sim \mathcal{X}}{\mathbb{E}} L(f, x) \]

The workhorse algorithm for optimizing (training) model parameters is gradient descent:

\[ \vec{w}[t+1] = \vec{w}[t] - \eta \frac{\partial L}{\partial \vec{w}}[t] \]

In Stochastic Gradient Descent (SGD), you chunk the training data into minibatches (AKA batches), \(\vec{x}_{bt}\), and take a gradient descent step with each minibatch:

\[ \vec{w}[t+1] = \vec{w}[t] - \frac{\eta}{m} \sum_{i=1}^m \frac{\partial L}{\partial \vec{w}}[\vec{x}_{bt}] \]

where

  • \(t \in \mathbf{N}\) is the learning step number
  • \(\eta\) is the learning rate
  • \(m\) is the number of samples in a minibatch, called the batch size
  • \(L\) is the loss function
  • \(\frac{\partial L}{\partial \vec{w}}\) is the gradient

3.13.3 Deep double descent

Papers:

207 Belkin, Hsu, Ma, & Mandal (2019).

208 Muthukumar, Vodrahalli, Subramanian, & Sahai (2019).

209 Nakkiran, P. et al. (2019).

210 Chang, Li, Oymak, & Thrampoulidis (2020).

211 Holzmüller (2020).

212 Dar, Muthukumar, & Baraniuk (2021).

213 Balestriero, Pesenti, & LeCun (2021).

214 Belkin (2021).

215 Nagarajan (2021).

216 Bach (2022), p. 225–230.

217 Ghosh & Belkin (2022).

218 Singh, Lucchi, Hofmann, & Schölkopf (2022).

219 Hastie, Montanari, Rosset, & Tibshirani (2022).

220 Bubeck & Sellke (2023).

221 Gamba, Englesson, Björkman, & Azizpour (2022).

222 Schaeffer, R. et al. (2023).

223 Yang & Suzuki (2023).

224 Maddox, Benton, & Wilson (2023).

Blogs:

225 Steinhardt (2022).

226 Henighan, T. et al. (2023).

Twitter threads:

3.13.4 Regularization

Regularization = any change we make to the training algorithm in order to reduce the generalization error but not the training error. 227

227 Mishra, D. (2020). Weight Decay == L2 Regularization?

Most common regularizations:

  • L2 Regularization
  • L1 Regularization
  • Data Augmentation
  • Dropout
  • Early Stopping

Papers:

228 S. Chen, Dobriban, & Lee (2020).

3.13.5 Batch size vs learning rate

Papers:

  1. Keskar, N.S. et al. (2016). On large-batch training for deep learning: Generalization gap and sharp minima.

[L]arge-batch methods tend to converge to sharp minimizers of the training and testing functions—and as is well known—sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation.

  1. Hoffer, E. et al. (2017). Train longer, generalize better: closing the generalization gap in large batch training of neural networks.

    • \(\eta \propto \sqrt{m}\)
  2. Goyal, P. et al. (2017). Accurate large minibatch SGD: Training ImageNet in 1 hour.

    • \(\eta \propto m\)
  3. You, Y. et al. (2017). Large batch training of convolutional networks.

    • Layer-wise Adaptive Rate Scaling (LARS)
  4. You, Y. et al. (2017). ImageNet training in minutes.

    • Layer-wise Adaptive Rate Scaling (LARS)
  5. Jastrzebski, S. (2018). Three factors influencing minima in SGD.

    • \(\eta \propto m\)
  6. Smith, S.L. & Le, Q.V. (2018). A Bayesian Perspective on Generalization and Stochastic Gradient Descent.

  7. Smith, S.L. et al. (2018). Don’t decay the learning rate, increase the batch size.

    • \(m \propto \eta\)
  8. Masters, D. & Luschi, C. (2018). Revisiting small batch training for deep neural networks.

This linear scaling rule has been widely adopted, e.g., in Krizhevsky (2014), Chen et al. (2016), Bottou et al. (2016), Smith et al. (2017) and Jastrzebski et al. (2017).

On the other hand, as shown in Hoffer et al. (2017), when \(m \ll M\), the covariance matrix of the weight update \(\mathrm{Cov(\eta \Delta\theta)}\) scales linearly with the quantity \(\eta^2/m\).

This implies that, adopting the linear scaling rule, an increase in the batch size would also result in a linear increase in the covariance matrix of the weight update \(\eta \Delta\theta\). Conversely, to keep the scaling of the covariance of the weight update vector \(\eta \Delta\theta\) constant would require scaling \(\eta\) with the square root of the batch size \(m\) (Krizhevsky, 2014; Hoffer et al., 2017).

  1. Lin, T. et al. (2020). Don’t use large mini-batches, use local SGD.
    - Post-local SGD.

  2. Golmant, N. et al. (2018). On the computational inefficiency of large batch sizes for stochastic gradient descent.

Scaling the learning rate as \(\eta \propto \sqrt{m}\) attempts to keep the weight increment length statistics constant, but the distance between SGD iterates is governed more by properties of the objective function than the ratio of learning rate to batch size. This rule has also been found to be empirically sub-optimal in various problem domains. … There does not seem to be a simple training heuristic to improve large batch performance in general.

  1. McCandlish, S. et al. (2018). An empirical model of large-batch training.
    • Critical batch size
  2. Shallue, C.J. et al. (2018). Measuring the effects of data parallelism on neural network training.

In all cases, as the batch size grows, there is an initial period of perfect scaling (\(b\)-fold benefit, indicated with a dashed line on the plots) where the steps needed to achieve the error goal halves for each doubling of the batch size. However, for all problems, this is followed by a region of diminishing returns that eventually leads to a regime of maximal data parallelism where additional parallelism provides no benefit whatsoever.

  1. Jastrzebski, S. et al. (2018). Width of minima reached by stochastic gradient descent is influenced by learning rate to batch size ratio.
    • \(\eta \propto m\)

We show this experimentally in Fig. 5, where similar learning dynamics and final performance can be observed when simultaneously multiplying the learning rate and batch size by a factor up to a certain limit.

  1. You, Y. et al. (2019). Large-batch training for LSTM and beyond.
    • Warmup and use \(\eta \propto m\)

[W]e propose linear-epoch gradual-warmup approach in this paper. We call this approach Leg-Warmup (LEGW). LEGW enables a Sqrt Scaling scheme in practice and as a result we achieve much better performance than the previous Linear Scaling learning rate scheme. For the GNMT application (Seq2Seq) with LSTM, we are able to scale the batch size by a factor of 16 without losing accuracy and without tuning the hyper-parameters mentioned above.

  1. You, Y. et al. (2019). Large batch optimization for deep learning: Training BERT in 76 minutes.
    • LARS and LAMB
  2. Zhang, G. et al. (2019). Which algorithmic choices matter at which batch sizes? Insights from a Noisy Quadratic Model.

Consistent with the empirical results of Shallue et al. (2018), each optimizer shows two distinct regimes: a small-batch (stochastic) regime with perfect linear scaling, and a large-batch (deterministic) regime insensitive to batch size. We call the phase transition between these regimes the critical batch size.

  1. Li, Y., Wei, C., & Ma, T. (2019). Towards explaining the regularization effect of initial large learning rate in training neural networks.

Our analysis reveals that more SGD noise, or larger learning rate, biases the model towards learning “generalizing” kernels rather than “memorizing” kernels.

  1. Kaplan, J. et al. (2020). Scaling laws for neural language models.

  2. Jastrzebski, S. et al. (2020). The break-even point on optimization trajectories of deep neural networks.

Blogs:

3.13.6 Normalization

229 Chiley, V. et al. (2019).

230 Kiani, Balestriero, Lecun, & Lloyd (2022).

231 Huang, L. et al. (2020).

3.13.7 Finetuning

232 Hu, E.J. et al. (2021).

233 Dettmers, Pagnoni, Holtzman, & Zettlemoyer (2023).

234 Zhao, J. et al. (2024).

235 Huh, M. et al. (2024).

3.13.8 Computer vision

236 Fukushima & Miyake (1982).

237 LeCun, Y. et al. (1989).

238 LeCun, Bottou, Bengio, & Haffner (1998).

239 Ciresan, Meier, Masci, & Schmidhuber (2012).

240 Krizhevsky, Sutskever, & Hinton (2012).

241 Simonyan & Zisserman (2014).

242 He, Zhang, Ren, & Sun (2015).

243 Haber & Ruthotto (2017) and Haber, E. et al. (2018).

244 Howard, A.G. et al. (2017).

245 R. T. Q. Chen, Rubanova, Bettencourt, & Duvenaud (2018).

246 Tan & Le (2019).

247 Dosovitskiy, A. et al. (2020).

248 Tan & Le (2021).

249 H. Liu, Dai, So, & Le (2021).

250 Dhariwal & Nichol (2021).

251 Liu, Y. et al. (2021).

252 Ingrosso & Goldt (2022).

253 Park & Kim (2022).

254 Zhao, Y. et al. (2023).

255 Nakkiran, Bradley, Zhou, & Advani (2024).

Resources:

3.13.9 Natural language processing

3.13.9.1 Introduction

256 Firth (1957).

257 Nirenburg (1996).

258 Hutchins (2000).

259 Jurafsky & Martin (2022).

260 Z. Liu, Lin, & Sun (2023).

3.13.9.2 word2vec

261 Mikolov, Chen, Corrado, & Dean (2013), Mikolov, Yih, & Zweig (2013), and Mikolov, T. et al. (2013).

262 Kun (2018), p. 176–8.

3.13.9.3 RNNs

263 Hochreiter & Schmidhuber (1997).

264 Graves (2013).

Chain rule of language modeling (chain rule of probability):

\[ P(x_1, \ldots, x_T) = P(x_1, \ldots, x_{n-1}) \prod_{t=n}^{T} P(x_t | x_1 \ldots x_{t-1}) \]

or for the whole sequence:

\[ P(x_1, \ldots, x_T) = \prod_{t=1}^{T} P(x_t | x_1 \ldots x_{t-1}) \]

\[ = P(x_1) \: P(x_2 | x_1) \: P(x_3 | x_1 x_2) \: P(x_4 | x_1 x_2 x_3) \ldots \]

A language model (LM), predicts the next token given previous context. The output of the model is a vector of logits, which is given to a softmax to convert to probabilities for the next token.

\[ P(x_t | x_1 \ldots x_{t-1}) = \mathrm{softmax}\left( \mathrm{model}(x_1 \ldots x_{t-1}) \right) \]

Auto-regressive inference follows this chain rule. If done with greedy search:

\[ \hat{x}_t = \underset{x_t \in V}{\mathrm{argmax}} \: P(x_t | x_1 \ldots x_{t-1}) \]

Beam search:

265 Sutskever, Vinyals, & Le (2014), p. 4.

266 Zhang (1998).

267 Zhou & Hansen (2005).

268 Collobert, Hannun, & Synnaeve (2019).

Backpropagation through time (BPTT):

269 Werbos (1990).

Neural Machine Translation (NMT):

  • Sutskever seq2seq 270
  • Bahdanau attention 271 and GNMT 272
  • Review by Stahlberg 273

270 Sutskever et al. (2014).

271 Bahdanau, Cho, & Bengio (2015).

272 Wu, Y. et al. (2016).

273 Stahlberg (2019).

3.13.9.4 Transformers

Figure 3.12: Diagram of the Transformer model (source: d2l.ai).

\[ \mathrm{attention}(Q, K, V) = \mathrm{softmax}\left(\frac{Q\, K^\intercal}{\sqrt{d_k}}\right) V \]

3.13.9.4.1 Attention and Transformers

274 Vaswani, A. et al. (2017).

275 Devlin, Chang, Lee, & Toutanova (2018).

276 Liu, Y. et al. (2019).

277 Raffel, C. et al. (2019).

278 Lan, Z. et al. (2019).

279 Lewis, M. et al. (2019).

280 Radford, Narasimhan, Salimans, & Sutskever (2018).

281 Radford, A. et al. (2019).

282 Brown, T.B. et al. (2020).

283 Yang, Z. et al. (2019).

284 Zaheer, M. et al. (2020).

285 Edelman, Goel, Kakade, & Zhang (2021).

286 Tay, Dehghani, Bahri, & Metzler (2022).

287 Phuong & Hutter (2022).

288 Chowdhery, A. et al. (2022).

289 Ouyang, L. et al. (2022).

290 Wolfram (2023).

291 OpenAI (2023).

292 Mohamadi, S. et al. (2023).

293 Zhao, W.X. et al. (2023).

294 Golovneva, Wang, Weston, & Sukhbaatar (2024).

Figure 3.13: Diagram of the BERT model (source: peltarion.com).
3.13.9.4.2 Computational complexity of transformers
3.13.9.4.3 Efficient transformers

295 Dao, T. et al. (2022).

3.13.9.4.4 What comes after Transformers?

296 Gu, Goel, & Ré (2021).

297 Merrill & Sabharwal (2022).

298 Bulatov, Kuratov, & Burtsev (2022).

299 Bulatov, Kuratov, & Burtsev (2023).

300 Bertsch, Alon, Neubig, & Gormley (2023).

301 Mialon, G. et al. (2023).

302 Peng, B. et al. (2023).

303 Sun, Y. et al. (2023).

304 Gu & Dao (2023).

305 Wang, H. et al. (2023).

306 Ma, S. et al. (2024).

307 Ma, X. et al. (2024).

308 Bhargava, Witkowski, Shah, & Thomson (2023).

309 Dao & Gu (2024).

310 Banerjee, Agarwal, & Singla (2024).

3.13.9.5 Evaluation methods

3.13.9.6 Scaling laws in NLP

311 Hestness, J. et al. (2017).

312 Church & Hestness (2019).

313 Kaplan, J. et al. (2020).

314 Rae, J.W. et al. (2022).

315 Hoffmann, J. et al. (2022).

316 Caballero, Gupta, Rish, & Krueger (2022).

317 Muennighoff, N. et al. (2023).

318 Pandey (2024).

319 Bach (2024).

320 Finzi, M. et al. (2025).

3.13.9.7 Language understanding

321 Mahowald, K. et al. (2023).

322 Kosinski (2023).

See also:

3.13.9.8 Interpretability

323 Watson & Floridi (2019).

324 Gurnee, W. et al. (2023).

325 Meng, Bau, Andonian, & Belinkov (2023).

326 McDougall, C. et al. (2023).

Linear probes:

327 Alain & Bengio (2016).

328 Belinkov (2022).

329 Gurnee & Tegmark (2023).

3.13.10 Reinforcement learning

Pedagogy:

330 Sutton & Barto (2018).

331 Arulkumaran, Deisenroth, Brundage, & Bharath (2017).

332 Cesa-Bianchi & Lugosi (2006).

Tutorials:

More:

333 Xu, Hasselt, & Silver (2018).

334 Silver, Singh, Precup, & Sutton (2024).

335 Javed & Sutton (2024).

3.13.10.1 Q-learning

  • Q-learning and DQN
  • Uses the Markov Decision Process (MDP) framework
  • The Bellman equation 336
  • Q-learning is a values-based learning algorithm. Value based algorithms updates the value function based on an equation (particularly Bellman equation). Whereas the other type, policy-based estimates the value function with a greedy policy obtained from the last policy improvement (source: towardsdatascience.com).
  • DQN masters Atari 337

336 Bellman (1952).

337 Mnih, V. et al. (2013) and Mnih, V. et al. (2015).

3.13.10.2 AlphaZero

338 Silver, D. et al. (2016).

339 Silver, D. et al. (2017b).

340 Silver, D. et al. (2017a).

3.13.10.3 Regret minimization

Regret matching (RM)

341 Hart & Mas‐Colell (2000).

Consider a game like rock-paper-scissors, where there is only one action per round. Let \(v^{t}(a)\) be the value observed when playing action \(a\) on iteration \(t\).

TODO: explain that the entire rewards vector, \(v^{t}(a)\), over \(a\) is observable after the chosen action is played.

Let a strategy, \(\sigma^t\), be a probability distribution over actions, \(a \in A\). Then the value of a strategy, \(v^{t}(\sigma^{t})\), is the expectation of its value over actions.

\[ v^{t}(\sigma^{t}) = \sum_{a \in A} \sigma^{t}(a) \: v^{t}(a) \]

Regret, \(R^{T}\), measures how much better some sequence of strategies, \(\sigma'\), would do compared to the chosen sequence of strategies, \(\sigma = \{\sigma^1, \sigma^2, \ldots \sigma^T\}\).

\[ R^{T} \equiv \sum_{t=1}^{T} \left( v^{t}({\sigma'}^{t}) - v^{t}(\sigma^{t}) \right) \]

External regret, \(R^{T}(a)\), measures the regret of the chosen sequence of strategies versus a hypothetical stategy where action \(a\) is always chosen.

\[ R^{T}(a) \equiv \sum_{t=1}^{T} \left( v^{t}(a) - v^{t}(\sigma^{t}) \right) \]

Regret Matching (RM) is a rule to determine the strategy for the next iteration:

\[ \sigma^{t+1}(a) \equiv \frac{ R^{t}_{+}(a) }{ \sum_{b \in A} R^{t}_{+}(b) } \]

where \(R_{+} \equiv \mathrm{max}(R, 0)\).

At the end of training, the resulting recommended strategy with convergence bounds is not the final strategy used in training, \(\sigma^{T}\), but the average strategy over all time steps:

\[ \bar{\sigma}^{T}(a) = \frac{1}{T} \sum_{t=1}^{T} \sigma^{t}(a) \]

TODO: explain the convergence of \(\bar{\sigma}^{t}\) to an \(\varepsilon\)-Nash equilibrium.

342 Roughgarden (2016).

Counterfactual regret minimization (CFR)

343 Zinkevich, Johanson, Bowling, & Piccione (2007).

344 N. Brown (2020), p. 12.

345 Zinkevich et al. (2007), p. 4.

346 Tammelin (2014).

347 Tammelin, Burch, Johanson, & Bowling (2015).

348 Burch, Moravcik, & Schmid (2019).

349 N. Brown & Sandholm (2019a).

TODO: explain extensive-form games.

A finite extensive game with imperfect information has the following components: 350

350 Zinkevich et al. (2007) and Lanctot, Waugh, Zinkevich, & Bowling (2009).

  • A finite set \(N\) of players. A finite set \(H\) of sequences, the possible histories of actions, such that the empty sequence is in \(H\) and every prefix of a sequence in \(H\) is also in \(H\). Define \(h \sqsubseteq h'\) to mean \(h\) is a prefix of \(h'\). \(Z \subseteq H\) are the terminal histories (those which are not a prefix of any other sequences). \(A(h) = \{a : ha \in H\}\) are the actions available after a non-terminal history, \(h \in H \backslash Z\).
  • A function \(P\) that assigns to each non-terminal history a member of \(N \cup \{c\}\). \(P\) is the player function. \(P(h)\) is the player who takes an action after the history \(h\). If \(P(h) = c\) then chance determines the action taken after history \(h\).
  • For each player \(i \in N \cup \{c\}\) a partition \(\mathcal{I}_i\) of \(\{h \in H : P (h) = i\}\) with the property that \(A(h) = A(h')\) whenever \(h\) and \(h'\) are in the same member of the partition. For \(I \in \mathcal{I}_i\) we denote by \(A(I_i)\) the set \(A(h)\) and by \(P(I_i)\) the player \(P(h)\) for any \(h \in I_i\) . \(\mathcal{I}_i\) is the information partition of player \(i\); a set \(I_i \in \mathcal{I}_i\) is an information set of player \(i\).
  • A function \(f_c\) that associates with every information set \(I\) where \(P(I) = c\), a probability measure \(f_c(a|I)\) on \(A(h)\); \(f_c(a|I)\) is the probability that \(a\) occurs given some \(h \in I\), where each such probability measure is independent of every other such measure.
  • For each player \(i \in N\) there is a utility function \(u_i\) from the terminal states \(Z\) to the reals \(\mathbb{R}\). If \(N = \{1, 2\}\) and \(u_1 = -u_2\), it is a zero-sum extensive game. Define \(\Delta_{u,i} \equiv \mathrm{max}_z \: u_i(z) - \mathrm{min}_z \: u_i(z)\) to be the range of utilities to player \(i\).

The player reach, \(\pi^{\sigma}_{i}(h)\), of a history \(h\) is the product of the probabilities for all agent \(i\) actions leading to \(h\). Formally, 351

351 N. Brown (2020), p. 6.

\[ \pi^{\sigma}_{i}(h) \equiv \prod_{h' \cdot a' \sqsubseteq h | P(h') = i} \sigma_{i}(h', a') \]

Due to perfect recall, any two histories in infoset \(I_i\) have the same player reach for player \(i\). Thus, we similarly define the player reach \(\pi^{\sigma}_{i}(I_i)\) of infoset \(I_i\) as

\[ \pi^{\sigma}_{i}(I_i) \equiv \prod_{ {I'}_{i} \cdot a' \sqsubseteq I_i | P(I_i) = i } \sigma_{i}({I'}_{i}, a') = \left.\pi^{\sigma}_{i}(h)\right|_{h \in I_i} \]

The external reach AKA opponent reach, \(\pi^{\sigma}_{-i}(h)\), of a history \(h\) is the contribution of chance and all other players than \(i\). Formally,

\[ \pi^{\sigma}_{-i}(h) \equiv \prod_{h' \cdot a' \sqsubseteq h | P(h') \neq i} \sigma_{i}(h', a') \]

We also define the external reach of an infoset as

\[ \pi^{\sigma}_{-i}(I_i) \equiv \sum_{h \in I_{i}} \pi^{\sigma}_{-i}(h) \]

The counterfactual value of an infoset \(I\) is the expected utility to player \(i\) given that \(I\) has been reached, weighted by the external reach of \(I\) for player \(i\). Formally, 352

352 N. Brown (2020), p. 12.

\[ v(I) = \sum_{h \in I} \pi^{\sigma}_{-i}(h) \sum_{z \in Z} \pi^{\sigma}(h, z) \: u_{i}(z) \]

The counterfactual value of an action, \(a\), is

\[ v(I, a) = \sum_{h \in I} \pi^{\sigma}_{-i}(h) \sum_{z \in Z} \pi^{\sigma}(h \cdot a, z) \: u_{i}(z) \]

Let’s consider the case where, like in NLHE, our two private hole cards each make a single unique history \(h\), and we form infosets with a single hand, so \(I=h\). Then

\[ v(h) = \pi^{\sigma}_{-i}(h) \sum_{z \in Z} \pi^{\sigma}(h, z) \: u_{i}(z) \]

making explicit the player reach and the external reach,

\[ v(h) = \pi^{\sigma}_{-i}(h) \sum_{z \in Z} \pi_{i}^{\sigma}(h, z) \: \pi_{-i}^{\sigma}(h, z) \: u_{i}(z) \]

At a leaf node where we finally calculate the rewards,

\[ v(z) = \pi^{\sigma}_{-i}(z) \: u_{i}(z) \]

TODO: explain CFR.

The instantaneous regret is

\[ r^{t}(I, a) = v^{\sigma^t}(I, a) - v^{\sigma^t}(I) \]

The (cummulative) counterfactual regret

\[ R^{t}(I, a) = \sum_{t=1}^{T} r^{t}(I, a) \]

Similar to the single-node game discussed above, eq. \(\eqref{eq:regret_matching}\), applying regret matching during training means to update strategies according to the following rule.

\[ \sigma^{t+1}(I, a) \equiv \frac{ R^{t}_{+}(I, a) }{ \sum_{b \in A} R^{t}_{+}(I, b) } \]

The average strategy is

\[ \bar{\sigma}^{T}(I, a) = \sum_{t=1}^{T} \frac{\pi^{t}_{i}(I) \: \sigma^{t}(I, a) }{\pi^{t}_{i}(I)} \]

Monte Carlo Counterfactual Regret Minimization (MCCFR)

353 Lanctot et al. (2009).

354 Neller & Lanctot (2013).

355 Burch, Lanctot, Szafron, & Gibson (2012).

356 Johanson, M. et al. (2012).

357 Schmid, M. et al. (2019).

358 Li, H. et al. (2020).

359 Habara, Fukuda, & Yamashita (2023).

360 Lanctot (2013).

361 Gibson (2014).

362 Johanson (2016).

363 Burch (2018).

364 Horacek (2022).

TODO: explain MCCFR.

External sampling MCCFR:

\[ \tilde{v}^{\sigma}_{i}(I) = \sum_{z \in Q} u_{i}(z) \: \pi^{\sigma}_{i}(z[I] \rightarrow z) \]

Best response and exploitability

Best response:

\[ \mathrm{BR}(\sigma_{-i}) = \underset{\sigma_{i}^{\prime}}{\mathrm{argmax}} \: u_{i}(\sigma_{i}^{\prime}, \sigma_{-i}) \]

TODO: Local Best Response (LBR). 365

365 Lisy & Bowling (2016), p. 2.

Exploitability:

\[ \varepsilon_{i}(\sigma) = u_{i}(\mathrm{BR}(\sigma_{-i}), \sigma_{-i}) - u_{i}(\sigma_{i}, \mathrm{BR}(\sigma_{i})) \]

NashConv 366 exploitability uses the convention:

366 See NashConv exploitability defined in Lanctot, M. et al. (2017).

\[ \varepsilon_{i}(\sigma) = u_{i}(\mathrm{BR}(\sigma_{-i}), \sigma_{-i}) - u_{i}(\sigma_{i}, \sigma_{-i}) \]

The average exploitability per player is

\[ \varepsilon(\sigma) = \frac{1}{n} \sum_{i}^{n} \varepsilon_{i}(\sigma) \]

Note that in zero-sum games, when summing over players, the second terms in NashConv sum to zero. 367

367 Timbers (2020), p. 3.

\[ \varepsilon(\sigma) = \frac{1}{n} \sum_{i}^{n} u_{i}(\mathrm{BR}(\sigma_{-i}), \sigma_{-i}) \]

In two-player games:

\[ \varepsilon(\sigma) = \frac{1}{2} \Big( u_{1}(\mathrm{BR}(\sigma_{2}), \sigma_{2}) + u_{2}(\sigma_{1}, \mathrm{BR}(\sigma_{1})) \Big) \]

368 Johanson, Waugh, Bowling, & Zinkevich (2011).

369 Ponsen, De Jong, & Lanctot (2011).

370 Lisy & Bowling (2016).

371 Timbers (2020).

3.13.10.4 Solving poker

372 Kuhn (1950).

373 Southey, F. et al. (2012).

374 Billings, Davidson, Schaeffer, & Szafron (2002).

375 Billings, D. et al. (2003).

376 Johanson (2013).

377 Bowling, Burch, Johanson, & Tammelin (2015).

378 Heinrich & Silver (2016).

379 Moravcik, M. et al. (2017).

380 N. Brown & Sandholm (2017).

381 N. Brown & Sandholm (2018).

382 N. Brown & Sandholm (2019a).

383 N. Brown, Lerer, Gross, & Sandholm (2019).

384 N. Brown & Sandholm (2019b).

385 N. Brown, Bakhtin, Lerer, & Gong (2020).

386 N. Brown (2020).

387 Schmid, M. et al. (2021).

388 Kovarik, V. et al. (2022).

3.13.11 Applications in physics

389 Denby (1988).

390 Denby (1993).

391 Spears, B.K. et al. (2018).

392 Cranmer, Seljak, & Terao (2021).

393 Liu, Z. et al. (2024).

394 Jiao, L. et al. (2024).

See also:

3.14 Theoretical machine learning

3.14.1 Algorithmic information theory

395 Cilibrasi & Vitanyi (2005).

396 Hutter (2007).

397 Rathmanner & Hutter (2011).

3.14.2 No free lunch theorems

398 Wolpert & Macready (1995).

399 Wolpert (1996).

400 Wolpert & Macready (1997).

401 Shalev-Shwarz & Ben-David (2014), p. 60–66.

402 McDermott (2019).

403 Wolpert (2007).

404 Wolpert & Kinney (2020).

405 Wolpert (2023).

406 Mitchell (1980).

407 Roberts (2021).

408 Goldreich & Ron (1997).

409 Joyce & Herrmann (2017).

410 H. W. Lin, Tegmark, & Rolnick (2017).

411 Lauc (2020).

412 Nakkiran (2021).

413 Bousquet, O. et al. (2021).

414 Andrews (2023).

Raissi et al.:

encoding such structured information into a learning algorithm results in amplifying the information content of the data that the algorithm sees, enabling it to quickly steer itself towards the right solution and generalize well even when only a few training examples are available. 415

415 Raissi, Perdikaris, & Karniadakis (2017a), p. 2.

Roberts:

From an algorithmic complexity standpoint it is somewhat miraculous that we can compress our huge look-up table of experiment/outcome into such an efficient description. In many senses, this type of compression is precisely what we mean when we say that physics enables us to understand a given phenomenon. 416

416 Roberts (2021), p. 7.

  • TODO: Note Dennett’s discussion of compression. 417

417 Dennett (1991), p. TODO.

3.14.3 Connectivists vs symbolicists

3.14.4 Graphical tensor notation

3.14.5 Universal approximation theorem

418 Minsky & Papert (1969), p. TODO.

419 Hornik, Stinchcombe, & White (1989).

420 Lu, Z. et al. (2017).

421 H. Lin & Jegelka (2018).

422 Ismailov (2020).

423 Bishop (2006), p. 230.

3.14.6 Relationship to statistical mechanics

424 Opper & Kinzel (1996).

425 Opper (2001).

426 Bahri, Y. et al. (2020).

427 Halverson, Maiti, & Stoner (2020).

428 Canatar, Bordelon, & Pehlevan (2020).

429 Roberts, Yaida, & Hanin (2021).

430 Cantwell (2022).

431 Dinan, Yaida, & Zhang (2023).

432 Sohl-Dickstein (2020).

433 Aifer, M. et al. (2023).

434 Geshkovski, Letrouit, Polyanskiy, & Rigollet (2023).

3.14.7 Relationship to gauge theory

Invariant:

\[ f(g x) = f(x) \]

Equivariant:

\[ f(g x) = g' f(x) \]

Same-equivariant is the case that \(g' = g\).

435 Dieleman, Fauw, & Kavukcuoglu (2016).

436 T. S. Cohen & Welling (2016).

437 T. S. Cohen, Weiler, Kicanaoglu, & Welling (2019).

438 Fuchs, Worrall, Fischer, & Welling (2020).

439 Bogatskiy, A. et al. (2023).

440 Marchetti, Hillar, Kragic, & Sanborn (2023).

441 Battiloro, C. et al. (2024).

3.14.8 Thermodynamics of computation

442 Bérut, A. et al. (2012).

443 Bérut, Petrosyan, & Ciliberto (2015).

3.15 Information geometry

3.15.1 Introduction

444 Smith (2019).

445 Nielsen (2020).

446 Amari (1998).

447 Amari (2016).

3.15.2 Geometric understanding of classical statistics

448 Balasubramanian (1996a).

449 Balasubramanian (1996b).

450 Calin & Udriste (2014).

451 de Carvalho, Page, & Barney (2019).

3.15.3 Geometric understanding of deep learning

452 Lei, Luo, Yau, & Gu (2018).

453 Gao & Chaudhari (2020).

454 Bronstein, Bruna, Cohen, & Velickovic (2021).

3.16 Automation

3.16.1 AutoML

  • Neural Architecture Search (NAS)
  • AutoML frameworks
  • RL-driven NAS
  • learned sparsity

3.16.2 Surrogate models

455 Fefferman, Mitter, & Narayanan (2016).

456 Raissi et al. (2017a) and Raissi, Perdikaris, & Karniadakis (2017b).

457 Karniadakis, G.E. et al. (2021).

458 Howard, Mandt, Whiteson, & Yang (2021).

459 Thuerey, N. et al. (2021).

460 Cranmer, Pavez, & Louppe (2015).

461 Cranmer, Brehmer, & Louppe (2019).

462 Baydin, A.G. et al. (2019).

Lectures:

3.16.3 AutoScience

463 C. Anderson (2008).

464 Asch, M. et al. (2018).

465 D’Agnolo & Wulzer (2019).

466 Krenn, M. et al. (2022).

467 Udrescu & Tegmark (2020).

468 Cranmer, M. et al. (2020).

469 Z. Liu, Madhavan, & Tegmark (2022).

470 Lu, C. et al. (2024).

Figure 3.14: The inference cycle for the process of scientific inquiry. The three distinct forms of inference (abduction, deduction, and induction) facilitate an all-encompassing vision, enabling HPC and HDA to converge in a rational and structured manner. HPC: high- performance computing; HDA: high-end data analysis. 471

471 Asch, M. et al. (2018).

See also:

3.17 Implications for the realism debate

3.17.1 Introduction

472 Korb (2001).

473 Williamson (2009).

474 Bensusan (2000).

See also:

3.17.2 Real clusters

  • Nope: Hennig

See also:

3.17.3 Word meanings

475 Perone (2018).

476 Tenney, I. et al. (2019).

477 Nissim, Noord, & Goot (2019).

478 Skelac & Jandric (2020).

479 Bender & Koller (2020).

480 Elhage, N. et al. (2022).

481 Patel & Pavlick (2022).

482 Lovering & Pavlick (2022).

483 Kornai (2023).

484 Huh, Cheung, Wang, & Isola (2024).

485 Jamali, M. et al. (2024).

Wittgenstein in PI:

The meaning of a word is its use in the language. 486

486 Wittgenstein (2009), §43.

and

One cannot guess how a word functions. One has to look at its use, and learn from that. 487

487 Wittgenstein (2009), §340.

Piantadosi:

Modern large language models integrate syntax and semantics in the underlying representations: encoding words as vectors in a high-dimensional space, without an effort to separate out e.g. part of speech categories from semantic representations, or even predict at any level of analysis other than the literal word. Part of making these models work well was in determining how to encode semantic properties into vectors, and in fact initializing word vectors via encodings of distribution semantics from e.g. Mikolov et al. 2013 (Radford et al. 2019). Thus, an assumption of the autonomy of syntax is not required to make models that predict syntactic material and may well hinder it. 488

488 Piantadosi (2023), p. 15.

See also:

3.18 My thoughts

My docs:

My talks:

3.19 Annotated bibliography

3.19.1 Mayo, D.G. (1996). Error and the Growth of Experimental Knowledge.

3.19.1.1 My thoughts

  • TODO

3.19.2 Cowan, G. (1998). Statistical Data Analysis.

3.19.2.1 My thoughts

  • TODO

3.19.3 James, F. (2006). Statistical Methods in Experimental Physics.

3.19.3.1 My thoughts

  • TODO

3.19.4 Cowan, G. et al. (2011). Asymptotic formulae for likelihood-based tests of new physics.

  • Cowan et al. (2011)
  • Glen Cowan, Kyle Cranmer, Eilam Gross, Ofer Vitells

3.19.4.1 My thoughts

  • TODO

3.19.5 ATLAS Collaboration. (2012). Combined search for the Standard Model Higgs boson.

3.19.5.1 My thoughts

  • TODO

3.19.6 Cranmer, K. (2015). Practical statistics for the LHC.

3.19.6.1 My thoughts

  • TODO