# Philosophy of statistics

Statistical analysis is very important in addressing the problem of induction. Can inductive inference be formalized? What are the caveats? Can inductive inference be automated? How does machine learning work?

## Introduction to the foundations of statistics

### Problem of induction

A key issue for the scientific method, as discussed in the previous outline, is the problem of induction. Inductive inferences are used in the scientific method to make generalizations from finite data. This introduces unique avenues of error not found in purely deductive inferences, like in logic and mathematics. Compared to deductive inferences, which are sound and necessarily follow if an argument is valid and all of its premises obtain, inductive inferences can be valid and probably (not certainly) sound, and therefore can still result in error in some cases because the support of the argument is ultimately probabilistic.

A skeptic may further probe if we are even justified in using the probabilities we use in inductive arguments. What is the probability the Sun will rise tomorrow? What kind of probabilities are reasonable?

In this outline, we sketch and explore how the mathematical theory of statistics has arisen to wrestle with the problem of induction, and how it equips us with careful ways of framing inductive arguments and notions of confidence in them.

### Early investigators

• “Ibn al-Haytham was an early proponent of the concept that a hypothesis must be supported by experiments based on confirmable procedures or mathematical evidence—an early pioneer in the scientific method five centuries before Renaissance scientists.” - Wikipedia
• Gerolamo Cardano (1501-1576)
• Book on Games of Chance (1564)
• John Graunt (1620-1674)
• Jacob Bernoulli (1655-1705)
• Ars Conjectandi (1713, posthumous)
• First modern phrasing of the problem of parameter estimation1
• See Hacking2
• Early vision of decision theory:

The art of measuring, as precisely as possible, probabilities of things, with the goal that we would be able always to choose or follow in our judgments and actions that course, which will have been determined to be better, more satisfactory, safer or more advantageous.3

### Foundations of modern statistics

• Central limit theorem
• Charles Sanders Peirce (1839-1914)
• Formulated modern statistics in “Illustrations of the Logic of Science,” a series published in Popular Science Monthly (1877-1878), and also “A Theory of Probable Inference” in Studies in Logic (1883).5
• With a repeated measures design, introduced blinded, controlled randomized experiments (before Fisher).
• Karl Pearson (1857-1936)
• The Grammar of Science (1892)
• “On the criterion that a given system of deviations…” (1900)6
• Proposed testing the validity of hypothesized values by evaluating the chi distance between the hypothesized and the empirically observed values via the $$p$$-value.
• With Frank Raphael Weldon, he established the journal Biometrika in 1902.
• Founded the world’s first university statistics department at University College, London in 1911.
• Ronald Fisher (1890-1972)
• Fisher significance of the null hypothesis ($$p$$-values)
• “On an absolute criterion for fitting frequency curves”7
• “Frequency distribution of the values of the correlation coefficient in samples of indefinitely large population”8
• “On the ‘probable error’ of a coefficient of correlation deduced from a small sample”9
• Definition of likelihood
• ANOVA
• Statistical Methods for Research Workers (1925)
• The Design of Experiments (1935)
• “Statistical methods and scientific induction”10
• The Lady Tasting Tea11
• Jerzy Neyman (1894-1981)
• Egon Pearson (1895-1980)
• Neyman-Pearson confidence intervals with fixed error probabilities (also $$p$$-values but considering two hypotheses involves two types of errors)
• Harold Jeffreys (1891-1989)
• objective (non-informative) Jeffreys priors
• Andrey Kolmogorov (1903-1987)
• C.R. Rao (b. 1920)

### Probability

Probability is of epistemic interest, being in some sense a measure of inductive confidence.

TODO:

• Kolmogorov axioms
• Probability vs odds: $$p/(p+q)$$ vs $$p/q$$
• Carnap: “Probability as a guide in life”23

### Expectation and variance

Expectation:

$\mathbb{E}(y) \equiv \int dx \: p(x) \: y(x) \label{eq:expectation}$

Expectation values can be approximated with a partial sum over some data or Monte Carlo sample:

$\mathbb{E}(y) \approx \frac{1}{n} \sum_s^n y(x_s) \label{eq:expectation_sum}$

The variance of a random variable, $$y$$, is defined as

\begin{align} \mathrm{Var}(y) &\equiv \mathbb{E}((y - \mathbb{E}(y))^2) \nonumber \\ &= \mathbb{E}(y^2 - 2 \: y \: \mathbb{E}(y) + \mathbb{E}(y)^2) \nonumber \\ &= \mathbb{E}(y^2) - 2 \: \mathbb{E}(y) \: \mathbb{E}(y) + \mathbb{E}(y)^2 \nonumber \\ &= \mathbb{E}(y^2) - \mathbb{E}(y)^2 \label{eq:variance} \end{align}

The covariance matrix, $$\boldsymbol{V}$$, of random variables $$x_i$$ is

\begin{align} V_{ij} &= \mathrm{Cov}(x_i, x_j) \equiv \mathbb{E}[(x_i - \mathbb{E}(x_i)) \: (x_j - \mathbb{E}(x_j))] \nonumber \\ &= \mathbb{E}(x_i \: x_{j} - \mu_i \: x_j - x_i \: \mu_j + \mu_i \: \mu_j ) \nonumber \\ &= \mathbb{E}(x_i \: x_{j}) - \mu_i \: \mu_j \label{eq:covariance_matrix_indexed} \end{align}

$\begin{equation} \boldsymbol{V} = \begin{pmatrix} \mathrm{Var}(x_1) & \mathrm{Cov}(x_1, x_2) & \cdots & \mathrm{Cov}(x_1, x_n) \\ \mathrm{Cov}(x_2, x_1) & \mathrm{Var}(x_2) & \cdots & \mathrm{Cov}(x_2, x_n) \\ \vdots & \vdots & \ddots & \vdots \\ \mathrm{Cov}(x_n, x_1) & \mathrm{Cov}(x_n, x_2) & \cdots & \mathrm{Var}(x_n) \end{pmatrix} \label{eq:covariance_matrix_array} \end{equation}$

Diagonal elements of the covariance matrix are the variances of each variable.

$\mathrm{Cov}(x_i, x_i) = \mathrm{Var}(x_i)$

Off-diagonal elements of a covariance matrix measure how related two variables are, linearly. Covariance can be normalized to give the correlation coefficient between variables:

$\mathrm{Cor}(x_i, x_j) \equiv \frac{ \mathrm{Cov}(x_i, x_j) }{ \sqrt{ \mathrm{Var}(x_i) \: \mathrm{Var}(x_j) } } \label{eq:correlation_matrix}$

which is bounded: $$-1 \leq \mathrm{Cor}(x_i, x_j) \leq 1$$.

The covariance of two random vectors is given by

$\boldsymbol{V} = \mathrm{Cov}(\vec{x}, \vec{y}) = \mathbb{E}(\vec{x} \: \vec{y}^{\mathsf{T}}) - \vec{\mu}_x \: \vec{\mu}_{y}^{\mathsf{T}}\label{eq:covariance_matrix_vectors}$

### Cross entropy

TODO: discuss the Shannon entropy and Kullback-Leibler (KL) divergence.24

Shannon entropy:

$H(p) = - \underset{x\sim{}p}{\mathbb{E}}\big[ \log p(x) \big] \label{eq:shannon_entropy}$

Cross entropy:

$H(p, q) = - \underset{x\sim{}p}{\mathbb{E}}\big[ \log q(x) \big] \label{eq:cross_entropy}$

Kullback-Leibler (KL) divergence:

\begin{align} D_\mathrm{KL}(p, q) &= \underset{x\sim{}p}{\mathbb{E}}\left[ \log \left(\frac{p(x)}{q(x)}\right) \right] = \underset{x\sim{}p}{\mathbb{E}}\big[ \log p(x) - \log q(x) \big] \label{eq:kl_divergence} \\ &= - H(p) + H(p, q) \\ \end{align}

See also the section on logistic regression.

### Uncertainty

#### Quantiles and standard error

TODO:

• Quantiles
• Practice of standard error for uncertainty quantification.

#### Propagation of error

Given some vector of random variables, $$\vec{x}$$, with estimated means, $$\vec{\mu}$$, and estimated covariance matrix, $$\boldsymbol{V}$$, suppose we are concerned with estimating the variance of some variable, $$y$$, that is a function of $$\vec{x}$$. The variance of $$y$$ is given by

$\sigma^2_y = \mathbb{E}(y^2) - \mathbb{E}(y)^2 \,.$

Taylor expanding $$y(\vec{x})$$ about $$x=\mu$$ gives

$y(\vec{x}) \approx y(\vec{\mu}) + \left.\frac{\partial y}{\partial x_i}\right|_{\vec{x}=\vec{\mu}} (x_i - \mu_i) \,.$

Therefore, to first order

$\mathbb{E}(y) \approx y(\vec{\mu})$

and

\begin{align} \mathbb{E}(y^2) &\approx y^2(\vec{\mu}) + 2 \, y(\vec{\mu}) \, \left.\frac{\partial y}{\partial x_i}\right|_{\vec{x}=\vec{\mu}} \mathbb{E}(x_i - \mu_i) \nonumber \\ &+ \mathbb{E}\left[ \left(\left.\frac{\partial y}{\partial x_i}\right|_{\vec{x}=\vec{\mu}}(x_i - \mu_i)\right) \left(\left.\frac{\partial y}{\partial x_j}\right|_{\vec{x}=\vec{\mu}}(x_j - \mu_j)\right) \right] \\ &= y^2(\vec{\mu}) + \, \left.\frac{\partial y}{\partial x_i}\frac{\partial y}{\partial x_j}\right|_{\vec{x}=\vec{\mu}} V_{ij} \\ \end{align}

TODO: clarify above, then specific examples.

See Cowan.25

### Bayes’ theorem

$P(A|B) = P(B|A) \: P(A) \: / \: P(B) \label{eq:bayes_theorem}$

• Extended version of Bayes theorem
• Example of conditioning with medical diagnostics

### Likelihood and frequentist vs bayesian probability

$P(H|D) = P(D|H) \: P(H) \: / \: P(D) \label{eq:bayes_theorem_hd}$

• Likelihood

$L(\theta) = P(D|\theta) \label{eq:likelihood_def_x}$

To appeal to such a result is absurd. Bayes’ theorem ought only to be used where we have in past experience, as for example in the case of probabilities and other statistical ratios, met with every admissible value with roughly equal frequency. There is no such experience in this case.28

### Curse of dimensionality

• Curse of dimensionality
• The volume of the space increases so fast that the available data become sparse.
• The ordinary decision rule for estimating the mean of a multivariate Gaussian distribution (with dimensions, $$n \geq 3$$) is inadmissible under mean squared error risk.
• Proof of Stein’s example
• Probability in high dimensions29
• High-Dimensional Probability:An introduction with applications in data science30

## Statistical models

### Parametric models

• Data: $$x_i$$
• Parameters: $$\theta_j$$
• Model: $$f(\vec{x} ; \vec{\theta})$$

### Canonical distributions

#### Bernoulli distribution

$\mathrm{Ber}(k; p) = \begin{cases} p & \mathrm{if}\ k = 1 \\ 1-p & \mathrm{if}\ k = 0 \end{cases} \label{eq:bernoulli}$

which can also be written as

$\mathrm{Ber}(k; p) = p^k \: (1-p)^{(1-k)} \quad \mathrm{for}\ k \in \{0, 1\}$

or

$\mathrm{Ber}(k; p) = p k + (1-p)(1-k) \quad \mathrm{for}\ k \in \{0, 1\}$

• Binomial distribution
• Poisson distribution

TODO: explain, another important relationship is Figure 1: Relationships among Bernoulli, binomial, categorical, and multinomial distributions.

#### Normal/Gaussian distribution

$N(x \,|\, \mu, \sigma^2) = \frac{1}{\sqrt{2\,\pi\:\sigma^2}} \: \exp\left(\frac{-(x-\mu)^2}{2\,\sigma^2}\right) \label{eq:gaussian}$

and in $$k$$ dimensions:

$N(\vec{x} \,|\, \vec{\mu}, \boldsymbol{\Sigma}) = (2 \pi)^{-k/2}\:\left|\boldsymbol{\Sigma}\right|^{-1/2} \: \exp\left(\frac{-1}{2}\:(\vec{x}-\vec{\mu})^{\mathsf{T}}\:\boldsymbol{\Sigma}^{-1}\:(\vec{x}-\vec{\mu})\right) \label{eq:gaussian_k_dim}$

where $$\boldsymbol{\Sigma}$$ is the covariance matrix (defined in eq. $$\eqref{eq:covariance_matrix_indexed}$$) of the distribution.

• Central limit theorem
• $$\chi^2$$ distribution
• Univariate distribution relationships Figure 2: Detail of a figure showing relationships among univariate distributions. See the full figure here.31
• The exponential family of distributions are maximum entropy distributions.

## Point estimation and confidence intervals

### Inverse problems

Recall that in the context of parametric models of data, $$x_i$$ the pdf of which is modeled by a function, $$f(x_i ; \theta_j)$$ with parameters, $$\theta_j$$. In a statistical inverse problem, the goal is to infer values of the model parameters, $$\theta_j$$ given some finite set of data, $$\{x_i\}$$ sampled from a probability density, $$f(x_i; \theta_j)$$ that models the data reasonably well.33

### Bias and variance

The bias of an estimator, $$\hat\theta$$, is defined as

$\mathrm{Bias}(\hat{\theta}) \equiv \mathbb{E}(\hat{\theta} - \theta) = \int dx \: P(x|\theta) \: (\hat{\theta} - \theta) \label{eq:bias}$

The mean squared error (MSE) of an estimator has a similar formula to variance (eq. $$\eqref{eq:variance}$$) except that instead of quantifying the square of the difference of the estimator and its expected value, the MSE uses the square of the difference of the estimator and the true parameter:

$\mathrm{MSE}(\hat{\theta}) \equiv \mathbb{E}((\hat{\theta} - \theta)^2) \label{eq:mse}$

The MSE of an estimator can be related to its bias and its variance by the following proof:

\begin{align} \mathrm{MSE}(\hat{\theta}) &= \mathbb{E}(\hat{\theta}^2 - 2 \: \hat{\theta} \: \theta + \theta^2) \nonumber \\ &= \mathbb{E}(\hat{\theta}^2) - 2 \: \mathbb{E}(\hat{\theta}) \: \theta + \theta^2 \end{align}

noting that

$\mathrm{Var}(\hat{\theta}) = \mathbb{E}(\hat{\theta}^2) - \mathbb{E}(\hat{\theta})^2$

and

\begin{align} \mathrm{Bias}(\hat{\theta})^2 &= \mathbb{E}(\hat{\theta} - \theta)^2 \nonumber \\ &= \mathbb{E}(\hat{\theta})^2 - 2 \: \mathbb{E}(\hat{\theta}) \: \theta + \theta^2 \end{align}

we see that MSE is equivalent to

$\mathrm{MSE}(\hat{\theta}) = \mathrm{Var}(\hat{\theta}) + \mathrm{Bias}(\hat{\theta})^2 \label{eq:mse_variance_bias}$

For an unbiased estimator, the MSE is the variance of the estimator.

TODO:

• Note the discussion of the bias-variance tradeoff by Cranmer.
• Note the new deep learning view. See Deep learning.

### Maximum likelihood estimation

A maximum likelihood estimator (MLE) was first used by Fisher.35

$\hat{\theta} \equiv \underset{\theta}{\mathrm{argmax}} \: \mathrm{log} \: L(\theta) \label{eq:mle}$

Maximizing $$\mathrm{log} \: L(\theta)$$ is equivalent to maximizing $$L(\theta)$$, and the former is more convenient because for data that are independent and identically distributed (i.i.d.) the joint likelihood can be factored into a product of individual measurements:

$L(\theta) = \prod_i L(\theta|x_i) = \prod_i P(x_i|\theta)$

and taking the log of the product makes it a sum:

$\mathrm{log} \: L(\theta) = \sum_i \mathrm{log} \: L(\theta|x_i) = \sum_i \mathrm{log} \: P(x_i|\theta)$

Maximizing $$\mathrm{log} \: L(\theta)$$ is also equivalent to minimizing $$-\mathrm{log} \: L(\theta)$$, the negative log-likelihood (NLL). For distributions that are i.i.d.,

$\mathrm{NLL} \equiv - \log L = - \log \prod_i L_i = - \sum_i \log L_i = \sum_i \mathrm{NLL}_i$

#### Invariance of likelihoods under reparametrization

• Likelihoods are invariant under reparametrization.36
• Bayesian posteriors are not invariant in general.

#### Ordinary least squares

• Least squares from MLE of gaussian models: $$\chi^2$$
• Ordinary Least Squares (OLS)
• Geometric interpretation

### Variance of MLEs Figure 3: Transformation of non-parabolic log-likelihood to parabolic (source: my slides, recreation of James (2006), p. 235).

### Bayesian credibility intervals

• Inverse problem to find a posterior probability distribution.
• Maximum a posteriori estimation (MAP)
• Prior sensitivity
• Not invariant to reparametrization in general
• Jeffreys priors are
• TODO: James

## Statistical hypothesis testing

### Null hypothesis significance testing

• Karl Pearson observing how rare sequences of roulette spins are
• Null hypothesis significance testing (NHST)
• goodness of fit
• Fisher

Fisher:

[T]he null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation.50

### Neyman-Pearson theory

#### Introduction Figure 4: TODO: ROC explainer. (Wikimedia, 2015).

#### Neyman-Pearson lemma

Neyman-Pearson lemma:54

For a fixed signal efficiency, $$1-\alpha$$, the selection that corresponds to the lowest possible misidentification probability, $$\beta$$, is given by

$\frac{L(H_1)}{L(H_0)} > k_{\alpha} \,, \label{eq:np-lemma}$

where $$k_{\alpha}$$ is the cut value required to achieve a type-1 error rate of $$\alpha$$.

Neyman-Pearson test statistic:

$q_\mathrm{NP} = - 2 \ln \frac{L(H_1)}{L(H_0)} \label{eq:qnp-test-stat}$

Profile likelihood ratio:

$\lambda(\mu) = \frac{ L(\mu, \hat{\theta}_\mu) }{ L(\hat{\mu}, \hat{\theta}) } \label{eq:profile-llh-ratio}$

where $$\hat{\theta}$$ is the (unconditional) maximum-likelihood estimator that maximizes $$L$$, while $$\hat{\theta}_\mu$$ is the conditional maximum-likelihood estimator that maximizes $$L$$ for a specified signal strength, $$\mu$$, and $$\theta$$ as a vector includes all other parameters of interest and nuisance parameters.

#### Neyman construction

Cranmer: Neyman construction. Figure 5: Neyman construction for a confidence belt for $$\theta$$ (source: K. Cranmer, 2020).

TODO: fix

$q = - 2 \ln \frac{L(\mu\,s + b)}{L(b)} \label{eq:q0-test-stat}$

#### Flip-flopping

• Flip-flopping and Feldman-Cousins confidence intervals55

### p-values and significance

• $$p$$-values and significance56
• Coverage
• Fisherian vs Neyman-Pearson $$p$$-values

Cowan et al. define a $$p$$-value as

a probability, under assumption of $$H$$, of finding data of equal or greater incompatibility with the predictions of $$H$$.57

Also:

It should be emphasized that in an actual scientific context, rejecting the background-only hypothesis in a statistical sense is only part of discovering a new phenomenon. One’s degree of belief that a new process is present will depend in general on other factors as well, such as the plausibility of the new signal hypothesis and the degree to which it can describe the data. Here, however, we only consider the task of determining the $$p$$-value of the background-only hypothesis; if it is found below a specified threshold, we regard this as “discovery.”58

#### CLs method

• Conservative coverage; used in particle physics
• Junk60
• ATLAS62

### Asymptotics

• Analytic variance of the likelihood-ratio of gaussians: $$\chi^2$$
• Wilks63
• Under the null hypothesis, $$-2 \ln(\lambda) \sim \chi^{2}_{k}$$, where $$k$$, the degrees of freedom for the $$\chi^{2}$$ distribution is the number of parameters of interest (including signal strength) in the signal model but not in the null hypothesis background model.
• Wald64
• Wald generalized the work of Wilks for the case of testing some nonzero signal for exclusion, showing $$-2 \ln(\lambda) \approx (\hat{\theta} - \theta)^{\mathsf{T}}V^{-1} (\hat{\theta} - \theta) \sim \mathrm{noncentral}\:\chi^{2}_{k}$$.
• In the simplest case where there is only one parameter of interest (the signal strength, $$\mu$$), then $$-2 \ln(\lambda) \approx \frac{ (\hat{\mu} - \mu)^{2} }{ \sigma^2 } \sim \mathrm{noncentral}\:\chi^{2}_{1}$$.
• Pearson $$\chi^2$$-test
• Cowan et al.65
• Criteria for projected discovery and exclusion sensitivities of counting experiments66

### Student’s t-test

• Student’s t-test
• ANOVA
• A/B-testing

### Frequentist vs bayesian decision theory

Support for using Bayes factors:

which properly separates issues of long-run behavior from evidential strength and allows the integration of background knowledge with statistical findings.69

### Examples

• Difference of two means: $$t$$-test
• A/B-testing
• New physics

## Uncertainty quantification

### Sinervo classification of systematic uncertainties

• Class-1, class-2, and class-3 systematic uncertanties (good, bad, ugly), Classification by Pekka Sinervo (PhyStat2003)70
• Not to be confused with type-1 and type-2 errors in Neyman-Pearson theory
• Heinrich, J. & Lyons, L. (2007). Systematic errors.71
• Caldeira & Nord72

Lyons:

In analyses involving enough data to achieve reasonable statistical accuracy, considerably more effort is devoted to assessing the systematic error than to determining the parameter of interest and its statistical error.73 Figure 6: Classification of measurement uncertainties (philosophy-in-figures.tumblr.com, 2016).
• Poincaré’s three levels of ignorance

### Profile likelihoods

• Profiling and the profile likelihood
• Importance of Wald and Cowan et al.
• hybrid Bayesian-frequentist method

### Examples of poor estimates of systematic uncertanties Figure 7: Demonstration of sensitivity to the jet energy scale for an alleged excess in $$Wjj$$ by Tommaso Dorigo (2011) (see also: GIF).
• OPERA. (2011). Faster-than-light neutrinos.
• BICEP2 claimed evidence of B-modes in the CMB as evidence of cosmic inflation without accounting for cosmic dust.

## Statistical classification

### Introduction

• Precision vs recall
• Recall is sensitivity
• Sensitivity vs specificity
• Accuracy

• TODO

## Causal inference

### Causal models

• Structural Causal Model (SCM)
• Pearl, J. (2009). Causal inference in statistics: An overview.76
• Robins, J.M. & Wasserman, L. (1999). On the impossibility of inferring causation from association without background knowledge.77
• Peters, J., Janzing, D., & Schölkopf, B. (2017). Elements of Causal Inference.78

• TODO

## Exploratory data analysis

### Look-elsewhere effect

• Look-elsewhere effect (LEE)
• AKA File-drawer effect
• Stopping rules
• validation dataset
• statistical issues, violates the likelihood principle

## “Statistics Wars”

### Introduction

• Kruschke
• Carnap
• “The two concepts of probability”80
• Royall
• “What do these data say?”81

Cranmer:

Bayes’s theorem is a theorem, so there’s no debating it. It is not the case that Frequentists dispute whether Bayes’s theorem is true. The debate is whether the necessary probabilities exist in the first place. If one can define the joint probability $$P (A, B)$$ in a frequentist way, then a Frequentist is perfectly happy using Bayes theorem. Thus, the debate starts at the very definition of probability.82

Neyman:

Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong.83

### Likelihood principle

• Likelihood principle
• The likelihood principle is the proposition that, given a statistical model and a data sample, all the evidence relevant to model parameters is contained in the likelihood function.
• The history of likelihood85
• Berger & Wolpert. (1988). The Likelihood Principle.88

O’Hagan:

The first key argument in favour of the Bayesian approach can be called the axiomatic argument. We can formulate systems of axioms of good inference, and under some persuasive axiom systems it can be proved that Bayesian inference is a consequence of adopting any of these systems… If one adopts two principles known as ancillarity and sufficiency principles, then under some statement of these principles it follows that one must adopt another known as the likelihood principle. Bayesian inference conforms to the likelihood principle whereas classical inference does not. Classical procedures regularly violate the likelihood principle or one or more of the other axioms of good inference. There are no such arguments in favour of classical inference.89

Mayo:

Likelihoods are vital to all statistical accounts, but they are often misunderstood because the data are fixed and the hypothesis varies. Likelihoods of hypotheses should not be confused with their probabilities. … [T]he same phenomenon may be perfectly predicted or explained by two rival theories; so both theories are equally likely on the data, even though they cannot both be true.94

### Discussion

Lyons:

Particle Physicists tend to favor a frequentist method. This is because we really do consider that our data are representative as samples drawn according to the model we are using (decay time distributions often are exponential; the counts in repeated time intervals do follow a Poisson distribution, etc.), and hence we want to use a statistical approach that allows the data “to speak for themselves,” rather than our analysis being dominated by our assumptions and beliefs, as embodied in Bayesian priors.95 Figure 9: The major virtues and vices of Bayesian, frequentist, and likelihoodist approaches to statistical inference (gandenberger.org/research/, 2015).

Goodman:

The idea that the $$P$$ value can play both of these roles is based on a fallacy: that an event can be viewed simultaneously both from a long-run and a short-run perspective. In the long-run perspective, which is error-based and deductive, we group the observed result together with other outcomes that might have occurred in hypothetical repetitions of the experiment. In the “short run” perspective, which is evidential and inductive, we try to evaluate the meaning of the observed result from a single experiment. If we could combine these perspectives, it would mean that inductive ends (drawing scientific conclusions) could be served with purely deductive methods (objective probability calculations).115

## Replication crisis

### Introduction

• Ioannidis, J.P. (2005). Why most published research findings are false.116

### p-value controversy

[N]o isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon; for the “one chance in a million” will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be that it should occur to us. In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result.120

From “The ASA president’s task force statement on statistical significance and replicability”:

P-values are valid statistical measures that provide convenient conventions for communicating the uncertainty inherent in quantitative results. Indeed, P-values and significance tests are among the most studied and best understood statistical procedures in the statistics literature. They are important tools that have advanced science through their proper application.123

## Classical machine learning

### Introduction

• Classification vs regression
• Supervised and unsupervised learning
• Classification = supervised; clustering = unsupervised
• Hastie, Tibshirani, & Friedman124
• Information Theory, Inference, and Learning125
• Murphy, K.P. (2012). Machine Learning: A probabilistic perspective. MIT Press.126
• Murphy, K.P. (2022). Probabilistic Machine Learning: An introduction. MIT Press.127
• Shalev-Shwarz, S. & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms.128
• VC-dimension
• Vapnik (1994)129
• Shalev-Shwarz, S. & Ben-David, S. (2014).130

### Logistic regression

From a probabilistic point of view,134 logistic regression can be derived from doing maximum likelihood estimation of a vector of model parameters, $$\vec{w}$$, in a dot product with the input features, $$\vec{x}$$, and squashed with a logistic function that yields the probability, $$\mu$$, of a Bernoulli random variable, $$y \in \{0, 1\}$$.

$p(y | \vec{x}, \vec{w}) = \mathrm{Ber}(y | \mu(\vec{x}, \vec{w})) = \mu(\vec{x}, \vec{w})^y \: (1-\mu(\vec{x}, \vec{w}))^{(1-y)}$

The negative log-likelihood of multiple trials is

\begin{align} \mathrm{NLL} &= - \sum_i \log p(y_i | \vec{x}_i, \vec{w}) \nonumber \\ &= - \sum_i \log\left( \mu(\vec{x}_i, \vec{w})^{y_i} \: (1-\mu(\vec{x}_i, \vec{w}))^{(1-y_i)} \right) \nonumber \\ &= - \sum_i \log\left( \mu_i^{y_i} \: (1-\mu_i)^{(1-y_i)} \right) \nonumber \\ &= - \sum_i \big( y_i \, \log \mu_i + (1-y_i) \log(1-\mu_i) \big) \label{eq:cross_entropy_loss0} \end{align}

which is the cross entropy loss. Note that the first term is non-zero only when the true target is $$y_i=1$$, and similarly the second term is non-zero only when $$y_i=0$$.135 Therefore, we can reparametrize the target $$y_i$$ in favor of $$t_{ki}$$ that is one-hot in an index $$k$$ over classes.

$\mathrm{CEL} = \mathrm{NLL} = - \sum_i \sum_k \big( t_{ki} \, \log \mu_{ki} \big) \label{eq:cross_entropy_loss1}$

where

$t_{ki} = \begin{cases} 1 & \mathrm{if}\ (k = y_i = 0)\ \mathrm{or}\ (k = y_i = 1) \\ 0 & \mathrm{otherwise} \end{cases}$

and

$\mu_{ki} = \begin{cases} 1-\mu_i & \mathrm{if}\ k = 0 \\ \mu_i & \mathrm{if}\ k =1 \end{cases}$

This readily generalizes from binary classification to classification over many classes as we will discuss more below. Note that in the sum over classes, $$k$$, only one term for the true class contributes.

$\mathrm{CEL} = - \left. \sum_i \log \mu_{ki} \right|_{k\ \mathrm{is\ such\ that}\ y_k=1} \label{eq:cross_entropy_loss2}$

Logistic regression uses the logit function,136 which is the logarithm of the odds—the ratio of the chance of success to failure. Let $$\mu$$ be the probability of success in a Bernoulli trial, then the logit function is defined as

$\mathrm{logit}(\mu) \equiv \log\left(\frac{\mu}{1-\mu}\right) \label{eq:logit}$

Logistic regression assumes that the logit function is a linear function of the explanatory variable, $$x$$.

$\log\left(\frac{\mu}{1-\mu}\right) = \beta_0 + \beta_1 x$

where $$\beta_0$$ and $$\beta_1$$ are trainable parameters. (TODO: Why would we assume this?) This can be generalized to a vector of multiple input variables, $$\vec{x}$$, where the input vector has a 1 prepended to be its zeroth component in order to conveniently include the bias, $$\beta_0$$, in a dot product.

$\vec{x} = (1, x_1, x_2, \ldots, x_n)^{\mathsf{T}}$

$\vec{w} = (\beta_0, \beta_1, \beta_2, \ldots, \beta_n)^{\mathsf{T}}$

$\log\left(\frac{\mu}{1-\mu}\right) = \vec{w}^{\mathsf{T}}\vec{x}$

For the moment, let $$z \equiv \vec{w}^{\mathsf{T}}\vec{x}$$. Exponentiating and solving for $$\mu$$ gives

$\mu = \frac{ e^z }{ 1 + e^z } = \frac{ 1 }{ 1 + e^{-z} }$

This function is called the logistic or sigmoid function.

$\mathrm{logistic}(z) \equiv \mathrm{sigm}(z) \equiv \frac{ 1 }{ 1 + e^{-z} } \label{eq:logistic}$

Since we inverted the logit function by solving for $$\mu$$, the inverse of the logit function is the logistic or sigmoid.

$\mathrm{logit}^{-1}(z) = \mathrm{logistic}(z) = \mathrm{sigm}(z)$

And therefore,

$\mu = \mathrm{sigm}(z) = \mathrm{sigm}(\vec{w}^{\mathsf{T}}\vec{x})$

### Softmax regression

Again, from a probabilistic point of view, we can derive the use of multi-class cross entropy loss by starting with the Bernoulli distribution, generalizing it to multiple classes (indexed by $$k$$) as

$p(y_k | \mu) = \mathrm{Cat}(y_k | \mu_k) = \prod_k {\mu_k}^{y_k} \label{eq:categorical_distribution}$

which is the categorical or multinoulli distribution. The negative-log likelihood of multiple independent trials is

$\mathrm{NLL} = - \sum_i \log \left(\prod_k {\mu_{ki}}^{y_{ki}}\right) = - \sum_i \sum_k y_{ki} \: \log \mu_{ki} \label{eq:nll_multinomial}$

Noting again that $$y_{ki} = 1$$ only when $$k$$ is the true class, and is 0 otherwise, this simplifies to eq. $$\eqref{eq:cross_entropy_loss2}$$.

• Multinomial logistic regression
• Softmax is really a soft argmax. TODO: find ref.
• Softmax is not unique. There are other squashing functions.138
• Roelants, P. (2019). Softmax classification with cross-entropy.
• Gradients from backprop through a softmax
• Goodfellow et al. point out that any negative log-likelihood is a cross entropy between the training data and the probability distribution predicted by the model.139

## Deep learning

### Introduction Figure 11: Raw input image is transformed into gradually higher levels of representation.156

### Regularization

Regularization = any change we make to the training algorithm in order to reduce the generalization error but not the training error.165

Most common regularizations:

• L2 Regularization
• L1 Regularization
• Data Augmentation
• Dropout
• Early Stopping

Papers:

### Batch size vs learning rate

Papers:

1. Keskar, N.S. et al. (2016). On large-batch training for deep learning: Generalization gap and sharp minima.

[L]arge-batch methods tend to converge to sharp minimizers of the training and testing functions—and as is well known—sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation.

• $$\eta \propto \sqrt{m}$$
1. Goyal, P. et al. (2017). Accurate large minibatch SGD: Training ImageNet in 1 hour.

• $$\eta \propto m$$
2. You, Y. et al. (2017). Large batch training of convolutional networks.

• Layer-wise Adaptive Rate Scaling (LARS)
3. You, Y. et al. (2017). ImageNet training in minutes.

• Layer-wise Adaptive Rate Scaling (LARS)
4. Jastrzebski, S. (2018). Three factors influencing minima in SGD.

• $$\eta \propto m$$
5. Smith, S.L. & Le, Q.V. (2018). A Bayesian Perspective on Generalization and Stochastic Gradient Descent.

6. Smith, S.L. et al. (2018). Don’t decay the learning rate, increase the batch size.

• $$m \propto \eta$$
7. Masters, D. & Luschi, C. (2018). Revisiting small batch training for deep neural networks.

This linear scaling rule has been widely adopted, e.g., in Krizhevsky (2014), Chen et al. (2016), Bottou et al. (2016), Smith et al. (2017) and Jastrzebski et al. (2017).

On the other hand, as shown in Hoffer et al. (2017), when $$m \ll M$$, the covariance matrix of the weight update $$\mathrm{Cov(\eta \Delta\theta)}$$ scales linearly with the quantity $$\eta^2/m$$.

This implies that, adopting the linear scaling rule, an increase in the batch size would also result in a linear increase in the covariance matrix of the weight update $$\eta \Delta\theta$$. Conversely, to keep the scaling of the covariance of the weight update vector $$\eta \Delta\theta$$ constant would require scaling $$\eta$$ with the square root of the batch size $$m$$ (Krizhevsky, 2014; Hoffer et al., 2017).

1. Lin, T. et al. (2020). Don’t use large mini-batches, use local SGD.
- Post-local SGD.

2. Golmant, N. et al. (2018). On the computational inefficiency of large batch sizes for stochastic gradient descent.

Scaling the learning rate as $$\eta \propto \sqrt{m}$$ attempts to keep the weight increment length statistics constant, but the distance between SGD iterates is governed more by properties of the objective function than the ratio of learning rate to batch size. This rule has also been found to be empirically sub-optimal in various problem domains. … There does not seem to be a simple training heuristic to improve large batch performance in general.

1. McCandlish, S. et al. (2018). An empirical model of large-batch training.
• Critical batch size
2. Shallue, C.J. et al. (2018). Measuring the effects of data parallelism on neural network training.

In all cases, as the batch size grows, there is an initial period of perfect scaling ($$b$$-fold benefit, indicated with a dashed line on the plots) where the steps needed to achieve the error goal halves for each doubling of the batch size. However, for all problems, this is followed by a region of diminishing returns that eventually leads to a regime of maximal data parallelism where additional parallelism provides no benefit whatsoever.

1. Jastrzebski, S. et al. (2018). Width of minima reached by stochastic gradient descent is influenced by learning rate to batch size ratio.
• $$\eta \propto m$$

We show this experimentally in Fig. 5, where similar learning dynamics and final performance can be observed when simultaneously multiplying the learning rate and batch size by a factor up to a certain limit.

1. You, Y. et al. (2019). Large-batch training for LSTM and beyond.
• Warmup and use $$\eta \propto m$$

[W]e propose linear-epoch gradual-warmup approach in this paper. We call this approach Leg-Warmup (LEGW). LEGW enables a Sqrt Scaling scheme in practice and as a result we achieve much better performance than the previous Linear Scaling learning rate scheme. For the GNMT application (Seq2Seq) with LSTM, we are able to scale the batch size by a factor of 16 without losing accuracy and without tuning the hyper-parameters mentioned above.

1. You, Y. et al. (2019). Large batch optimization for deep learning: Training BERT in 76 minutes.
• LARS and LAMB
2. Zhang, G. et al. (2019). Which algorithmic choices matter at which batch sizes? Insights from a Noisy Quadratic Model.

Consistent with the empirical results of Shallue et al. (2018), each optimizer shows two distinct regimes: a small-batch (stochastic) regime with perfect linear scaling, and a large-batch (deterministic) regime insensitive to batch size. We call the phase transition between these regimes the critical batch size.

1. Li, Y., Wei, C., & Ma, T. (2019). Towards explaining the regularization effect of initial large learning rate in training neural networks.

Our analysis reveals that more SGD noise, or larger learning rate, biases the model towards learning “generalizing” kernels rather than “memorizing” kernels.

1. Kaplan, J. et al. (2020). Scaling laws for neural language models.

2. Jastrzebski, S. et al. (2020). The break-even point on optimization trajectories of deep neural networks.

Blogs:

Resources:

### Natural language processing

Chain rule of language modeling (chain rule of probability):

$P(x_1, \ldots, x_T) = P(x_1, \ldots, x_{n-1}) \prod_{t=n}^{T} P(x_t | x_1 \ldots x_{t-1}) \label{eq:chain_rule_of_lm}$

or for the whole sentence:

$P(x_1, \ldots, x_T) = \prod_{t=1}^{T} P(x_t | x_1 \ldots x_{t-1}) \label{eq:chain_rule_of_lm_2}$

$= P(x_1) \: P(x_2 | x_1) \: P(x_3 | x_1 x_2) \: P(x_4 | x_1 x_2 x_3) \ldots$

Auto-regressive inference follows this chain rule. If done with greedy search:

$\hat{x}_t = \underset{x_t \in V}{\mathrm{argmax}} \: P(x_t | x_1 \ldots x_{t-1}) \label{eq:greedy_search}$ Figure 12: Diagram of the Transformer model (source: d2l.ai).

$\mathrm{attention}(Q, K, V) = \mathrm{softmax}\left(\frac{Q\, K^{\mathsf{T}}}{\sqrt{d_k}}\right) V \label{eq:attention}$ Figure 13: Diagram of the BERT model (source: peltarion.com).

### Reinforcement learning

Pedagogy:

Tutorials:

More:

#### Q-learning

• Q-learning and DQN
• Uses the Markov Decision Process (MDP) framework
• The Bellman equation211
• Q-learning is a values-based learning algorithm. Value based algorithms updates the value function based on an equation (particularly Bellman equation). Whereas the other type, policy-based estimates the value function with a greedy policy obtained from the last policy improvement (source: towardsdatascience.com).
• DQN masters Atari212

## Theoretical machine learning

### Algorithmic information theory

• Ray Solomonoff (1926-2009)
• Solomonoff induction
• Naturally formalizes Occam’s razor
• Incomputable
• Rathmanner, S. & Hutter, M. (2011). A philosophical treatise of universal induction.233

### No free lunch theorems

Raissi et al.:

encoding such structured information into a learning algorithm results in amplifying the information content of the data that the algorithm sees, enabling it to quickly steer itself towards the right solution and generalize well even when only a few training examples are available.248

Roberts:

From an algorithmic complexity standpoint it is somewhat miraculous that we can compress our huge look-up table of experiment/outcome into such an efficient description. In many senses, this type of compression is precisely what we mean when we say that physics enables us to understand a given phenomenon.249

## Automation

### AutoML

• Neural Architecture Search (NAS)
• AutoML frameworks
• RL-driven NAS
• learned sparsity

Lectures:

### AutoScience Figure 14: The inference cycle for the process of scientific inquiry. The three distinct forms of inference (abduction, deduction, and induction) facilitate an all-encompassing vision, enabling HPC and HDA to converge in a rational and structured manner. HPC: high- performance computing; HDA: high-end data analysis.285

## Implications for the realism debate

• Nope: Hennig

### Word meanings

Wittgenstein in PI:

The meaning of a word is its use in the language.291

and

One cannot guess how a word functions. One has to look at its use, and learn from that.292

My docs:

My talks:

## Annotated bibliography

• Mayo (1996)

• TODO

### Cowan, G. (1998). Statistical Data Analysis.

• Cowan (1998) and Cowan (2016)

• TODO

• James (2006)

• TODO

### Cowan, G. et al. (2011). Asymptotic formulae for likelihood-based tests of new physics.

• Cowan et al. (2011)
• Glen Cowan, Kyle Cranmer, Eilam Gross, Ofer Vitells

• TODO

• TODO

### Cranmer, K. (2015). Practical statistics for the LHC.

• Cranmer (2015)

#### My thoughts

• TODO

• All of Statistics293
• The Foundations of Statistics294

Aldrich, J. (1997). R. A. Fisher and the making of maximum likelihood 1912-1922. Statistical Science, 12, 162–176.
Amari, S. (2016). Information Geometry and Its Applications. Springer Japan.
Anderson, C. (2008). The End of Theory: The data deluge makes the scientific method obsolete. Wired. June 23, 2008. https://www.wired.com/2008/06/pb-theory/
Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). Deep Reinforcement Learning: A Brief Survey. IEEE Signal Processing Magazine, 34, 26–38.
Asch, M. et al. (2018). Big data and extreme-scale computing: Pathways to Convergence-Toward a shaping strategy for a future software and data ecosystem for scientific inquiry. The International Journal of High Performance Computing Applications, 32, 435–479.
ATLAS and CMS Collaborations. (2011). Procedure for the LHC Higgs boson search combination in Summer 2011. CMS-NOTE-2011-005, ATL-PHYS-PUB-2011-11. http://cds.cern.ch/record/1379837
ATLAS Collaboration. (2012). Combined search for the Standard Model Higgs boson in $$pp$$ collisions at $$\sqrt{s}$$ = 7 TeV with the ATLAS detector. Physical Review D, 86, 032003. https://arxiv.org/abs/1207.0319
ATLAS Statistics Forum. (2011). The CLs method: Information for conference speakers. http://www.pp.rhul.ac.uk/~cowan/stat/cls/CLsInfo.pdf
Bach, F. (2022). Learning Theory from First Principles. (Draft). https://www.di.ens.fr/~fbach/ltfp_book.pdf
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations, 3rd, 2015. https://arxiv.org/abs/1409.0473
Bahri, Y. et al. (2020). Statistical mechanics of deep learning. Annual Review of Condensed Matter Physics, 11, 501–528.
Balasubramanian, V. (1996a). A geometric formulation of Occam’s razor for inference of parametric distributions. https://arxiv.org/abs/adap-org/9601001
———. (1996b). Statistical inference, Occam’s razor and statistical mechanics on the space of probability distributions. https://arxiv.org/abs/cond-mat/9601030
Balestriero, R., Pesenti, J., & LeCun, Y. (2021). Learning in high dimension always amounts to extrapolation. https://arxiv.org/abs/2110.09485
Batson, J., Haaf, C. G., Kahn, Y., & Roberts, D. A. (2021). Topological obstructions to autoencoding. https://arxiv.org/abs/2102.08380
Baydin, A.G. et al. (2019). Etalumis: Bringing probabilistic programming to scientific simulators at scale. https://arxiv.org/abs/1907.03382
Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proceedings of the National Academy of Sciences, 116, 15849–15854. https://arxiv.org/abs/1812.11118
Bellman, R. (1952). On the theory of dynamic programming. Proceedings of the National Academy of Sciences, 38, 716–719.
Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2, 1–127. https://www.iro.umontreal.ca/~lisa/pointeurs/TR1312.pdf
Benjamin, D.J. et al. (2017). Redefine statistical significance. PsyArXiv. July 22, 2017. https://psyarxiv.com/mky9j/
Benjamini, Y. et al. (2021). The ASA president’s task force statement on statistical significance and replicability. Annals of Applied Statistics, 16, 1–2. https://magazine.amstat.org/blog/2021/08/01/task-force-statement-p-value/
Bensusan, H. (2000). Is machine learning experimental philosophy of science? In ECAI2000 Workshop notes on scientific Reasoning in Artificial Intelligence and the Philosophy of Science (pp. 9–14).
Berger, J. O. (2003). Could Fisher, Jeffreys and Neyman have agreed on testing? Statistical Science, 18, 1–32.
Berger, J. O. & Wolpert, R. L. (1988). The Likelihood Principle (2nd ed.). Haywood, CA: The Institute of Mathematical Statistics.
Bhattiprolu, P. N., Martin, S. P., & Wells, J. D. (2020). Criteria for projected discovery and exclusion sensitivities of counting experiments. https://arxiv.org/abs/2009.07249
Birnbaum, A. (1962). On the foundations of statistical inference. Journal of the American Statistical Association, 57, 269–326.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Blondel, M., Martins, A. F., & Niculae, V. (2020). Learning with Fenchel-Young losses. Journal of Machine Learning Research, 21, 1–69.
Bottou, L. (1998). Stochastic gradient descent tricks. In G. B. Orr & K. R. Muller (Eds.), Neural Networks: Tricks of the trade. Springer. https://www.microsoft.com/en-us/research/publication/stochastic-gradient-tricks/
Bousquet, O. et al. (2021). A theory of universal learning. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing (pp. 532–541). https://dl.acm.org/doi/pdf/10.1145/3406325.3451087
Bowling, M., Burch, N., Johanson, M., & Tammelin, O. (2015). Heads-up limit hold’em poker is solved. Science, 347, 145–149. http://science.sciencemag.org/content/347/6218/145
Bronstein, M. M., Bruna, J., Cohen, T., & Velickovic, P. (2021). Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. https://arxiv.org/abs/2104.13478
Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval estimation for a binomial proportion. Statistical Science, 16, 101–133. https://projecteuclid.org/euclid.ss/1009213286
Brown, N. (2020). Equilibrium finding for large adversarial imperfect-information games. (Ph.D. thesis). http://www.cs.cmu.edu/~noamb/thesis.pdf
Brown, N., Bakhtin, A., Lerer, A., & Gong, Q. (2020). Combining deep reinforcement learning and search for imperfect-information games. https://arxiv.org/abs/2007.13544
Brown, N., Lerer, A., Gross, S., & Sandholm, T. (2019). Deep counterfactual regret minimization. https://arxiv.org/abs/1811.00164
Brown, N. & Sandholm, T. (2018). Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 359, 418–424. https://science.sciencemag.org/content/359/6374/418
———. (2019a). Solving imperfect-information games via discounted regret minimization. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 1829–1836. https://arxiv.org/abs/1809.04040
———. (2019b). Superhuman AI for multiplayer poker. Science, 365, 885–890. https://science.sciencemag.org/content/365/6456/885
Brown, T.B. et al. (2020). Language models are few-shot learners. (Paper on the GPT-3 model by OpenAI). https://arxiv.org/abs/2005.14165
Bubeck, S. & Sellke, M. (2021). A universal law of robustness via isoperimetry. https://arxiv.org/abs/2105.12806
Burch, N. (2018). Time and Space: Why imperfect information games are hard. University of Alberta. (Ph.D. thesis). https://era.library.ualberta.ca/items/db44409f-b373-427d-be83-cace67d33c41/view/bcb00dca-39e6-4c43-9ec2-65026a50135e/Burch_Neil_E_201712_PhD.pdf
Caldeira, J. & Nord, B. (2020). Deeply uncertain: comparing methods of uncertainty quantification in deep learning algorithms. Machine Learning: Science and Technology, 2, 015002. https://iopscience.iop.org/article/10.1088/2632-2153/aba6f3
Calin, O. & Udriste, C. (2014). Geometric Modeling in Probability and Statistics. Springer Switzerland.
Canatar, A., Bordelon, B., & Pehlevan, C. (2020). Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. https://arxiv.org/abs/2006.13198
Carnap, R. (1945). The two concepts of probability. Philosophy and Phenomenological Research, 5, 513–32.
———. (1947). Probability as a guide in life. Journal of Philosophy, 44, 141–48.
Cesa-Bianchi, N. & Lugosi, G. (2006). Prediction, Learning, and Games. Cambridge University Press. https://ii.uni.wroc.pl/~lukstafi/pmwiki/uploads/AGT/Prediction_Learning_and_Games.pdf
Chen, R. T. Q., Rubanova, Y., Bettencourt, J., & Duvenaud, D. (2018). Neural ordinary differential equations. https://arxiv.org/abs/1806.07366
Chen, S., Dobriban, E., & Lee, J. H. (2020). A group-theoretic framework for data augmentation. https://arxiv.org/abs/1907.10905
Chen, T. & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. https://arxiv.org/abs/1603.02754
Chiley, V. et al. (2019). Online normalization for training neural networks. NeurIPS 2019. https://arxiv.org/abs/1905.05894
Church, K. W. & Hestness, J. (2019). A survey of 25 years of evaluation. Natural Language Engineering, 25, 753–767. https://www.cambridge.org/core/journals/natural-language-engineering/article/survey-of-25-years-of-evaluation/E4330FAEB9202EC490218E3220DDA291
Ciresan, D., Meier, U., Masci, J., & Schmidhuber, J. (2012). Multi-column deep neural network for traffic sign classification. Neural Networks, 32, 333–338. https://arxiv.org/abs/1202.2745
Clopper, C. J. & Pearson, E. S. (1934). The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika, 26, 404–413.
Cohen, T. S., Weiler, M., Kicanaoglu, B., & Welling, M. (2019). Gauge equivariant convolutional networks and the icosahedral CNN. https://arxiv.org/abs/1902.04615
Cohen, T. S. & Welling, M. (2016). Group equivariant convolutional networks. Proceedings of International Conference on Machine Learning, 2016, 2990–9. http://proceedings.mlr.press/v48/cohenc16.pdf
Cousins, R. D. (2018). Lectures on statistics in theory: Prelude to statistics in practice. https://arxiv.org/abs/1807.05996
Cousins, R. D. & Highland, V. L. (1992). Incorporating systematic uncertainties into an upper limit. Nuclear Instruments and Methods in Physics Research Section A, 320, 331–335. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.193.1581&rep=rep1&type=pdf
Cowan, G. (1998). Statistical Data Analysis. Clarendon Press.
———. (2012). Discovery sensitivity for a counting experiment with background uncertainty. https://www.pp.rhul.ac.uk/~cowan/stat/notes/medsigNote.pdf
———. (2016). Statistics. In C. Patrignani et al. (Particle Data Group),. Chinese Physics C, 40, 100001. http://pdg.lbl.gov/2016/reviews/rpp2016-rev-statistics.pdf
Cowan, G., Cranmer, K., Gross, E., & Vitells, O. (2011). Asymptotic formulae for likelihood-based tests of new physics. European Physical Journal C, 71, 1544. https://arxiv.org/abs/1007.1727
———. (2012). Asymptotic distribution for two-sided tests with lower and upper boundaries on the parameter of interest. https://arxiv.org/abs/1210.6948
Cox, D. R. (2006). Principles of Statistical Inference. Cambridge University Press.
Cramér, H. (1946). A contribution to the theory of statistical estimation. Skandinavisk Aktuarietidskrift, 29, 85–94.
Cranmer, K. (2015). Practical statistics for the LHC. https://arxiv.org/abs/1503.07622
Cranmer, K., Brehmer, J., & Louppe, G. (2019). The frontier of simulation-based inference. https://arxiv.org/abs/1911.01429
Cranmer, K. et al. (2012). HistFactory: A tool for creating statistical models for use with RooFit and RooStats. Technical Report: CERN-OPEN-2012-016. http://inspirehep.net/record/1236448/
Cranmer, K., Seljak, U., & Terao, K. (2021). Machine learning. In P. A. Z. et al. (Ed.), Progress of Theoretical and Experimental Physics. 2020, 083C01. (and 2021 update). https://pdg.lbl.gov/2021-rev/2021/reviews/contents_sports.html
Cranmer, M. et al. (2020). Discovering symbolic models from deep learning with inductive biases. https://arxiv.org/abs/2006.11287
D’Agnolo, R. T. & Wulzer, A. (2019). Learning New Physics from a Machine. Physical Review D, 99, 015014. https://arxiv.org/abs/1806.02350
Dar, Y., Muthukumar, V., & Baraniuk, R. G. (2021). A farewell to the bias-variance tradeoff? An overview of the theory of overparameterized machine learning. https://arxiv.org/abs/2109.02355
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. https://arxiv.org/abs/1810.04805
Dosovitskiy, A. et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. https://arxiv.org/abs/2010.11929
Edwards, A. W. F. (1974). The history of likelihood. International Statistical Review, 42, 9–15.
Efron, B. & Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, evidence, and data science. Cambridge University Press.
Evans, M. (2013). What does the proof of Birnbaum’s theorem prove? https://arxiv.org/abs/1302.5468
Fefferman, C., Mitter, S., & Narayanan, H. (2016). Testing the manifold hypothesis. Journal of the American Mathematical Society, 29, 983–1049. https://www.ams.org/journals/jams/2016-29-04/S0894-0347-2016-00852-4/S0894-0347-2016-00852-4.pdf
Feldman, G. J. & Cousins, R. D. (1998). A unified approach to the classical statistical analysis of small signals. Physical Review D, 57, 3873. https://arxiv.org/abs/physics/9711021
Fienberg, S. E. (2006). When did Bayesian inference become "Bayesian"? Bayesian Analysis, 1, 1–40. https://projecteuclid.org/journals/bayesian-analysis/volume-1/issue-1/When-did-Bayesian-inference-become-Bayesian/10.1214/06-BA101.full
Firth, J. R. (1957). A synopsis of linguistic theory, 1930-1955. In Studies in Linguistic Analysis (pp. 1–31). Oxford: Blackwell.
Fisher, R. A. (1912). On an absolute criterion for fitting frequency curves. Statistical Science, 12, 39–41.
———. (1915). Frequency distribution of the values of the correlation coefficient in samples of indefinitely large population. Biometrika, 10, 507–521.
———. (1921). On the "probable error" of a coefficient of correlation deduced from a small sample. Metron, 1, 1–32.
———. (1935). The Design of Experiments. Hafner.
———. (1955). Statistical methods and scientific induction. Journal of the Royal Statistical Society, Series B, 17, 69–78.
Fréchet, M. (1943). Sur l’extension de certaines évaluations statistiques au cas de petits échantillons. Revue de l’Institut International de Statistique, 11, 182–205.
Fuchs, F. B., Worrall, D. E., Fischer, V., & Welling, M. (2020). SE(3)-Transformers: 3D roto-translation equivariant attention networks. https://arxiv.org/abs/2006.10503
Fukushima, K. & Miyake, S. (1982). Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern Recognition, 15, 455–469.
Gandenberger, G. (2015). A new proof of the likelihood principle. British Journal for the Philosophy of Science, 66, 475–503. https://www.journals.uchicago.edu/doi/abs/10.1093/bjps/axt039
———. (2016). Why I am not a likelihoodist. Philosopher’s Imprint, 16, 1–22. https://quod.lib.umich.edu/p/phimp/3521354.0016.007/--why-i-am-not-a-likelihoodist
Gao, Y. & Chaudhari, P. (2020). An information-geometric distance on the space of tasks. https://arxiv.org/abs/2011.00613
Gelman, A. & Hennig, C. (2017). Beyond subjective and objective in statistics. Journal of the Royal Statistical Society: Series A (Statistics in Society), 180, 967–1033.
Gibson, R. G. (2014). Regret minimization in games and the development of champion multiplayer computer poker-playing agents. University of Alberta. (Ph.D. thesis). https://era.library.ualberta.ca/items/15d28cbf-49d4-42e5-a9c9-fc55b1d816af/view/5ee708c7-6b8b-4b96-b1f5-23cdd95b6a46/Gibson_Richard_Spring-202014.pdf
Goldreich, O. & Ron, D. (1997). On universal learning algorithms. Information Processing Letters, 63, 131–136. https://www.wisdom.weizmann.ac.il/~oded/p_ul.html
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. http://www.deeplearningbook.org
Goodman, S. N. (1999a). Toward evidence-based medical statistics 1: The P value fallacy. Annals of Internal Medicine, 130, 995–1004. https://courses.botany.wisc.edu/botany_940/06EvidEvol/papers/goodman1.pdf
———. (1999b). Toward evidence-based medical statistics 2: The Bayes factor. Annals of Internal Medicine, 130, 1005–1013. https://courses.botany.wisc.edu/botany_940/06EvidEvol/papers/goodman2.pdf
Gorard, S. & Gorard, J. (2016). What to do instead of significance testing? Calculating the ’number of counterfactual cases needed to disturb a finding’. International Journal of Social Research Methodology, 19, 481–490.
Haber, E. & Ruthotto, L. (2017). Stable architectures for deep neural networks. https://arxiv.org/abs/1705.03341
Hacking, I. (1965). Logic of Statistical Inference. Cambridge University Press.
———. (1971). Jacques Bernoulli’s Art of conjecturing. The British Journal for the Philosophy of Science, 22, 209–229.
Halverson, J., Maiti, A., & Stoner, K. (2020). Neural networks and quantum field theory. https://arxiv.org/abs/2008.08601
Hanley, J. A. & Lippman-Hand, A. (1983). If nothing goes wrong, is everything all right?: Interpreting zero numerators. JAMA, 249, 1743–1745.
Hart, S. & Mas‐Colell, A. (2000). A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68, 1127–1150. https://www.ma.imperial.ac.uk/~dturaev/Hart0.pdf
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. https://arxiv.org/abs/1512.03385
Heinrich, J. & Lyons, L. (2007). Systematic errors. Annual Reviews of Nuclear and Particle Science, 57, 145–169. https://www.annualreviews.org/doi/abs/10.1146/annurev.nucl.57.090506.123052
Heinrich, J. & Silver, D. (2016). Deep reinforcement learning from self-play in imperfect-information games. https://arxiv.org/abs/1603.01121
Hennig, C. (2015). What are the true clusters? Pattern Recognition Letters, 64, 53–62. https://arxiv.org/abs/1502.02555
Hochreiter, S. & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9, 1735–1780.
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359–366. https://cognitivemedium.com/magic_paper/assets/Hornik.pdf
Howard, A.G. et al. (2017). MobileNets: Efficient convolutional neural networks for mobile vision applications. https://arxiv.org/abs/1704.04861
Howard, J. N., Mandt, S., Whiteson, D., & Yang, Y. (2021). Foundations of a fast, data-driven, machine-learned simulator. https://arxiv.org/abs/2101.08944
Hutchins, J. (2000). Yehoshua Bar-Hillel: A philosophers’ contribution to machine translation.
Ingrosso, A. & Goldt, S. (2022). Data-driven emergence of convolutional structure in neural networks. https://arxiv.org/abs/2202.00565
Ioannidis, J. P. (2005). Why most published research findings are false. PLOS Medicine, 2, 696–701.
Ismailov, V. (2020). A three layer neural network can represent any multivariate function. https://arxiv.org/abs/2012.03016
James, F. (2006). Statistical Methods in Experimental Particle Physics (2nd ed.). World Scientific.
James, F. & Roos, M. (1975). MINUIT: A system for function minimization and analysis of the parameter errors and corrections. Computational Physics Communications, 10, 343–367. https://cds.cern.ch/record/310399
Joyce, T. & Herrmann, J. M. (2017). A review of no free lunch theorems, and their implications for metaheuristic optimisation. In X. S. Yang (Ed.), Nature-Inspired Algorithms and Applied Optimization (pp. 27–52).
Junk, T. (1999). Confidence level computation for combining searches with small statistics. Nuclear Instruments and Methods in Physics Research Section A, 434, 435–443. https://arxiv.org/abs/hep-ex/9902006
Jurafsky, D. & Martin, J. H. (2022). Speech and Language Processing: An introduction to natural language processing, computational linguistics, and speech recognition (3rd ed.). https://web.stanford.edu/~jurafsky/slp3/ed3book_jan122022.pdf
Kaplan, J. et al. (2019). Notes on contemporary machine learning for physicists. https://sites.krieger.jhu.edu/jared-kaplan/files/2019/04/ContemporaryMLforPhysicists.pdf
———. (2020). Scaling laws for neural language models. https://arxiv.org/abs/2001.08361
Kardum, M. (2020). Rudolf Carnap–The grandfather of artificial neural networks: The influence of Carnap’s philosophy on Walter Pitts. In S. Skansi (Ed.), Guide To Deep Learning Basics: Logical, Historical And Philosophical Perspectives (pp. 55–66). Springer.
Karniadakis, G.E. et al. (2021). Physics-informed machine learning. Nature Reviews Physics, 3, 422–440. https://doi.org/10.1038/s42254-021-00314-5
Kiani, B., Balestriero, R., Lecun, Y., & Lloyd, S. (2022). projUNN: efficient method for training deep networks with unitary matrices. https://arxiv.org/abs/2203.05483
Korb, K. B. (2001). Machine learning as philosophy of science. In Proceedings of the ECML-PKDD-01 Workshop on Machine Learning as Experimental Philosophy of Science. Freiburg.
Krenn, M. et al. (2022). On scientific understanding with artificial intelligence. https://arxiv.org/abs/2204.01467
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 2012, 1097–1105. https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
Kruschke, J. K. & Liddell, T. M. (2018). The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychonomic Bulletin & Review, 25, 178–206. https://link.springer.com/article/10.3758/s13423-016-1221-4
Lan, Z. et al. (2019). ALBERT: A lite BERT for self-supervised learning of language representations. https://arxiv.org/abs/1909.11942
Lanctot, M., Waugh, K., Zinkevich, M., & Bowling, M. (2009). Monte Carlo sampling for regret minimization in extensive games. Advances in Neural Information Processing Systems, 22, 1078–1086.
Lauc, D. (2020). Machine learning and the philosophical problems of induction. In S. Skansi (Ed.), Guide To Deep Learning Basics: Logical, Historical And Philosophical Perspectives (pp. 93–106). Springer.
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436–44.
LeCun, Y. & Bottou, L. (1998). Efficient BackProp. In G. B. Orr & K. R. Muller (Eds.), Neural Networks: Tricks of the trade. Springer. http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86, 2278–2324. http://vision.stanford.edu/cs598_spring07/papers/Lecun98.pdf
LeCun, Y. et al. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1, 541–551. http://yann.lecun.com/exdb/publis/pdf/lecun-89e.pdf
Leemis, L. M. & McQueston, J. T. (2008). Univariate distribution relationships. The American Statistician, 62, 45–53. http://www.stat.rice.edu/~dobelman/courses/texts/leemis.distributions.2008amstat.pdf
Lei, N., Luo, Z., Yau, S., & Gu, D. X. (2018). Geometric understanding of deep learning. https://arxiv.org/abs/1805.10451
Lewis, D. (1981). Causal decision theory. Australasian Journal of Philosophy, 59, 5–30. https://www.andrewmbailey.com/dkl/Causal_Decision_Theory.pdf
Lista, L. (2016a). Practical statistics for particle physicists. https://arxiv.org/abs/1609.04150
———. (2016b). Statistical Methods for Data Analysis in Particle Physics. Springer. http://foswiki.oris.mephi.ru/pub/Main/Literature/st_methods_for_data_analysis_in_particle_ph.pdf
Liu, H., Dai, Z., So, D. R., & Le, Q. V. (2021). Pay attention to MLPs. https://arxiv.org/abs/2105.08050
Liu, Z., Madhavan, V., & Tegmark, M. (2022). AI Poincare 2: Machine learning conservation laws from differential equations. https://arxiv.org/abs/2203.12610
Lu, Z. et al. (2017). The expressive power of neural networks: A view from the width. Advances in Neural Information Processing Systems, 30. https://proceedings.neurips.cc/paper/2017/file/32cbf687880eb1674a07bf717761dd3a-Paper.pdf
Lyons, L. (2008). Open statistical issues in particle physics. The Annals of Applied Statistics, 2, 887–915. https://projecteuclid.org/journals/annals-of-applied-statistics/volume-2/issue-3/Open-statistical-issues-in-Particle-Physics/10.1214/08-AOAS163.full
MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge University Press.
Mayo, D. G. (1981). In defense of the Neyman-Pearson theory of confidence intervals. Philosophy of Science, 48, 269–280.
———. (1996). Error and the Growth of Experimental Knowledge. Chicago University Press.
———. (2014). On the Birnbaum Argument for the Strong Likelihood Principle,. Statistical Science, 29, 227–266.
———. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. Cambridge University Press.
———. (2019). The law of likelihood and error statistics. https://errorstatistics.com/2019/04/04/excursion-1-tour-ii-error-probing-tools-versus-logics-of-evidence-excerpt/
———. (2021). Significance tests: Vitiated or vindicated by the replication crisis in psychology? Review of Philosophy and Psychology, 12, 101–121. https://link.springer.com/article/10.1007/s13164-020-00501-w
Mayo, D. G. & Spanos, A. (2006). Severe testing as a basic concept in a Neyman-Pearson philosophy of induction. British Journal for the Philosophy of Science, 57, 323–357.
———. (2011). Error statistics. In Philosophy of Statistics (pp. 153–198). North-Holland.
McCarthy, J., Minsky, M. L., Rochester, N., & Shannon, C. E. (1955). A proposal for the Dartmouth Summer Research Project on Artificial Intelligence. http://www-formal.stanford.edu/jmc/history/dartmouth.pdf
McDermott, J. (2019). When and why metaheuristics researchers can ignore "no free lunch" theorems. https://arxiv.org/abs/1906.03280
McFadden, D. & Zarembka, P. (1973). Conditional logit analysis of qualitative choice behavior. In Frontiers in Econometrics (pp. 105–142). New York: Academic Press.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. https://arxiv.org/abs/1301.3781
Mikolov, T. et al. (2013). Distributed representations of words and phrases and their compositionality. https://arxiv.org/abs/1310.4546
Mikolov, T., Yih, W. T., & Zweig, G. (2013). Linguistic regularities in continuous space word representations. NAACL HLT 2013. https://www.aclweb.org/anthology/N13-1090.pdf
Minsky, M. & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press.
Mitchell, T. M. (1980). The need for biases in learning generalizations. In Readings in Machine Learning (pp. 184–192). San Mateo, CA, USA. http://www.cs.cmu.edu/afs/cs/usr/mitchell/ftp/pubs/NeedForBias_1980.pdf
Mnih, V. et al. (2013). Playing Atari with deep reinforcement learning. https://arxiv.org/abs/1312.5602
———. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529–533. http://files.davidqiu.com//research/nature14236.pdf
Moravcik, M. et al. (2017). DeepStack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356, 508–513. https://arxiv.org/abs/1701.01724
Murphy, K. P. (2012). Machine Learning: A probabilistic perspective. MIT Press.
———. (2022). Probabilistic Machine Learning: An introduction. MIT Press.
Nagarajan, V. (2021). Explaining generalization in deep learning: progress and fundamental limits. (Ph.D. thesis). https://arxiv.org/abs/2110.08922
Nakkiran, P. (2021). Turing-universal learners with optimal scaling laws. https://arxiv.org/abs/2111.05321
Nakkiran, P. et al. (2019). Deep double descent: Where bigger models and more data hurt. https://arxiv.org/abs/1912.02292
Neller, T. W. & Lanctot, M. (2013). An introduction to counterfactual regret minimization. Proceedings of Model AI Assignments, 11. http://cs.gettysburg.edu/~tneller/modelai/2013/cfr/cfr.pdf
Neyman, J. (1955). The problem of inductive inference. Communications on Pure and Applied Mathematics, 8, 13–45. https://errorstatistics.files.wordpress.com/2017/04/neyman-1955-the-problem-of-inductive-inference-searchable.pdf
———. (1977). Frequentist probability and frequentist statistics. Synthese, 36, 97–131.
Neyman, J. & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society A, 231, 289–337.
Nielsen, F. (2018). An elementary introduction to information geometry. https://arxiv.org/abs/1808.08271
Nirenburg, S. (1996). Bar Hillel and Machine Translation: Then and Now.
O’Hagan, A. (2010). Kendall’s Advanced Theory of Statistics, Vol 2B: Bayesian Inference. Wiley.
Park, N. & Kim, S. (2022). How do vision transformers work? https://arxiv.org/abs/2202.06709
Pearl, J. (2009). Causal inference in statistics: An overview. Statistics Surveys, 3, 96–146. https://projecteuclid.org/journals/statistics-surveys/volume-3/issue-none/Causal-inference-in-statistics-An-overview/10.1214/09-SS057.pdf
———. (2018). The Book of Why: The new science of cause and effect. Basic Books.
Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50, 157–175.
Peirce, C. S. (1883). Studies in Logic. Boston: Little, Brown, and Co.
Perone, C. S. (2018). NLP word representations and the Wittgenstein philosophy of language. http://blog.christianperone.com/2018/05/nlp-word-representations-and-the-wittgenstein-philosophy-of-language/
Peters, J., Janzing, D., & Scholkopf, B. (2017). Elements of Causal Inference. MIT Press.
Phuong, M. & Hutter, M. (2022). Formal algorithms for transformers. https://arxiv.org/abs/2207.09238
Radford, A. et al. (2019). Language models are unsupervised multitask learners. (Paper on the GPT-2 model by OpenAI). https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. (Paper on the GPT model by OpenAI). https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2017a). Physics informed deep learning (Part I): Data-driven solutions of nonlinear partial differential equations. https://arxiv.org/abs/1711.10561
———. (2017b). Physics informed deep learning (Part II): Data-driven discovery of nonlinear partial differential equations. https://arxiv.org/abs/1711.10566
Rao, C. R. (1945). Information and the accuracy attainable in the estimation of statistical parameters. Bulletin of the Calcutta Mathematical Society, 37, 81–91.
———. (1947). Minimum variance and the estimation of several parameters. In Mathematical Proceedings of the Cambridge Philosophical Society. 43, 280–283. Cambridge University Press.
Rathmanner, S. & Hutter, M. (2011). A philosophical treatise of universal induction. Entropy, 13, 1076–1136. https://www.mdpi.com/1099-4300/13/6/1076/pdf
Read, A. L. (2002). Presentation of search results: the CLs technique. Journal of Physics G: Nuclear and Particle Physics, 28, 2693. https://indico.cern.ch/event/398949/attachments/799330/1095613/The_CLs_Technique.pdf
Reid, C. (1998). Neyman. Springer-Verlag.
Rice, J. A. (2007). Mathematical Statistics and Data Analysis (3rd ed.). Thomson.
Roberts, D. A. (2021). Why is AI hard and physics simple? https://arxiv.org/abs/2104.00008
Roberts, D. A., Yaida, S., & Hanin, B. (2021). The Principles of Deep Learning Theory: An Effective Theory Approach to Understanding Neural Networks. Cambridge University Press. https://deeplearningtheory.com/PDLT.pdf
Robins, J. M. & Wasserman, L. (1999). On the impossibility of inferring causation from association without background knowledge. In C. Glymour & G. Cooper (Eds.), Computation, Causation, and Discovery (pp. 305–321). AAAI & MIT Press.
Ronen, M., Finder, S. E., & Freifeld, O. (2022). DeepDPM: Deep clustering with an unknown number of clusters. https://arxiv.org/abs/2203.14309
Royall, R. (1997). Statistical Evidence: A likelihood paradigm. CRC Press.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536.
Salsburg, D. (2001). The Lady Tasting Tea. Holt.
Savage, L. J. (1954). The Foundations of Statistics. John Wiley & Sons.
Shalev-Shwarz, S. & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press. https://www.cs.huji.ac.il/w~shais/UnderstandingMachineLearning/understanding-machine-learning-theory-algorithms.pdf
Silver, D. et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529, 484–489.
———. (2017a). Mastering chess and shogi by self-play with a general reinforcement learning algorithm. https://arxiv.org/abs/1712.01815
———. (2017b). Mastering the game of Go without human knowledge. Nature, 550, 354–359.
Simonyan, K. & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. https://arxiv.org/abs/1409.1556
Sinervo, P. (2002). Signal significance in particle physics. In M. R. Whalley & L. Lyons (Eds.), Proceedings of the Conference on Advanced Statistical Techniques in Particle Physics. Durham, UK: Institute of Particle Physics Phenomenology. https://arxiv.org/abs/hep-ex/0208005v1
———. (2003). Definition and treatment of systematic uncertainties in high energy physics and astrophysics. In Lyons L., Mount R., & R. Reitmeyer (Eds.), Proceedings of the Conference on Statistical Problems in Particle Physics, Astrophysics, and Cosmology (PhyStat2003) (pp. 122–129). Stanford Linear Accelerator Center. https://www.slac.stanford.edu/econf/C030908/papers/TUAT004.pdf
Skelac, I. & Jandric, A. (2020). Meaning as use: From Wittgenstein to Google’s Word2vec. In S. Skansi (Ed.), Guide To Deep Learning Basics: Logical, Historical And Philosophical Perspectives (pp. 41–53). Springer.
Slonim, N., Atwal, G. S., Tkacik, G., & Bialek, W. (2005). Information-based clustering. Proceedings of the National Academy of Sciences, 102, 18297–18302. https://arxiv.org/abs/q-bio/0511043
Smith, L. (2019). A gentle introduction to information geometry. September 27, 2019. http://www.robots.ox.ac.uk/~lsgs/posts/2019-09-27-info-geom.html
Solomonoff, G. (2016). Ray Solomonoff and the Dartmouth Summer Research Project in Artificial Intelligence, 1956. http://raysolomonoff.com/dartmouth/dartray.pdf
Spears, B.K. et al. (2018). Deep learning: A guide for practitioners in the physical sciences. Physics of Plasmas, 25, 080901.
Stahlberg, F. (2019). Neural machine translation: A review. https://arxiv.org/abs/1912.02047
Steinhardt, J. (2012). Beyond Bayesians and frequentists. https://jsteinhardt.stat.berkeley.edu/files/stats-essay.pdf
———. (2022). More is different for AI. https://bounded-regret.ghost.io/more-is-different-for-ai/
Stuart, A., Ord, K., & Arnold, S. (2010). Kendall’s Advanced Theory of Statistics, Vol 2A: Classical Inference and the Linear Model. Wiley.
Sutskever, I. (2015). A brief overview of deep learning. https://web.archive.org/web/20220728224752/http://yyue.blogspot.com/2015/01/a-brief-overview-of-deep-learning.html
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems, 2014, 3104–3112. https://arxiv.org/abs/1409.3215
Sutton, R. S. (2019). The bitter lesson. http://www.incompleteideas.net/IncIdeas/BitterLesson.html
Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning (2nd ed.). MIT Press.
Sznajder, M. (2018). Inductive logic as explication: The evolution of Carnap’s notion of logical probability. The Monist, 101, 417–440.
Tan, M. & Le, Q. V. (2019). EfficientNet: Rethinking model scaling for convolutional neural networks. https://arxiv.org/abs/1905.11946
———. (2021). EfficientNetV2: Smaller models and faster training. https://arxiv.org/abs/2104.00298
Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2022). Efficient transformers: A survey. https://arxiv.org/abs/2009.06732
Tegmark, M., Taylor, A. N., & Heavens, A. F. (1997). Karhunen-Loeve eigenvalue problems in cosmology: How should we tackle large data sets? The Astrophysical Journal, 480, 22–35. https://arxiv.org/abs/astro-ph/9603021
Thuerey, N. et al. (2021). Physics-based deep learning. https://arxiv.org/abs/2109.05237
Tukey, J. W. (1977). Exploratory Data Analysis. Pearson.
Udrescu, S. & Tegmark, M. (2020). Symbolic pregression: Discovering physical laws from raw distorted video. https://arxiv.org/abs/2005.11212
van Handel, R. (2016). Probability in high dimensions. Lecture notes at Princeton. https://web.math.princeton.edu/~rvan/APC550.pdf
Vapnik, V., Levin, E., & LeCun, Y. (1994). Measuring the VC-dimension of a learning machine. Neural Computation, 6, 851–876.
Vaswani, A. et al. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 2017, 5998–6008. https://arxiv.org/abs/1706.03762
Venn, J. (1888). The Logic of Chance. London: MacMillan and Co. (Originally published in 1866).
Vershynin, R. (2018). High-Dimensional Probability:An introduction with applications in data science. Cambridge University Press. https://www.math.uci.edu/~rvershyn/papers/HDP-book/HDP-book.pdf
Wainer, H. (2007). The most dangerous equation. American Scientist, 95, 249–256. https://sites.stat.washington.edu/people/peter/498.Sp16/Equation.pdf
Wakefield, J. (2013). Bayesian and Frequentist Regression Methods. Springer.
Wald, A. (1943). Tests of statistical hypotheses concerning several parameters when the number of observations is large. Transactions of the American Mathematical Society, 54, 426–482. https://www.ams.org/journals/tran/1943-054-03/S0002-9947-1943-0012401-3/S0002-9947-1943-0012401-3.pdf
Wasserman, L. (2003). All of Statistics: A Concise Course in Statistical Inference. Springer.
Wasserstein, R. L., Allen, L. S., & Lazar, N. A. (2019). Moving to a World Beyond "p<0.05". American Statistician, 73, 1–19.
Wasserstein, R. L. & Lazar, N. A. (2016). The ASA’s statement on p-values: Context, process, and purpose. American Statistician, 70, 129–133.
Watson, D. & Floridi, L. (2019). The explanation game: A formal framework for interpretable machine learning. SSRN, 3509737. https://ssrn.com/abstract=3509737
Weisberg, J. (2019). Odds & Ends: Introducing Probability & Decision with a Visual Emphasis. https://jonathanweisberg.org/vip/
Werbos, P. J. (1990). Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78, 1550–1560. http://www.werbos.com/Neural/BTT.pdf
Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. The Annals of Mathematical Statistics, 9, 60–62. https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-9/issue-1/The-Large-Sample-Distribution-of-the-Likelihood-Ratio-for-Testing/10.1214/aoms/1177732360.full
Williamson, J. (2009). The philosophy of science and its relation to machine learning. In Scientific Data Mining and Knowledge Discovery (pp. 77–89). Springer, Berlin, Heidelberg.
Wittgenstein, L. (2009). Philosophical Investigations. (E. Anscombe & P. Hacker, Trans., P. Hacker & J. Schulte, Eds.) (4th ed.). Wiley-Blackwell. (Originally published in 1953).
Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms. Neural Computation, 8, 1341–1390.
———. (2007). Physical limits of inference. https://arxiv.org/abs/0708.1362
Wolpert, D. H. & Kinney, D. (2020). Noisy deductive reasoning: How humans construct math, and how math constructs universes. https://arxiv.org/abs/2012.08298
Wolpert, D. H. & Macready, W. G. (1995). No free lunch theorems for search. Technical Report SFI-TR-95-02-010, Santa Fe Institute.
———. (1997). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1, 67–82.
Wu, Y. et al. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. https://arxiv.org/abs/1409.0473
Yang, Z. et al. (2019). XLNet: Generalized autoregressive pretraining for language understanding. https://arxiv.org/abs/1906.08237
Zaheer, M. et al. (2020). Big Bird: Transformers for longer sequences. https://arxiv.org/abs/2007.14062
Zech, G. (1995). Comparing statistical data to Monte Carlo simulation: Parameter fitting and unfolding. (DESY-95-113). Deutsches Elektronen-Synchrotron (DESY). https://cds.cern.ch/record/284321
Zinkevich, M., Johanson, M., Bowling, M., & Piccione, C. (2007). Regret minimization in games with incomplete information. Advances in Neural Information Processing Systems, 20, 1729–1736.

1. Edwards (1974), p. 9.↩︎

2. Hacking (1971).↩︎

3. Bernoulli, J. (1713). Ars Conjectandi, Chapter II, Part IV, defining the art of conjecture [wikiquote].↩︎

4. Venn (1888).↩︎

5. Peirce (1883), p. 126–181.↩︎

6. Pearson (1900).↩︎

7. Fisher (1912).↩︎

8. Fisher (1915).↩︎

9. Fisher (1921).↩︎

10. Fisher (1955).↩︎

11. Salsburg (2001).↩︎

12. Reid (1998).↩︎

13. Neyman (1955).↩︎

14. Stuart, Ord, & Arnold (2010).↩︎

15. James (2006).↩︎

16. Cowan (1998) and Cowan (2016).↩︎

17. Cranmer (2015).↩︎

18. Lista (2016b).↩︎

19. Lista (2016a).↩︎

20. Cox (2006).↩︎

21. Cousins (2018).↩︎

22. Weisberg (2019).↩︎

23. Carnap (1947).↩︎

24. Goodfellow, Bengio, & Courville (2016), p. 72-73.↩︎

25. Cowan (1998), p. 20-22.↩︎

26. Fienberg (2006).↩︎

27. Weisberg (2019), ch. 15.↩︎

28. Fisher (1921), p. 15.↩︎

29. van Handel (2016).↩︎

30. Vershynin (2018).↩︎

31. Leemis & McQueston (2008).↩︎

32. Cranmer, K. et al. (2012).↩︎

33. This assumption that the model models the data “reasonably” well reflects that to the degree required by your analysis, the important features of the data match well within the systematic uncertainties parametrized within the model. If the model is incomplete because it is missing an important feature of the data, then this is the “ugly” (class-3) error in the Sinervo classification of systematic uncertainties.↩︎

34. Cowan (1998) and Cowan (2016), p. TODO.↩︎

35. Aldrich (1997).↩︎

36. James (2006), p. 234.↩︎

37. Cox (2006), p. 11.↩︎

38. Murphy (2012), p. 222.↩︎

39. Fréchet (1943), Cramér (1946), Rao (1945), and Rao (1947).↩︎

40. Rice (2007), p. 300–2.↩︎

41. Cowan (1998), p. 130-5.↩︎

42. James (2006), p. 234.↩︎

43. James & Roos (1975).↩︎

44. Cowan, Cranmer, Gross, & Vitells (2012).↩︎

45. Wainer (2007).↩︎

46. Tegmark, Taylor, & Heavens (1997).↩︎

47. Clopper & Pearson (1934).↩︎

48. Hanley & Lippman-Hand (1983).↩︎

49. L. D. Brown, Cai, & DasGupta (2001).↩︎

50. Fisher (1935), p. 16.↩︎

51. Goodman (1999a). p. 998.↩︎

52. ATLAS and CMS Collaborations (2011).↩︎

53. Cowan, Cranmer, Gross, & Vitells (2011).↩︎

54. Neyman & Pearson (1933).↩︎

55. Feldman & Cousins (1998).↩︎

56. Sinervo (2002) and Cowan (2012).↩︎

57. Cowan et al. (2011), p. 2–3.↩︎

58. Cowan et al. (2011), p. 3.↩︎

59. Cousins & Highland (1992).↩︎

60. Junk (1999).↩︎

62. ATLAS Statistics Forum (2011).↩︎

63. Wilks (1938).↩︎

64. Wald (1943).↩︎

65. Cowan et al. (2011).↩︎

66. Bhattiprolu, Martin, & Wells (2020).↩︎

67. Murphy (2012), p. 197.↩︎

68. Goodman (1999b).↩︎

69. Goodman (1999a). p. 995.↩︎

70. Sinervo (2003).↩︎

71. Heinrich & Lyons (2007).↩︎

72. Caldeira & Nord (2020).↩︎

73. Lyons (2008), p. 890.↩︎

74. Pearl (2018).↩︎

75. Lewis (1981).↩︎

76. Pearl (2009).↩︎

77. Robins & Wasserman (1999).↩︎

78. Peters, Janzing, & Scholkopf (2017).↩︎

79. Tukey (1977).↩︎

80. Carnap (1945).↩︎

81. Royall (1997), p. 171–2.↩︎

82. Cranmer (2015), p. 6.↩︎

83. Neyman & Pearson (1933).↩︎

84. Kruschke & Liddell (2018).↩︎

85. Edwards (1974).↩︎

86. Birnbaum (1962).↩︎

87. Hacking (1965).↩︎

88. Berger & Wolpert (1988).↩︎

89. O’Hagan (2010), p. 17–18.↩︎

90. Gandenberger (2015).↩︎

91. Evans (2013).↩︎

92. Mayo (2014).↩︎

93. Mayo (2019).↩︎

94. Mayo (2019).↩︎

95. Lyons (2008), p. 891.↩︎

96. Sznajder (2018).↩︎

97. Hacking (1965).↩︎

98. Neyman (1977).↩︎

99. Zech (1995).↩︎

100. Royall (1997).↩︎

101. Berger (2003).↩︎

102. Mayo (1981).↩︎

103. Mayo (1996).↩︎

104. Mayo & Spanos (2006).↩︎

105. Mayo & Spanos (2011).↩︎

106. Mayo (2018).↩︎

107. Gelman & Hennig (2017).↩︎

108. Murphy (2012), ch. 6.6.↩︎

109. Murphy (2022), p. 195–198.↩︎

110. Gandenberger (2016).↩︎

111. Wakefield (2013), ch. 4.↩︎

112. Efron & Hastie (2016), p. 30–36.↩︎

113. Kruschke & Liddell (2018).↩︎

114. Steinhardt (2012).↩︎

115. Goodman (1999a). p. 999.↩︎

116. Ioannidis (2005).↩︎

117. Wasserstein & Lazar (2016).↩︎

118. Wasserstein, Allen, & Lazar (2019).↩︎

119. Benjamin, D.J. et al. (2017).↩︎

120. Fisher (1935), p. 13–14.↩︎

121. Mayo (2021).↩︎

122. Gorard & Gorard (2016).↩︎

123. Benjamini, Y. et al. (2021), p. 1.↩︎

124. Hastie, Tibshirani, & Friedman (2009).↩︎

125. MacKay (2003).↩︎

126. Murphy (2012).↩︎

127. Murphy (2022), p. 195–198.↩︎

128. Shalev-Shwarz & Ben-David (2014).↩︎

129. Vapnik, Levin, & LeCun (1994).↩︎

130. Shalev-Shwarz & Ben-David (2014), p. 67–82.↩︎

131. McCarthy, Minsky, Rochester, & Shannon (1955).↩︎

132. Solomonoff (2016).↩︎

133. Kardum (2020).↩︎

134. Murphy (2012), p. 21.↩︎

135. Note: Label smoothing is a regularization technique that smears the activation over other labels, but we don’t do that here.↩︎

136. “Logit” was coined by Joseph Berkson (1899-1982).↩︎

137. McFadden & Zarembka (1973).↩︎

138. Blondel, Martins, & Niculae (2020).↩︎

139. Goodfellow et al. (2016), p. 129.↩︎

140. T. Chen & Guestrin (2016).↩︎

141. Slonim, Atwal, Tkacik, & Bialek (2005).↩︎

142. Batson, Haaf, Kahn, & Roberts (2021).↩︎

143. Hennig (2015).↩︎

144. Lauc (2020), p. 103–4.↩︎

145. Ronen, Finder, & Freifeld (2022).↩︎

146. Bengio (2009).↩︎

147. LeCun, Bengio, & Hinton (2015).↩︎

148. Sutskever (2015).↩︎

149. Goodfellow et al. (2016).↩︎

150. Kaplan, J. et al. (2019).↩︎

151. Rumelhart, Hinton, & Williams (1986).↩︎

152. LeCun & Bottou (1998).↩︎

153. Bottou (1998).↩︎

154. Sutton (2019).↩︎

155. Watson & Floridi (2019).↩︎

156. Bengio (2009).↩︎

157. Belkin, Hsu, Ma, & Mandal (2019).↩︎

158. Nakkiran, P. et al. (2019).↩︎

159. Dar, Muthukumar, & Baraniuk (2021).↩︎

160. Balestriero, Pesenti, & LeCun (2021).↩︎

161. Nagarajan (2021).↩︎

162. Bubeck & Sellke (2021).↩︎

163. Bach (2022), p. 225–230.↩︎

164. Steinhardt (2022).↩︎

165. Mishra, D. (2020). Weight Decay == L2 Regularization?↩︎

166. S. Chen, Dobriban, & Lee (2020).↩︎

167. Chiley, V. et al. (2019).↩︎

168. Kiani, Balestriero, Lecun, & Lloyd (2022).↩︎

169. Fukushima & Miyake (1982).↩︎

170. LeCun, Y. et al. (1989).↩︎

171. LeCun, Bottou, Bengio, & Haffner (1998).↩︎

172. Ciresan, Meier, Masci, & Schmidhuber (2012).↩︎

173. Krizhevsky, Sutskever, & Hinton (2012).↩︎

174. Simonyan & Zisserman (2014).↩︎

175. He, Zhang, Ren, & Sun (2015).↩︎

176. Haber & Ruthotto (2017).↩︎

177. Howard, A.G. et al. (2017).↩︎

178. R. T. Q. Chen, Rubanova, Bettencourt, & Duvenaud (2018).↩︎

179. Tan & Le (2019).↩︎

180. Dosovitskiy, A. et al. (2020).↩︎

181. Tan & Le (2021).↩︎

182. H. Liu, Dai, So, & Le (2021).↩︎

183. Ingrosso & Goldt (2022).↩︎

184. Park & Kim (2022).↩︎

185. Firth (1957).↩︎

186. Nirenburg (1996).↩︎

187. Hutchins (2000).↩︎

188. Mikolov, Chen, Corrado, & Dean (2013), Mikolov, Yih, & Zweig (2013), and Mikolov, T. et al. (2013).↩︎

189. Hochreiter & Schmidhuber (1997).↩︎

190. Werbos (1990).↩︎

191. Sutskever, Vinyals, & Le (2014).↩︎

192. Bahdanau, Cho, & Bengio (2015).↩︎

193. Wu, Y. et al. (2016).↩︎

194. Stahlberg (2019).↩︎

195. Church & Hestness (2019).↩︎

196. Kaplan, J. et al. (2020).↩︎

197. Vaswani, A. et al. (2017).↩︎

198. Devlin, Chang, Lee, & Toutanova (2018).↩︎

199. Lan, Z. et al. (2019).↩︎

200. Radford, Narasimhan, Salimans, & Sutskever (2018).↩︎

201. Radford, A. et al. (2019).↩︎

202. Brown, T.B. et al. (2020).↩︎

203. Yang, Z. et al. (2019).↩︎

204. Zaheer, M. et al. (2020).↩︎

205. Tay, Dehghani, Bahri, & Metzler (2022).↩︎

206. Phuong & Hutter (2022).↩︎

207. Jurafsky & Martin (2022).↩︎

208. Sutton & Barto (2018).↩︎

209. Arulkumaran, Deisenroth, Brundage, & Bharath (2017).↩︎

210. Cesa-Bianchi & Lugosi (2006).↩︎

211. Bellman (1952).↩︎

212. Mnih, V. et al. (2013) and Mnih, V. et al. (2015).↩︎

213. Silver, D. et al. (2016).↩︎

214. Silver, D. et al. (2017b).↩︎

215. Silver, D. et al. (2017a).↩︎

216. Hart & Mas‐Colell (2000).↩︎

217. Zinkevich, Johanson, Bowling, & Piccione (2007).↩︎

218. Lanctot, Waugh, Zinkevich, & Bowling (2009).↩︎

219. Neller & Lanctot (2013).↩︎

220. Gibson (2014).↩︎

221. Burch (2018).↩︎

222. Bowling, Burch, Johanson, & Tammelin (2015).↩︎

223. Heinrich & Silver (2016).↩︎

224. Moravcik, M. et al. (2017).↩︎

225. N. Brown & Sandholm (2018).↩︎

226. N. Brown & Sandholm (2019a).↩︎

227. N. Brown, Lerer, Gross, & Sandholm (2019).↩︎

228. N. Brown & Sandholm (2019b).↩︎

229. N. Brown, Bakhtin, Lerer, & Gong (2020).↩︎

230. N. Brown (2020).↩︎

231. Spears, B.K. et al. (2018).↩︎

232. Cranmer, Seljak, & Terao (2021).↩︎

233. Rathmanner & Hutter (2011).↩︎

234. Wolpert & Macready (1995).↩︎

235. Wolpert (1996).↩︎

236. Wolpert & Macready (1997).↩︎

237. Shalev-Shwarz & Ben-David (2014), p. 60–66.↩︎

238. McDermott (2019).↩︎

239. Wolpert (2007).↩︎

240. Wolpert & Kinney (2020).↩︎

241. Mitchell (1980).↩︎

242. Roberts (2021).↩︎

243. Goldreich & Ron (1997).↩︎

244. Joyce & Herrmann (2017).↩︎

245. Lauc (2020).↩︎

246. Nakkiran (2021).↩︎

247. Bousquet, O. et al. (2021).↩︎

248. Raissi, Perdikaris, & Karniadakis (2017a), p. 2.↩︎

249. Roberts (2021), p. 7.↩︎

250. Minsky & Papert (1969).↩︎

251. Hornik, Stinchcombe, & White (1989).↩︎

252. Lu, Z. et al. (2017).↩︎

253. Ismailov (2020).↩︎

254. Bishop (2006), p. 230.↩︎

255. Bahri, Y. et al. (2020).↩︎

256. Halverson, Maiti, & Stoner (2020).↩︎

257. Canatar, Bordelon, & Pehlevan (2020).↩︎

258. Roberts, Yaida, & Hanin (2021).↩︎

259. Cohen & Welling (2016).↩︎

260. Cohen, Weiler, Kicanaoglu, & Welling (2019).↩︎

261. Fuchs, Worrall, Fischer, & Welling (2020).↩︎

262. Smith (2019).↩︎

263. Nielsen (2018).↩︎

264. Amari (2016).↩︎

265. Balasubramanian (1996a).↩︎

266. Balasubramanian (1996b).↩︎

267. Calin & Udriste (2014).↩︎

268. Lei, Luo, Yau, & Gu (2018).↩︎

269. Gao & Chaudhari (2020).↩︎

270. Bronstein, Bruna, Cohen, & Velickovic (2021).↩︎

271. Fefferman, Mitter, & Narayanan (2016).↩︎

272. Raissi et al. (2017a) and Raissi, Perdikaris, & Karniadakis (2017b).↩︎

273. Karniadakis, G.E. et al. (2021).↩︎

274. Howard, Mandt, Whiteson, & Yang (2021).↩︎

275. Thuerey, N. et al. (2021).↩︎

276. Cranmer, Brehmer, & Louppe (2019).↩︎

277. Baydin, A.G. et al. (2019).↩︎

278. Anderson (2008).↩︎

279. Asch, M. et al. (2018).↩︎

280. D’Agnolo & Wulzer (2019).↩︎

281. Udrescu & Tegmark (2020).↩︎

282. Cranmer, M. et al. (2020).↩︎

283. Z. Liu, Madhavan, & Tegmark (2022).↩︎

284. Krenn, M. et al. (2022).↩︎

285. Asch, M. et al. (2018).↩︎

286. Korb (2001).↩︎

287. Williamson (2009).↩︎

288. Bensusan (2000).↩︎

289. Perone (2018).↩︎

290. Skelac & Jandric (2020).↩︎

291. Wittgenstein (2009), §43.↩︎

292. Wittgenstein (2009), §340.↩︎

293. Wasserman (2003).↩︎

294. Savage (1954).↩︎