NB Page under construction - Very preliminary notes
Will rewrite soon in context of study of Information geometry, maximum entropy, etc.
Considerable amount of common ground.
One first question must be what are the goals of the two traditions. Now, it is clear that not even people working in a single tradition need share the same goals. For example, you might be at the more practical or more theoretical end of the spectrum. Bayesianism concerns an especially broad range of activity. This having been said, there is sufficient overlap that there is much for them to talk to each other about. One specific achievement hailed by one party, may be dismissed as of little value by the other, but if developments in one tradition emerging out a series of theoretical advances led to them sweeping the floor in all sorts of machine learning competitions, members of the other tradition would surely take note. And if these theoretical advances were not accessible in the language of the other tradition, and perhaps the first tradition could show how the second could not catch up, we might imagine a few people would change horses.
How principled should work within a tradition be?
...enquiry can only be systematic in its progress when its goal is to contribute to the construction of a system of thought and practice - including in the notion of construction such activities as those of more or less radical modification, and even partial demolition with a view to reconstruction - by participating in types of rational activity which have their telos in achieving for that system a perfected form in the light of the best standards for judging of that perfection so far to emerge. Particular problems are then partially, but in key ways, defined in terms of the constraints imposed by their place within the overall structure, and the significance of solving this or that particular problem derives from that place. It is one of the marks of the flourishing of such a developing system of thought and practice that from time to time its telos is reformulated. And those who contribute to the perfecting of such a system can do so only by developing and deploying their own skills in the manner characteristic of a craft and by participating in the activities in which those skills are put to work in the manner characteristic of a tradition. It is then only in contexts in which enquiry is understood by those participating in it in ways characteristic of system, craft, and tradition, whether or not that understanding is made explicit, that the concept of progress in enquiry will have application. (MacIntyre, After Virtue: 148-9)
"To understand the nature of the problem, its spirit, and its philosophy, one has to see the theory as a whole, not only as a collection of its different parts. Understanding the nature of the problem is extremely important because it leads to searching in the right directions for results and prevents searching in wrong directions." (Vapnik 1999: xi) [Cf. Bernhard's admission that SLT is not as unified as Bayesianism.]
A hundred parallel comments by Jaynes.
"Now a new methodological situation in the learning problem has developed where practical methods are the result of a deep theoretical analysis of the statistical bounds rather than the result of inventing new smart heuristics." (Vapnik 1999: p. 15 n 9)
Jaynes also criticises ad hocness of tricks.
Vapnik (199: 174 n. 19) Feynman remark on science, ad hoc ways of neural networks practitioners doesn't mean they're not good, but that the theory is not science.
Olivier Chapelle points out that one might designate a third group beyond Vapnikians and Bayesians: "99% of machine learning people who are not Bayesian, do their model selection by cross-validation and don't use bounds... So, maybe in your discussion, you should add a third camp (which is, in my opinion, by far the largest): non-Bayesian and non theoretician people, who are just very pragmatic and try to have the lowest possible test error. 15 years ago, those people were typically using neural networks. However, it is true that Vapnik had probably more impact on this group, in the sense that he formalized some of the heuristics they were using (Vapnik was actually in the group of Yann LeCun, a big shot in the NN community)."
Might one respond: Believing that a research tradition should be principled doesn't need to interfere with a certain pragmatism, but someone has to do the principle spadework. The major ideas that transform a field generally arise from principled thinking.
So a concern: does a MacIntyrean analysis unfairly rule out treating this third group as a tradition of enquiry?
Both Vapnik and Jaynes have provided historical accounts (perhaps could be improved) and lists of their own strengths, and the weaknesses of the other. In Jaynes' case the target is more Fisherian statistics, but some criticisms still apply. Other Bayesians (Bishop, Neal, Tipping,...) have criticisms targeted at statistical learning theory.
We need a better structured debate here.
As Aquinas wrote in Summa Theologica, his reconciliation of Aristotelianism with Augustianism:
Because the doctor of Catholic truth ought not only to teach the proficient, but also to instruct beginners (according to the Apostle: As unto little ones in Christ, I gave you milk to drink, not meat -- 1 Cor. 3:1-2), we purpose in this book to treat of whatever belongs to the Christian religion, in such a way as may tend to the instruction of beginners. We have considered that students in this doctrine have not seldom been hampered by what they have found written by other authors, partly on account of the multiplication of useless questions, articles, and arguments, partly also because those things that are needful for them to know are not taught according to the order of the subject matter, but according as the plan of the book might require, or the occasion of the argument offer, partly, too, because frequent repetition brought weariness and confusion to the minds of readers.
Endeavouring to avoid these and other like faults, we shall try, by God's help, to set forth whatever is included in this sacred doctrine as briefly and clearly as the matter itself may allow.
In the correct order for a beginner there are theses in the form of questions, the best arguments to date are put forward, then the best counterarguments, followed by Thomas' argument. Note the contrast with Spinoza's Ethics set out like Euclid's Elements as a finished product. The latter form of presentation is properly suited to the final product of enquiry.
Control and generalisability
Vapnik's 4 parts of learning theory
I) What are the (necessary and sufficient) conditions for consistency of a learning process based on the ERM principle?
II) How fast is the rate of convergence of the learning process?
III) How can one control the rate of convergence (the generalisation ability) of the learning process?
IV) How can one construct algorithms that can control the generalisation ability?
Vapnik: Bayesians use strong a priori information that the given set of functions of a machine coincide with the problems to be solved + an a priori distribution of P(α). SRM uses weak a priori information about reality: a structure on the admissible set of functions, ordered according to usefulness.
But, he says he is told, "The Bayesian approach also works in general situations". That it does work sometimes in situations where the functions of the machine don't coincide with those to be learned, is due to human choice of the regulariser ln P(α), put in by hand. SRM is a pure machine method for solving problems. "For any l [number of elements in the training set] they use the same structure on the set of admissible functions and the same formal mechanisms for capacity control."
Bayesians can reconstruct SRM, how natural is this? Why choose that prior over spaces of functions. See Herbrich "Learning Kernel Classifiers" pp.28-29.
p. 701 of Statistical Learning Theory "..for an uncountably infinite set of admissible functions the method of maximum a posteriori probability is not necessarily consistent. To construct a consistent method, one should modify the MAP method in the spirit of SRM in order to control capacity. Must know:
Given set of functions of learning machine must coincide with a set of target functions, otherwise Bayes' formula makes no sense. SRM is uniformly consistent because of capacity control.
Zero prior probability of any function from admissible set (Popper's point).
Maximum likelihood consistent only on finite set of admissible functions, Maximum a posteriori on countably many functions, SRM works for uncountably many. [Curious that SRM uses language of ML.]
Bayesian objection: Vapnikian bounds are usually too loose. What's so great about bounds which are algorithm and training set independent?
Meaning of convergence
Don't both sides use the notion of capacity? Isn't this an information theoretic term? Such as "What is the capacity of a neural network"?
Bayesians not so interested in these general convergence results. However, see chap. 5 of Herbrich "Learning Kernel Classifiers" for PAC-Bayesian, Luckiness frameworks. How does the Bayesian cope when they sense that they're not using all their background knowledge? How robust are their priors? Is there much more here than its debate with classical statistics?
Neal makes the point that:
P(|testing error rate − training error rate| > ε) < δ
should not be taken to mean things like
P(|testing error rate − 0.02| > ε | training rate = 0.02) < δ.
Usual problems with confidence intervals? "This suboptimality of sampling theory, achieved with great effort, is why I am passionate about Bayesian methods. Bayesian methods are straightforward, and they optimally use all the information in the data. Sampling theorists concentrate on having methods guaranteed to work most of the time, given minimal assumptions. Bayesians try to make inferences that take into account all available information and answer the question of interest given the particular data set." Mackay p. 457
Bayesianism encourages laziness
"while I believe that most well-posed inference problems have a unique correct answer, which can be found by Bayesian methods, not all problems are well-posed. A common question arising in data modelling is 'am I using an appropriate model?' Model criticism, that is, hunting for defects in a current model, is a task that may be aided by sampling theory tests, in which the null hypothesis ('the current model is correct') is well defined, but the alternative model is not specified. One could use sampling theory measures such as p-values to guide one's search for the aspects of the model most in need of scrutiny." (Mackay: 466)
Bayesians aren't lazy. Jaynes Chap. 17 points out the orthodox tend to overlook things on grounds that their not surprising on the basis of the null hypothesis is not very surprising, and that, fortunately, physicists don't behave like this.
Any system can make you stop thinking. Constantly practice the virtues of enquiry.
Examples of where SLT is incoherent from Bayesian point of view
There are useful averaging methods for P(α | data) which don't rely on Bayes theorem. (AdaBoost, Bagging averages). (Vapnik)
SVM looks for largest inscribed circle in version space, but if region were like a slowly narrowing rectangle this would be very far from the centre of mass. See Figs 2.8 and 3.7 of Herbrich.
Bounds for all classifiers vs bounds for specific algorithms, also bounds given training data. Does Bayesian see this as a place where information is lost?
Classification vs probability
Wouldn't it be nice to know how likely the machine thinks it's got it right? Then it could give small number of doubtful cases to human, like failures of digit recognition in S & S. Is the only way here for SLT via distance from optimal hyperplane?
"Predictions are not probabilistic. In regression the SVM outputs a point estimate, and in classification, a 'hard' binary decision. For may real world applications, as distinct from bench-marking, we require the conditional distribution p(t | x) of targets given inputs rather than just a point prediction. Such a distribution expresses our uncertainty in the prediction and offers numerous advantages (Bishop 1995) such as optimal rejection, flexible and optimal decision making, fusion of outputs with other sources of probabilistic information, and so on.
Choice of prior
Difficulty in extracting what you know. Everyone 'knows' that the distribution of the first digits of the area of the countries of the world in sq. km. is approximately Pr(X = n) = log(1 + 1/n), they just don't know they know. (Benford's Law).
Know they know but can't express it.
Can express it but can't work with it analytically. Temptation to resort to conjugate priors. Pseudo-Bayesianism: 'Using "technological" or "reference" priors chosen solely for convenience, or out of a mis-guided desire for objectivity' (Neal)
Can use MCMC, but this isn't Bayesian (Mackay). Use Bayesian MC?
Recognising that there are open problems for your tradition is not fatal. There are endless opportunities for theoretical advancement. For instance, regarding the last point why not try Carl's Bayesian MC.
Dynamism of program
Bayesians would never have arrived at Kernel methods. (BS)
"Gaussian processes and support vector learning machines have a lot in common. Both are kernel-based predictors, the kernel being another name for the covariance function." (Mackay: 548)
Of which Bayesian ideas could it be claimed that SLT would never have thought of them?
Vapnik: SLT suggested the search for a structured set of functions described by large numbers of parameters but with low VC dimension. Found only one (as of 1999), and this gave SVMs.
Transduction, selection,... as signs of a dynamic research programme?
Battle as played out locally: SVM vs Bayesian counterparts
"Nevertheless, the twin properties of accuracy and sparsity make the SVM a very attractive model. Bayesian approach naturally deals with over-fitting. Now generates sparser models overcoming above objections." (Tipping & Bishop on RVMs)
"The advantages of SVR are: a global minimum solution as the minimization of a convex programming problem; relatively fast training speed; and sparseness in solution representation. However, the performance of SVR crucially depends on the shape of the kernel function and other hyperparameters that represent the characteristics of the noise distribution in the training data. Re-sampling approaches, such as cross-validation, are commonly used in practice to decide values of these hyperparameters, but such approaches are very expensive when a large number of parameters are involved. Typically, Bayesian methods are regarded as suitable tools to determine the values of these hyperparameters." (Chu Wei's Thesis, p. 16) See following pages for translation between two languages.
Tipping&Bishop: It is necessary to estimate the error/margin trade-off parameter 'C' (and in regression, the insensitivity parameter 'epsilon' too). This generally entails a cross-validation procedure, which is wasteful both of data and computation.
Sollich: we can understand SVM in terms of GPs and MAP, helps choose good kernel and C. However, it leads to weird generative model.
"SVMs make unnecessarily liberal use of basis functions since the number of support vectors required typically grows linearly with the size of the training set." (Tipping & Bishop)
"The kernel function K(x,x_i) must satisfy Mercer's condition." (Tipping & Bishop)
SVMs generally faster than GPs.
Classification with more than two classes easily handled by Bayesian GP. Predicitions are probabilistic, and hyperparameters in covariance function easily inferred using, e.g., Automatic Relevance Determination.
Fruitful interplay?: "Our intention is to integrate support vector machines with Gaussian processes tightly, while preserving their individual advantages as much as possible. Hence, the contributions of this work might be two-fold: for classical support vector machines, we apply the standard Bayesian techniques to implement model selection, which is convenient to tune large numbers of hyperparameters automatically; for standard Gaussian processes, we introduce sparseness into Bayesian computation that helps to reduce the computational cost and makes it possible to tackle reasonably large-scale data sets. (Chu: 17)
Extra bits
Is there something to the thought that often it's easier not to encode background knowledge. Just play to the computer's strengths. If you wanted to search a document for the string 'kangaroo', you might assess each paragraph for the likelihood of it containing something about Australian animals, but clearly not the most efficient way for a machine.
For digit recognition, invariance should be encoded in allowable regions of feature space. They ought to be symmetric under group of allowable symmetries. 〈Φ(x), Φ(y)〉 = 〈Φ(gx), Φ(gy)〉.
Is this actually a group? More like a neighbourhood of the identity. Must include local thickening and thinning. Many of the obvious-to-human misclassified digits occur when pen doesn't work for a stretch.
Olivier Chapelle on loss function approach.