Monday, July 31, 2006

Feeling the master's superiority

There can be but two reasons for a philosopher to spend time, as I have, rethinking fundamental concepts of geometry. The first of these derives from the thesis that philosophy may legitimately contribute to the formulation of certain fundamental concepts, which include those of geometry. The second derives from the Collingwoodian thesis that to do the philosophy of a discipline well one must be 'thoroughly at home' with it. As for the first, the list of such legitimate concepts, as conceived by contemporary analytic philosophy, includes causation, probability, necessity, time, consciousness, identity, number, set, and yet would appear to exclude mathematical concepts such as (mathematical) space, point, dimension. I have no objection to the drawing of a distinction - it is hardly within the philosopher's brief to explore the nature of being igneous or of being crystalline - , but I have yet to see a principled way of separating the legitimate from the illegitimate. Let me, here, pursue the second reason.

As I observe in my book,
For Collingwood, ... a capacity to experience the force of the absolute presuppositions of the contemporary form of the discipline about which one is philosophising is vital. While describing which qualities someone should possess to be able to answer the questions of philosophy of history, he remarks acidly that:
No one, for example, is likely to answer them worse than an Oxford philosopher, who, having read Greats in his youth, was once a student of history and thinks that this youthful experience of historical thinking entitles him to say what history is, what it is about, how it proceeds, and what it is for. (Collingwood 1946: 8)
A similar conclusion could be formulated for philosophy of mathematics, and indeed Kant is praised for dealing with the presuppositions of mathematics 'rather briefly' for 'he was not very much of a mathematician; and no philosopher can acquit himself with credit in philosophizing at length about a region of experience in which he is not very thoroughly at home' (Collingwood 1940: 240). Returning to history, he continues:
An historian who has never worked much at philosophy will probably answer our four questions in a more intelligent and valuable way than a philosopher who has never worked much at history. (Collingwood 1946: 9)
Evidence for the equivalent statement about mathematics is provided by the very many important contributions made by mathematicians thinking about their discipline, several of which I shall lean on in the course of this book. (pp. 17-18)

This Collingwood, Robin George, Waynflete Professor of Metaphysics at Oxford, was the son of William Gershom Collingwood, who worked with John Ruskin at Brantwood, the latter's home on the shores of Coniston Water. There could have been little direct personal influence on the young Collingwood, Ruskin dying in 1900 shortly before he was 11, and suffering greatly from mental illness in his final years, but most likely his father, who completed a biography of Ruskin in 1893, provided the necessary immersion in Ruskinian principles.

In earlier posts (2 May & 5 May), I mentioned Ruskin's Unto This Last. I am currently reading Ruskin's autobiography Praeterita, where we read:

"Mostly a quiet stream there, through the bogs, with only a bit of step or tumble a foot or two high on occasion; above which I was able practically to ascertain for myself the exact power of level water in a current at the top of a fall. I need not say that on the Cumberland and Swiss lakes, and within and without the Lido, I had learned by this time how to manage a boat - an extremely different thing, be it observed, from steering one in a race; and the little two-foot steps of Tummel were, for scientific purposes, as good as falls twenty or two hundred feet high. I found that I could put the stern of my boat full six inches into the air over the top of one of these little falls, and hold it there, with very short sculls, against the level [Distinguish carefully between this and a sloping rapid.] stream, with perfect ease for any time I liked; and any child of ten years old may do the same. The nonsense written about the terror of feeling streams quicken as they approach a mill weir is in a high degree dangerous, in making giddy water-parties lose their presence of mind if any such chance take them unawares. And (to get this needful bit of brag, and others connected with it, out of the way at once), I have to say that half my power of ascertaining facts of any kind connected with the arts, is in my stern habit of doing the thing with my own hands till I know its difficulty; and though I have no time nor wish to acquire showy skill in anything, I make myself clear as to what the skill means, and is. Thus, when I had to direct road-making at Oxford, I sate, myself,with an iron-masked stone-breaker, on his heap, to break stones beside the London road, just under Iffley Hill, till I knew how to advise my too impetuous pupils to effect their purposes in that matter, instead of breaking the heads of their hammers off, (a serious item in our daily expenses). I learned from an Irish street crossing-sweeper what he could teach me of sweeping; but found myself in that matter nearly his match, from my boy-gardening; and again and again I swept bits of St Giles' foot-pavements, showing my corps of subordinates how to finish into depths of gutter. I worked with a carpenter until I could take an even shaving six feet long off a board; and painted enough with properly and delightfully soppy green paint to feel the master's superiority in the use of a blunt brush."
How much more important for Ruskin, then, to devote years to drawing, so as to be able to write intelligently about art.

Sunday, July 30, 2006

Gambling on the Riemann Hypothesis

I discussed the idea of Bayesianism and gambling in mathematics back on 3 November. Now, from sigpe, I see that you can trade in futures for mathematical results such as the Riemann Hypothesis. Of course, to put a finite time limit on a mathematical gamble, it has to be of the form, 'By 20XX, the Y conjecture will have been proved.' In the case of the Riemann hypothesis, it's 2020.

But, clearly your degree of belief in RH's being proved by 2020 is upper bounded by your degree of belief in it's truth. Put the other way around, if you happened to believe that it is 70% likely that RH is false, you would be very happy to sell options at its current price. This price has fluctuated significantly over the decade of the future's existence. Presumably most of the fluctuation comes from evidence that someone is closing in on a proof. It would be interesting to see how, say, analogical evidence for or against its truth, such as Deninger provides, played out.

Wednesday, July 26, 2006

Information Geometry and Machine Learning

I'm drawing up a list of papers which formulate machine learning algorithms as maximum entropy or minimum relative entropy solutions, or more broadly are written in the general framework of information geometry. The list isn't aiming for completeness, rather coverage. If anyone knows of any obvious omissions, I'd be grateful to hear.

I'm interested in the moment by the topic at the end of the list - Bayesian Information Geometry, which might just be what a huge number of machine learing algorithms are approximating. The idea is simple enough. Starting out form a prior distribution in the space of distributions, for any data the decision rule minimises the divergence between the true distribution and its estimate. For a given data set this is equivalent to finding the distribution at the smallest mean distance from the true distribution, the mean being taken with respect to the posterior distribution. Snoussi's paper shows how broad this framework is, with the flexibility to choose your distance function and the weight you give to your choice of prior.

At a sociological level, it's interesting to speculate why information geometry has been somewhat reluctantly taken up. I'm sure a large part of this is due to there being no straightforward introduction to the subject. Someone should write an exposition of its key successes without the usual huge dollop of differential geometry in the opening section.

Something I'm also curious to know is why there's not greater use of information geometry by the Gaussian process machine learning theorists. With Gaussian processes as maximum entropy solutions, you'd have thought they'd be tailor-made for the IG treatment, perhaps even to help in the choice of covariance function.

Monday, July 24, 2006

Philosophy's foreign relations

Jon Williamson has just paid me a visit here in Tubingen. Jon and I go back quite a way to the time when we were engaged on parallel projects, studying the interaction betweeen AI and philosophy, at King's College London. We hope to work together in the future on a topic I've addressed in recent posts, namely, maximum entropy, perhaps in the context of information geometry. Over a wheat beer, we mulled over our thoughts on the proper relationship between philosophy and neighbouring disciplines. While I doubt Jon would wholeheartedly support my historical stance, we both are of the opinion that it is important for philosophers to leave their comfort zone periodically to engage with these disciplines.

A good way to catch a glimpse of how practitioners of these disciplines view philosophy comes in the introductory spiel to their invited contributions to philosophical collections. A case in point is the Handbook on the Philosophy of Information. Here are two such views:
The philosophy of X, where X is a science, often involves philosophers analyzing the concepts of X and commenting on what concepts are or are not likely to be coherent. AI necessarily shares many concepts with philosophy, e.g. action, consciousness, epistemology (what it is sensible to say about the world), and even free will. This article treats the philosophy of AI but also reverses the usual course and analyzes some basic concepts of philosophy from the standpoint of AI. The philosophy of X often involves advice to practioners of X about what they can and cannot do. We reverse the usual course and offer advice to philosopers, especially philosophers of mind. The point is that philosophical theories can make sense only if they don’t preclude human-level artificial systems, and this fact has further consequences.
Information in Artificial Intelligence by J. McCarthy

Philosophers of science are concerned with explaining various aspects of science, and often, moreover, with viewing science as a kind of gold-mine of philosophical opportunity. The direction in both cases is philosophy from science. For a theoretical scientist, the primary inclination is often to see conceptual analysis as a preliminary to a more technical investigation, which may lead to a new theoretical development. In short: science from philosophy. This article is written mainly in the latter spirit, from the stand-point of Theoretical Computer Science, or perhaps more broadly “Theoretical Informatics”: a — still largely putative — general science of information. That being said, we hope that our conceptual discussions may also provide some useful grist to the philosopher’s mill.
Information, Processes and Games by Samson Abramsky
Both, then, are seemingly open to dialogue with philosophy. Of course, we should not underestimate the preparation necessary to be able to engage fruitfully with practioners. The main danger of their breaking off contact comes from their perception that you have a peculiarly lop-sided view of their field, driven by some quixotic philosophical position.

Thursday, July 20, 2006

Renyi entropy

Following the discussion we had here about the merits of Tsallis and Renyi entropies, here's an interesting paper by Peter Harremöes - Interpretations of Renyi Entropies And Divergences. Harremöes is looking for an information theoretic interpretation of the Renyi entropies in terms of what he calls an operational definition:
To us an operational definition of a quantity means that the quantity is the natural way to answer a natural question and that the quantity can be estimated by feasible measurements combined with a reasonable number of computations. In this sense the Shannon entropy has an operational definition as a compression rate and the Kolmogorov entropy has an operational definition as shortest program describing data. (p. 2)
Via an introductory account of codes, we learn that "the Renyi divergence measures how much a probabilistic mixture of two codes can be compressed".

Like the Kullback-Leibler divergence (Shannon relative entropy), the Renyi divergence is addititive or extensive in the sense that
Dq(P1 x P2//Q1 x Q2) = Dq(P1//Q1) + Dq(P2//Q2).
[KL-divergence equals the Renyi divergence for q = 1.] So too is the corresponding Renyi entropy. But for q > 1 it lacks a property possessed by the Shannon entropy, and also by all Renyi entropies with q in [0,1], namely concavity. The Tsallis entropy chooses the other option, and so while concave for q > 1, it is no longer additive/extensive.

Wednesday, July 19, 2006

Klein 2-Geometry III

Update: I'm floating this post to the top again so that we don't lose it.

Time to begin the new month's posting on categorified geometry, continuing May and June. Fortunately John Baez, although now in Shanghai, is on broadband. I wouldn't fancy solo Kleincategorification (second and final comments). It's like when you're learning to ski, you can manage much trickier slopes with an expert to follow.

I guess the biggest worry in a venture of this kind is that all you achieve is a repackaging of what's already known. There's a discussion here, involving John, about whether Lie 2-algebras bring into the light anything new (cf. posts 9, 13 and 14). (The archives of sci.physics.research is full of delights. Here's another thread on 2-groups.)

In this discussion, John mentions his reasons for quitting his role as moderator of sci.physics.research. I'm not sure I've characterised all that well there what it is I'm looking for beyond individuals exposing their ideas in a free and informal way. John can see I want something a little more agonistic, but fears the tendency towards antagonism. I see agonism is a political position. One of its advocates has this to say:

Agonism implies a deep respect and concern for the other; indeed, the Greek agon refers most directly to an athletic contest oriented not merely toward victory or defeat, but emphasizing the importance of the struggle itself-a struggle that cannot exist without the opponent. Victory through forfeit or default, or over an unworthy opponent, comes up short compared to a defeat at the hands of a worthy opponent-a defeat that still brings honor. An agonistic discourse will therefore be one marked not merely by conflict but just as importantly, by mutual admiration. (Samuel Chambers)
Bloggers of the world, forego antagonism, choose agonism.

Of course, you may be able to internalise the agon, by taking also the part of the opponent. Indeed, this is how I arrived at the idea behind this series on the categorification of Kleinian geometry. I imagined what someone highly dubious about the scope of worthwhile categorification might say. "Let us for the moment accept that the 'categorification' (such an ugly name) of arithmetic and combinatorial identities via groupoids and species has been worthwhile, what do you have to say about Euclidean geometry, the jewel of Greek mathematics. If you have nothing new to tell me about points, lines and circles, I shall remain unconvinced."

Tuesday, July 18, 2006

We've hardly begun

From Barry Mazur's Foreword to Fearless Symmetry: Exposing the Hidden Patterns of Numbers by Avner Ash & Robert Gross (to be published by Princeton University Press):
At some point in his or her life every working mathematician has to explain to someone, usually a relative, that mathematics is hardly a finished project. The mathematicians know, of course, that it is far too early to put the glorious achievements of their trade into a big museum and become happy curators. Our subject has, in certain respects, hardly begun. But, at least in the past, this seems not to have been universally acknowledged.
Yes, forget the big museum, it's the seed bank we need.

Friday, July 14, 2006

Conceptual essentialism

John Baez recently added a comment to this post, which is too old now for comments to appear in 'recent comments'. I had remarked that something he had said earlier sounded like it came straight from the Jaffe-Quinn debate. For those of you who don't remember it, these two mathematical physicists launched a passionate attack on slipping standards in mathematics, brought about by an imitation of the sloppier ways of physicists. Many very interesting responses were made, not least William Thurston's wonderful On proof and progress in mathematics.

Anyway, John replied:

I hope it's clear that I'm not complaining about the lack of rigor. I'm complaining about a swarm of people writing hundreds of short papers on the same subject in a short time, each referring to many of the previous ones, nobody taking the time to distill the matter to its essence. Even if all the papers contained nothing but rigorous theorems, I would still find this annoying. It's fine if you wish to devote yourself to one specialized subject, rapidly master the literature, and compete with the crowd to extract some big nuggets before this vein of ore looks exhausted and it's time to move on. I'm sure this is fun for people with a competitive streak. But there are other people who like to slowly mull over one topic and nurse it to perfection - or like me, mull over lots of topics and gradually form a web of connections until something interesting emerges. And, you know, it's just possible that some of the people in that Jaffe-Quinn dispute were secretly annoyed about the fast-paced "swarming" style of theoretical physics more than any lack of rigor. I forget if any of them came out and said this.
I think the phrase 'nobody taking the time to distill the matter to its essence' is the key one here. Remember, two posts ago we had Borovik saying "The work of three generations of mathematicians confirmed that matroids, indeed, capture the essence of linear dependence" (my emphasis).

I've done my damnedest to get the idea of mathematical activity at its highest level aiming to extract the essence of a situation to be the principal topic of philosophy of mathematics, but with little success. It's not that I'm the only philosopher thinking about such things. For instance, Kenny Easwaran posted Do Mathematical Concepts Have Essences? on his blog, where you can follow up the reference to a paper I wrote on the subject. But it never stays on the agenda for long.

I think what is needed is a name. Essentialism is overused. Conceptualism is also already taken. It concerns the kind of problem faced when wondering what the tallness is shared by, say, a 2 metre man, a 30 metre tree, and a 300 metre building. As this paper explains:

Conceptualism, along with nominalism and realism, is one of three traditional families of views about universals. There are many species of each family, but the story line goes like this. Realists hold that there are universal properties and that these solve the problems of universals. Conceptualists deny this, arguing that concepts can do most of the work realists invoke properties to do. And nominalists, at least traditional ones, spurn both universals and concepts, arguing that words alone can do all the legitimate aspects of this work.
Blending the two, conceptual essentialism has been used in philosophy of science to designate a similar position. But is it snappy enough? How much of Kuhn's success was down to his choice of the word revolution?

Tuesday, July 11, 2006

The prevalence of Kullback-Leibler

How's this for an explanation of the prevalence of the Kullback-Leibler divergence:

Much statistical inference takes the form of finding an optimal distribution satisfying some set of constraints. Very often these constraints are such that for any two distributions, P and Q, satisfying them, so do all mixtures of the form bP + (1 - b)Q. This is what Amari calls m-flatness (m for mixture), i.e., these paths of mixtures are geodesics with respect to the m-connection. Now, the dual affine connection to the m-connection is the e-connection (e for exponential), and e-flat manifolds of distributions are the ubiquitous exponential families. (To see e-flatness is not the same as m-flatness, consider that mixtures of Gaussians are not generally Gaussian.) Minimizing the relative KL-entropy of distributions satisfying the constraint is equivalent to finding where the exponential family meets the space of constrained distributions.

So my question is whether presenting the m-flatness idea first, as arising out of common-or-garden linear constraints, such as fixing the values of moments, is a good way to motivate the KL-divergence. Then it would be interesting to think about other types of constraints which would lead to flatness with other connections, and what the generalized exponential families, flat according to the dual affine connection, would look like.

Monday, July 10, 2006

Discrimination against -oids

From Not Even Wrong, I see that the Institut Henri Poincare in Paris is holding a 3 month program on Groupoids and Stacks in Physics and Geometry. They have included an interesting overview of the subject to motivate the program. I have a particular soft spot for groupoids having studied the case for their admission into the paradise of mathematics in chapter 9 of my book. Groupoids had a strangely difficult childhood, finding acceptance surprisingly late, the oddest explanation for which is Connes' claim that:
...it is fashionable among mathematicians to despise groupoids and to consider that only groups have authentic mathematical status, probably because of the pejorative suffix 'oid'. (Noncommutative Geometry, 6-7)
Rather than this persecution of suffixes, a more common sentiment is that they are really just dressed up groups. In an old e-mail I have from Saunders Mac Lane he adopts just this line, perhaps surprisingly for the co-inventor of category theory, groupoids being categories with invertible morphisms, and very present in Mac Lane's home of algebraic topology.

Someone who had read my book made the good suggestion that I subject matroids to a similar treatment. Now I hear (penultimate comment by Srandby) that Mac Lane didn't like these either.
Once, Mac Lane came to give a talk. During the talk, in front of a packed audience, he stated that matroid theory wasn’t good or important mathematics, pissing off several faculty who worked in matroid theory. I found this comment to be very bizarre. Here was an advocate of a vast generalization of dubious importance dismissing a generalization of vector spaces that has tremendous importance.
Is there something to Connes' anti-oid theory?

Anyone who wants to take up the challenge of assessing matroids should take a look at Coxeter Theory: The Cognitive Aspects, an article by Alexandre Borovik. In section 13 - Combinatorics as non-parametric mathematics - Borovik claims:
The work of three generations of mathematicians confirmed that matroids, indeed, capture the essence of linear dependence. Since linear dependence is a ubiquitous and really basic concept of mathematics, it is not surprising that the concept of matroid has proven to be one of the most pervasive and versatile in modern combinatorics. (p. 23)
This book with Gelfand should no doubt be consulted too: Coxeter Matroids, Birkhauser, xiv+264 pp., ISBN 0-8176-3764-8 (with I. M. Gelfand and N. White), 2003.

I believe that Borovik will include the Coxeter Theory article in a book to appear with Springer. I read a draft of this book a couple of years back and found it wonderfully rich.

Thursday, July 06, 2006

Conditionalization as I-projection

First Greenspan,

“In essence, the risk management approach to monetary policy-making is an application of Bayesian decision-making.” (p. 37)

“Our problem is not, as is sometimes alleged, the complexity of our policy-making process, but the far greater complexity of a world economy whose underlying linkages appear to be continuously evolving. Our response to that continuous evolution has been disciplined by the Bayesian type decision-making in which we have engaged.” (p. 39)


“Risk and Uncertainty in Monetary Policy,” American Economic Review, May, 2004, 33-40.

Now it's the turn of the US Food and Drug Administration to come out in favour of Bayesianism.

Thanks to Yet Another Machine Learning Blog for this. An earlier post - Maximum entropy and bayesian updating - on this interesting blog presents the following example from Kass of a possible clash between maximising entropy and conditionalization:
Consider a Die (6 sides), consider prior knowledge E[X]=3.5.

Maximum entropy leads to P(X)= (1/6, 1/6, 1/6, 1/6, 1/6, 1/6).

Now consider a new piece of evidence A="X is an odd number"

Bayesian posterior P(X/A)= P(A/X) P(X) = (1/3, 0, 1/3, 0, 1/3, 0).

But MaxEnt with the constraints E[X]=3.5 and E[Indicator function of A]=1 leads to (.22, 0, .32, 0, .47, 0) !! (note that E[Indicator function of A]=P(A))

Indeed, for MaxEnt, because there is no more '6', big numbers must be more probable to ensure an average of 3.5. For bayesian updating, P(X/A) doesn’t have to have a 3.5 expectation. P(X) and P(X/A) are different distributions. Conclusion ? MaxEnt and bayesian updating are two different principles leading to different belief distributions. Am I right ?

Example 3 on p. 4 of Information topologies with applications by Peter Harremoes provides the answer here. Passing from a distribution P(X) to P(X/A) is just one simple case of a general process of projection from a point to a subspace of a space of distributions. Let P(X) be a distribution and A an event such that P(A)>0. Let C(A) be the set of distibutions Q, with Q(A) = 1. Then P(./A) is the closest element of C(A) to P in the sense of Kullback-Leibler distance (relative entropy). It is a robust bayes act to update thus.

More technically, C(A) is 'm-flat' in the sense of Amari, i.e., if Q and R are in C(A) then so is b.Q + (1 - b).R. The projection of P onto C(A) along the dual e-connection is P(./A). Forming the conditional distribution is but one small example of Csiszar's I-projection, which may use divergences other than the Kullback-Leibler.

Back to Kass' example, the MaxEnt formulation is projecting to the manifold of distributions satisfying both of the constraints, rather than just one as in the case of conditionalization.