The basic idea behind the Bayesian approach to probability and statistics is fairly straightforward. In fact it mimics the way we operate in daily life quite accurately. Bayes provides a means of updating the probability some event will occur as we acquire new information. This sounds like a fairly uncontentious way of going about business, and yet for over two hundred years Bayes was rejected by the mainstream community of statisticians – although the military has always used it secretly, and left the egg-heads to debate the philosophical issues (Bayes was used in the Second World War to help crack the Enigma Code). The basic philosophy behind Bayes is well expressed in the following quote:
When the facts change, I change my opinion. What do you do, sir?
John Maynard Keynes
The Bayes Rule for computing probabilities was formulated by Rev Thomas Bayes an eighteenth century clergyman. It was published posthumously and didn’t really receive any attention until the great mathematician Lapalce reformulated and extended the idea. Even then statisticians remained in denial for over two hundred years until the overwhelming success of some high profile applications meant they could not hide their heads in the sand any longer – but believe it or not there is still a statisticians equivalent of the flat earth society.
Classical statistics is far more rigid. It assumes a population of objects we are interested in and that some unknown parameters describe this population perfectly. This is flawed from the start – but let’s not get distracted by philosophical issues. We then sample the population and derive statistics which, through a whole lot of jigery-pokery, are assumed to approximate to the parameters of the whole population. Bayes is far more direct. Nothing is fixed and probabilities vary as new information becomes available. Imagine it is summer and rain is a rarity. On watching the weather forecast a high probability of rain for the following day is made. Do we choose to ignore this information, or do we make sure we go out with an umbrella the following day? The latter would seem to make sense – a perfect example of Bayesian thinking.
If we are going to get any further than anecdotes and analogies we need a smattering of math unfortunately. It isn’t onerous, and as long as we remember the meaning behind the math the end results should make sense. Although I should warn that the results from Bayesian analysis are often counter-intuitive.
A central notion used in Bayes is that of conditional probability. Please do not fast forward at this point, conditional probabilities are fairly easy to understand. Going back to the rain example we might say that the probability of rain on a given day in summer is just two percent – or a probability of 0.02. So given it is summer we assume that the probability of rain tomorrow is fairly remote. This could be expressed as P(Rain|Summer) – read as the probability of rain given it is summer – the vertical bar can be read as ‘given that’. We know the answer to this it is P(Rain|Summer) = 0.02. This is the conditional probability of rain given that it is summer.
Conditional probabilities can be expressed in terms of unconditional probabilities via a simple formula. To get to this formula we’ll use a Venn diagram.
The blue rectangle represents all the days in the year – 365 of them. The yellow circle represents all the summer days – say 90 of them. The large brown circle represents all the days it rains – 150 say. The bit we are really interested in is the small orange area, the days when it is summer and it rains. The probability of rain on any particular day in the year, is the number of days it rains divided by the number of days in the year. We’ve already said it rains 150 days a year, and so the probability of rain on any given day is 150/365, and this equals 0.411, or forty one percent. This takes no account for season, but is just the chance of rain given any random day from the year. However we know that it is summer. The minute we say ‘given that it is summer’ we lose interest in all the other days – the yellow circle becomes our universe.
To calculate P(Rain|Summer) we need to know the number of days it rains when it is summer, and divide by the total number of days in summer. In the Venn diagram it means dividing the number of days in the small orange area by the number of days in the yellow area.
Now for some more notation (and mathematics is nothing more than notation). When two areas in Venn diagram overlap they form an intersection – because they intersect. Mathematicians use an inverted ‘U’ to symbolize this, but we’ll use the ‘&’ sign. In other words ‘rain and summer’ is represented by ‘rain&summer’ – the intersection on the diagram above. We’ve already stated that the probability of rain given that it is summer can be calculated by dividing the number of days in the summer when it rains by the number of days in summer. This can be expressed mathematically as:
P(Rain|Summer) = P(Rain&Summer)/P(Summer) – (i)
Now there is no reason why we cannot reverse the reasoning here and calculate the probability of summer given that it is raining – or P(Summer|Rain). Swapping rain and summer around we get:
P(Summer|Rain) = P(Summer&Rain)/P(Rain) – (ii)
Bear with it – we are almost there.
Now Summer&Rain is just the same as Rain&Summer. Since both the above equations contain P(Summer&Rain) we can isolate this term in both equations by multiplying through by P(Summer) in (i) and P(Rain) in (ii). Having done this we get:
P(Rain|Summer) x P(Summer) = P(Summer|Rain) x P(Rain)
By rearranging we get a simplified form of Bayes rule:
P(Summer|Rain) = (P(Rain|Summer) x P(Summer))/P(Rain)
So what you might ask. Well what we have just done is permit the calculation of one conditional probability given that we know its opposite. In this trivial example we can ask what is the probability that it is summer given that it is raining. Well we know all the probabilities on the right hand side. P(Rain|Summer) = 0.02 as stated earlier. P(Summer) = 90/365 = 0.247. P(Rain) = 150/365 = 0.411
So P(Summer|Rain) = (0.02 x 0.247) / 0.411 = 0.012
The probability it is summer given a rainy day is just over one per cent, whereas the probability of a rainy day given summer is two percent. It is often difficult to see what this really means, but if you look at the Venn diagram it can be seen that rainy days are a much larger proportion of all days than summer days, and so the proportion of summer days given it is raining is smaller.
I’m going to end this piece by introducing two terms – prior and posterior probabilities. As the name suggests a prior probability is one prior to any conditions, and so P(Summer) is a prior because it is totally unconditioned. Whereas P(Summer|Rain) is posterior because it is post a condition – namely that it is raining. This is what Bayes is really all about and why at the beginning I talked about updating probabilities based on evidence. So our example can be formulated as:
P(Summer|Rain) = P(Summer) x P(Rain|Summer)/P(Rain)
In other words how does the knowledge that it is raining modify our belief that this might be a summer day? Well P(Summer), as we have shown before is 0.247 or around 25%. As soon as we are given the evidence that it is raining the probability that it is summer drops to just 0.01 or around 1%.
Now this is a trivial example, and of course most sane people would know the season of the year. Bayes is typically applied when we cannot observe the outcome and the best we can do is use evidence to establish the probability of a particular event happening. It is used in gambling, stock market forecasting, identifying spam, and very heavily by the military. It was even used to predict the likelihood of a space shuttle accident, before the disaster struck Columbia. Bayes gave an estimate of about one in thirty whereas traditional statistics put the probability at one in a thousand
If you are unfamiliar with Bayes please read part 1 of this series.
In the article previous to this one we established a simplified form of Bayes rule for the very specific example involving the weather and the season of the year. The equation looked like this:
P(Summer|Rain) = P(Summer) x P(Rain|Summer)/P(Rain)
Obviously we would want to generalise this and so we can replace ‘summer’ and ‘rain’ with variable events A and B. For those not familiar with probability notation a slight detour is needed. Please read this to get the basic vocabulary.
So our expression for Bayes becomes the celebrated formula:
P(A|B) = P(A) x P(B|A) / P(B)
The way it is expressed here emphasises the relationship between the posterior and the prior probabilities, P(A|B) and P(A) respectively. This formula shows how knowledge that event B has occurred changes the probability that event A has occurred. We need to know all three terms on the right to compute the value on the left. Critics of Bayes quite rightly point out that this information is not always available – but that is not the point. Often it is, and when it is we can perform a little magic that almost always surprises.
The first application to a business scenario we are going to make is that of recruitment. I’m afraid there is a little more mathematical manipulation we have to plough through, but it shouldn’t be too arduous.
The term P(B) is often not known directly, and to create the full blown expression for Bayes we need to consider how P(B) might be broken down to something which is more general. To get to grips with this we need to consider the diagram below.
The rectangle represents our sample space and the lines divide it into a number of events (sets) E1, E2, E3, … , En. Of course in the diagram there are just four segments, but there could be any number n. The way the sample space has been divided up is known as a partition – because it partitions the sample space with no overlapping areas. Or to use the jargon, E1, E2, … ,En are pairwise disjoint. Without getting too bogged down in the math it should be clear that A can be reconstructed from its intersections with each of these partitioned areas. More specifically we can state that (and remembering we use the & symbol for an intersection):
A = (A&E1) U (A&E2) U … U (A&En)
The ‘U’ is being used as the symbol for union – similar to addition when numbers are considered instead of sets. Because the partition is pairwise disjoint we can represent the probability of A as the sum of the probabilities of the intersections.
P(A) = P(A&E1) + P(A&E2) + … + P(A&En)
Just one more trick. We saw in the previous article that P(A&E1) is equal to P(A|E1)*P(E1). So substituting this back into the equation above we get:
P(A) = P(A|E1) x P(E1) + P(A|E2) x P(E2) + … +P(A|En) x P(En)
This pain is necessary unfortunately to get to the most useful form of Bayes rule. We are going to move a few things around. We will now consider P(Ei|A) as the posterior probability we wish to calculate. There is a rabbit hole here and I don’t want to go down it right now. But when we are updating a probability using Bayes rule we need to know all the possible scenarios that might have happened. In our ‘summer’, ‘rain’ example we were looking to calculate the probability it is summer given that it has rained. The other options are that it might be winter, spring or autumn. The four seasons together make up the whole sample space of days in the year, and we are using the knowledge that it has rained to establish the most likely season. We only calculated the value for summer and found it was pretty unlikely. The important point here is that we are using the evidence given to us to update the probabilities of a number of alternatives, which together should cover all possibilities. So without further ado let’s restate Bayes:
P(Ei|A) = P(Ei) x P(A|Ei) / (P(A|E1)*P(E1) + P(A|E2)*P(E2) + … +P(A|En)*P(En))
This is the general, and most useful formulation for Bayes rule, and it will become apparent why this is so very soon.
Probably a good time to take a break.
OK – here is the business problem we are going to consider, and not only will it provide a surprising insight into the recruitment process, but should help us put the pieces together. Your organisation is looking to recruit someone with very specific IT skills – C++, SAP, SOA, Insurance Industry, Linux – not too many of those around. Prior experience shows you that in a situation like this only one out of every twenty applicants is likely to be suitable. But to compensate for this, the recruitment process in your organisation is top notch and gets it right nine times out of ten – 90% of the time.
We are going to use a decision tree to represent this process, and to clarify the reasoning behind the Bayes rule.
The first split on the tree is whether a candidate is suitable. Of course we don’t actually know which candidates are suitable, just that 19 out of 20 are usually unsuitable. What we do know is that given a suitable candidate there is a 90% chance she or he will be selected. Similarly we know there is a 10% change they will be rejected. The situation reverses for unsuitable candidates – with an acceptance rate of 10% and a rejection rate of 90%. If you have followed so far some bells should be ringing. We’ve just talked about the probability of accepting a candidate given he or she is suitable.
What we want to know is the probability that a candidate is suitable given that they have been selected – this is the only thing we are really interested in. So now I’m going to list the things we know and plug them into Bayes rule – remember we want P(S|A).
P(A|S) = 0.9
P(R|S) = 0.1
P(A|U) = 0.1
P(R|U) = 0.9
P(S) = 0.05
P(U) = 0.95
Expressed in terms of Bayes formula we transform
P(Ei|A) = P(Ei) x P(A|Ei) / (P(A|E1)*P(E1) + P(A|E2)*P(E2) + … +P(A|En)*P(En))
P(S|A) = P(S) x P(A|S) / (P(A|S)xP(S) + P(A|U)xP(U))
Note that S and U partition our sample space.
When we put the numbers in this becomes:
P(S|A) = (0.05 x 0.9)/(0.05 x 0.9 + 0.1 x 0.95) = 0.32
The probability we recruit a suitable candidate is less than one third at 32%. On the face of it this does not seem reasonable in light of the excellent recruitment skills of the organisation. The probability is skewed by the fact there is such a large number of unsuitable candidates, and even though 90% of them get rejected, the 10% that are accepted is a large slice of the acceptance pie. The solution to this problem is simple enough – do a second round of interviews. We know that for this second round there is a 32% chance that a candidate is suitable. Plug the numbers in and the probability of getting a suitable candidate is 81% – somewhat better.
In the next article I’m going to go over this example with a fine tooth comb to map the features of this problem to the Bayes rule.