Attention readers: This blog has moved to a new home at https://chenghlee.wordpress.com/.

Wednesday, January 23, 2008

Our intuition about probability (part 1)

Peter Donnelly's TED talk highlights some common mistakes made in interpreting probabilities and statistics as well as the consequences of such mistakes.

One of the interesting questions he poses is this: Suppose we toss a fair coin repeatedly, counting the number of tosses made before getting the pattern 'HTT' (head, tail, tail); for example, if we toss 'HHTHHTTH', the count would be 7. We perform a large number of trials, resetting the count of tosses between trials, to get the average number of tosses required to produce 'HTT'. We then repeat this experiment (counting number of tosses requried to produce a pattern across a large number of trials), but instead of looking for 'HTT', we look for 'HTH'. We then ask which, if either, is larger: the average number of tosses needed to produce 'HTT' or the average number of tosses needed to produce 'HTH'?

If you are like most people (including me), your reasoning goes something like this: For a fair coin, the probability of a head P(H) = 1/2, and each toss is independent of the one before it. Thus, P(HTT) = P(HTH) = (1/2)(1/2)(1/2) = 1/8, and the average number of tosses required to produce 'HTT' is equal to the average number of tosses required to produce 'HTH'.

However, when you actually perform this experiment, you discover that the average number of tosses required to produce 'HTH' is larger than the average number of tosses required to produce 'HTT'. We can demonstrate this using the following R code:

# Number of tosses per trial; this should be fairly large to ensure (so
# far as we can) that the 'HTH' and 'HTT' patterns occur in each trial
n.tosses <- 50;

# Number of trials to run; this should be large so we get a good
# estimate for the average number of tosses required to generate each
# pattern ('HTH' or 'HTT')
n.trials <- 10000;

# Perform n.trials of coin tosses; the value of each toss is either
# '1' (head) or '0' (tail), as drawn from a binomial distribution with
# one trial and a probability of success of 0.5
generate.string <- function(x)
    { paste(rbinom(n.tosses, 1, 0.5), collapse=''); }
data <- sapply(1:n.trials, generate.string);

# Number of tosses for the patterns of interest in the data.
# We need the '+ 2' correction factor since the number of tosses is
# counted at the end of the pattern, but regexpr returns the index of
# where the pattern starts.
hth <- regexpr('101', data) + 2;
htt <- regexpr('100', data) + 2;

# Print the average number of tosses required to produce each pattern
cat('HTH:', mean(hth), 'tosses\n');
cat('HTT:', mean(htt), 'tosses\n');

Running this simulation, we see that 'HTH' takes an average of ~10 tosses to produce while 'HTT' only takes an average of ~8 tosses. The explanation for this lies in the fact that the 'HTT' pattern allows for some overlap in our search while the 'HTH' pattern does not.

Borrowing from Donnelly's talk, suppose we have already tossed the 'HT' pattern, and the next toss is a 'H'. If we are looking for 'HTH', we are done. If we are looking for 'HTT', we have to continue looking, but we can use the 'H' that was just produced as the possible start of the 'HTT' pattern we are looking for.

On the other hand, suppose we have already tossed the 'HT' pattern, and the next toss is a 'T'. If we are looking for 'HTT', we are done. If we are looking for 'HTH', we have to continue looking; unfortunately, unlike the case in the last paragraph, we cannot use our most recent toss (a 'T') as the possible start of the 'HTT' pattern we are looking for. We are forced to wait for the next 'H' to show up, and thus, are now behind by at least one toss. This explains why the 'HTH' pattern takes more tosses, on average, to generate than the 'HTT' pattern.

Obviously, a formal, mathematical proof of this exists, but that is a topic for another post.

EDIT [31 Jan. 2008]: I have now added another post with the mathematical proof of the situation.

No comments: