What is the probability that a sequence of events completes within a given time interval?

If an event has probability p of occurring in some time interval, then the probability the event does not occur q is:$$q=1-p$$

The probability of the event not occurring by time t will be:

$$P(q_1 \cap q_2 \cap...q_t) =q_1q_2...q_t = q^t$$

So the probability it did occur by is one minus this value ($q^t$), which is equal to the cumulative sum of p times the probability it had not occurred up to that point ($q^{t-1}$):$$p_t=1-q^t=\sum\limits_{i=1}^tpq^{t-1}$$

If we are concerned with n independent events occurring with probabilities $p_1, p_2, ... p_n$ by time t, then:$$P(p_{1t} \cap p_{2t} \cap...p_{nt}) =p_{1t}p_{2t}...p_{nt}$$

If $p_1=p_2= ... p_n$ then the above will be simply $p_t^n$. So the probability that all n events have occurred by time t will be:

$$P(t_{Allevents}\leq t)=(1-q^t)^n=(\sum\limits_{i=1}^tpq^{t-1})^n$$

If there is only one sequence of these events that results in the outcome of interest (e.g. $t_1<t_2<...t_n$, where $t_i$ refers to time of occurrence), the probability it is the observed sequence will be one over the total number of permutations ($1/n!$). So:

$$P(t_{Sequence}\leq t)=\frac{(1-q^t)^n}{n!}=\frac{1}{n!}(\sum\limits_{i=1}^tpq^{t-1})^n$$

That gives us the CDF. To get the PDF we take the first derivative which is:$$P(t\geq t_{Sequence}\leq t+1)=\frac{-nq^tln(q)(1-q^t)^{n-1}}{n!}$$

Given the above assumptions, we would expect the probability that the sequence of events completed at any given time interval to follow the PDF, shown in the lower row of plots:

t=1:100; p=.025; q=1-ppar(mfrow=c(2,4))for(n in c(1,2,4,6)){  plot(t,(cumsum(p*q^(t-1))^n)/factorial(n), xlab="Time",       ylab="P(t.Seq <= t)",main=paste(n, "Events"))  lines(t,((1-q^(t))^n)/factorial(n))}for(n in c(1,2,4,6)){  plot(t,(-n*(q^t)*log(q)*(1-q^t)^(n-1))/factorial(n), log="xy", xlab="Time",       ylab="P(t<= t.Seq<= t+1)",main=paste(n, "Events"))  lines(t[-1]-.5,diff(((1-q^(t))^n)/factorial(n)), col="Red")}

enter image description here

I can find no flaw with the above reasoning. So my question is what did Armitage and Doll calculate here: I have an epidemiology question with logs ?

Edit:

From MichaelM's comment I see that the usual way to deal with this distribution is to calculate the probability an event occurs at time t, while above I have calculated the probability it occurs within an interval of time. Is there something wrong with doing this? It seems that when modeling incidence of cancer as in the linked question, we are dealing with intervals of time.

Edit 2:

I realized that if we don't like taking the derivative, the discrete alternative is already shown as the red line on the plot. The PMF is the lag-1 difference of the CDF:$$P(t\geq t_{Sequence}\leq t+1)=\frac{(1-q^{t+1})^n}{n!}-\frac{(1-q^t)^n}{n!}$$

This is even more straightforward. Where is my mistake?

Latest Images

Trending Articles

Latest Images