5. Confidence Intervals

Let’s go back to the main goal of inferential statistics:

The goal of inferential statistics is:
To give a credible answer to an interesting question using data. Moreover, the uncertainty that is due to sampling should be quantified.

For example, suppose that we are willing to assume that the population distribution is a member of the family of Poisson distributions with parameter called λλ. The (potentially interesting) question is what the value of lambda is (knowing this value would pinpoint the population distribution).

We have learned to obtain an estimator (possibly using the Method of Moments or the method of Maximum Likelihood), as well as its repeated sampling distribution. That is, we can formulate statements like this:

ˆλ=ˉXapproxN(μ,σ2n)=N(λ,λn).^λ=¯XapproxN(μ,σ2n)=N(λ,λn).

This statement does not really answer our question. We need to do a final step: re-formulate the statement above into a statement that clearly answers our question. We are now going to do this final step.

Suppose that we have a random sample from the population distribution at our disposal. We will discuss in this chapter how one can use the information about our estimator and its RSD to construct a (possibly approximate) 95%95% (or 99%99%) confidence interval for the unknown parameter λλ. A confidence interval is a statement that could, in a particular example, look like this:

P(2.1<λ<3.4)=0.95.P(2.1<λ<3.4)=0.95.

Without having a random sample, we did not have any knowledge about the value of the unknown parameter λλ. After applying the method for constructing a confidence interval, to be discussed below, we end up stating that we are 95%95% sure that the confidence interval contains the true value of λλ. We have put the information that we have about estimators and their RSD into a form that everybody is able to understand.

In other words, we have used our data (a random sample) and an assumption about the population distribution (that may or may not be credible) to answer our question (what is the value of λλ) and quantify the uncertainty.

The uncertainty arises because we are using only a sample of individuals (or firms), instead of the whole population. If we had information about all individuals (or firms) in the population (i.e. an infinitely large sample), we would be able to obtain the true value of λλ with 100%100% certainty (if our estimator is consistent). As we have only a random sample of finite size at our disposal, the confidence interval that we have obtained would be different if we were to have had another sample of the same size at our disposal.

The animation below illustrates the correct interpretation of a confidence interval: Suppose that the population distribution is a Normal distribution with mean equal to zero. In practice, we wouldn’t know that the mean is zero. We use our sample and the method of obtaining a confidence interval, to be discussed below, and find (using our observed sample):

P(0.08<μ<0.43)=0.95.P(0.08<μ<0.43)=0.95. The interpretation is that we are 95%95% sure that this interval contains the (unknown) true value. The reason for this is as follows: if we were to draw 10 million samples of the same sample size, the (10 million) confidence intervals will contain the true value (00 in our case) in 95%95% of the cases.

Note that the only confidence interval that we obtain is the first one. The other confidence intervals are obtained using hypothetical samples (repeated samples that we do not have at our disposal). We are confident that our observed confidence interval contains the true value because 95%95% of all possible realisations of the confidence interval do.

Note also that we always make the implicit assumption that the population is infinitely large. This simplifies the analysis a little bit. For instance, assuming that the population distribution is Bernoulli with parameter pp with 0<p<10<p<1, implicitly assumes that the population is infinitely large: otherwise, pp cannot take all values between zero and one.

1 Constructing a Confidence Interval

We will now discuss the method that can be used to obtain confidence intervals.

The Method: Constructing a 95%95% Confidence Interval For θθ.

Step 1: Turn ˆθ^θ into a pivotal quantity PQ(data,θ)PQ(data,θ).
Step 2: Find the quantiles aa and bb such that Pr(a<PQ<b)=0.95.
Step 3: Rewite the expression to Pr(..<θ<..)=0.95.

The hardest part is step 1. A pivotal quantity (PQ) is a function of the data (our random sample) and θ, the parameter for which we want to construct a 95% confidence interval. This function should satisfy two conditions:

  1. The distribution of PQ should be completely known (for instance, N(0,1) or F2,13).
  2. PQ itself should not depend on any other unknown parameters. It is only allowed to depend on θ.
  • Condition 1 allows us to perform step 2 of the method.
  • Condition 2 allows us to perform step 3 of the method.

The easiest way to clarify the method is to apply it to some examples.

2 Confidence Interval for the Mean (or Proportion)

If we want a confidence interval for the mean of a single population, we will discuss examples where the population distribution is Bernoulli, Poisson, Normal or Exponential.


2.1 Example 1: XN(μ,σ2)

We are considering the situation where we are willing to assume that the population distribution is a member of the family of Normal distributions. We are interested in finding a 95% confidence interval for the unknown population mean μ.

Step 1: Turn ˆμ into a pivotal quantity PQ(data,μ):

The method requires us to find a pivotal quantity PQ. One requirement is that PQ depends on μ. It therefore makes sense to consider the sample average (this is the Method of Moments estimator of μ, as well as the Maximum Likelihood estimator of μ). We know that linear combinations of Normals are Normal, so that we know the RSD of ˉX:

ˉXN(μ,σ2n). The quantity ˉX is not a pivotal quantity, for two reasons:

  • The RSD of ˉX is not completely known: it depends on the unknown parameters μ and σ2.
  • ˉX does depend on the data, but does not depend on μ.

The first problem can be solved by standardising. Let’s try the following:

ˉXμσ2nN(0,1).

The quantity on the left-hand side satisfies the first condition of a pivotal quantity: the standard normal distribution is completely known (i.e. has no unknown parameters). The quantity itself does not satisfy the second condition: it depends on μ, but also on the unknown parameter σ2.

We can solve this problem by standardising with the estimated variance S2 (instead of σ2):

PQ=ˉXμS2ntn1.

We have obtained a pivotal quantity: PQ itself depends on μ and not on any other unknown parameters. Moreover, the RSD of PQ is completely known. We can progress to step two.

Step 2: Find the quantiles a and b such that Pr(a<PQ<b)=0.95.

We need to find the values a and b such that P(a<PQ<b)=0.95. In other words, we need to find the following:

  • a=qtn10.025: the 2.5% quantile of the t distribution with n1 degrees of freedom.
  • b=qtn10.975: the 97.5% quantile of the t distribution with n1 degrees of freedom.

These are easily found using the statistical tables or by using R. If, for example, we have a sample of size n=50:

qt(0.025, df=49)
## [1] -2.009575
qt(0.975, df=49)
## [1] 2.009575

As the t distribution is symmetric, it would have sufficed to calculate only one of the two quantiles.

We now have the following true statement to work with:

P(qtn10.025<ˉXμS2n<qtn10.975)=0.95.

Step 3: Rewite the expression to Pr(..<μ<..)=0.95.

Using a sample of size n=50, we can calculate the values of ˉX and S2. Our pivotal quantity that appears in the middle of our probability statement therefore only depends on μ. This is by construction, as it was one of the conditions for a pivotal quantity. As we saw, the quantiles are also easily determined.

We set out to obtain a statement like P(<μ<)=0.95). To get there, we only need to rewrite the expression inside the probability:

P(qtn10.025<ˉXμS2n<qtn10.975)=0.95P(qtn10.025S2n<ˉXμ<qtn10.975S2n)=0.95P(ˉX+qtn10.025S2n<μ<ˉX+qtn10.975S2n)=0.95P(ˉXqtn10.975S2n<μ<ˉXqtn10.025S2n)=0.95P(ˉXqtn10.975S2n<μ<ˉX+qtn10.975S2n)=0.95.

Note that a<μ<b is equivalent to b<μ<a and that qtn10.025=qtn10.975, as the t distribution is symmetric. We can now use the values of ˉX and S2 and the values of the quantiles to obtain the 95% confidence interval for μ.

For illustration purposes, let’s draw a random sample of size 50 from N(μ=5,σ2=100):

set.seed(12345)
data <- rnorm(50, mean=5, sd=10)
data
##  [1]  10.8552882  12.0946602   3.9069669   0.4650283  11.0588746 -13.1795597
##  [7]  11.3009855   2.2381589   2.1584026  -4.1932200   3.8375219  23.1731204
## [13]   8.7062786  10.2021646  -2.5053199  13.1689984  -3.8635752   1.6842241
## [19]  16.2071265   7.9872370  12.7962192  19.5578508  -1.4432843 -10.5313741
## [25] -10.9770952  23.0509752   0.1835264  11.2037980  11.1212349   3.3768902
## [31]  13.1187318  26.9683355  25.4919034  21.3244564   7.5427119   9.9118828
## [37]   1.7591342 -11.6205024  22.6773385   5.2580105  16.2851083 -18.8035806
## [43]  -5.6026555  14.3714054  13.5445172  19.6072940  -9.1309878  10.6740325
## [49]  10.8318765  -8.0679883

We calculate the sample average and the adjusted sample variance:

xbar <- mean(data)
xbar
## [1] 6.795663
S2 <- var(data)
S2
## [1] 120.2501

We calculate the quantiles (it turns out that we only need b):

b <- qt(0.975, df=49)
b
## [1] 2.009575

The confidence interval is:

left <- xbar - b * sqrt( S2/50 )
left
## [1] 3.6792
right <- xbar + b * sqrt( S2/50 )
right
## [1] 9.912126

We conclude that P(3.68<μ<9.91)=0.95. The 95% confidence interval is (3.68,9.91). Because we simulated the data ourselves, we know that the true value of μ is equal to 5. In practice, one would not know this.

The t.test command can do the calculations for us:

t.test(data, conf.level=0.95)$"conf.int"
## [1] 3.679200 9.912126
## attr(,"conf.level")
## [1] 0.95

A 99% confidence interval for μ would be wider:

t.test(data, conf.level=0.99)$"conf.int"
## [1]  2.639575 10.951750
## attr(,"conf.level")
## [1] 0.99

This occurs, because b=qtn10.995 is larger than b=qtn10.975:

qt(0.995, df=49)
## [1] 2.679952

2.1.1 Aproximation using CLT

We are considering the situation where we either do not know anything about the population distribution, or we asume that the population distribution is a distribution that is not Normal (e.g. Bernoulli(p)). We are interested in finding an approximate 95% confidence interval for the unknown population mean μ (or p, in the Bernoulli case).

The narrative does not have to be changed much:

Step 1: Turn ˆμ into a pivotal quantity PQ(data,μ):

The method requires us to find a pivotal quantity PQ. One requirement is that PQ depends on μ. It therefore makes sense to consider the sample average. We know, from the central limit theorem, that

ˉXapproxN(μ,σ2n).

We can obtain a pivotal quantity by standardising with the estimated variance:

PQ=ˉXμS2napproxN(0,1).

Step 2: Find the quantiles a and b such that Pr(a<PQ<b)=0.95.

The 97.5% quantile of the standard normal distribution is:

b2 <- qnorm(0.975, mean=0, sd=1)
b2
## [1] 1.959964

We can therefore say that the following probability statement is correct:

P(qN(0,1)0.975<ˉXμS2n<qN(0,1)0.975)=0.95.

Step 3: Rewite the expression to Pr(..<μ<..)=0.95.

Using the same steps as before, we obtain:

P(qN(0,1)0.025<ˉXμS2n<qN(0,1)0.975)=0.95P(ˉXqN(0,1)0.975S2n<μ<ˉX+qN(0,1)0.975S2n)=0.95.

Hence, without assuming that the population distribution is Normal, we obtain the following confidence interval:

c(xbar - b2 * sqrt(S2/50), xbar + b2 * sqrt(S2/50))
## [1] 3.756137 9.835188

That is, we conclude that P(3.76<μ<9.84)=0.95. Not assuming that the population distribution is Normal makes our 95% confidence interval wider.

2.2 Example 2: XPoisson(λ)

We are considering the situation where we are willing to assume that the population distribution is a member of the family of Poissson(λ) distributions. We are interested in finding a 95% confidence interval for the unknown population mean λ.

Step 1: Turn ˆλ into a pivotal quantity PQ(data,λ):

The method requires us to find a pivotal quantity PQ. One requirement is that PQ depends on λ. It therefore makes sense to consider the sample average (this is the Method of Moments estimator of λ, as well as the Maximum Likelihood estimator of λ). We do not know the repeated sampling distribution of ˉX in this case. All we know is that

ni=1XiPoisson(nλ).

This is not a pivotal quantity: the RSD of the sum is not completely known and the sum itself does not depend on λ. We cannot standardise to get an RSD that is completely known (as we could with the Normal distribution):

ni=1Xinλnλ?

Without a pivotal quantity, we cannot proceed to step 2. We are going to have to use a large sample approximation.

2.2.1 Aproximation using CLT

We are now interested in finding an approximate 95% confidence interval for the unknown population mean λ.

Step 1: Turn ˆλ into a pivotal quantity PQ(data,λ):

As E[X]=Var[X]=λ for Poisson(λ) distributions, the central limit theorem states that

ˆλ=ˉXapproxN(λ,λn).

We can now standardise:

ˉXλλnapproxN(0,1).

This is a pivotal quantity. However, we can also standardise with the estimated variance:

ˉXλˉXnapproxN(0,1). This is also a pivotal quantity. Looking ahead to step 3, this pivotal quantity will be easier to work with.

Step 2: Find the quantiles a and b such that Pr(a<PQ<b)=0.95.

qnorm(0.975)
## [1] 1.959964

The required quantile is qN(0,1)0.975=1.96. The probability statement is equal to

P(qN(0,1)0.025<ˉXλˉXn<qN(0,1)0.975)=0.95.

Step 3: Rewite the expression to Pr(..<λ<..)=0.95.

Rewriting this statement gives

P(qN(0,1)0.025<ˉXλˉXn<qN(0,1)0.975)=0.95P(qN(0,1)0.025ˉXn<ˉXλ<qN(0,1)0.975ˉXn)=0.95P(ˉX+qN(0,1)0.025ˉXn<λ<ˉX+qN(0,1)0.975ˉXn)=0.95P(ˉXqN(0,1)0.975ˉXn<λ<ˉXqN(0,1)0.025ˉXn)=0.95P(ˉXqN(0,1)0.975ˉXn<λ<ˉX+qN(0,1)0.975ˉXn)=0.95.

As an example, consider a random sample of size n=25 from the Poisson distribution (we take λ=10):

set.seed(12345)
data <- rpois(25, lambda=10)
data
##  [1] 11 12  9  8 11  4 11  7  8 11  9 10  9 11 13 10 12 14  7  4 15  8 11 11  9

We calculate the sample average and the 97.5% quantile:

xbar <- mean(data)
xbar
## [1] 9.8
b <- qnorm(0.975)
b
## [1] 1.959964

the approximate 95% confidence interval is:

c( xbar - b * sqrt(xbar/25) , xbar + b * sqrt(xbar/25))
## [1]  8.572868 11.027132

We conclude that P(8.57<λ<11.03)=0.95.


2.3 Example 3: XExp(λ)

We are considering the situation where we are willing to assume that the population distribution is a member of the family of Exponential distributions. We could be interested in finding a 95% confidence interval for the parameter λ, or we could be interested in finding a 95% confidence interval for the mean 1λ.

We focus on the first question. The parameter λ can be estimated by ˆλ=1ˉX, the Method of Moments and Maximum Likelihood estimator.

Step 1: Turn ˆλ into a pivotal quantity PQ(data,λ):

If we have a random sample from the Exponential distribution, all we know is that

ni=1XiGamma(n,λ). This result does not allow us to find the exact RSD of ˆλ=1ˉX, let alone use this to find a pivotal quantity. We will therefore have to resort to a large sample approximation.

2.3.1 Aproximation using CLT

If we have a random sample from the Exponential distribution, which has E[X]=1λ and Var[X]=1λ2, the CLT states that

ˉXapproxN(1λ,1nλ2).

Our estimator ˆλ=1ˉX is a nonlinear function of an approximate Normal. We can therefore use the Delta Method discussed in section 1.6 of the Estimators chapter, to obtain:

ˆλ=1ˉXapproxN(λ,λ4nλ2)=N(λ,λ2n). This is not a pivotal quantity.

Step 1: Turn ˆλ into a pivotal quantity PQ(data,λ):

We can standardise to obtain a quantity with an approximately standard normal distribution:

1ˉXλλ2napproxN(0,1). Looking ahead to step 3, however, we can obtain a pivotal quantity that is easier to work with by standardising with the estimated variance:

1ˉXλ1nˉX2approxN(0,1).

Step 2: Find the quantiles a and b such that Pr(a<PQ<b)=0.95.

As before, the 97.5% quantile of N(0,1) is (more or less) 1.96.

We can make the following probability statement:

P(qN(0,1)0.025<1ˉXλ1nˉX2<qN(0,1)0.975)=0.95.

Step 3: Rewite the expression to Pr(..<λ<..)=0.95.

Rewriting this statement (in the usual way) gives

P(1ˉXqN(0,1)0.9751nˉX2<λ<1ˉX+qN(0,1)0.9751nˉX2)=0.95.

A confidence interval for the mean 1λ would start from

ˉXapproxN(1λ,1nλ2).

Standardising with the estimated variance gives

ˉX1λˉX2napproxN(0,1).

Step 3 then gives

P(ˉXqN(0,1)0.975ˉX2n<1λ<ˉX+qN(0,1)0.975ˉX2n)=0.95.

As an example, we draw a random sample of size n=30 from the Exp(λ=2) distribution:

set.seed(12345)
data <- rexp(30, rate=2)
data
##  [1] 0.220903896 0.263736378 0.404233657 0.009224336 0.227705254 0.011969070
##  [7] 3.201094619 0.628980261 0.481094044 0.909001520 0.113243755 0.223381456
## [13] 0.626230560 0.198543639 0.044057864 0.861082101 2.692431711 0.943996964
## [19] 0.181833623 0.586781837 0.569824730 0.213983048 0.729273719 0.282193281
## [25] 0.614559271 0.365936084 0.489022937 1.401355220 0.125542797 0.109180467

The approximate 95% confidence interval for the parameter λ is:

n <- 30
xbar <- mean(data)
b <- qnorm(0.95)
c(1/xbar - b * sqrt( 1/(n*(xbar^2)) ), 1/xbar + b * sqrt( 1/(n*(xbar^2))) )
## [1] 1.183886 2.200133

The approximate 95% confidence interval for the mean of the population distribution 1λ is:

c(xbar - b * sqrt( xbar^2 /n)  , xbar + b * sqrt( xbar^2 /n))
## [1] 0.4135274 0.7684992

2.4 Example 4: XBernoulli(p)

We are considering the situation where we are willing to assume that the population distribution is a member of the family of Bernoulli distributions. We are interested in finding the 95% confidence interval for the mean (or: population proportion) p.

Both the Method of Moments estimator and the Maximum Likelihood for p are equal to the sample average.

Step 1: Turn ˆp into a pivotal quantity PQ(data,p):

When sampling from the Bernoulli(p) distribution, we only have the following result:

ni=1XiBinomial(n,p). We can find the exact distribution by observing that

P(ni=1Xi=k)=P(ˉX=kn)=(nk) pk(1p)nk for k=0,1,,n.

The exact RSD of ˉX has the Binomial probabilities, but assigned to 0,1n,2n,,1 instead of 0,1,2,,n. The quantity ni=1Xi is not pivotal, and we can’t standardise to make it so:

ni=1Xinpnp?

We will have to resort to a large sample approximation.

2.4.1 Approximation using CLT

We are interested in finding an approximate 95% confidence interval for the population proportion p.

Step 1: Turn ˆp into a pivotal quantity PQ(data,p):

If XBernoulli(p), we know that E[X]=p and Var[X]=p(1p). The central limit theorem then states that

ˉXapproxN(p,p(1p)n). This is not a pivotal quantity. standardising gives:

ˉXpp(1p)napproxN(0,1).

Although this is a pivotal quantity, we prefer to standardise with the estimated variance:

ˉXpˉX(1ˉX)napproxN(0,1). as this will simplify step 3 of the method.

The probability statement is:

P(qN(0,1)0.025<ˉXpˉX(1ˉX)n<qN(0,1)0.975)=0.95.

Step 2: Find the quantiles a and b such that Pr(a<PQ<b)=0.95.

As before, the 97.5% quantile of N(0,1) is about 1.96.

Step 3: Rewite the expression to Pr(..<p<..)=0.95.

We can rewrite this statement in the usual way, to obtain

P(ˉXqN(0,1)0.975ˉX(1ˉX)n<p<ˉX+qN(0,1)0.975ˉX(1ˉX)n)=0.95.

As an example, we draw a random sample of size 40 from a Bernoulli(p=0.3) population:

set.seed(123456)
data <- rbinom(40, size=1, prob=0.3)
data
##  [1] 1 1 0 0 0 0 0 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 1 0 0 1 0 0 1 0 1 1
## [39] 1 0

We calculate the sample average and the quantile:

xbar <- mean(data)
xbar
## [1] 0.475
b <- qnorm(0.95)

The approximate 95% confidence interval for p is:

c( xbar - b * (sqrt(xbar*(1-xbar)/40)), xbar + b * (sqrt(xbar*(1-xbar)/40)) )
## [1] 0.3451256 0.6048744

Notice that, using this particular sample, the true value p=0.3 is not in the interval. We were unlucky, as this happens only in 5% of the (infinitely many) possible repeated samples. Each repeated sample give a confidence interval, and in 95% of the cases, this interval contains the true value. This is precisely how a confidence interval should be interpreted.



3 Confidence Interval for the Difference in Means (or Proportions)

For the case that we want a confidence interval for the difference in means of two populations, we will discuss examples where the population distributions are Normal and Bernoulli.

Suppose that we have two random samples. The first random sample (X11,X1n1) is a sample of incomes from Portugal and has size n1. The second random sample (X21,X2n2) is a sample of incomes from England and has size n2. We want to construct a 95% confidence interval of the difference in means: μ1μ2.


3.1 Example 5: X1N(μ1,σ21) and X2N(μ2,σ22)

Suppose that we are willing to assume that both income distributions are members of the Normal family of distributions.

Step 1: Turn ˉX1ˉX2 into a pivotal quantity PQ(data,μ1μ2):

From example 5 of the RSD chapter, we know that:

ˉX1ˉX2N(μ1μ2,σ21n1+σ22n2). This is not a pivotal quantity. To obtain a pivotal quantity, we need to standardise. How we standardise will depend on whether or not σ21 and σ22 are known. If they are known, we will denote them by σ210 and σ220.

3.1.1 Variances Known: σ210 and σ220

If the variances are known, we can standardise using the known variances and obtain a repeated sampling distribution that is standard-normal:

(ˉX1ˉX2)(μ1μ2)σ210n1+σ220n2N(0,1).

This is a pivotal quantity. The probability statement is:

P(qN(0,1)0.025<(ˉX1ˉX2)(μ1μ2)σ210n1+σ220n2<qN(0,1)0.975)=0.95.

Step 2: Find the quantiles a and b such that Pr(a<PQ<b)=0.95.

The 97.5% quantile of $N(0,1) is about 1.96.

Step 3: Rewite the expression to Pr(..<μ1μ2<..)=0.95.

We can rewrite the probability statement in the usual way:

P((ˉX1ˉX2)qN(0,1)0.975σ210n1+σ220n2<μ1μ2<(ˉX1ˉX2)+qN(0,1)0.975σ210n1+σ220n2)=0.95.

As an example, we draw a random sample of size n1=50 from N(μ=10,σ21=100) and another random sample of size n2=60 from N(μ=15,σ22=81). We know that the distributions are Normal and that σ21=100 and σ22=81. We do not know the values of μ1 and μ2.

set.seed(12345)
data1 <- rnorm(50, mean=10, sd=10)
data1
##  [1]  15.8552882  17.0946602   8.9069669   5.4650283  16.0588746  -8.1795597
##  [7]  16.3009855   7.2381589   7.1584026   0.8067800   8.8375219  28.1731204
## [13]  13.7062786  15.2021646   2.4946801  18.1689984   1.1364248   6.6842241
## [19]  21.2071265  12.9872370  17.7962192  24.5578508   3.5567157  -5.5313741
## [25]  -5.9770952  28.0509752   5.1835264  16.2037980  16.1212349   8.3768902
## [31]  18.1187318  31.9683355  30.4919034  26.3244564  12.5427119  14.9118828
## [37]   6.7591342  -6.6205024  27.6773385  10.2580105  21.2851083 -13.8035806
## [43]  -0.6026555  19.3714054  18.5445172  24.6072940  -4.1309878  15.6740325
## [49]  15.8318765  -3.0679883
data2 <- rnorm(60, mean=15, sd=9)
data2
##  [1] 10.136525 32.529234 15.482312 18.164966  8.961211 17.501583 21.220541
##  [8] 22.414158 34.305585 -6.122496 16.346328  2.917217 19.979728 29.309666
## [15]  9.718084 -1.491396 22.993255 29.341396 19.651692  3.338955 15.491540
## [22]  7.938156  5.555825 35.974608 27.624348 23.483408 22.436325  7.696136
## [29] 19.286235 24.191326 20.808448 24.388292 12.260678 37.293998 23.740986
## [36] 31.803893 21.048382 12.228420 19.828713 22.423831  6.324887  7.304257
## [43] 31.982522 11.473626  6.174303 21.185989 10.454608 34.419478  9.601822
## [50]  8.749080 17.015329  4.593990 18.801767  3.077203 16.269759 10.175568
## [57] 12.195545 29.004987 10.967700 17.890112

We calculate the sample averages:

xbar1 <- mean(data1)
xbar1
## [1] 11.79566
xbar2 <- mean(data2)
xbar2
## [1] 17.16441
b <- qnorm(0.95)

The 95% confidence interval for μ1mu2 is:

c( (xbar1 - xbar2) - b * (sqrt(100/100 + 81/100 )), (xbar1 - xbar2) + b * (sqrt(100/100 + 81/100 )) )
## [1] -7.581672 -3.155824

The true difference in means is equal to 5.

3.1.2 Variances Unknown but The Same: σ21=σ22= σ2

If the variances σ21 and σ22 are not known, we need to estimate them. If σ21 and σ22 are unknown, but known to be equal (to the number σ2, say), we can estimate the single unknown variance using only the first sample or using only the second sample. It would be better, however, to use the information contained in both samples. As discussed in example 5 of the RSD chapter, we can use the pooled adjusted sample variance S2p, leading to:

(ˉX1ˉX2)(μ1μ2)S2p{1n1+1n2}tn1+n22.

This is a pivotal quantity. The probability statement is:

P(qtn1+n220.025<(ˉX1ˉX2)(μ1μ2)S2p{1n1+1n2}<qtn1+n220.975)=0.95.

Step 2: Find the quantiles a and b such that Pr(a<PQ<b)=0.95.

In our example, n1=50 and n2=60. The 97.5% quantile of the t108 distribution is:

q2 <- qt(0.975, df=108)
q2
## [1] 1.982173

Step 3: Rewite the expression to Pr(..<p<..)=0.95.

We can rewrite the expression in the usual way:

P((ˉX1ˉX2)qtn1+n220.975S2p{1n1+1n2}<μ1μ2<(ˉX1ˉX2)+qtn1+n220.975S2p{1n1+1n2})=0.95.

As an example, we again take a random sample of size n1=60 from N(μ=10,σ21=100) and another random sample of size n2=60 from N(μ=15,σ22=100). Note that, this time, the variances are the same, but we do not know the value.

set.seed(12345)
data1 <- rnorm(50, mean=10, sd=10)
data1 <- rnorm(60, mean=15, sd=10)
S2_1 <- var(data1)
S2_1
## [1] 121.1479
S2_2 <- var(data2)
S2_2
## [1] 98.12977

We obtain the following confidence interval:

S2_p <- (59*S2_1 + 49*S2_2)/(60 + 50 -2)
S2_p
## [1] 110.7045
c( (xbar1 - xbar2) - b2 * sqrt( (S2_p*(1/60 + 1/50)) ), (xbar1 - xbar2) + b2 * sqrt( (S2_p*(1/60 + 1/50)) ) )
## [1] -9.317559 -1.419936

Not knowing the variances (but knowing that they are the same) made the confidence interval wider.

3.1.3 Variances Unknown, Approximation Using CLT

If the variances σ21 and σ22 are unknown and not known to be the same, we have the estimate both of them. Standardising with two estimated variances does not lead to a t distribution, so we have no exact RSD. We can approximate the RSD using the Central Limit Theorem:

  • ˉX1ˉX2approxN(μ1μ2,σ21n1+σ22n2).

Standardising with two estimated variances gives:

(ˉX1ˉX2)(μ1μ2)S21n1+S22n2approxN(0,1).

This is a pivotal quantity.

The probability statement is:

P(qN(0,1)0.025<(ˉX1ˉX2)(μ1μ2)S21n1+S22n2<qN(0,1)0.975)=0.95.

Step 2: Find the quantiles a and b such that Pr(a<PQ<b)=0.95.

The 97.5% quantile of N(0,1) is about 1.96.

Step 3: Rewite the expression to Pr(..<μ1μ2<..)=0.95.

Rewriting in the usual way gives:

P((ˉX1ˉX2)qN(0,1)0.975S21n1+S22n2<μ1μ2<(ˉX1ˉX2)+qN(0,1)0.975S21n1+S22n2)=0.95.

We generate the data again according to our first example:

data1 <- rnorm(50, mean=10, sd=10)
xbar1 <- mean(data1)
data2 <- rnorm(60, mean=15, sd=9)
xbar2 <- mean(data2)

In this case, we do not know σ21 and σ22, so we estimate them:

S2_1 <- var(data1)
S2_1
## [1] 127.94
S2_2 <- var(data2)
S2_2
## [1] 74.59016

The 95% confidence interval is now:

b <- qnorm(0.95)
c( (xbar1 - xbar2) - b * sqrt(S2_1/50 + S2_2/60), (xbar1 - xbar2) + b * sqrt(S2_1/50 + S2_2/60)  )
## [1] -8.973851 -2.559370

3.2 Example 6: X1Bernoulli(p1) and X2Bernoulli(p2)

We are now assuming that X1 and X2 both have Bernoulli distributions, with E[X1]=p1, Var[X1]=p1(1p1) and E[X2]=p2, Var[X2]=p2(1p2), respectively. For example, p1 could represent the unemployment rate in Portugal and p2 the unemployment rate in England. We want to obtain a 95% confidence interval for the difference in unemployment rates: p1p2.

Step 1: Turn ¯X1ˉX2 into a pivotal quantity PQ(data,p1p2):

In that case we would base our analysis of p1p2 on ˉX1ˉX2, the difference of the fraction of unemployed in sample 1 with the fraction of unemployed in sample 2. The central limit theorem states that both sample averages have an RSD that is approximately Normal:

ˉX1approxN(p1,p1(1p1)n1) and ˉX2approxN(p2,p2(1p2)n2).

As ˉX1ˉX2 is a linear combination of (approximately) Normals, its RSD is (approximately) Normal:

ˉX1ˉX2approxN(p1p2,p1(1p1)n1+p2(1p2)n1).

This is not a pivotal quantity. Note that we have used that Cov[ˉX1,ˉX2]=0: we assume that both (random) samples are drawn independently from each other.

Step 2: Find the quantiles a and b such that Pr(a<PQ<b)=0.95.

Section 3.1.1 tells us how to standardise with known variances:

(ˉX1ˉX2)(p1p2)p1(1p1)n1+p2(1p2)n2approxN(0,1), We can estimate p1 by ˉX1 and p2 by ˉX2. Using section 3.1.3 of the RSD chapter, standardising with the estimated variances gives:

(ˉX1ˉX2)(p1p2)ˉX1(1ˉX1)n1+ˉX2(1ˉX2)n2approxN(0,1).

This is a pivotal quantity. The probability statement is:

P(qN(0,1)0.025<(ˉX1ˉX2)(p1p2)ˉX1(1ˉX1)n1+ˉX2(1ˉX2)n2<qN(0,1)0.975)=0.95.

Step 3: Rewite the expression to Pr(..<p1p2<..)=0.95.

We can rewrite the statement in the usual way:

P((ˉX1ˉX2)qN(0,1)0.975ˉX1(1ˉX1)n1+ˉX2(1ˉX2)n2<p1p2<(ˉX1ˉX2)+qN(0,1)0.975ˉX1(1ˉX1)n1+ˉX2(1ˉX2)n2)=0.95.

As an example, we draw a random sample of size n1=50 from Bernoulli(p1=0.7), and another random sample of size n2=40 from Bernoulli(p2=0.5):

set.seed(12345)
data1 <- rbinom(50, size=1, prob=0.7)
data1
##  [1] 0 0 0 0 1 1 1 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 0 0
## [39] 1 1 0 1 0 0 1 1 1 1 1 1
data2 <- rbinom(40, size=1, prob=0.5)
data2
##  [1] 1 1 0 0 1 0 1 0 0 0 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 1 1 1 0 1 1 0 0 0 0 1 1
## [39] 1 0

We calculate the sample averages:

xbar1 <- mean(data1)
xbar1
## [1] 0.68
xbar2 <- mean(data2)
xbar2
## [1] 0.55
diff <- xbar1 - xbar2
diff
## [1] 0.13

The approximate 95% confidence interval for p1p2 is equal to:

b <- qnorm(0.975)
c(  diff - b * sqrt(xbar1*(1-xbar1)/50 + xbar2*(1-xbar2)/40), diff + b * sqrt(xbar1*(1-xbar1)/50 + xbar2*(1-xbar2)/40))
## [1] -0.07121395  0.33121395

The true difference p1p2 is 0.2.



4 Confidence Interval for the Variance

If we want a confidence interval for the variance of the population distribution, we will only discuss the example where the population distributions is Normal.


4.1 Example 7: XN(μ,σ2):

We are considering the situation where we are willing to assume that the population distribution is a member of the family of Normal distributions. We are now interested in constructing a 95% confidence interval for the variance of the population distribution: σ2. The natural estimator for the population variance is the adjusted sample variance

S2=1n1ni=1(XiˉX)2.

We do not know the repeated sampling distribution of S2.

Step 1: Turn S2 into a pivotal quantity PQ(data,σ2):

ALthough we do not know the RSD of S2, there is one thing that we do know:

ni=1(XiˉXσ)2=ni=1(XiˉX)2σ2=(n1)S2σ2χ2n1.

This follows because a sum of n standard-normals squared has a χ2n distribution. The quantity (n1)S2σ2 is exactly this, except that μ is estimated by ˉX. The quantity (n1)S2σ2 is a pivotal quantity: the quantity itself only depends on our random sample and the parameter of interest σ2. Moreover, the χ2n1 distribution contains no unknown parameters.

The probability statement is:

P(qχ2n10.025<(n1)S2σ2<qχ2n10.975)=0.95.

Step 2: Find the quantiles a and b such that Pr(a<PQ<b)=0.95.

The Chi-squared distribution is not symmetric, so we need to calculate both quantiles separately. If we have a sample of size n=100 from N(μ=4,σ2=16), we need the 0.025 and 0.975 quantiles of the χ299 distribution:

a <- qchisq(0.025, df=99)
a
## [1] 73.36108
b <- qchisq(0.975, df=99)
b
## [1] 128.422

Step 3: Rewite the expression to Pr(..<σ2<..)=0.95.

P(qχ2n10.025<(n1)S2σ2<qχ2n10.975)=0.95P(qχ2n10.025(n1)S2<1σ2<qχ2n10.975(n1)S2)=0.95P((n1)S2qχ2n10.975<σ2<(n1)S2qχ2n10.025)=0.95,

where we have used that a<1x<b is equivalent to 1b<x<1a.

To apply this, we draw a sample of size n=100 from N(μ=4,σ2=16) and calculate the adjusted sample variance:

set.seed(12345)
data <- rnorm(1000, mean=4, sd=4)
S2 <- var(data)
S2
## [1] 15.95995

The exact 95% confidence interval for the unknown population variance σ2 is:

c( 99*S2/b , 99*S2/a)
## [1] 12.30346 21.53778

The true variance is 16.

4.2 Aproximation using CLT

We will only consider the Normal population distribution for questions about the population variance.



5 Confidence Interval for the Difference in Variances

If we want a confidence interval for the difference in variances of two populations, we will only discuss the example where the population distributions are Normal.

Suppose that we have two random samples. The first random sample (X11,X1n1) is a sample of incomes from Portugal and has size n1. The second random sample (X21,X2n2) is a sample of incomes from England and has size n2. We want to construct a 95% confidence interval for the difference in variances: σ21σ22.

The obvious thing to do is to compare S21, the adjusted sample variance of the Portuguese sample, with S22, the adjusted sample variance of the English sample. In particular, we would compute S21S22. Unfortunately, there is no easy way to obtain the repeated sampling distribution of S21S22.

If we assume that both population distributions are Normal, we can derive a result that comes close enough.


5.1 Example 8: X1N(μ1,σ21) and X2N(μ2,σ22)

Instead of comparing the the differencet S21S22 of the two sample variances, we can also calculate S21/S22. If this number is close to 1, perhaps σ21σ22 is also close to one, which implies that σ21σ22 is close to zero.

Step 1: Turn ^σ21σ22 into a pivotal quantity PQ(data,σ21σ22):

We have seen in Example 8 of the RSD chapter that

S21/σ21S22/σ22=S21/S22σ21/σ22Fn11,n21. The good news is that this is a pivotal quantity: the quantity itself only depends on our random samples (via S21 and S22) and the parameter of interest σ21σ22. Moreover, the distribution of this quatity is completely known.

The probability statement is:

P(qFn11,n210.025<S21/S22σ21/σ22<qFn11,n210.975)=0.95.

Step 2: Find the quantiles a and b such that Pr(a<PQ<b)=0.95.

As the F-distribution is not symmetric, we need to calculate both quantiles. As an example, let n1=100 and n2=150. Then we have

a <- qf(0.025, df1=99, df2=149)
a
## [1] 0.6922322
b <- qf(0.975, df1=99, df2=149)
b
## [1] 1.425503

If you use the staistical tables, then you need the following result to compute a:

qFn1,n2p=1qFn2,n11p. See Example 8 of the RSD chapter.

Step 3: Rewite the expression to Pr(..<σ21σ22<..)=0.95.

We can rewite the probability statement as follows:

P(qFn11,n210.025<S21/S22σ21/σ22<qFn11,n210.975)=0.95P(qFn11,n210.025S22S21<1σ21/σ22<qFn11,n210.975S22S21)=0.95P(S21qFn11,n210.975S22<σ21σ22<S21qFn11,n210.025S22)=0.95,

where we have (again) used that a<1x<b is equivalent to 1b<x<1a.

To apply this, we use a random sample of size n1=100 from N(μ1=10,σ21=100) and a second random sample of size n2=150 from N(μ=20,σ22=400), and calculate the adjusted sample variances:

set.seed(12345)
data1 <- rnorm(100, mean=10, sd=10)
S2_1 <- var(data1)
S2_1
## [1] 124.2625
data2 <- rnorm(150, mean=20, sd=sqrt(400))
S2_2 <- var(data2)
S2_2
## [1] 402.0311

The exact 95% confidence interval for σ21σ22 is equal to:

c( S2_1 / (b*S2_2)  , S2_1 / (a*S2_2) )
## [1] 0.2168264 0.4465073

The true value of σ21σ22 is equal to 0.25.

To summarize, we were not able to construct confidence intervals for σ21σ22. We were able to construct confidence intervals for σ21σ22 if both population distributions are Normal.

5.2 Aproximation using CLT

We will only consider the Normal population distributions for confidence intervals for σ21σ22.


Previous
Next