5. Confidence Intervals
- 1 Constructing a Confidence Interval
- 2 Confidence Interval for the Mean (or Proportion)
- 3 Confidence Interval for the Difference in Means (or Proportions)
- 4 Confidence Interval for the Variance
- 5 Confidence Interval for the Difference in Variances
Let’s go back to the main goal of inferential statistics:
The goal of inferential statistics is:
To give a credible answer to an interesting question using data. Moreover, the uncertainty that is due to sampling should be quantified.
For example, suppose that we are willing to assume that the population distribution is a member of the family of Poisson distributions with parameter called λλ. The (potentially interesting) question is what the value of lambda is (knowing this value would pinpoint the population distribution).
We have learned to obtain an estimator (possibly using the Method of Moments or the method of Maximum Likelihood), as well as its repeated sampling distribution. That is, we can formulate statements like this:
ˆλ=ˉXapprox∼N(μ,σ2n)=N(λ,λn).^λ=¯Xapprox∼N(μ,σ2n)=N(λ,λn).
This statement does not really answer our question. We need to do a final step: re-formulate the statement above into a statement that clearly answers our question. We are now going to do this final step.
Suppose that we have a random sample from the population distribution at our disposal. We will discuss in this chapter how one can use the information about our estimator and its RSD to construct a (possibly approximate) 95%95% (or 99%99%) confidence interval for the unknown parameter λλ. A confidence interval is a statement that could, in a particular example, look like this:
P(2.1<λ<3.4)=0.95.P(2.1<λ<3.4)=0.95.
Without having a random sample, we did not have any knowledge about the value of the unknown parameter λλ. After applying the method for constructing a confidence interval, to be discussed below, we end up stating that we are 95%95% sure that the confidence interval contains the true value of λλ. We have put the information that we have about estimators and their RSD into a form that everybody is able to understand.
In other words, we have used our data (a random sample) and an assumption about the population distribution (that may or may not be credible) to answer our question (what is the value of λλ) and quantify the uncertainty.
The uncertainty arises because we are using only a sample of individuals (or firms), instead of the whole population. If we had information about all individuals (or firms) in the population (i.e. an infinitely large sample), we would be able to obtain the true value of λλ with 100%100% certainty (if our estimator is consistent). As we have only a random sample of finite size at our disposal, the confidence interval that we have obtained would be different if we were to have had another sample of the same size at our disposal.
The animation below illustrates the correct interpretation of a confidence interval: Suppose that the population distribution is a Normal distribution with mean equal to zero. In practice, we wouldn’t know that the mean is zero. We use our sample and the method of obtaining a confidence interval, to be discussed below, and find (using our observed sample):
P(−0.08<μ<0.43)=0.95.P(−0.08<μ<0.43)=0.95. The interpretation is that we are 95%95% sure that this interval contains the (unknown) true value. The reason for this is as follows: if we were to draw 10 million samples of the same sample size, the (10 million) confidence intervals will contain the true value (00 in our case) in 95%95% of the cases.
Note that the only confidence interval that we obtain is the first one. The other confidence intervals are obtained using hypothetical samples (repeated samples that we do not have at our disposal). We are confident that our observed confidence interval contains the true value because 95%95% of all possible realisations of the confidence interval do.
Note also that we always make the implicit assumption that the population is infinitely large. This simplifies the analysis a little bit. For instance, assuming that the population distribution is Bernoulli with parameter pp with 0<p<10<p<1, implicitly assumes that the population is infinitely large: otherwise, pp cannot take all values between zero and one.
1 Constructing a Confidence Interval
We will now discuss the method that can be used to obtain confidence intervals.
The Method: Constructing a 95%95% Confidence Interval For θθ.
Step 1: Turn ˆθ^θ into a pivotal quantity PQ(data,θ)PQ(data,θ).
Step 2: Find the quantiles aa and bb such that Pr(a<PQ<b)=0.95.
Step 3: Rewite the expression to Pr(..<θ<..)=0.95.
The hardest part is step 1. A pivotal quantity (PQ) is a function of the data (our random sample) and θ, the parameter for which we want to construct a 95% confidence interval. This function should satisfy two conditions:
- The distribution of PQ should be completely known (for instance, N(0,1) or F2,13).
- PQ itself should not depend on any other unknown parameters. It is only allowed to depend on θ.
- Condition 1 allows us to perform step 2 of the method.
- Condition 2 allows us to perform step 3 of the method.
The easiest way to clarify the method is to apply it to some examples.
2 Confidence Interval for the Mean (or Proportion)
If we want a confidence interval for the mean of a single population, we will discuss examples where the population distribution is Bernoulli, Poisson, Normal or Exponential.
2.1 Example 1: X∼N(μ,σ2)
We are considering the situation where we are willing to assume that the population distribution is a member of the family of Normal distributions. We are interested in finding a 95% confidence interval for the unknown population mean μ.
Step 1: Turn ˆμ into a pivotal quantity PQ(data,μ):
The method requires us to find a pivotal quantity PQ. One requirement is that PQ depends on μ. It therefore makes sense to consider the sample average (this is the Method of Moments estimator of μ, as well as the Maximum Likelihood estimator of μ). We know that linear combinations of Normals are Normal, so that we know the RSD of ˉX:
ˉX∼N(μ,σ2n). The quantity ˉX is not a pivotal quantity, for two reasons:
- The RSD of ˉX is not completely known: it depends on the unknown parameters μ and σ2.
- ˉX does depend on the data, but does not depend on μ.
The first problem can be solved by standardising. Let’s try the following:
ˉX−μ√σ2n∼N(0,1).
The quantity on the left-hand side satisfies the first condition of a pivotal quantity: the standard normal distribution is completely known (i.e. has no unknown parameters). The quantity itself does not satisfy the second condition: it depends on μ, but also on the unknown parameter σ2.
We can solve this problem by standardising with the estimated variance S′2 (instead of σ2):
PQ=ˉX−μ√S′2n∼tn−1.
We have obtained a pivotal quantity: PQ itself depends on μ and not on any other unknown parameters. Moreover, the RSD of PQ is completely known. We can progress to step two.
Step 2: Find the quantiles a and b such that Pr(a<PQ<b)=0.95.
We need to find the values a and b such that P(a<PQ<b)=0.95. In other words, we need to find the following:
- a=qtn−10.025: the 2.5% quantile of the t distribution with n−1 degrees of freedom.
- b=qtn−10.975: the 97.5% quantile of the t distribution with n−1 degrees of freedom.
These are easily found using the statistical tables or by using R. If, for example, we have a sample of size n=50:
qt(0.025, df=49)
## [1] -2.009575
qt(0.975, df=49)
## [1] 2.009575
As the t distribution is symmetric, it would have sufficed to calculate only one of the two quantiles.
We now have the following true statement to work with:
P(qtn−10.025<ˉX−μ√S′2n<qtn−10.975)=0.95.
Step 3: Rewite the expression to Pr(..<μ<..)=0.95.
Using a sample of size n=50, we can calculate the values of ˉX and S′2. Our pivotal quantity that appears in the middle of our probability statement therefore only depends on μ. This is by construction, as it was one of the conditions for a pivotal quantity. As we saw, the quantiles are also easily determined.
We set out to obtain a statement like P(…<μ<…)=0.95). To get there, we only need to rewrite the expression inside the probability:
P(qtn−10.025<ˉX−μ√S′2n<qtn−10.975)=0.95⟺P(qtn−10.025√S′2n<ˉX−μ<qtn−10.975√S′2n)=0.95⟺P(−ˉX+qtn−10.025√S′2n<−μ<−ˉX+qtn−10.975√S′2n)=0.95⟺P(ˉX−qtn−10.975√S′2n<μ<ˉX−qtn−10.025√S′2n)=0.95⟺P(ˉX−qtn−10.975√S′2n<μ<ˉX+qtn−10.975√S′2n)=0.95.
Note that a<−μ<b is equivalent to −b<μ<−a and that −qtn−10.025=qtn−10.975, as the t distribution is symmetric. We can now use the values of ˉX and S′2 and the values of the quantiles to obtain the 95% confidence interval for μ.
For illustration purposes, let’s draw a random sample of size 50 from N(μ=5,σ2=100):
set.seed(12345)
data <- rnorm(50, mean=5, sd=10)
data
## [1] 10.8552882 12.0946602 3.9069669 0.4650283 11.0588746 -13.1795597
## [7] 11.3009855 2.2381589 2.1584026 -4.1932200 3.8375219 23.1731204
## [13] 8.7062786 10.2021646 -2.5053199 13.1689984 -3.8635752 1.6842241
## [19] 16.2071265 7.9872370 12.7962192 19.5578508 -1.4432843 -10.5313741
## [25] -10.9770952 23.0509752 0.1835264 11.2037980 11.1212349 3.3768902
## [31] 13.1187318 26.9683355 25.4919034 21.3244564 7.5427119 9.9118828
## [37] 1.7591342 -11.6205024 22.6773385 5.2580105 16.2851083 -18.8035806
## [43] -5.6026555 14.3714054 13.5445172 19.6072940 -9.1309878 10.6740325
## [49] 10.8318765 -8.0679883
We calculate the sample average and the adjusted sample variance:
xbar <- mean(data)
xbar
## [1] 6.795663
S2 <- var(data)
S2
## [1] 120.2501
We calculate the quantiles (it turns out that we only need b):
b <- qt(0.975, df=49)
b
## [1] 2.009575
The confidence interval is:
left <- xbar - b * sqrt( S2/50 )
left
## [1] 3.6792
right <- xbar + b * sqrt( S2/50 )
right
## [1] 9.912126
We conclude that P(3.68<μ<9.91)=0.95. The 95% confidence interval is (3.68,9.91). Because we simulated the data ourselves, we know that the true value of μ is equal to 5. In practice, one would not know this.
The t.test command can do the calculations for us:
t.test(data, conf.level=0.95)$"conf.int"
## [1] 3.679200 9.912126
## attr(,"conf.level")
## [1] 0.95
A 99% confidence interval for μ would be wider:
t.test(data, conf.level=0.99)$"conf.int"
## [1] 2.639575 10.951750
## attr(,"conf.level")
## [1] 0.99
This occurs, because b=qtn−10.995 is larger than b=qtn−10.975:
qt(0.995, df=49)
## [1] 2.679952
2.1.1 Aproximation using CLT
We are considering the situation where we either do not know anything about the population distribution, or we asume that the population distribution is a distribution that is not Normal (e.g. Bernoulli(p)). We are interested in finding an approximate 95% confidence interval for the unknown population mean μ (or p, in the Bernoulli case).
The narrative does not have to be changed much:
Step 1: Turn ˆμ into a pivotal quantity PQ(data,μ):
The method requires us to find a pivotal quantity PQ. One requirement is that PQ depends on μ. It therefore makes sense to consider the sample average. We know, from the central limit theorem, that
ˉXapprox∼N(μ,σ2n).
We can obtain a pivotal quantity by standardising with the estimated variance:
PQ=ˉX−μ√S′2napprox∼N(0,1).
Step 2: Find the quantiles a and b such that Pr(a<PQ<b)=0.95.
The 97.5% quantile of the standard normal distribution is:
b2 <- qnorm(0.975, mean=0, sd=1)
b2
## [1] 1.959964
We can therefore say that the following probability statement is correct:
P(qN(0,1)0.975<ˉX−μ√S′2n<qN(0,1)0.975)=0.95.
Step 3: Rewite the expression to Pr(..<μ<..)=0.95.
Using the same steps as before, we obtain:
P(qN(0,1)0.025<ˉX−μ√S′2n<qN(0,1)0.975)=0.95⟺P(ˉX−qN(0,1)0.975√S′2n<μ<ˉX+qN(0,1)0.975√S′2n)=0.95.
Hence, without assuming that the population distribution is Normal, we obtain the following confidence interval:
c(xbar - b2 * sqrt(S2/50), xbar + b2 * sqrt(S2/50))
## [1] 3.756137 9.835188
That is, we conclude that P(3.76<μ<9.84)=0.95. Not assuming that the population distribution is Normal makes our 95% confidence interval wider.
2.2 Example 2: X∼Poisson(λ)
We are considering the situation where we are willing to assume that the population distribution is a member of the family of Poissson(λ) distributions. We are interested in finding a 95% confidence interval for the unknown population mean λ.
Step 1: Turn ˆλ into a pivotal quantity PQ(data,λ):
The method requires us to find a pivotal quantity PQ. One requirement is that PQ depends on λ. It therefore makes sense to consider the sample average (this is the Method of Moments estimator of λ, as well as the Maximum Likelihood estimator of λ). We do not know the repeated sampling distribution of ˉX in this case. All we know is that
n∑i=1Xi∼Poisson(nλ).
This is not a pivotal quantity: the RSD of the sum is not completely known and the sum itself does not depend on λ. We cannot standardise to get an RSD that is completely known (as we could with the Normal distribution):
∑ni=1Xi−nλ√nλ∼?
Without a pivotal quantity, we cannot proceed to step 2. We are going to have to use a large sample approximation.
2.2.1 Aproximation using CLT
We are now interested in finding an approximate 95% confidence interval for the unknown population mean λ.
Step 1: Turn ˆλ into a pivotal quantity PQ(data,λ):
As E[X]=Var[X]=λ for Poisson(λ) distributions, the central limit theorem states that
ˆλ=ˉXapprox∼N(λ,λn).
We can now standardise:
ˉX−λ√λnapprox∼N(0,1).
This is a pivotal quantity. However, we can also standardise with the estimated variance:
ˉX−λ√ˉXnapprox∼N(0,1). This is also a pivotal quantity. Looking ahead to step 3, this pivotal quantity will be easier to work with.
Step 2: Find the quantiles a and b such that Pr(a<PQ<b)=0.95.
qnorm(0.975)
## [1] 1.959964
The required quantile is qN(0,1)0.975=1.96. The probability statement is equal to
P(qN(0,1)0.025<ˉX−λ√ˉXn<qN(0,1)0.975)=0.95.
Step 3: Rewite the expression to Pr(..<λ<..)=0.95.
Rewriting this statement gives
P(qN(0,1)0.025<ˉX−λ√ˉXn<qN(0,1)0.975)=0.95⟺P(qN(0,1)0.025√ˉXn<ˉX−λ<qN(0,1)0.975√ˉXn)=0.95⟺P(−ˉX+qN(0,1)0.025√ˉXn<−λ<−ˉX+qN(0,1)0.975√ˉXn)=0.95⟺P(ˉX−qN(0,1)0.975√ˉXn<λ<ˉX−qN(0,1)0.025√ˉXn)=0.95⟺P(ˉX−qN(0,1)0.975√ˉXn<λ<ˉX+qN(0,1)0.975√ˉXn)=0.95.
As an example, consider a random sample of size n=25 from the Poisson distribution (we take λ=10):
set.seed(12345)
data <- rpois(25, lambda=10)
data
## [1] 11 12 9 8 11 4 11 7 8 11 9 10 9 11 13 10 12 14 7 4 15 8 11 11 9
We calculate the sample average and the 97.5% quantile:
xbar <- mean(data)
xbar
## [1] 9.8
b <- qnorm(0.975)
b
## [1] 1.959964
the approximate 95% confidence interval is:
c( xbar - b * sqrt(xbar/25) , xbar + b * sqrt(xbar/25))
## [1] 8.572868 11.027132
We conclude that P(8.57<λ<11.03)=0.95.
2.3 Example 3: X∼Exp(λ)
We are considering the situation where we are willing to assume that the population distribution is a member of the family of Exponential distributions. We could be interested in finding a 95% confidence interval for the parameter λ, or we could be interested in finding a 95% confidence interval for the mean 1λ.
We focus on the first question. The parameter λ can be estimated by ˆλ=1ˉX, the Method of Moments and Maximum Likelihood estimator.
Step 1: Turn ˆλ into a pivotal quantity PQ(data,λ):
If we have a random sample from the Exponential distribution, all we know is that
n∑i=1Xi∼Gamma(n,λ). This result does not allow us to find the exact RSD of ˆλ=1ˉX, let alone use this to find a pivotal quantity. We will therefore have to resort to a large sample approximation.
2.3.1 Aproximation using CLT
If we have a random sample from the Exponential distribution, which has E[X]=1λ and Var[X]=1λ2, the CLT states that
ˉXapprox∼N(1λ,1nλ2).
Our estimator ˆλ=1ˉX is a nonlinear function of an approximate Normal. We can therefore use the Delta Method discussed in section 1.6 of the Estimators chapter, to obtain:
ˆλ=1ˉXapprox∼N(λ,λ4nλ2)=N(λ,λ2n). This is not a pivotal quantity.
Step 1: Turn ˆλ into a pivotal quantity PQ(data,λ):
We can standardise to obtain a quantity with an approximately standard normal distribution:
1ˉX−λ√λ2napprox∼N(0,1). Looking ahead to step 3, however, we can obtain a pivotal quantity that is easier to work with by standardising with the estimated variance:
1ˉX−λ√1nˉX2approx∼N(0,1).
Step 2: Find the quantiles a and b such that Pr(a<PQ<b)=0.95.
As before, the 97.5% quantile of N(0,1) is (more or less) 1.96.
We can make the following probability statement:
P(qN(0,1)0.025<1ˉX−λ√1nˉX2<qN(0,1)0.975)=0.95.
Step 3: Rewite the expression to Pr(..<λ<..)=0.95.
Rewriting this statement (in the usual way) gives
P(1ˉX−qN(0,1)0.975√1nˉX2<λ<1ˉX+qN(0,1)0.975√1nˉX2)=0.95.
A confidence interval for the mean 1λ would start from
ˉXapprox∼N(1λ,1nλ2).
Standardising with the estimated variance gives
ˉX−1λ√ˉX2napprox∼N(0,1).
Step 3 then gives
P(ˉX−qN(0,1)0.975√ˉX2n<1λ<ˉX+qN(0,1)0.975√ˉX2n)=0.95.
As an example, we draw a random sample of size n=30 from the Exp(λ=2) distribution:
set.seed(12345)
data <- rexp(30, rate=2)
data
## [1] 0.220903896 0.263736378 0.404233657 0.009224336 0.227705254 0.011969070
## [7] 3.201094619 0.628980261 0.481094044 0.909001520 0.113243755 0.223381456
## [13] 0.626230560 0.198543639 0.044057864 0.861082101 2.692431711 0.943996964
## [19] 0.181833623 0.586781837 0.569824730 0.213983048 0.729273719 0.282193281
## [25] 0.614559271 0.365936084 0.489022937 1.401355220 0.125542797 0.109180467
The approximate 95% confidence interval for the parameter λ is:
n <- 30
xbar <- mean(data)
b <- qnorm(0.95)
c(1/xbar - b * sqrt( 1/(n*(xbar^2)) ), 1/xbar + b * sqrt( 1/(n*(xbar^2))) )
## [1] 1.183886 2.200133
The approximate 95% confidence interval for the mean of the population distribution 1λ is:
c(xbar - b * sqrt( xbar^2 /n) , xbar + b * sqrt( xbar^2 /n))
## [1] 0.4135274 0.7684992
2.4 Example 4: X∼Bernoulli(p)
We are considering the situation where we are willing to assume that the population distribution is a member of the family of Bernoulli distributions. We are interested in finding the 95% confidence interval for the mean (or: population proportion) p.
Both the Method of Moments estimator and the Maximum Likelihood for p are equal to the sample average.
Step 1: Turn ˆp into a pivotal quantity PQ(data,p):
When sampling from the Bernoulli(p) distribution, we only have the following result:
n∑i=1Xi∼Binomial(n,p). We can find the exact distribution by observing that
P(n∑i=1Xi=k)=P(ˉX=kn)=(nk) pk(1−p)n−k for k=0,1,…,n.
The exact RSD of ˉX has the Binomial probabilities, but assigned to 0,1n,2n,…,1 instead of 0,1,2,…,n. The quantity ∑ni=1Xi is not pivotal, and we can’t standardise to make it so:
∑ni=1Xi−np√np∼?
We will have to resort to a large sample approximation.
2.4.1 Approximation using CLT
We are interested in finding an approximate 95% confidence interval for the population proportion p.
Step 1: Turn ˆp into a pivotal quantity PQ(data,p):
If X∼Bernoulli(p), we know that E[X]=p and Var[X]=p(1−p). The central limit theorem then states that
ˉXapprox∼N(p,p(1−p)n). This is not a pivotal quantity. standardising gives:
ˉX−p√p(1−p)napprox∼N(0,1).
Although this is a pivotal quantity, we prefer to standardise with the estimated variance:
ˉX−p√ˉX(1−ˉX)napprox∼N(0,1). as this will simplify step 3 of the method.
The probability statement is:
P(qN(0,1)0.025<ˉX−p√ˉX(1−ˉX)n<qN(0,1)0.975)=0.95.
Step 2: Find the quantiles a and b such that Pr(a<PQ<b)=0.95.
As before, the 97.5% quantile of N(0,1) is about 1.96.
Step 3: Rewite the expression to Pr(..<p<..)=0.95.
We can rewrite this statement in the usual way, to obtain
P(ˉX−qN(0,1)0.975√ˉX(1−ˉX)n<p<ˉX+qN(0,1)0.975√ˉX(1−ˉX)n)=0.95.
As an example, we draw a random sample of size 40 from a Bernoulli(p=0.3) population:
set.seed(123456)
data <- rbinom(40, size=1, prob=0.3)
data
## [1] 1 1 0 0 0 0 0 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 1 0 0 1 0 0 1 0 1 1
## [39] 1 0
We calculate the sample average and the quantile:
xbar <- mean(data)
xbar
## [1] 0.475
b <- qnorm(0.95)
The approximate 95% confidence interval for p is:
c( xbar - b * (sqrt(xbar*(1-xbar)/40)), xbar + b * (sqrt(xbar*(1-xbar)/40)) )
## [1] 0.3451256 0.6048744
Notice that, using this particular sample, the true value p=0.3 is not in the interval. We were unlucky, as this happens only in 5% of the (infinitely many) possible repeated samples. Each repeated sample give a confidence interval, and in 95% of the cases, this interval contains the true value. This is precisely how a confidence interval should be interpreted.
3 Confidence Interval for the Difference in Means (or Proportions)
For the case that we want a confidence interval for the difference in means of two populations, we will discuss examples where the population distributions are Normal and Bernoulli.
Suppose that we have two random samples. The first random sample (X11,…X1n1) is a sample of incomes from Portugal and has size n1. The second random sample (X21,…X2n2) is a sample of incomes from England and has size n2. We want to construct a 95% confidence interval of the difference in means: μ1−μ2.
3.1 Example 5: X1∼N(μ1,σ21) and X2∼N(μ2,σ22)
Suppose that we are willing to assume that both income distributions are members of the Normal family of distributions.
Step 1: Turn ˉX1−ˉX2 into a pivotal quantity PQ(data,μ1−μ2):
From example 5 of the RSD chapter, we know that:
ˉX1−ˉX2∼N(μ1−μ2,σ21n1+σ22n2). This is not a pivotal quantity. To obtain a pivotal quantity, we need to standardise. How we standardise will depend on whether or not σ21 and σ22 are known. If they are known, we will denote them by σ210 and σ220.
3.1.1 Variances Known: σ210 and σ220
If the variances are known, we can standardise using the known variances and obtain a repeated sampling distribution that is standard-normal:
(ˉX1−ˉX2)−(μ1−μ2)√σ210n1+σ220n2∼N(0,1).
This is a pivotal quantity. The probability statement is:
P(qN(0,1)0.025<(ˉX1−ˉX2)−(μ1−μ2)√σ210n1+σ220n2<qN(0,1)0.975)=0.95.
Step 2: Find the quantiles a and b such that Pr(a<PQ<b)=0.95.
The 97.5% quantile of $N(0,1) is about 1.96.
Step 3: Rewite the expression to Pr(..<μ1−μ2<..)=0.95.
We can rewrite the probability statement in the usual way:
P((ˉX1−ˉX2)−qN(0,1)0.975√σ210n1+σ220n2<μ1−μ2<(ˉX1−ˉX2)+qN(0,1)0.975√σ210n1+σ220n2)=0.95.
As an example, we draw a random sample of size n1=50 from N(μ=10,σ21=100) and another random sample of size n2=60 from N(μ=15,σ22=81). We know that the distributions are Normal and that σ21=100 and σ22=81. We do not know the values of μ1 and μ2.
set.seed(12345)
data1 <- rnorm(50, mean=10, sd=10)
data1
## [1] 15.8552882 17.0946602 8.9069669 5.4650283 16.0588746 -8.1795597
## [7] 16.3009855 7.2381589 7.1584026 0.8067800 8.8375219 28.1731204
## [13] 13.7062786 15.2021646 2.4946801 18.1689984 1.1364248 6.6842241
## [19] 21.2071265 12.9872370 17.7962192 24.5578508 3.5567157 -5.5313741
## [25] -5.9770952 28.0509752 5.1835264 16.2037980 16.1212349 8.3768902
## [31] 18.1187318 31.9683355 30.4919034 26.3244564 12.5427119 14.9118828
## [37] 6.7591342 -6.6205024 27.6773385 10.2580105 21.2851083 -13.8035806
## [43] -0.6026555 19.3714054 18.5445172 24.6072940 -4.1309878 15.6740325
## [49] 15.8318765 -3.0679883
data2 <- rnorm(60, mean=15, sd=9)
data2
## [1] 10.136525 32.529234 15.482312 18.164966 8.961211 17.501583 21.220541
## [8] 22.414158 34.305585 -6.122496 16.346328 2.917217 19.979728 29.309666
## [15] 9.718084 -1.491396 22.993255 29.341396 19.651692 3.338955 15.491540
## [22] 7.938156 5.555825 35.974608 27.624348 23.483408 22.436325 7.696136
## [29] 19.286235 24.191326 20.808448 24.388292 12.260678 37.293998 23.740986
## [36] 31.803893 21.048382 12.228420 19.828713 22.423831 6.324887 7.304257
## [43] 31.982522 11.473626 6.174303 21.185989 10.454608 34.419478 9.601822
## [50] 8.749080 17.015329 4.593990 18.801767 3.077203 16.269759 10.175568
## [57] 12.195545 29.004987 10.967700 17.890112
We calculate the sample averages:
xbar1 <- mean(data1)
xbar1
## [1] 11.79566
xbar2 <- mean(data2)
xbar2
## [1] 17.16441
b <- qnorm(0.95)
The 95% confidence interval for μ1−mu2 is:
c( (xbar1 - xbar2) - b * (sqrt(100/100 + 81/100 )), (xbar1 - xbar2) + b * (sqrt(100/100 + 81/100 )) )
## [1] -7.581672 -3.155824
The true difference in means is equal to −5.
3.1.2 Variances Unknown but The Same: σ21=σ22= σ2
If the variances σ21 and σ22 are not known, we need to estimate them. If σ21 and σ22 are unknown, but known to be equal (to the number σ2, say), we can estimate the single unknown variance using only the first sample or using only the second sample. It would be better, however, to use the information contained in both samples. As discussed in example 5 of the RSD chapter, we can use the pooled adjusted sample variance S′2p, leading to:
(ˉX1−ˉX2)−(μ1−μ2)√S′2p{1n1+1n2}∼tn1+n2−2.
This is a pivotal quantity. The probability statement is:
P(qtn1+n2−20.025<(ˉX1−ˉX2)−(μ1−μ2)√S′2p{1n1+1n2}<qtn1+n2−20.975)=0.95.
Step 2: Find the quantiles a and b such that Pr(a<PQ<b)=0.95.
In our example, n1=50 and n2=60. The 97.5% quantile of the t108 distribution is:
q2 <- qt(0.975, df=108)
q2
## [1] 1.982173
Step 3: Rewite the expression to Pr(..<p<..)=0.95.
We can rewrite the expression in the usual way:
P((ˉX1−ˉX2)−qtn1+n2−20.975√S′2p{1n1+1n2}<μ1−μ2<(ˉX1−ˉX2)+qtn1+n2−20.975√S′2p{1n1+1n2})=0.95.
As an example, we again take a random sample of size n1=60 from N(μ=10,σ21=100) and another random sample of size n2=60 from N(μ=15,σ22=100). Note that, this time, the variances are the same, but we do not know the value.
set.seed(12345)
data1 <- rnorm(50, mean=10, sd=10)
data1 <- rnorm(60, mean=15, sd=10)
S2_1 <- var(data1)
S2_1
## [1] 121.1479
S2_2 <- var(data2)
S2_2
## [1] 98.12977
We obtain the following confidence interval:
S2_p <- (59*S2_1 + 49*S2_2)/(60 + 50 -2)
S2_p
## [1] 110.7045
c( (xbar1 - xbar2) - b2 * sqrt( (S2_p*(1/60 + 1/50)) ), (xbar1 - xbar2) + b2 * sqrt( (S2_p*(1/60 + 1/50)) ) )
## [1] -9.317559 -1.419936
Not knowing the variances (but knowing that they are the same) made the confidence interval wider.
3.1.3 Variances Unknown, Approximation Using CLT
If the variances σ21 and σ22 are unknown and not known to be the same, we have the estimate both of them. Standardising with two estimated variances does not lead to a t distribution, so we have no exact RSD. We can approximate the RSD using the Central Limit Theorem:
- ˉX1−ˉX2approx∼N(μ1−μ2,σ21n1+σ22n2).
Standardising with two estimated variances gives:
(ˉX1−ˉX2)−(μ1−μ2)√S′21n1+S′22n2approx∼N(0,1).
This is a pivotal quantity.
The probability statement is:
P(qN(0,1)0.025<(ˉX1−ˉX2)−(μ1−μ2)√S′21n1+S′22n2<qN(0,1)0.975)=0.95.
Step 2: Find the quantiles a and b such that Pr(a<PQ<b)=0.95.
The 97.5% quantile of N(0,1) is about 1.96.
Step 3: Rewite the expression to Pr(..<μ1−μ2<..)=0.95.
Rewriting in the usual way gives:
P((ˉX1−ˉX2)−qN(0,1)0.975√S′21n1+S′22n2<μ1−μ2<(ˉX1−ˉX2)+qN(0,1)0.975√S′21n1+S′22n2)=0.95.
We generate the data again according to our first example:
data1 <- rnorm(50, mean=10, sd=10)
xbar1 <- mean(data1)
data2 <- rnorm(60, mean=15, sd=9)
xbar2 <- mean(data2)
In this case, we do not know σ21 and σ22, so we estimate them:
S2_1 <- var(data1)
S2_1
## [1] 127.94
S2_2 <- var(data2)
S2_2
## [1] 74.59016
The 95% confidence interval is now:
b <- qnorm(0.95)
c( (xbar1 - xbar2) - b * sqrt(S2_1/50 + S2_2/60), (xbar1 - xbar2) + b * sqrt(S2_1/50 + S2_2/60) )
## [1] -8.973851 -2.559370
3.2 Example 6: X1∼Bernoulli(p1) and X2∼Bernoulli(p2)
We are now assuming that X1 and X2 both have Bernoulli distributions, with E[X1]=p1, Var[X1]=p1(1−p1) and E[X2]=p2, Var[X2]=p2(1−p2), respectively. For example, p1 could represent the unemployment rate in Portugal and p2 the unemployment rate in England. We want to obtain a 95% confidence interval for the difference in unemployment rates: p1−p2.
Step 1: Turn ¯X1−ˉX2 into a pivotal quantity PQ(data,p1−p2):
In that case we would base our analysis of p1−p2 on ˉX1−ˉX2, the difference of the fraction of unemployed in sample 1 with the fraction of unemployed in sample 2. The central limit theorem states that both sample averages have an RSD that is approximately Normal:
ˉX1approx∼N(p1,p1(1−p1)n1) and ˉX2approx∼N(p2,p2(1−p2)n2).
As ˉX1−ˉX2 is a linear combination of (approximately) Normals, its RSD is (approximately) Normal:
ˉX1−ˉX2approx∼N(p1−p2,p1(1−p1)n1+p2(1−p2)n1).
This is not a pivotal quantity. Note that we have used that Cov[ˉX1,ˉX2]=0: we assume that both (random) samples are drawn independently from each other.
Step 2: Find the quantiles a and b such that Pr(a<PQ<b)=0.95.
Section 3.1.1 tells us how to standardise with known variances:
(ˉX1−ˉX2)−(p1−p2)√p1(1−p1)n1+p2(1−p2)n2approx∼N(0,1), We can estimate p1 by ˉX1 and p2 by ˉX2. Using section 3.1.3 of the RSD chapter, standardising with the estimated variances gives:
(ˉX1−ˉX2)−(p1−p2)√ˉX1(1−ˉX1)n1+ˉX2(1−ˉX2)n2approx∼N(0,1).
This is a pivotal quantity. The probability statement is:
P(qN(0,1)0.025<(ˉX1−ˉX2)−(p1−p2)√ˉX1(1−ˉX1)n1+ˉX2(1−ˉX2)n2<qN(0,1)0.975)=0.95.
Step 3: Rewite the expression to Pr(..<p1−p2<..)=0.95.
We can rewrite the statement in the usual way:
P((ˉX1−ˉX2)−qN(0,1)0.975√ˉX1(1−ˉX1)n1+ˉX2(1−ˉX2)n2<p1−p2<(ˉX1−ˉX2)+qN(0,1)0.975√ˉX1(1−ˉX1)n1+ˉX2(1−ˉX2)n2)=0.95.
As an example, we draw a random sample of size n1=50 from Bernoulli(p1=0.7), and another random sample of size n2=40 from Bernoulli(p2=0.5):
set.seed(12345)
data1 <- rbinom(50, size=1, prob=0.7)
data1
## [1] 0 0 0 0 1 1 1 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 0 0
## [39] 1 1 0 1 0 0 1 1 1 1 1 1
data2 <- rbinom(40, size=1, prob=0.5)
data2
## [1] 1 1 0 0 1 0 1 0 0 0 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 1 1 1 0 1 1 0 0 0 0 1 1
## [39] 1 0
We calculate the sample averages:
xbar1 <- mean(data1)
xbar1
## [1] 0.68
xbar2 <- mean(data2)
xbar2
## [1] 0.55
diff <- xbar1 - xbar2
diff
## [1] 0.13
The approximate 95% confidence interval for p1−p2 is equal to:
b <- qnorm(0.975)
c( diff - b * sqrt(xbar1*(1-xbar1)/50 + xbar2*(1-xbar2)/40), diff + b * sqrt(xbar1*(1-xbar1)/50 + xbar2*(1-xbar2)/40))
## [1] -0.07121395 0.33121395
The true difference p1−p2 is 0.2.
4 Confidence Interval for the Variance
If we want a confidence interval for the variance of the population distribution, we will only discuss the example where the population distributions is Normal.
4.1 Example 7: X∼N(μ,σ2):
We are considering the situation where we are willing to assume that the population distribution is a member of the family of Normal distributions. We are now interested in constructing a 95% confidence interval for the variance of the population distribution: σ2. The natural estimator for the population variance is the adjusted sample variance
S′2=1n−1n∑i=1(Xi−ˉX)2.
We do not know the repeated sampling distribution of S′2.
Step 1: Turn S′2 into a pivotal quantity PQ(data,σ2):
ALthough we do not know the RSD of S′2, there is one thing that we do know:
n∑i=1(Xi−ˉXσ)2=∑ni=1(Xi−ˉX)2σ2=(n−1)S′2σ2∼χ2n−1.
This follows because a sum of n standard-normals squared has a χ2n distribution. The quantity (n−1)S′2σ2 is exactly this, except that μ is estimated by ˉX. The quantity (n−1)S′2σ2 is a pivotal quantity: the quantity itself only depends on our random sample and the parameter of interest σ2. Moreover, the χ2n−1 distribution contains no unknown parameters.
The probability statement is:
P(qχ2n−10.025<(n−1)S′2σ2<qχ2n−10.975)=0.95.
Step 2: Find the quantiles a and b such that Pr(a<PQ<b)=0.95.
The Chi-squared distribution is not symmetric, so we need to calculate both quantiles separately. If we have a sample of size n=100 from N(μ=4,σ2=16), we need the 0.025 and 0.975 quantiles of the χ299 distribution:
a <- qchisq(0.025, df=99)
a
## [1] 73.36108
b <- qchisq(0.975, df=99)
b
## [1] 128.422
Step 3: Rewite the expression to Pr(..<σ2<..)=0.95.
P(qχ2n−10.025<(n−1)S′2σ2<qχ2n−10.975)=0.95⟺P(qχ2n−10.025(n−1)S′2<1σ2<qχ2n−10.975(n−1)S′2)=0.95⟺P((n−1)S′2qχ2n−10.975<σ2<(n−1)S′2qχ2n−10.025)=0.95,
where we have used that a<1x<b is equivalent to 1b<x<1a.
To apply this, we draw a sample of size n=100 from N(μ=4,σ2=16) and calculate the adjusted sample variance:
set.seed(12345)
data <- rnorm(1000, mean=4, sd=4)
S2 <- var(data)
S2
## [1] 15.95995
The exact 95% confidence interval for the unknown population variance σ2 is:
c( 99*S2/b , 99*S2/a)
## [1] 12.30346 21.53778
The true variance is 16.
5 Confidence Interval for the Difference in Variances
If we want a confidence interval for the difference in variances of two populations, we will only discuss the example where the population distributions are Normal.
Suppose that we have two random samples. The first random sample (X11,…X1n1) is a sample of incomes from Portugal and has size n1. The second random sample (X21,…X2n2) is a sample of incomes from England and has size n2. We want to construct a 95% confidence interval for the difference in variances: σ21−σ22.
The obvious thing to do is to compare S′21, the adjusted sample variance of the Portuguese sample, with S′22, the adjusted sample variance of the English sample. In particular, we would compute S′21−S′22. Unfortunately, there is no easy way to obtain the repeated sampling distribution of S′21−S′22.
If we assume that both population distributions are Normal, we can derive a result that comes close enough.
5.1 Example 8: X1∼N(μ1,σ21) and X2∼N(μ2,σ22)
Instead of comparing the the differencet S′21−S′22 of the two sample variances, we can also calculate S′21/S′22. If this number is close to 1, perhaps σ21σ22 is also close to one, which implies that σ21−σ22 is close to zero.
Step 1: Turn ^σ21σ22 into a pivotal quantity PQ(data,σ21σ22):
We have seen in Example 8 of the RSD chapter that
S′21/σ21S′22/σ22=S′21/S′22σ21/σ22∼Fn1−1,n2−1. The good news is that this is a pivotal quantity: the quantity itself only depends on our random samples (via S′21 and S′22) and the parameter of interest σ21σ22. Moreover, the distribution of this quatity is completely known.
The probability statement is:
P(qFn1−1,n2−10.025<S′21/S′22σ21/σ22<qFn1−1,n2−10.975)=0.95.
Step 2: Find the quantiles a and b such that Pr(a<PQ<b)=0.95.
As the F-distribution is not symmetric, we need to calculate both quantiles. As an example, let n1=100 and n2=150. Then we have
a <- qf(0.025, df1=99, df2=149)
a
## [1] 0.6922322
b <- qf(0.975, df1=99, df2=149)
b
## [1] 1.425503
If you use the staistical tables, then you need the following result to compute a:
qFn1,n2p=1qFn2,n11−p. See Example 8 of the RSD chapter.
Step 3: Rewite the expression to Pr(..<σ21σ22<..)=0.95.
We can rewite the probability statement as follows:
P(qFn1−1,n2−10.025<S′21/S′22σ21/σ22<qFn1−1,n2−10.975)=0.95⟺P(qFn1−1,n2−10.025S′22S′21<1σ21/σ22<qFn1−1,n2−10.975S′22S′21)=0.95⟺P(S′21qFn1−1,n2−10.975S′22<σ21σ22<S′21qFn1−1,n2−10.025S′22)=0.95,
where we have (again) used that a<1x<b is equivalent to 1b<x<1a.
To apply this, we use a random sample of size n1=100 from N(μ1=10,σ21=100) and a second random sample of size n2=150 from N(μ=20,σ22=400), and calculate the adjusted sample variances:
set.seed(12345)
data1 <- rnorm(100, mean=10, sd=10)
S2_1 <- var(data1)
S2_1
## [1] 124.2625
data2 <- rnorm(150, mean=20, sd=sqrt(400))
S2_2 <- var(data2)
S2_2
## [1] 402.0311
The exact 95% confidence interval for σ21σ22 is equal to:
c( S2_1 / (b*S2_2) , S2_1 / (a*S2_2) )
## [1] 0.2168264 0.4465073
The true value of σ21σ22 is equal to 0.25.
To summarize, we were not able to construct confidence intervals for σ21−σ22. We were able to construct confidence intervals for σ21σ22 if both population distributions are Normal.