5. Confidence Intervals

Let’s go back to the main goal of inferential statistics:

The goal of inferential statistics is:
To give a credible answer to an interesting question using data. Moreover, the uncertainty that is due to sampling should be quantified.

For example, suppose that we are willing to assume that the population distribution is a member of the family of Poisson distributions with parameter called \(\lambda\). The (potentially interesting) question is what the value of lambda is (knowing this value would pinpoint the population distribution).

We have learned to obtain an estimator (possibly using the Method of Moments or the method of Maximum Likelihood), as well as its repeated sampling distribution. That is, we can formulate statements like this:

\[ \hat{\lambda} = \bar{X} \overset{approx}{\sim} N \left( \mu, \frac{\sigma^2}{n} \right)=N \left( \lambda, \frac{\lambda}{n} \right). \]

This statement does not really answer our question. We need to do a final step: re-formulate the statement above into a statement that clearly answers our question. We are now going to do this final step.

Suppose that we have a random sample from the population distribution at our disposal. We will discuss in this chapter how one can use the information about our estimator and its RSD to construct a (possibly approximate) \(95\%\) (or \(99\%\)) confidence interval for the unknown parameter \(\lambda\). A confidence interval is a statement that could, in a particular example, look like this:

\[ P(2.1 < \lambda < 3.4) = 0.95. \]

Without having a random sample, we did not have any knowledge about the value of the unknown parameter \(\lambda\). After applying the method for constructing a confidence interval, to be discussed below, we end up stating that we are \(95\%\) sure that the confidence interval contains the true value of \(\lambda\). We have put the information that we have about estimators and their RSD into a form that everybody is able to understand.

In other words, we have used our data (a random sample) and an assumption about the population distribution (that may or may not be credible) to answer our question (what is the value of \(\lambda\)) and quantify the uncertainty.

The uncertainty arises because we are using only a sample of individuals (or firms), instead of the whole population. If we had information about all individuals (or firms) in the population (i.e. an infinitely large sample), we would be able to obtain the true value of \(\lambda\) with \(100\%\) certainty (if our estimator is consistent). As we have only a random sample of finite size at our disposal, the confidence interval that we have obtained would be different if we were to have had another sample of the same size at our disposal.

The animation below illustrates the correct interpretation of a confidence interval: Suppose that the population distribution is a Normal distribution with mean equal to zero. In practice, we wouldn’t know that the mean is zero. We use our sample and the method of obtaining a confidence interval, to be discussed below, and find (using our observed sample):

\[ P(-0.08 < \mu < 0.43) = 0.95. \] The interpretation is that we are \(95\%\) sure that this interval contains the (unknown) true value. The reason for this is as follows: if we were to draw 10 million samples of the same sample size, the (10 million) confidence intervals will contain the true value (\(0\) in our case) in \(95\%\) of the cases.

Note that the only confidence interval that we obtain is the first one. The other confidence intervals are obtained using hypothetical samples (repeated samples that we do not have at our disposal). We are confident that our observed confidence interval contains the true value because \(95\%\) of all possible realisations of the confidence interval do.

Note also that we always make the implicit assumption that the population is infinitely large. This simplifies the analysis a little bit. For instance, assuming that the population distribution is Bernoulli with parameter \(p\) with \(0<p<1\), implicitly assumes that the population is infinitely large: otherwise, \(p\) cannot take all values between zero and one.

1 Constructing a Confidence Interval

We will now discuss the method that can be used to obtain confidence intervals.

The Method: Constructing a \(95\%\) Confidence Interval For \(\theta\).

Step 1: Turn \(\hat{\theta}\) into a pivotal quantity \(PQ(data, \theta)\).
Step 2: Find the quantiles \(a\) and \(b\) such that \(\Pr(a < PQ < b)=0.95\).
Step 3: Rewite the expression to \(\Pr(..< \theta < ..)=0.95\).

The hardest part is step 1. A pivotal quantity (PQ) is a function of the data (our random sample) and \(\theta\), the parameter for which we want to construct a \(95\%\) confidence interval. This function should satisfy two conditions:

  1. The distribution of PQ should be completely known (for instance, \(N(0,1)\) or \(F_{2,13}\)).
  2. PQ itself should not depend on any other unknown parameters. It is only allowed to depend on \(\theta\).
  • Condition 1 allows us to perform step 2 of the method.
  • Condition 2 allows us to perform step 3 of the method.

The easiest way to clarify the method is to apply it to some examples.

2 Confidence Interval for the Mean (or Proportion)

If we want a confidence interval for the mean of a single population, we will discuss examples where the population distribution is Bernoulli, Poisson, Normal or Exponential.


2.1 Example 1: \(X \sim N(\mu,\sigma^2)\)

We are considering the situation where we are willing to assume that the population distribution is a member of the family of Normal distributions. We are interested in finding a \(95\%\) confidence interval for the unknown population mean \(\mu\).

Step 1: Turn \(\hat{\mu}\) into a pivotal quantity \(PQ(data, \mu):\)

The method requires us to find a pivotal quantity PQ. One requirement is that PQ depends on \(\mu\). It therefore makes sense to consider the sample average (this is the Method of Moments estimator of \(\mu\), as well as the Maximum Likelihood estimator of \(\mu\)). We know that linear combinations of Normals are Normal, so that we know the RSD of \(\bar{X}\):

\[ \bar{X} \sim N \left( \mu, \frac{\sigma^2}{n}\right). \] The quantity \(\bar{X}\) is not a pivotal quantity, for two reasons:

  • The RSD of \(\bar{X}\) is not completely known: it depends on the unknown parameters \(\mu\) and \(\sigma^2\).
  • \(\bar{X}\) does depend on the data, but does not depend on \(\mu\).

The first problem can be solved by standardising. Let’s try the following:

\[ \frac{\bar{X} - \mu}{ \sqrt{ \frac{\sigma^2}{n}}} \sim N(0,1). \]

The quantity on the left-hand side satisfies the first condition of a pivotal quantity: the standard normal distribution is completely known (i.e. has no unknown parameters). The quantity itself does not satisfy the second condition: it depends on \(\mu\), but also on the unknown parameter \(\sigma^2\).

We can solve this problem by standardising with the estimated variance \(S^{\prime 2}\) (instead of \(\sigma^2\)):

\[ PQ = \frac{\bar{X} - \mu}{ \sqrt{ \frac{S^{\prime 2}}{n}}} \sim t_{n-1}. \]

We have obtained a pivotal quantity: \(PQ\) itself depends on \(\mu\) and not on any other unknown parameters. Moreover, the RSD of PQ is completely known. We can progress to step two.

Step 2: Find the quantiles \(a\) and \(b\) such that \(\Pr(a < PQ < b)=0.95\).

We need to find the values \(a\) and \(b\) such that \(P(a<PQ<b)=0.95\). In other words, we need to find the following:

  • \(a = q^{t_{n-1}}_{0.025}\): the \(2.5\%\) quantile of the \(t\) distribution with \(n-1\) degrees of freedom.
  • \(b = q^{t_{n-1}}_{0.975}\): the \(97.5\%\) quantile of the \(t\) distribution with \(n-1\) degrees of freedom.

These are easily found using the statistical tables or by using R. If, for example, we have a sample of size \(n=50\):

qt(0.025, df=49)
## [1] -2.009575
qt(0.975, df=49)
## [1] 2.009575

As the \(t\) distribution is symmetric, it would have sufficed to calculate only one of the two quantiles.

We now have the following true statement to work with:

\[ P\left( q^{t_{n-1}}_{0.025} < \frac{\bar{X} - \mu}{ \sqrt{ \frac{S^{\prime 2}}{n}}} < q^{t_{n-1}}_{0.975} \right) = 0.95. \]

Step 3: Rewite the expression to \(\Pr(..< \mu < ..)=0.95\).

Using a sample of size \(n=50\), we can calculate the values of \(\bar{X}\) and \(S^{\prime 2}\). Our pivotal quantity that appears in the middle of our probability statement therefore only depends on \(\mu\). This is by construction, as it was one of the conditions for a pivotal quantity. As we saw, the quantiles are also easily determined.

We set out to obtain a statement like \(P(\ldots < \mu < \ldots) = 0.95\)). To get there, we only need to rewrite the expression inside the probability:

\[\begin{align} &P\left( q^{t_{n-1}}_{0.025} < \frac{\bar{X} - \mu}{ \sqrt{ \frac{S^{\prime 2}}{n}}} < q^{t_{n-1}}_{0.975} \right) = 0.95 \iff \\ &P\left( q^{t_{n-1}}_{0.025} \sqrt{ \frac{S^{\prime 2}}{n}} < \bar{X} - \mu < q^{t_{n-1}}_{0.975} \sqrt{ \frac{S^{\prime 2}}{n}}\right) = 0.95 \iff \\ &P\left( -\bar{X} + q^{t_{n-1}}_{0.025} \sqrt{ \frac{S^{\prime 2}}{n}} < - \mu < -\bar{X} + q^{t_{n-1}}_{0.975} \sqrt{ \frac{S^{\prime 2}}{n}}\right) = 0.95 \iff \\ &P\left( \bar{X} - q^{t_{n-1}}_{0.975} \sqrt{ \frac{S^{\prime 2}}{n}} < \mu < \bar{X} - q^{t_{n-1}}_{0.025} \sqrt{ \frac{S^{\prime 2}}{n}}\right) = 0.95 \iff \\ &P\left( \bar{X} - q^{t_{n-1}}_{0.975} \sqrt{ \frac{S^{\prime 2}}{n}} < \mu < \bar{X} + q^{t_{n-1}}_{0.975} \sqrt{ \frac{S^{\prime 2}}{n}}\right) = 0.95. \end{align}\]

Note that \(a<-\mu<b\) is equivalent to \(-b<\mu<-a\) and that \(- q^{t_{n-1}}_{0.025}= q^{t_{n-1}}_{0.975}\), as the \(t\) distribution is symmetric. We can now use the values of \(\bar{X}\) and \(S^{\prime 2}\) and the values of the quantiles to obtain the \(95\%\) confidence interval for \(\mu\).

For illustration purposes, let’s draw a random sample of size \(50\) from \(N(\mu=5, \sigma^2=100)\):

set.seed(12345)
data <- rnorm(50, mean=5, sd=10)
data
##  [1]  10.8552882  12.0946602   3.9069669   0.4650283  11.0588746 -13.1795597
##  [7]  11.3009855   2.2381589   2.1584026  -4.1932200   3.8375219  23.1731204
## [13]   8.7062786  10.2021646  -2.5053199  13.1689984  -3.8635752   1.6842241
## [19]  16.2071265   7.9872370  12.7962192  19.5578508  -1.4432843 -10.5313741
## [25] -10.9770952  23.0509752   0.1835264  11.2037980  11.1212349   3.3768902
## [31]  13.1187318  26.9683355  25.4919034  21.3244564   7.5427119   9.9118828
## [37]   1.7591342 -11.6205024  22.6773385   5.2580105  16.2851083 -18.8035806
## [43]  -5.6026555  14.3714054  13.5445172  19.6072940  -9.1309878  10.6740325
## [49]  10.8318765  -8.0679883

We calculate the sample average and the adjusted sample variance:

xbar <- mean(data)
xbar
## [1] 6.795663
S2 <- var(data)
S2
## [1] 120.2501

We calculate the quantiles (it turns out that we only need \(b\)):

b <- qt(0.975, df=49)
b
## [1] 2.009575

The confidence interval is:

left <- xbar - b * sqrt( S2/50 )
left
## [1] 3.6792
right <- xbar + b * sqrt( S2/50 )
right
## [1] 9.912126

We conclude that \(P(3.68 < \mu < 9.91) = 0.95\). The \(95\%\) confidence interval is \((3.68,9.91)\). Because we simulated the data ourselves, we know that the true value of \(\mu\) is equal to \(5\). In practice, one would not know this.

The t.test command can do the calculations for us:

t.test(data, conf.level=0.95)$"conf.int"
## [1] 3.679200 9.912126
## attr(,"conf.level")
## [1] 0.95

A \(99\%\) confidence interval for \(\mu\) would be wider:

t.test(data, conf.level=0.99)$"conf.int"
## [1]  2.639575 10.951750
## attr(,"conf.level")
## [1] 0.99

This occurs, because \(b = q_{0.995}^{t_{n-1}}\) is larger than \(b = q_{0.975}^{t_{n-1}}\):

qt(0.995, df=49)
## [1] 2.679952

2.1.1 Aproximation using CLT

We are considering the situation where we either do not know anything about the population distribution, or we asume that the population distribution is a distribution that is not Normal (e.g. Bernoulli(\(p\))). We are interested in finding an approximate \(95\%\) confidence interval for the unknown population mean \(\mu\) (or \(p\), in the Bernoulli case).

The narrative does not have to be changed much:

Step 1: Turn \(\hat{\mu}\) into a pivotal quantity \(PQ(data, \mu)\):

The method requires us to find a pivotal quantity PQ. One requirement is that PQ depends on \(\mu\). It therefore makes sense to consider the sample average. We know, from the central limit theorem, that

\[ \bar{X} \overset{approx}{\sim} N \left( \mu, \frac{\sigma^2}{n}\right). \]

We can obtain a pivotal quantity by standardising with the estimated variance:

\[ PQ = \frac{\bar{X} - \mu}{ \sqrt{ \frac{S^{\prime 2}}{n}}} \overset{approx}{\sim} N(0,1). \]

Step 2: Find the quantiles \(a\) and \(b\) such that \(\Pr(a < PQ < b)=0.95\).

The \(97.5\%\) quantile of the standard normal distribution is:

b2 <- qnorm(0.975, mean=0, sd=1)
b2
## [1] 1.959964

We can therefore say that the following probability statement is correct:

\[ P\left( q^{N(0,1)}_{0.975} < \frac{\bar{X} - \mu}{ \sqrt{ \frac{S^{\prime 2}}{n}}} < q^{N(0,1)}_{0.975} \right) = 0.95. \]

Step 3: Rewite the expression to \(\Pr(..< \mu < ..)=0.95\).

Using the same steps as before, we obtain:

\[\begin{align} &P\left( q^{N(0,1)}_{0.025} < \frac{\bar{X} - \mu}{ \sqrt{ \frac{S^{\prime 2}}{n}}} < q^{N(0,1)}_{0.975} \right) = 0.95 \iff \\ &P\left( \bar{X} - q^{N(0,1)}_{0.975} \sqrt{ \frac{S^{\prime 2}}{n}} < \mu < \bar{X} + q^{N(0,1)}_{0.975} \sqrt{ \frac{S^{\prime 2}}{n}}\right) = 0.95. \end{align}\]

Hence, without assuming that the population distribution is Normal, we obtain the following confidence interval:

c(xbar - b2 * sqrt(S2/50), xbar + b2 * sqrt(S2/50))
## [1] 3.756137 9.835188

That is, we conclude that \(P(3.76 < \mu < 9.84) = 0.95\). Not assuming that the population distribution is Normal makes our \(95\%\) confidence interval wider.

2.2 Example 2: \(X \sim Poisson(\lambda)\)

We are considering the situation where we are willing to assume that the population distribution is a member of the family of Poissson(\(\lambda\)) distributions. We are interested in finding a \(95\%\) confidence interval for the unknown population mean \(\lambda\).

Step 1: Turn \(\hat{\lambda}\) into a pivotal quantity \(PQ(data, \lambda)\):

The method requires us to find a pivotal quantity PQ. One requirement is that PQ depends on \(\lambda\). It therefore makes sense to consider the sample average (this is the Method of Moments estimator of \(\lambda\), as well as the Maximum Likelihood estimator of \(\lambda\)). We do not know the repeated sampling distribution of \(\bar{X}\) in this case. All we know is that

\[ \sum_{i=1}^n X_i \sim Poisson(n \lambda). \]

This is not a pivotal quantity: the RSD of the sum is not completely known and the sum itself does not depend on \(\lambda\). We cannot standardise to get an RSD that is completely known (as we could with the Normal distribution):

\[ \frac{\sum_{i=1}^n X_i - n \lambda}{\sqrt{n\lambda}} \sim \;\; ? \]

Without a pivotal quantity, we cannot proceed to step 2. We are going to have to use a large sample approximation.

2.2.1 Aproximation using CLT

We are now interested in finding an approximate \(95\%\) confidence interval for the unknown population mean \(\lambda\).

Step 1: Turn \(\hat{\lambda}\) into a pivotal quantity \(PQ(data, \lambda)\):

As \(E[X]=Var[X]=\lambda\) for Poisson(\(\lambda\)) distributions, the central limit theorem states that

\[ \hat{\lambda} = \bar{X} \overset{approx}{\sim} N \left( \lambda, \frac{\lambda}{n} \right). \]

We can now standardise:

\[ \frac{\bar{X} - \lambda}{ \sqrt{ \frac{\lambda}{n} }} \overset{approx}{\sim} N \left( 0, 1 \right). \]

This is a pivotal quantity. However, we can also standardise with the estimated variance:

\[ \frac{\bar{X} - \lambda}{ \sqrt{ \frac{\bar{X}}{n} }} \overset{approx}{\sim} N \left( 0, 1 \right). \] This is also a pivotal quantity. Looking ahead to step 3, this pivotal quantity will be easier to work with.

Step 2: Find the quantiles \(a\) and \(b\) such that \(\Pr(a < PQ < b)=0.95\).

qnorm(0.975)
## [1] 1.959964

The required quantile is \(q^{N(0,1)}_{0.975}=1.96\). The probability statement is equal to

\[ P \left(q^{N(0,1)}_{0.025} < \frac{\bar{X} - \lambda}{ \sqrt{ \frac{\bar{X}}{n} }} < q^{N(0,1)}_{0.975} \right) = 0.95. \]

Step 3: Rewite the expression to \(\Pr(..< \lambda < ..)=0.95\).

Rewriting this statement gives

\[\begin{align} &P \left(q^{N(0,1)}_{0.025} < \frac{\bar{X} - \lambda}{ \sqrt{ \frac{\bar{X}}{n} }} < q^{N(0,1)}_{0.975} \right) = 0.95 \iff \\ &P \left(q^{N(0,1)}_{0.025} \sqrt{ \frac{\bar{X}}{n} } < \bar{X} - \lambda < q^{N(0,1)}_{0.975} \sqrt{ \frac{\bar{X}}{n} } \right) = 0.95 \iff \\ &P \left( -\bar{X} + q^{N(0,1)}_{0.025} \sqrt{ \frac{\bar{X}}{n} } < - \lambda < -\bar{X} + q^{N(0,1)}_{0.975} \sqrt{ \frac{\bar{X}}{n} } \right) = 0.95 \iff \\ &P \left( \bar{X} - q^{N(0,1)}_{0.975} \sqrt{ \frac{\bar{X}}{n} } < \lambda < \bar{X} - q^{N(0,1)}_{0.025} \sqrt{ \frac{\bar{X}}{n} } \right) = 0.95 \iff \\ &P \left( \bar{X} - q^{N(0,1)}_{0.975} \sqrt{ \frac{\bar{X}}{n} } < \lambda < \bar{X} + q^{N(0,1)}_{0.975} \sqrt{ \frac{\bar{X}}{n} } \right) = 0.95. \end{align}\]

As an example, consider a random sample of size \(n=25\) from the Poisson distribution (we take \(\lambda=10\)):

set.seed(12345)
data <- rpois(25, lambda=10)
data
##  [1] 11 12  9  8 11  4 11  7  8 11  9 10  9 11 13 10 12 14  7  4 15  8 11 11  9

We calculate the sample average and the \(97.5\%\) quantile:

xbar <- mean(data)
xbar
## [1] 9.8
b <- qnorm(0.975)
b
## [1] 1.959964

the approximate \(95\%\) confidence interval is:

c( xbar - b * sqrt(xbar/25) , xbar + b * sqrt(xbar/25))
## [1]  8.572868 11.027132

We conclude that \(P(8.57 < \lambda < 11.03) = 0.95\).


2.3 Example 3: \(X \sim Exp(\lambda)\)

We are considering the situation where we are willing to assume that the population distribution is a member of the family of Exponential distributions. We could be interested in finding a \(95\%\) confidence interval for the parameter \(\lambda\), or we could be interested in finding a \(95\%\) confidence interval for the mean \(\frac{1}{\lambda}\).

We focus on the first question. The parameter \(\lambda\) can be estimated by \(\hat{\lambda}=\frac{1}{\bar{X}}\), the Method of Moments and Maximum Likelihood estimator.

Step 1: Turn \(\hat{\lambda}\) into a pivotal quantity \(PQ(data, \lambda)\):

If we have a random sample from the Exponential distribution, all we know is that

\[ \sum_{i=1}^n X_i \sim Gamma(n, \lambda). \] This result does not allow us to find the exact RSD of \(\hat{\lambda}=\frac{1}{\bar{X}}\), let alone use this to find a pivotal quantity. We will therefore have to resort to a large sample approximation.

2.3.1 Aproximation using CLT

If we have a random sample from the Exponential distribution, which has \(E[X]=\frac{1}{\lambda}\) and \(Var[X]=\frac{1}{\lambda^2}\), the CLT states that

\[ \bar{X} \overset{approx}{\sim} N \left( \frac{1}{\lambda}, \frac{1}{n \lambda^2} \right). \]

Our estimator \(\hat{\lambda}=\frac{1}{\bar{X}}\) is a nonlinear function of an approximate Normal. We can therefore use the Delta Method discussed in section 1.6 of the Estimators chapter, to obtain:

\[ \hat{\lambda} = \frac{1}{\bar{X}} \overset{approx}{\sim} N \left( \lambda, \frac{\lambda^4}{n \lambda^2} \right) = N \left( \lambda, \frac{\lambda^2}{n} \right). \] This is not a pivotal quantity.

Step 1: Turn \(\hat{\lambda}\) into a pivotal quantity \(PQ(data, \lambda)\):

We can standardise to obtain a quantity with an approximately standard normal distribution:

\[ \frac{ \frac{1}{\bar{X}} - \lambda }{ \sqrt{ \frac{\lambda^2}{n} } } \overset{approx}{\sim} N(0,1). \] Looking ahead to step 3, however, we can obtain a pivotal quantity that is easier to work with by standardising with the estimated variance:

\[ \frac{ \frac{1}{\bar{X}} - \lambda }{ \sqrt{ \frac{1}{n\bar{X}^2} } } \overset{approx}{\sim} N(0,1). \]

Step 2: Find the quantiles \(a\) and \(b\) such that \(\Pr(a < PQ < b)=0.95\).

As before, the \(97.5\%\) quantile of \(N(0,1)\) is (more or less) 1.96.

We can make the following probability statement:

\[ P \left(q^{N(0,1)}_{0.025} < \frac{ \frac{1}{\bar{X}} - \lambda }{ \sqrt{ \frac{1}{n\bar{X}^2} } } < q^{N(0,1)}_{0.975} \right) = 0.95. \]

Step 3: Rewite the expression to \(\Pr(..< \lambda < ..)=0.95\).

Rewriting this statement (in the usual way) gives

\[ P \left( \frac{1}{\bar{X}} - q^{N(0,1)}_{0.975} \sqrt{ \frac{1}{n\bar{X}^2} } < \lambda < \frac{1}{\bar{X}} + q^{N(0,1)}_{0.975} \sqrt{ \frac{1}{n\bar{X}^2} } \right) = 0.95. \]

A confidence interval for the mean \(\frac{1}{\lambda}\) would start from

\[ \bar{X} \overset{approx}{\sim} N \left( \frac{1}{\lambda}, \frac{1}{n \lambda^2} \right). \]

Standardising with the estimated variance gives

\[ \frac{\bar{X} - \frac{1}{\lambda}}{ \sqrt{ \frac{ \bar{X}^2 }{n} }} \overset{approx}{\sim} N(0,1). \]

Step 3 then gives

\[ P \left( \bar{X} - q^{N(0,1)}_{0.975} \sqrt{ \frac{\bar{X}^2}{n} } < \frac{1}{\lambda} < \bar{X} + q^{N(0,1)}_{0.975} \sqrt{ \frac{\bar{X}^2}{n} } \right) = 0.95. \]

As an example, we draw a random sample of size \(n=30\) from the \(Exp(\lambda=2)\) distribution:

set.seed(12345)
data <- rexp(30, rate=2)
data
##  [1] 0.220903896 0.263736378 0.404233657 0.009224336 0.227705254 0.011969070
##  [7] 3.201094619 0.628980261 0.481094044 0.909001520 0.113243755 0.223381456
## [13] 0.626230560 0.198543639 0.044057864 0.861082101 2.692431711 0.943996964
## [19] 0.181833623 0.586781837 0.569824730 0.213983048 0.729273719 0.282193281
## [25] 0.614559271 0.365936084 0.489022937 1.401355220 0.125542797 0.109180467

The approximate \(95\%\) confidence interval for the parameter \(\lambda\) is:

n <- 30
xbar <- mean(data)
b <- qnorm(0.95)
c(1/xbar - b * sqrt( 1/(n*(xbar^2)) ), 1/xbar + b * sqrt( 1/(n*(xbar^2))) )
## [1] 1.183886 2.200133

The approximate \(95\%\) confidence interval for the mean of the population distribution \(\frac{1}{\lambda}\) is:

c(xbar - b * sqrt( xbar^2 /n)  , xbar + b * sqrt( xbar^2 /n))
## [1] 0.4135274 0.7684992

2.4 Example 4: \(X \sim Bernoulli(p)\)

We are considering the situation where we are willing to assume that the population distribution is a member of the family of Bernoulli distributions. We are interested in finding the \(95\%\) confidence interval for the mean (or: population proportion) \(p\).

Both the Method of Moments estimator and the Maximum Likelihood for \(p\) are equal to the sample average.

Step 1: Turn \(\hat{p}\) into a pivotal quantity \(PQ(data, p)\):

When sampling from the Bernoulli(\(p\)) distribution, we only have the following result:

\[ \sum_{i=1}^n X_i \sim Binomial(n,p). \] We can find the exact distribution by observing that

\[ P \left( \sum_{i=1}^n X_i = k \right) = P \left( \bar{X}=\frac{k}{n} \right) = \binom{n}{k}~p^k (1-p)^{n-k}\text{ for } k=0,1,\ldots,n. \]

The exact RSD of \(\bar{X}\) has the Binomial probabilities, but assigned to \(0, \frac{1}{n}, \frac{2}{n}, \dots , 1\) instead of \(0,1,2,\ldots, n\). The quantity \(\sum_{i=1}^n X_i\) is not pivotal, and we can’t standardise to make it so:

\[ \frac{\sum_{i=1}^n X_i - n p}{\sqrt{np}} \sim \;\; ? \]

We will have to resort to a large sample approximation.

2.4.1 Approximation using CLT

We are interested in finding an approximate \(95\%\) confidence interval for the population proportion \(p\).

Step 1: Turn \(\hat{p}\) into a pivotal quantity \(PQ(data, p)\):

If \(X \sim Bernoulli(p)\), we know that \(E[X]=p\) and \(Var[X]=p(1-p)\). The central limit theorem then states that

\[ \bar{X} \overset{approx}{\sim} N \left( p, \frac{p(1-p)}{n} \right). \] This is not a pivotal quantity. standardising gives:

\[ \frac{\bar{X} - p}{ \sqrt{\frac{p(1-p)}{n}} } \overset{approx}{\sim} N(0,1). \]

Although this is a pivotal quantity, we prefer to standardise with the estimated variance:

\[ \frac{\bar{X} - p}{ \sqrt{\frac{\bar{X}(1-\bar{X})}{n}} } \overset{approx}{\sim} N(0,1). \] as this will simplify step 3 of the method.

The probability statement is:

\[ P \left( q^{N(0,1)}_{0.025} < \frac{\bar{X} - p}{\sqrt{\frac{\bar{X}(1-\bar{X})}{n}}} < q^{N(0,1)}_{0.975} \right) = 0.95. \]

Step 2: Find the quantiles \(a\) and \(b\) such that \(\Pr(a < PQ < b)=0.95\).

As before, the \(97.5\%\) quantile of \(N(0,1)\) is about \(1.96\).

Step 3: Rewite the expression to \(\Pr(..< p < ..)=0.95\).

We can rewrite this statement in the usual way, to obtain

\[ P \left( \bar{X} - q^{N(0,1)}_{0.975} \sqrt{\frac{\bar{X}(1-\bar{X})}{n}} < p < \bar{X} + q^{N(0,1)}_{0.975} \sqrt{\frac{\bar{X}(1-\bar{X})}{n}} \right) = 0.95. \]

As an example, we draw a random sample of size \(40\) from a Bernoulli(\(p=0.3\)) population:

set.seed(123456)
data <- rbinom(40, size=1, prob=0.3)
data
##  [1] 1 1 0 0 0 0 0 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 1 0 0 1 0 0 1 0 1 1
## [39] 1 0

We calculate the sample average and the quantile:

xbar <- mean(data)
xbar
## [1] 0.475
b <- qnorm(0.95)

The approximate \(95\%\) confidence interval for \(p\) is:

c( xbar - b * (sqrt(xbar*(1-xbar)/40)), xbar + b * (sqrt(xbar*(1-xbar)/40)) )
## [1] 0.3451256 0.6048744

Notice that, using this particular sample, the true value \(p=0.3\) is not in the interval. We were unlucky, as this happens only in \(5\%\) of the (infinitely many) possible repeated samples. Each repeated sample give a confidence interval, and in \(95\%\) of the cases, this interval contains the true value. This is precisely how a confidence interval should be interpreted.



3 Confidence Interval for the Difference in Means (or Proportions)

For the case that we want a confidence interval for the difference in means of two populations, we will discuss examples where the population distributions are Normal and Bernoulli.

Suppose that we have two random samples. The first random sample (\(X_{11}, \ldots X_{1 n_1}\)) is a sample of incomes from Portugal and has size \(n_1\). The second random sample (\(X_{21}, \ldots X_{2 n_2}\)) is a sample of incomes from England and has size \(n_2\). We want to construct a \(95\%\) confidence interval of the difference in means: \(\mu_1 - \mu_2\).


3.1 Example 5: \(X_1 \sim N(\mu_1,\sigma^2_1)\) and \(X_2 \sim N(\mu_2,\sigma^2_2)\)

Suppose that we are willing to assume that both income distributions are members of the Normal family of distributions.

Step 1: Turn \(\bar{X}_1 - \bar{X}_2\) into a pivotal quantity \(PQ(data, \mu_1 - \mu_2)\):

From example \(5\) of the RSD chapter, we know that:

\[\bar{X}_1 - \bar{X}_2 \sim N \left(\mu_1 - \mu_2, \frac{\sigma^2_1}{n_1} + \frac{\sigma^2_2}{n_2} \right).\] This is not a pivotal quantity. To obtain a pivotal quantity, we need to standardise. How we standardise will depend on whether or not \(\sigma^2_1\) and \(\sigma^2_2\) are known. If they are known, we will denote them by \(\sigma^2_{10}\) and \(\sigma^2_{20}\).

3.1.1 Variances Known: \(\sigma^2_{10}\) and \(\sigma^2_{20}\)

If the variances are known, we can standardise using the known variances and obtain a repeated sampling distribution that is standard-normal:

\[ \frac{(\bar{X}_1 - \bar{X}_2) - (\mu_1 -\mu_2)}{\sqrt{\frac{\sigma^2_{10}}{n_1} + \frac{\sigma^2_{20}}{n_2}}} \sim N(0,1). \]

This is a pivotal quantity. The probability statement is:

\[ P \left( q^{N(0,1)}_{0.025} < \frac{(\bar{X}_1 - \bar{X}_2) - (\mu_1 -\mu_2)}{\sqrt{\frac{\sigma^2_{10}}{n_1} + \frac{\sigma^2_{20}}{n_2}}} < q^{N(0,1)}_{0.975} \right) = 0.95. \]

Step 2: Find the quantiles \(a\) and \(b\) such that \(\Pr(a < PQ < b)=0.95\).

The \(97.5\%\) quantile of $N(0,1) is about \(1.96\).

Step 3: Rewite the expression to \(\Pr(..< \mu_1 - \mu_2 < ..)=0.95\).

We can rewrite the probability statement in the usual way:

\[ P \left( (\bar{X}_1 - \bar{X}_2) - q^{N(0,1)}_{0.975} \sqrt{\frac{\sigma^2_{10}}{n_1} + \frac{\sigma^2_{20}}{n_2}} < \mu_1 -\mu_2 < (\bar{X}_1 - \bar{X}_2) + q^{N(0,1)}_{0.975} \sqrt{\frac{\sigma^2_{10}}{n_1} + \frac{\sigma^2_{20}}{n_2}} \right) = 0.95. \]

As an example, we draw a random sample of size \(n_1=50\) from \(N(\mu=10, \sigma^2_1=100)\) and another random sample of size \(n_2=60\) from \(N(\mu=15, \sigma_2^2=81)\). We know that the distributions are Normal and that \(\sigma^2_1=100\) and \(\sigma^2_2=81\). We do not know the values of \(\mu_1\) and \(\mu_2\).

set.seed(12345)
data1 <- rnorm(50, mean=10, sd=10)
data1
##  [1]  15.8552882  17.0946602   8.9069669   5.4650283  16.0588746  -8.1795597
##  [7]  16.3009855   7.2381589   7.1584026   0.8067800   8.8375219  28.1731204
## [13]  13.7062786  15.2021646   2.4946801  18.1689984   1.1364248   6.6842241
## [19]  21.2071265  12.9872370  17.7962192  24.5578508   3.5567157  -5.5313741
## [25]  -5.9770952  28.0509752   5.1835264  16.2037980  16.1212349   8.3768902
## [31]  18.1187318  31.9683355  30.4919034  26.3244564  12.5427119  14.9118828
## [37]   6.7591342  -6.6205024  27.6773385  10.2580105  21.2851083 -13.8035806
## [43]  -0.6026555  19.3714054  18.5445172  24.6072940  -4.1309878  15.6740325
## [49]  15.8318765  -3.0679883
data2 <- rnorm(60, mean=15, sd=9)
data2
##  [1] 10.136525 32.529234 15.482312 18.164966  8.961211 17.501583 21.220541
##  [8] 22.414158 34.305585 -6.122496 16.346328  2.917217 19.979728 29.309666
## [15]  9.718084 -1.491396 22.993255 29.341396 19.651692  3.338955 15.491540
## [22]  7.938156  5.555825 35.974608 27.624348 23.483408 22.436325  7.696136
## [29] 19.286235 24.191326 20.808448 24.388292 12.260678 37.293998 23.740986
## [36] 31.803893 21.048382 12.228420 19.828713 22.423831  6.324887  7.304257
## [43] 31.982522 11.473626  6.174303 21.185989 10.454608 34.419478  9.601822
## [50]  8.749080 17.015329  4.593990 18.801767  3.077203 16.269759 10.175568
## [57] 12.195545 29.004987 10.967700 17.890112

We calculate the sample averages:

xbar1 <- mean(data1)
xbar1
## [1] 11.79566
xbar2 <- mean(data2)
xbar2
## [1] 17.16441
b <- qnorm(0.95)

The \(95\%\) confidence interval for \(\mu_1-mu_2\) is:

c( (xbar1 - xbar2) - b * (sqrt(100/100 + 81/100 )), (xbar1 - xbar2) + b * (sqrt(100/100 + 81/100 )) )
## [1] -7.581672 -3.155824

The true difference in means is equal to \(-5\).

3.1.2 Variances Unknown but The Same: \(\sigma^2_{1}=\sigma^2_{2}=~\sigma^2\)

If the variances \(\sigma^2_1\) and \(\sigma^2_2\) are not known, we need to estimate them. If \(\sigma^2_1\) and \(\sigma^2_2\) are unknown, but known to be equal (to the number \(\sigma^2\), say), we can estimate the single unknown variance using only the first sample or using only the second sample. It would be better, however, to use the information contained in both samples. As discussed in example \(5\) of the RSD chapter, we can use the pooled adjusted sample variance \(S^{\prime 2}_p\), leading to:

\[ \frac{(\bar{X}_1 - \bar{X}_2) - (\mu_1 -\mu_2)}{\sqrt{ S^{\prime 2}_p \left\{ \frac{1}{n_1} + \frac{1}{n_2}\right\} }} \sim t_{n_1 + n_2 -2}. \]

This is a pivotal quantity. The probability statement is:

\[ P \left( q^{t_{n_1+n_2-2}}_{0.025} < \frac{(\bar{X}_1 - \bar{X}_2) - (\mu_1 -\mu_2)}{\sqrt{ S^{\prime 2}_p \left\{ \frac{1}{n_1} + \frac{1}{n_2}\right\} }} < q^{t_{n_1+n_2-2}}_{0.975} \right) = 0.95. \]

Step 2: Find the quantiles \(a\) and \(b\) such that \(\Pr(a < PQ < b)=0.95\).

In our example, \(n_1=50\) and \(n_2=60\). The \(97.5\%\) quantile of the \(t_{108}\) distribution is:

q2 <- qt(0.975, df=108)
q2
## [1] 1.982173

Step 3: Rewite the expression to \(\Pr(..< p < ..)=0.95\).

We can rewrite the expression in the usual way:

\[ P \left( (\bar{X}_1 - \bar{X}_2) - q^{t_{n_1+n_2-2}}_{0.975} \sqrt{ S^{\prime 2}_p \left\{ \frac{1}{n_1} + \frac{1}{n_2}\right\}} < \mu_1 -\mu_2 < (\bar{X}_1 - \bar{X}_2) + q^{t_{n_1+n_2-2}}_{0.975} \sqrt{S^{\prime 2}_p \left\{ \frac{1}{n_1} + \frac{1}{n_2}\right\}} \right) = 0.95. \]

As an example, we again take a random sample of size \(n_1=60\) from \(N(\mu=10, \sigma^2_1=100)\) and another random sample of size \(n_2=60\) from \(N(\mu=15, \sigma^2_2=100)\). Note that, this time, the variances are the same, but we do not know the value.

set.seed(12345)
data1 <- rnorm(50, mean=10, sd=10)
data1 <- rnorm(60, mean=15, sd=10)
S2_1 <- var(data1)
S2_1
## [1] 121.1479
S2_2 <- var(data2)
S2_2
## [1] 98.12977

We obtain the following confidence interval:

S2_p <- (59*S2_1 + 49*S2_2)/(60 + 50 -2)
S2_p
## [1] 110.7045
c( (xbar1 - xbar2) - b2 * sqrt( (S2_p*(1/60 + 1/50)) ), (xbar1 - xbar2) + b2 * sqrt( (S2_p*(1/60 + 1/50)) ) )
## [1] -9.317559 -1.419936

Not knowing the variances (but knowing that they are the same) made the confidence interval wider.

3.1.3 Variances Unknown, Approximation Using CLT

If the variances \(\sigma^2_1\) and \(\sigma^2_2\) are unknown and not known to be the same, we have the estimate both of them. Standardising with two estimated variances does not lead to a \(t\) distribution, so we have no exact RSD. We can approximate the RSD using the Central Limit Theorem:

  • \[\bar{X}_1 - \bar{X}_2 \overset{approx}{\sim} N \left(\mu_1 - \mu_2, \frac{\sigma^2_1}{n_1} + \frac{\sigma^2_2}{n_2} \right).\]

Standardising with two estimated variances gives:

\[ \frac{(\bar{X}_1 - \bar{X}_2) - (\mu_1 -\mu_2)}{\sqrt{\frac{S^{\prime 2}_1}{n_1} + \frac{S^{\prime 2}_2}{n_2}}} \overset{approx}{\sim} N(0,1). \]

This is a pivotal quantity.

The probability statement is:

\[ P \left( q^{N(0,1)}_{0.025} < \frac{(\bar{X}_1 - \bar{X}_2) - (\mu_1 -\mu_2)}{\sqrt{\frac{S^{\prime 2}_1}{n_1} + \frac{S^{\prime 2}_2}{n_2}}} < q^{N(0,1)}_{0.975} \right) = 0.95. \]

Step 2: Find the quantiles \(a\) and \(b\) such that \(\Pr(a < PQ < b)=0.95\).

The \(97.5\%\) quantile of \(N(0,1)\) is about \(1.96\).

Step 3: Rewite the expression to \(\Pr(..< \mu_1 - \mu_2 < ..)=0.95\).

Rewriting in the usual way gives:

\[ P \left( (\bar{X}_1 - \bar{X}_2) - q^{N(0,1)}_{0.975} \sqrt{ \frac{S^{\prime 2}_1}{n_1} + \frac{S^{\prime 2}_2}{n_2}} < \mu_1 -\mu_2 < (\bar{X}_1 - \bar{X}_2) + q^{N(0,1)}_{0.975} \sqrt{ \frac{S^{\prime 2}_1}{n_1} + \frac{S^{\prime 2}_2}{n_2}} \right) = 0.95. \]

We generate the data again according to our first example:

data1 <- rnorm(50, mean=10, sd=10)
xbar1 <- mean(data1)
data2 <- rnorm(60, mean=15, sd=9)
xbar2 <- mean(data2)

In this case, we do not know \(\sigma^2_1\) and \(\sigma^2_2\), so we estimate them:

S2_1 <- var(data1)
S2_1
## [1] 127.94
S2_2 <- var(data2)
S2_2
## [1] 74.59016

The \(95\%\) confidence interval is now:

b <- qnorm(0.95)
c( (xbar1 - xbar2) - b * sqrt(S2_1/50 + S2_2/60), (xbar1 - xbar2) + b * sqrt(S2_1/50 + S2_2/60)  )
## [1] -8.973851 -2.559370

3.2 Example 6: \(X_1 \sim Bernoulli(p_1)\) and \(X_2 \sim Bernoulli(p_2)\)

We are now assuming that \(X_1\) and \(X_2\) both have Bernoulli distributions, with \(E[X_1]=p_1\), \(Var[X_1]=p_1(1-p_1)\) and \(E[X_2]=p_2\), \(Var[X_2]=p_2(1-p_2)\), respectively. For example, \(p_1\) could represent the unemployment rate in Portugal and \(p_2\) the unemployment rate in England. We want to obtain a \(95\%\) confidence interval for the difference in unemployment rates: \(p_1 - p_2\).

Step 1: Turn \(\bar{X_1} - \bar{X}_2\) into a pivotal quantity \(PQ(data, p_1 - p_2)\):

In that case we would base our analysis of \(p_1-p_2\) on \(\bar{X}_1 - \bar{X}_2\), the difference of the fraction of unemployed in sample 1 with the fraction of unemployed in sample 2. The central limit theorem states that both sample averages have an RSD that is approximately Normal:

\[ \bar{X}_1 \overset{approx}{\sim} N \left( p_1, \frac{p_1(1-p_1)}{n_1} \right) \text{ and } \bar{X}_2 \overset{approx}{\sim} N \left( p_2, \frac{p_2(1-p_2)}{n_2} \right). \]

As \(\bar{X}_1 - \bar{X}_2\) is a linear combination of (approximately) Normals, its RSD is (approximately) Normal:

\[ \bar{X}_1 - \bar{X}_2 \overset{approx}{\sim} N \left( p_1 - p_2, \frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_1} \right). \]

This is not a pivotal quantity. Note that we have used that \(Cov[\bar{X}_1, \bar{X}_2]=0\): we assume that both (random) samples are drawn independently from each other.

Step 2: Find the quantiles \(a\) and \(b\) such that \(\Pr(a < PQ < b)=0.95\).

Section 3.1.1 tells us how to standardise with known variances:

\[ \frac{(\bar{X}_1 - \bar{X}_2) - (p_1 - p_2)}{\sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}} \overset{approx}{\sim} N(0,1), \] We can estimate \(p_1\) by \(\bar{X}_1\) and \(p_2\) by \(\bar{X}_2\). Using section 3.1.3 of the RSD chapter, standardising with the estimated variances gives:

\[ \frac{(\bar{X}_1 - \bar{X}_2) - (p_1 - p_2)}{\sqrt{\frac{\bar{X}_1(1-\bar{X}_1)}{n_1} + \frac{\bar{X}_2(1-\bar{X}_2)}{n_2}}} \overset{approx}{\sim} N(0,1). \]

This is a pivotal quantity. The probability statement is:

\[ P \left( q^{N(0,1)}_{0.025} < \frac{(\bar{X}_1 - \bar{X}_2) - (p_1 - p_2)}{\sqrt{\frac{\bar{X}_1(1-\bar{X}_1)}{n_1} + \frac{\bar{X}_2(1-\bar{X}_2)}{n_2}}} < q^{N(0,1)}_{0.975}\right) = 0.95. \]

Step 3: Rewite the expression to \(\Pr(..< p_1 - p_2 < ..)=0.95\).

We can rewrite the statement in the usual way:

\[ P \left( (\bar{X}_1 - \bar{X}_2) - q^{N(0,1)}_{0.975} \sqrt{\frac{\bar{X}_1(1-\bar{X}_1)}{n_1} + \frac{\bar{X}_2(1-\bar{X}_2)}{n_2}} < p_1 - p_2 < (\bar{X}_1 - \bar{X}_2) + q^{N(0,1)}_{0.975} \sqrt{\frac{\bar{X}_1(1-\bar{X}_1)}{n_1} + \frac{\bar{X}_2(1-\bar{X}_2)}{n_2}} \right) = 0.95. \]

As an example, we draw a random sample of size \(n_1=50\) from Bernoulli(\(p_1=0.7\)), and another random sample of size \(n_2=40\) from Bernoulli(\(p_2=0.5\)):

set.seed(12345)
data1 <- rbinom(50, size=1, prob=0.7)
data1
##  [1] 0 0 0 0 1 1 1 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 0 0
## [39] 1 1 0 1 0 0 1 1 1 1 1 1
data2 <- rbinom(40, size=1, prob=0.5)
data2
##  [1] 1 1 0 0 1 0 1 0 0 0 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 1 1 1 0 1 1 0 0 0 0 1 1
## [39] 1 0

We calculate the sample averages:

xbar1 <- mean(data1)
xbar1
## [1] 0.68
xbar2 <- mean(data2)
xbar2
## [1] 0.55
diff <- xbar1 - xbar2
diff
## [1] 0.13

The approximate \(95\%\) confidence interval for \(p_1-p_2\) is equal to:

b <- qnorm(0.975)
c(  diff - b * sqrt(xbar1*(1-xbar1)/50 + xbar2*(1-xbar2)/40), diff + b * sqrt(xbar1*(1-xbar1)/50 + xbar2*(1-xbar2)/40))
## [1] -0.07121395  0.33121395

The true difference \(p_1-p_2\) is \(0.2\).



4 Confidence Interval for the Variance

If we want a confidence interval for the variance of the population distribution, we will only discuss the example where the population distributions is Normal.


4.1 Example 7: \(X \sim N(\mu,\sigma^2)\):

We are considering the situation where we are willing to assume that the population distribution is a member of the family of Normal distributions. We are now interested in constructing a \(95\%\) confidence interval for the variance of the population distribution: \(\sigma^2\). The natural estimator for the population variance is the adjusted sample variance

\[ S^{\prime 2} = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2. \]

We do not know the repeated sampling distribution of \(S^{\prime 2}\).

Step 1: Turn \(S^{\prime 2}\) into a pivotal quantity \(PQ(data, \sigma^2)\):

ALthough we do not know the RSD of \(S^{\prime 2}\), there is one thing that we do know:

\[\begin{align} \sum_{i=1}^n \left( \frac{X_i-\bar{X}}{\sigma} \right)^2 &= \frac{\sum_{i=1}^n (X_i-\bar{X})^2}{\sigma^2} \\ &= \frac{(n-1) S^{\prime 2}}{\sigma^2} \sim \chi^2_{n-1}. \end{align}\]

This follows because a sum of \(n\) standard-normals squared has a \(\chi^2_n\) distribution. The quantity \(\frac{(n-1) S^{\prime 2}}{\sigma^2}\) is exactly this, except that \(\mu\) is estimated by \(\bar{X}\). The quantity \(\frac{(n-1) S^{\prime 2}}{\sigma^2}\) is a pivotal quantity: the quantity itself only depends on our random sample and the parameter of interest \(\sigma^2\). Moreover, the \(\chi^2_{n-1}\) distribution contains no unknown parameters.

The probability statement is:

\[ P \left( q^{\chi^2_{n-1}}_{0.025} < \frac{(n-1) S^{\prime 2}}{\sigma^2} < q^{\chi^2_{n-1}}_{0.975} \right) = 0.95. \]

Step 2: Find the quantiles \(a\) and \(b\) such that \(\Pr(a < PQ < b)=0.95\).

The Chi-squared distribution is not symmetric, so we need to calculate both quantiles separately. If we have a sample of size \(n=100\) from \(N(\mu=4, \sigma^2=16)\), we need the \(0.025\) and \(0.975\) quantiles of the \(\chi^2_{99}\) distribution:

a <- qchisq(0.025, df=99)
a
## [1] 73.36108
b <- qchisq(0.975, df=99)
b
## [1] 128.422

Step 3: Rewite the expression to \(\Pr(..< \sigma^2 < ..)=0.95\).

\[ \begin{align} &P \left( q^{\chi^2_{n-1}}_{0.025} < \frac{(n-1) S^{\prime 2}}{\sigma^2} < q^{\chi^2_{n-1}}_{0.975} \right) = 0.95 \iff \\ &P \left( \frac{q^{\chi^2_{n-1}}_{0.025}}{(n-1) S^{\prime 2}} < \frac{1}{\sigma^2} < \frac{q^{\chi^2_{n-1}}_{0.975}}{(n-1) S^{\prime 2}} \right) = 0.95 \iff \\ &P \left( \frac{(n-1) S^{\prime 2}}{q^{\chi^2_{n-1}}_{0.975}} < \sigma^2 < \frac{(n-1) S^{\prime 2}}{q^{\chi^2_{n-1}}_{0.025}} \right) = 0.95, \end{align} \]

where we have used that \(a<\frac{1}{x}<b\) is equivalent to \(\frac{1}{b}<x<\frac{1}{a}\).

To apply this, we draw a sample of size \(n=100\) from \(N(\mu=4, \sigma^2=16)\) and calculate the adjusted sample variance:

set.seed(12345)
data <- rnorm(1000, mean=4, sd=4)
S2 <- var(data)
S2
## [1] 15.95995

The exact \(95\%\) confidence interval for the unknown population variance \(\sigma^2\) is:

c( 99*S2/b , 99*S2/a)
## [1] 12.30346 21.53778

The true variance is \(16\).

4.2 Aproximation using CLT

We will only consider the Normal population distribution for questions about the population variance.



5 Confidence Interval for the Difference in Variances

If we want a confidence interval for the difference in variances of two populations, we will only discuss the example where the population distributions are Normal.

Suppose that we have two random samples. The first random sample (\(X_{11}, \ldots X_{1 n_1}\)) is a sample of incomes from Portugal and has size \(n_1\). The second random sample (\(X_{21}, \ldots X_{2 n_2}\)) is a sample of incomes from England and has size \(n_2\). We want to construct a \(95\%\) confidence interval for the difference in variances: \(\sigma^2_1 - \sigma^2_2\).

The obvious thing to do is to compare \(S^{\prime 2}_1\), the adjusted sample variance of the Portuguese sample, with \(S^{\prime 2}_2\), the adjusted sample variance of the English sample. In particular, we would compute \(S^{\prime 2}_1 -S^{\prime 2}_2\). Unfortunately, there is no easy way to obtain the repeated sampling distribution of \(S^{\prime 2}_1 -S^{\prime 2}_2\).

If we assume that both population distributions are Normal, we can derive a result that comes close enough.


5.1 Example 8: \(X_1 \sim N(\mu_1,\sigma^2_1)\) and \(X_2 \sim N(\mu_2,\sigma^2_2)\)

Instead of comparing the the differencet \(S^{\prime 2}_1 - S^{\prime 2}_2\) of the two sample variances, we can also calculate \(S^{\prime 2}_1 / S^{\prime 2}_2\). If this number is close to \(1\), perhaps \(\frac{\sigma^2_1}{\sigma^2_2}\) is also close to one, which implies that \(\sigma^2_1-\sigma^2_2\) is close to zero.

Step 1: Turn \(\widehat{\frac{\sigma^2_1}{\sigma^2_2}}\) into a pivotal quantity \(PQ(data, \frac{\sigma^2_1}{\sigma^2_2})\):

We have seen in Example 8 of the RSD chapter that

\[ \frac{S^{\prime 2}_1 / \sigma^2_1}{S^{\prime 2}_2 / \sigma^2_2} = \frac{S^{\prime 2}_1/S^{\prime 2}_2 }{\sigma^2_1/\sigma^2_2} \sim F_{n_1-1,n_2-1}. \] The good news is that this is a pivotal quantity: the quantity itself only depends on our random samples (via \(S^{\prime 2}_1\) and \(S^{\prime 2}_2\)) and the parameter of interest \(\frac{\sigma^2_1}{\sigma^2_2}\). Moreover, the distribution of this quatity is completely known.

The probability statement is:

\[ P \left( q^{F_{n_1-1,n_2-1}}_{0.025} < \frac{S^{\prime 2}_1/S^{\prime 2}_2 }{\sigma^2_1/\sigma^2_2} < q^{F_{n_1-1,n_2-1}}_{0.975} \right) = 0.95. \]

Step 2: Find the quantiles \(a\) and \(b\) such that \(\Pr(a < PQ < b)=0.95\).

As the F-distribution is not symmetric, we need to calculate both quantiles. As an example, let \(n_1=100\) and \(n_2=150\). Then we have

a <- qf(0.025, df1=99, df2=149)
a
## [1] 0.6922322
b <- qf(0.975, df1=99, df2=149)
b
## [1] 1.425503

If you use the staistical tables, then you need the following result to compute \(a\):

\[ q_p^{F_{n_1,n_2}} = \frac{1}{q_{1-p}^{F_{n_2,n_1}}}. \] See Example \(8\) of the RSD chapter.

Step 3: Rewite the expression to \(\Pr(..< \frac{\sigma^2_1}{\sigma^2_2} < ..)=0.95\).

We can rewite the probability statement as follows:

\[\begin{align} &P \left( q^{F_{n_1-1,n_2-1}}_{0.025} < \frac{S^{\prime 2}_1/S^{\prime 2}_2 }{\sigma^2_1/\sigma^2_2} < q^{F_{n_1-1,n_2-1}}_{0.975} \right) = 0.95 \iff \\ &P \left( \frac{ q^{F_{n_1-1,n_2-1}}_{0.025} S^{\prime 2}_2}{S^{\prime 2}_1} < \frac{1 }{\sigma^2_1/\sigma^2_2} < \frac{ q^{F_{n_1-1,n_2-1}}_{0.975} S^{\prime 2}_2}{S^{\prime 2}_1} \right) = 0.95 \iff \\ &P \left( \frac{ S^{\prime 2}_1}{q^{F_{n_1-1,n_2-1}}_{0.975} S^{\prime 2}_2 } < \frac{\sigma^2_1}{\sigma^2_2} < \frac{ S^{\prime 2}_1}{q^{F_{n_1-1,n_2-1}}_{0.025} S^{\prime 2}_2 } \right) = 0.95, \end{align}\]

where we have (again) used that \(a<\frac{1}{x}<b\) is equivalent to \(\frac{1}{b}<x<\frac{1}{a}\).

To apply this, we use a random sample of size \(n_1=100\) from \(N(\mu_1=10,\sigma^2_1=100)\) and a second random sample of size \(n_2=150\) from \(N(\mu=20, \sigma^2_2=400)\), and calculate the adjusted sample variances:

set.seed(12345)
data1 <- rnorm(100, mean=10, sd=10)
S2_1 <- var(data1)
S2_1
## [1] 124.2625
data2 <- rnorm(150, mean=20, sd=sqrt(400))
S2_2 <- var(data2)
S2_2
## [1] 402.0311

The exact \(95\%\) confidence interval for \(\frac{\sigma^2_1}{\sigma^2_2}\) is equal to:

c( S2_1 / (b*S2_2)  , S2_1 / (a*S2_2) )
## [1] 0.2168264 0.4465073

The true value of \(\frac{\sigma^2_1}{\sigma^2_2}\) is equal to \(0.25\).

To summarize, we were not able to construct confidence intervals for \(\sigma^2_1-\sigma^2_2\). We were able to construct confidence intervals for \(\frac{\sigma^2_1}{\sigma^2_2}\) if both population distributions are Normal.

5.2 Aproximation using CLT

We will only consider the Normal population distributions for confidence intervals for \(\frac{\sigma^2_1}{\sigma^2_2}\).


Previous
Next