next up previous contents
Next: 6.2 Effects of measurement Up: 6. Linear Regression Analysis Previous: 6. Linear Regression Analysis

6.1 Simple linear regression

The least-squares fit of a straight line to a set of measurements was discussed in Section 5.2. The solution, for the case where the dependent variable yi is measured with constant uncertainty $\sigma$, is
y = y0 + bx x  (6.1)
 
\begin{displaymath}y_0 = \overline{y} - b_x\overline{x} \end{displaymath} (6.2)
 
\begin{displaymath}b_x = {{\overline{xy} -\overline{x}\thinspace\overline{y}}\over{\overline{x^2}-\overline{x}^2}} \end{displaymath} (6.3)
 

or, for the variables $x^\prime = x - \overline{x}$ and $y^\prime = y -\overline{y}$,

       \begin{displaymath}y_0^\prime = 0 \end{displaymath} (6.4)
 
  \begin{displaymath}b_x = {{\overline{x^\primey^\prime}}\over{\overline{{x^\prime}^2}}} ~ . \end{displaymath} (6.5)
 

While this solution can be found for any set of measurements, it is still necessary to consider if the solution is a useful representation of the measurements and if the assumed dependency between y and x is valid. Indeed, if it is assumed that xis the dependent variable and y the dependent variable, the solution is different:

        \begin{displaymath}x_0^\prime = 0 \end{displaymath} (6.6)
 
    \begin{displaymath}b_y = {{\overline{x^\primey^\prime}}\over{\overline{{y^\prime}^2}}} . \end{displaymath} (6.7)
 

Linear regression analysis addresses the dual tasks of finding the best-fit relationships and testing for correlations in the data that indicate a linear relationship between the variables.

The key measure of correlation in regression analysis is the correlation coefficient, defined in terms of the variables $x^\prime = x - \overline{x}$ and $y^\prime = y -\overline{y}$ as

$\displaystyle {rl}\rho = {{V_{xy}}\over{\sqrt{V_xV_y}}} = {{\overline{x^\prime......prime}}\over{\sqrt{\overline{{x^\prime}^2}\thinspace\overline{{y^\prime}^2}}}}$$\displaystyle = {{\overline{(x-\overline{x})(y-\overline{y})}}\over{\sqrt{\ove......overline{x})^2}\thinspace \overline{(y-\overline{y})^2}}}} = \sqrt{b_xb_y} ~ .$     (6.8)
 

The correlation coefficient is thus not dependent on which variable is considered independent. However, the slope parameters bx and by for the two cases are only equal in the case where the variables x and y are linearly related, i.e., yi = (constant) xi. If x and y are completely uncorrelated (so that $\overline{x^\prime y^\prime}=0$), both bx and by are zero and hence the best-fit lines are perpendicular to each other. Figure 6.1 shows an example of the relationships between slope parameters for a case with correlation coefficient 0.8. When the data are considered in vertical slices (as appropriate for the assumption that y is a function of x), each vertical slice appears centered on the regression line labeled y(x), but when considered in horizontal slices each slice appears centered on the line labeled x(y); this illustrates why the two regression fits must be different.


 
Figure 6.1: The linear regression lines for the assumptions that y is the dependent variable (y(x)) and that x is the dependent variable (x(y)). The data have a correlation coefficient of 0.80 and were generated from random Gaussian distributions with standard deviations of 1.0 and means of zero.

The following example helps illustrates the meaning of the correlation coefficient. Let u1 and u2 be two independent variables that obey normal distributions with standard deviations equal to unity and means of zero. Define variables y1 and y2 as

y1 = a1 + b1u1  (6.9)
 
y2 = a2 + b2u2 + b3u1  (6.10)
 

where a1, a2, b1, and b2 are non-zero constants. The variables y1 and y2 are then correlated, and the correlation coefficient is

\begin{displaymath}\rho = {{\langle y_1^\prime y_2^\prime\rangle}\over{\sqrt{\la......gle y_2^\prime\rangle}}} = {{b_3}\over{\sqrt{b_2^2+b_3^2}}} . \end{displaymath} (6.11)
 

The variance in y2 is ( b22+b32), so the fraction of the variance contributed by the function u1 is the square of the correlation coefficient. The square of the correlation coefficient is sometimes said to measure the fraction of the variance in one variable that can be "explained" or accounted for by correlation with another variable. The remainder of the variance results from other sources, perhaps from correlation with other variables.

The correlation coefficient is notoriously dangerous to interpret, especially in the sense of statistical inference. To gain some sense of the variability to be expected in measurements of this parameter, consider the model of a general bivariate Gaussian distribution:

\begin{displaymath}P(x_1,x_2) ={{\exp\left\{{{-1}\over {(1-\rho^2)}} \left[{{......\right]\right\}}\over{2\pi\sigma_1\sigma_2\sqrt{1-\rho^2}}} . \end{displaymath} (6.12)
 

It can be verified by direct integration that $\langle x_1\rangle$$\mu_1$ and $\langle x_2\rangle$$\mu_2$, that $\sigma_{x_1}=\sigma_1$ and $\sigma_{x_2}=\sigma_2$, and that

\begin{displaymath}\rho = {{\langle(x_1-\mu_1)(x_2-\mu_2)\rangle}\over{\sigma_1\sigma_2}} , \end{displaymath} (6.13)
 

so this distribution has appropriate properties to serve as a model for regression results. Using this model, we can ask what the probability will be for observing specific values r of the correlation coefficient. (We will distinguish r, the result of calculations with finite samples from the population, from the population correlation coefficient $\rho$.)

Two properties derivable from the bivariate Gaussian distribution illustrate the meaning of the correlation coefficient. To obtain them, consider the conditional probability of x2 given x1:

P(x2|x1) = A P(x1,x2 (6.14)
 

where A is defined to normalize the probability distribution when integrated over x2:

\begin{displaymath}\int P(x_2\vert x_1) dx_2 = 1 . \end{displaymath} (6.15)
 

If $x_1^\prime = x_1-\mu_1$ and $x_2^\prime = x_2-\mu_2$are the deviations from the means $\mu_1$ and $\mu_2$, then the conditional probability distribution obtained by integrating (6.15) to determine A is

\begin{displaymath}P(x_2^\prime\vert x_1^\prime) = {{1}\over{\sqrt{2\pi}\sigma_2......1}}}\over{\sqrt{2}\sigma_2\sqrt{1-\rho^2}}}\Bigl)^2\Bigl\} . \end{displaymath} (6.16)
 

This probability distribution has the following two properties:
 
 

 
\begin{displaymath}\langle x_2\vert x_1\rangle = \mu_2 + \rho x_1{{\sigma_2}\over{\sigma_1}} . \end{displaymath} (6.17)
   
\begin{displaymath}\langle (x_2-\langle x_2\rangle)^2\rangle = \sigma_2^2 (1-\rho^2). \end{displaymath} (6.18)
 

Thus, if $\sigma_2$ represents the total variance in x2$\rho^2$ represents the fraction of this variance that is removed once x1 is fixed.

These properties of the bivariate Gaussian distribution make it possible to generate simulated experiments to study the expected distribution in r by Monte Carlo techniques. For a given population correlation coefficient $\rho$ and sample size N, one can generate many random samples from the appropriate bivariate distribution and compute the sample correlation coefficient r for each sample. Some results from such calculations are shown in Figures 6.2 and 6.3.  These figures illustrate that observed correlation coefficient will often differ significantly from the true population correlation coefficient, especially for small sample sizes, so it is important not to attributed unwarranted significance to correlation coefficients obtained from small samples.


 
Figure 6.2: Simulation results generated for a correlation coefficient of r=0., using repeated random sequences. The plot shows the fraction of the results for which the calculated correlation coefficient was smaller than the plotted value, for sequences of 10 points and 25 points.

 
Figure 6.3: Simulation results similar to those in Fig. 6.2, for a population correlation coefficient of 0.5 and for sequences each containing 10 points or 25 points.

Some of the characteristics of the correlation coefficient illustrated by examples in this chapter are:

Two other properties of the bivariate Gaussian distribution are useful in some interpretations of regression relationships: (i) a rotation can always be selected that transforms the variables into new variables that are uncorrelated; and (ii) the contours of constant probability in a plot of y0 vs b are ellipses in the general case.
 


Example 6.1: It is a common error to show a regression fit as evidence that there is a difference in the measurements of two instruments. Suppose two wind vanes behave identically, and each has the same measurement error $\sigma$. They may produce a set of measurements like those shown in Fig. 6.1, for which both x and y have the same mean and standard deviation. It is an error to conclude that, because the regression line for y(x) has a slope of about 0.8, the response of the instrument measuring y is only 80% as large as the response of the instrument measuring x.

The distribution of r about $\rho$ is asymmetrical, so it is useful to transform r to another variable that will have an error distribution that is approximately Gaussian. A common example is the Fisher z transformation, based on the variable

 
\begin{displaymath}z_f = 0.5 \ln\left({{1+r}\over{1-r}}\right)\ . \end{displaymath} (6.19)
 

This variable is approximately Gaussian-distributed with standard deviation

 
\begin{displaymath}\sigma_z = {{1}\over{\sqrt{N-3}}} . \end{displaymath} (6.20)
 

The inverse transformation is

\begin{displaymath}r = {{e^{2z}-1}\over{e^{2z}+1}} . \end{displaymath} (6.21)
 
 


Example 6.2: Consider the case where a sample of 25 measurements gives a correlation coefficient of 0.5, as shown in Fig. 6.3. Find the one-standard-deviation uncertainty range for the correlation coefficient.

From (6.19) and (6.20) zf=0.549 and $\sigma_z$= 0.213, so the one-standard-deviation limits for zf are 0.336 and 0.762. The corresponding values of r from (6.21) are 0.324 and 0.643, as shown on the plot.

The estimated uncertainty that would result from error propagation from the above formulas is

\begin{displaymath}\delta r \approx {{1}\over{\sqrt{N-3}}} (1+r)(1-r) \approx 0.16 ~ . \end{displaymath} (6.22)
 

This is approximately the same as the average of the two standard deviations found using the transformation equations. However, because of the skewed nature of the distribution functions, this estimate should only serve as a preliminary guide to the uncertainty.


The uncertainties in the slope and intercept for the regression can be estimated using the results from section 5.2. The elements of the covariance matrix, for the case where x is the independent variable, are

\begin{displaymath}V_{y_0^\prime y_0^\prime} = {{\sigma^2}\over{N}} \end{displaymath} (6.23)
 
\begin{displaymath}V_{b_xb_x} = {{\sigma^2}\over{N\overline{{x^\prime}^2}}} \end{displaymath} (6.24)
 
\begin{displaymath}V_{y_0^\prime b_x} = 0 \ , \end{displaymath} (6.25)
 

so in the primed coordinates the results for the slope and intercept are not correlated. This is not generally true in the original coordinates.

The uncertainty in the parameter b, the slope of the regression line, can also be determined using the analysis previously applied to linear least-squares fitting. In that case, the error matrix (cf. 5.22) was

\begin{displaymath}{rl}H^{-1} = {{\sigma^2}\over{N(\overline{x^2} - \overline{x}......\pmatrix{\overline{x^2}&-\overline{x}\cr-\overline{x}&1\cr} \end{displaymath} (6.26)
 

or, with $\sigma_x^2=(\overline{x^2}-\overline{x}^2)$,

\begin{displaymath}\pmatrix{V_{y_0y_0}&V_{y_0b}\cr V_{by_0}&V_{bb}\cr}= {{\sig......atrix{\overline{x^2}&-\overline{x}\cr-\overline{x}&1\cr} ~ . \end{displaymath} (6.27)
 

In particular,

                  \begin{displaymath}V_{bb} = {{\sigma^2}\over{N\sigma_x^2}} \end{displaymath} (6.28)
 

where the standard deviation (for a true linear relationship) can be estimated from $\sigma \approx s$ where

\begin{displaymath}s^2=\sum_i(y_i-y_0-bx_i)^2/(N-2) . \end{displaymath} (6.29)
 


next up previous contents
Next: 6.2 Effects of measurement Up: 6. Linear Regression Analysis Previous: 6. Linear Regression Analysis 

NCAR Advanced Study Program

http://www.asp.ucar.edu