| y = y0 + bx x | (6.1) |
| (6.2) |
| (6.3) |
or, for the variables
and
,
| |
(6.4) |
| |
(6.5) |
While this solution can be found for any set of measurements, it is still necessary to consider if the solution is a useful representation of the measurements and if the assumed dependency between y and x is valid. Indeed, if it is assumed that xis the dependent variable and y the dependent variable, the solution is different:
| |
(6.6) |
| |
(6.7) |
Linear regression analysis addresses the dual tasks of finding the best-fit relationships and testing for correlations in the data that indicate a linear relationship between the variables.
The key measure of correlation in regression analysis is the correlation
coefficient, defined in terms of the variables
and
as
![]() ![]() |
(6.8) |
The correlation coefficient is thus not dependent on which variable
is considered independent. However, the slope parameters bx
and by for the two cases are only equal in the case where
the variables x and y are linearly related, i.e., yi
= (constant) xi. If x and y are completely
uncorrelated (so that
),
both bx and by are zero and hence the
best-fit lines are perpendicular to each other. Figure 6.1 shows an example
of the relationships between slope parameters for a case with correlation
coefficient 0.8. When the data are considered in vertical slices (as appropriate
for the assumption that y is a function of x), each vertical
slice appears centered on the regression line labeled y(x),
but when considered in horizontal slices each slice appears centered on
the line labeled x(y); this illustrates why the two regression
fits must be different.

The following example helps illustrates the meaning of the correlation coefficient. Let u1 and u2 be two independent variables that obey normal distributions with standard deviations equal to unity and means of zero. Define variables y1 and y2 as
| y1 = a1 + b1u1 | (6.9) |
| y2 = a2 + b2u2 + b3u1 | (6.10) |
where a1, a2, b1, and b2 are non-zero constants. The variables y1 and y2 are then correlated, and the correlation coefficient is
| (6.11) |
The variance in y2 is ( b22+b32), so the fraction of the variance contributed by the function u1 is the square of the correlation coefficient. The square of the correlation coefficient is sometimes said to measure the fraction of the variance in one variable that can be "explained" or accounted for by correlation with another variable. The remainder of the variance results from other sources, perhaps from correlation with other variables.
The correlation coefficient is notoriously dangerous to interpret, especially in the sense of statistical inference. To gain some sense of the variability to be expected in measurements of this parameter, consider the model of a general bivariate Gaussian distribution:
![]() |
(6.12) |
It can be verified by direct integration that
=
and
=
,
that
and
,
and that
| (6.13) |
so this distribution has appropriate properties to serve as a model
for regression results. Using this model, we can ask what the probability
will be for observing specific values r of the correlation coefficient.
(We will distinguish r, the result of calculations with finite samples
from the population, from the population correlation coefficient
.)
Two properties derivable from the bivariate Gaussian distribution illustrate the meaning of the correlation coefficient. To obtain them, consider the conditional probability of x2 given x1:
| P(x2|x1) = A P(x1,x2) | (6.14) |
where A is defined to normalize the probability distribution when integrated over x2:
| (6.15) |
If
and
are
the deviations from the means
and
,
then the conditional probability distribution obtained by integrating (6.15)
to determine A is
| (6.16) |
This probability distribution has the following two properties:
| (6.17) |
| (6.18) |
Thus, if
represents the total variance in x2,
represents the fraction of this variance that is removed once x1
is fixed.
These properties of the bivariate Gaussian distribution make it possible
to generate simulated experiments to study the expected distribution in
r by Monte Carlo techniques. For a given population correlation
coefficient
and sample size N, one can generate many random samples from the
appropriate bivariate distribution and compute the sample correlation coefficient
r for each sample. Some results from such calculations are shown
in Figures 6.2 and 6.3. These figures illustrate that observed correlation
coefficient will often differ significantly from the true population correlation
coefficient, especially for small sample sizes, so it is important not
to attributed unwarranted significance to correlation coefficients obtained
from small samples.


Some of the characteristics of the correlation coefficient illustrated by examples in this chapter are:
The distribution of r about
is asymmetrical, so it is useful to transform r to another variable
that will have an error distribution that is approximately Gaussian. A
common example is the Fisher z transformation, based on the variable
| (6.19) |
This variable is approximately Gaussian-distributed with standard deviation
| (6.20) |
The inverse transformation is
| (6.21) |
From (6.19)
and (6.20)
zf=0.549 and
=
0.213, so the one-standard-deviation limits for zf are
0.336 and 0.762. The corresponding values of r from (6.21) are 0.324
and 0.643, as shown on the plot.
The estimated uncertainty that would result from error propagation from the above formulas is
| (6.22) |
This is approximately the same as the average of the two standard deviations
found using the transformation equations. However, because of the skewed
nature of the distribution functions, this estimate should only serve as
a preliminary guide to the uncertainty.
The uncertainties in the slope and intercept for the regression can be estimated using the results from section 5.2. The elements of the covariance matrix, for the case where x is the independent variable, are
| (6.23) |
| (6.24) |
| (6.25) |
so in the primed coordinates the results for the slope and intercept are not correlated. This is not generally true in the original coordinates.
The uncertainty in the parameter b, the slope of the regression line, can also be determined using the analysis previously applied to linear least-squares fitting. In that case, the error matrix (cf. 5.22) was
| (6.26) |
or, with
,
| (6.27) |
In particular,
| |
(6.28) |
where the standard deviation (for a true linear relationship) can be
estimated from
where
| (6.29) |