Regression

how one variable moves with another


Theory and Background.

In statistics, regression studies how one variable tends to change, on average, when another variable changes. It does not say that one value of X gives one exact value of Y. Instead, it estimates the expected or average value of Y for a given value of X.

That is why regression is not just a line on a graph. It is a way of describing the average tendency in a statistical relationship using observed data.

The word regression became famous through Sir Francis Galton in the nineteenth century. Galton studied heights of parents and children and noticed that very tall parents tended to have children who were also tall, but usually closer to the general average than the parents. He called this regression toward mediocrity, now usually called regression to the mean.

Later, Karl Pearson and other statisticians developed the mathematical tools that connected regression, correlation, averages, variation, and the method of least squares. In modern statistics, regression is used to estimate, explain, compare, and predict using data, but its foundation remains statistical.

The basic idea.
Regression studies a dependent variable and one or more independent variables. The dependent variable is the result we are trying to understand. The independent variable is the variable used to explain changes in that result.

Simple linear regression.
When there is one explanatory variable and the relationship is represented by a straight line, the form is:
Y = a + bX

Here, Y is the estimated value of the dependent variable, X is the independent variable, a is the intercept, and b is the slope. The intercept is the estimated value of Y when X is zero. The slope tells us how much Y changes on average when X increases by one unit.

Regression is about average movement
Regression does not say that every observation will fall exactly on the line. It says that the line represents the average tendency in the data. Individual points can be above or below the line because real data contains variation.

Errors and residuals.
The difference between an observed value and the value estimated by the regression line is called a residual. If actual Y is 80 and estimated Y is 75, the residual is 5. Regression tries to make these residuals as small as possible in a systematic way.

Least squares method.
The most common regression line is chosen by the least squares method. This method squares each residual, adds all squared residuals, and chooses the line with the smallest total. Squaring prevents positive and negative errors from cancelling each other.

Regression and correlation
Correlation measures the strength and direction of association between two variables. Regression goes further and gives an equation for estimating one variable from another. Correlation is symmetric, but regression is directional. Correlation between X and Y is the same as between Y and X, but regression of Y on X is not the same as regression of X on Y.

Positive and negative regression.
If the slope is positive, Y tends to increase when X increases. If the slope is negative, Y tends to decrease when X increases. If the slope is close to zero, X may not be useful for explaining the average movement of Y.

Multiple regression
When more than one independent variable is used, the method is called multiple regression. For example, sales may be studied using price, advertising, season, and income level together. The statistical purpose is to estimate the separate contribution of each variable while holding the others constant.

Important assumptions.
Linear regression usually assumes that the relationship is approximately linear, errors are independent, error variation is reasonably constant, and residuals are centered around zero. These assumptions help us decide whether the regression result is reliable.

A careful warning.
Regression can show statistical association, but association alone does not prove causation. A good regression result still needs sensible data, a meaningful question, and statistical judgement.

Examples for Understanding.

1. Study hours and marks.
Suppose students study for different numbers of hours and get different marks. We may fit a regression equation:
Marks = 35 + 5 x Study Hours

Interpretation: if study hours increase by one hour, marks increase by about 5 marks on average. If a student studies 6 hours, estimated marks = 35 + 5 x 6 = 65. This does not mean every 6-hour student must score exactly 65. It is the estimated average tendency.

2. Residual example.
Using the same equation, if a student who studied 6 hours actually scored 70:
Estimated marks = 65
Actual marks = 70
Residual = Actual - Estimated = 70 - 65 = 5

The residual tells us how far the observation is from the regression line. A positive residual means the actual value is above the estimate. A negative residual means it is below.

3. Advertising and sales.
A shop studies monthly advertising spend and monthly sales:
Sales = 12000 + 8 x Advertising Spend

If advertising spend is Rs. 1000, estimated sales = 12000 + 8 x 1000 = Rs. 20000. The slope 8 means each extra rupee of advertising is associated with Rs. 8 higher sales on average, within the range of data studied.

4. Negative slope example.
A product's demand may be described as:
Demand = 500 - 20 x Price

If price increases by one unit, demand decreases by about 20 units on average. This is a negative regression relationship. The slope does not say price is the only reason demand changes; it only describes the average statistical movement captured in the data.

5. Regression to the mean
Suppose very high-scoring students take another similar test. Many may still score high, but their second score may be closer to the class average than the first. This does not mean they became weak. It may simply reflect natural variation around an average level.

This is the historical idea behind regression: extreme observations are often followed by observations that are less extreme, especially when random variation is present.

6. Why correlation is not enough.
If height and weight have a positive correlation, we know they tend to move together. Regression lets us estimate:
Weight = 12 + 0.9 x Height

Now the relationship becomes an estimating rule. For a height of 170 cm:
Estimated weight = 12 + 0.9 x 170 = 12 + 153 = 165
The numbers here are only for understanding the method, not a medical rule.

7. Multiple regression example.
A statistical model for house price may be:
Price = a + b1 x Area + b2 x Age + b3 x Distance

Area may have a positive coefficient, age may have a negative coefficient, and distance from the city center may also have a negative coefficient. Each coefficient is interpreted while the other variables are held constant. That is the key statistical advantage of multiple regression.

8. Averages inside regression.
A regression line passes through the point made by the mean of X and the mean of Y. This is important because regression is anchored in averages. It explains how the average of Y changes across values of X.

9. Good fit and poor fit.
If points lie close to the regression line, the residuals are small and the fit is strong. If points are widely scattered, residuals are large and the fit is weak. A weak fit does not always mean regression is useless, but it warns us to interpret estimates carefully.

10. The main statistical reading.
Regression should be read as:
"For this data, when X changes by one unit, Y changes by this much on average."
That one sentence prevents many mistakes.

Regression becomes easier when we remember that it is not magic prediction. It is a disciplined statistical summary of how variables move together on average.

Regression explains average change,
estimates one variable from another,
studies residuals carefully,
and must never be confused with automatic causation.