Statistics - A Clear WAY

where Numbers Speak

Term and Definition

Example to Understand

Statistics

The discipline that helps machine learning convert data into evidence: what patterns exist, how reliable they are, and how much uncertainty remains.

Before trusting a churn model, statistics checks whether past patterns are reliable. If 300 of 1,000 past customers churned, the observed churn rate is 300 / 1000 = 0.30, but we still ask whether that rate will hold for future customers.Churn = a customer stops using a service, cancels, leaves, or becomes inactive.

Data

The recorded examples from which a model learns. In ML, data is not just numbers; it is the experience given to the algorithm.ML = Machine Learning.

For a loan-risk model, one row may contain income = 60,000, age = 35, loan amount = 500,000, past defaults = 1, and repaid = no. The model learns from many such rows.

Dataset

A collection of rows and columns prepared for analysis or model training, usually containing features and sometimes a target.

A CSV with 50,000 customers and columns such as usage_minutes, support_calls, plan_type, tenure_months, and churn is a dataset. Its shape may be 50,000 rows x 5 columns.CSV = Comma-Separated Values file. Churn = customer leaves or stops using the service.

Observation

One complete example in the dataset. In ML, each observation is one case from which the model can learn.

One passenger row with fare = 8,500, days_before_departure = 12, route = DEL-BOM, and no_show = 1 is one observation. It is one training example.

Feature

An input variable used by the model to learn patterns and make predictions.

In house-price prediction, features may be x1 = area, x2 = rooms, and x3 = age. A linear model may use them as price = b0 + b1x1 + b2x2 + b3x3.

Target Variable

The output variable the model is trained to predict. It is also called the response, label, or dependent variable.

In churn prediction, the target can be coded as y = 1 for churn and y = 0 for no churn. The model tries to predict this y from customer features.Churn = customer leaves, cancels, or becomes inactive.

Label

The known correct answer attached to a training example in supervised learning.

For an email classifier, label = 1 may mean spam and label = 0 may mean not spam. During training, the model compares its prediction with this known label.

Population

The full real-world group where the model is expected to work, not merely the data already collected.

If a fraud model will score 10 million future card transactions, those transactions are the population of interest. A training sample of 200,000 rows must represent that population well.

Sample

The available subset of the population used for training, validation, or testing. ML performance depends heavily on how representative this sample is.

If the population is 70% rural and 30% urban, but the sample is 90% urban, the model may learn urban behavior too strongly and fail for rural customers.

Sampling Bias

A distortion caused when the collected sample does not represent the population where the model will be used.

If 80% of past hires came from only 5 colleges, a resume model trained on that data may score those colleges higher even when skill is similar. The sample carries historical bias.

Training Data

The portion of data used by the algorithm to learn model parameters and patterns.

In logistic regression, training data is used to learn coefficients such as default_log_odds = b0 + b1(income) + b2(age) + b3(default_history).

Validation Data

Data used during model development to tune choices such as thresholds, features, or hyperparameters.

If cutoff 0.50 gives recall 62% and cutoff 0.40 gives recall 78%, validation data helps decide which threshold better fits the business cost of missed churners.

Test Data

Data held back until the end to estimate how well the final model may perform on unseen cases.

If final model accuracy is 92% on training data but 84% on untouched test data, the 84% is the more honest estimate of future performance.

Train-Test Split

The practice of dividing available data into one part for learning and another part for honest evaluation.

From 10,000 rows, an 80:20 split gives 8,000 training rows and 2,000 test rows. The model learns from 8,000 rows and is judged on the unseen 2,000 rows.

Supervised Learning

Machine learning where each training example includes the correct answer, called a label or target.

If each customer row contains features plus churn = 1 or 0, the model can learn from known answers. Classification and regression are common supervised tasks.

Unsupervised Learning

Machine learning where the data has no target label, so the algorithm searches for structure, groups, or patterns.

If customer data has age, spend, and usage but no churn label, clustering can still group customers into similar segments.Unsupervised = no known answer column is supplied during training.

Categorical Data

Data whose values are groups or labels. ML models usually need such values encoded before training.

Payment mode may take values card, cash, or UPI. Since these are categories, the model cannot treat them like normal quantities such as 1, 2, 3 without careful encoding.

Numerical Data

Data expressed as measurable or countable numbers that can often be used directly or after scaling.

Monthly spend = 12,500 and account age = 18 months are numerical features. A model can compare magnitudes, but scaling may still be needed if units differ greatly.

Encoding

The process of converting categorical values into numerical form so an ML algorithm can use them.

One-hot encoding converts payment_mode = UPI into card = 0, cash = 0, UPI = 1. This gives the model numeric inputs without imposing a false order.UPI = Unified Payments Interface.

Feature Engineering

The process of creating, transforming, or selecting input variables so the model can learn useful patterns more easily.

From booking_date and travel_date, create days_before_departure = travel_date - booking_date. If travel is on May 20 and booking was May 5, the feature value is 15 days.

Missing Values

Blank or unavailable values in the dataset. ML models usually need them handled before training.

If 600 of 10,000 income values are blank, missing rate is 600 / 10000 = 6%. You must decide whether to fill, flag, or remove them.

Imputation

The process of filling missing values with a reasonable substitute such as mean, median, mode, or a model-based estimate.

If ages are 20, 25, 30, and one value is missing, mean imputation fills it with (20 + 25 + 30) / 3 = 25.

Scaling

Adjusting numerical features to comparable ranges so that large-unit features do not dominate some algorithms.

If age ranges 18-80 and income ranges 10,000-500,000, income may dominate distance-based models. Standard scaling uses z = (x - mean) / standard deviation.

Mean

The arithmetic average. In ML, the mean is often used for summaries, imputation, centering, and baseline comparisons.

For incomes 30,000, 40,000, and 50,000, the mean is (30000 + 40000 + 50000) / 3 = 40000. A missing income may be filled with 40,000.

Median

The middle value after sorting. It is useful when data has extreme values that would distort the mean.

For incomes 25,000, 30,000, 35,000, 40,000, and 900,000, the median is 35,000 while the mean is 206,000. The median better reflects the typical customer.

Mode

The most frequent value. It is often used to summarize or fill missing categorical data.

If payment modes are card, UPI, UPI, cash, UPI, the mode is UPI. Missing payment_mode values may be filled with UPI.

Variance

A measure of spread around the mean. In ML, variance also describes how unstable a model's predictions are across different training samples.

If one training split gives churn predictions near 0.20 and another gives 0.75 for similar customers, the model has high variance. Its predictions are too sample-sensitive.

Standard Deviation

A spread measure in the original unit of the data, often used to understand feature dispersion and standardize features.

If mean age = 40 and standard deviation = 10, then age 55 becomes z = (55 - 40) / 10 = 1.5. This says 55 is 1.5 standard deviations above average.

Outlier

An unusually extreme value that may represent rare behavior, data error, or an important signal for the model.

If a customer usually spends 500-2,000 but suddenly spends 100,000, the value is an outlier. It may be an error, rare genuine purchase, or fraud signal.

Distribution

The shape of values in a feature or target. Distribution affects preprocessing, assumptions, model choice, and evaluation.

If most spends are 200-1,000 but a few are 100,000+, the distribution is right-skewed. A log transform such as log(spend) may make the feature easier to model.

Class Imbalance

A classification problem where one class is much more common than another, making accuracy alone misleading.

If 10,000 transactions contain only 100 frauds, predicting no fraud for all gives 9900 / 10000 = 99% accuracy but catches zero frauds.

Correlation

A measure of how two numerical variables move together. It helps detect relationships, redundancy, and possible predictors.

If usage rises from 10 to 50 hours and renewal rate rises from 40% to 85%, usage and renewal are positively correlated. But correlation alone does not prove usage causes renewal.

Multicollinearity

A situation where input features are highly related to each other, making coefficient interpretation unstable in linear models.

Area in square feet and area in square meters are almost perfectly correlated because sq_m = sq_ft x 0.0929. Keeping both can make coefficients unstable.

Regression

An ML task where the target is numerical. The model predicts an amount, score, time, price, or other continuous value.

Predicting airfare as 8,750 or delivery time as 42 minutes is regression because the output is numeric. Error may be measured as actual - predicted.

Classification

An ML task where the target is a category or class.

If the model predicts no_show = yes/no or spam = yes/no, it is classification. The output is a class, often after comparing probability with a threshold.

Clustering

An unsupervised learning task that groups similar observations without using a known target label.

Customers may be grouped by spend and usage into low-value, regular, and premium clusters. No churn label is required for this grouping.

Dimensionality Reduction

Techniques that reduce many input features into fewer useful representations while trying to preserve important information.

If a dataset has 100 highly related survey questions, PCA may compress them into 5 components that capture most variation.PCA = Principal Component Analysis.

Logistic Regression

A classification algorithm that models the probability of an event, usually for a yes/no target.

It may compute p(churn) = 0.73 from usage, complaints, and payment delay. If the threshold is 0.50, the predicted class becomes churn.p(churn) = probability of churn. Churn = customer leaves or stops using the service.

Sigmoid Function

The S-shaped function used by logistic regression to convert any real-valued score into a probability between 0 and 1.

If a model score is z = 1.2, sigmoid gives 1 / (1 + e^-1.2) = 0.768. That can be read as about 76.8% probability.

Log-Odds

The scale used inside logistic regression before converting the result into probability.

If probability is 0.75, odds are 0.75 / 0.25 = 3, and log-odds are log(3) = 1.099.

Decision Boundary

The line, curve, or surface that separates one predicted class from another.

With threshold 0.50, cases where P(churn) >= 0.50 are classified as churn and others as no churn. That cutoff creates the boundary.

Probability

A number between 0 and 1 expressing how likely an event is. Many ML classifiers output probabilities before converting them into class labels.

A model may output P(no-show) = 0.72. This does not mean the passenger will surely miss the flight; it means similar cases miss about 72 times out of 100.P(...) = probability of the event inside brackets.

Threshold

The cutoff used to convert a predicted probability into a class decision.

If P(fraud) = 0.83 and threshold = 0.80, flag fraud. If threshold is raised to 0.90, the same transaction is not flagged.P(fraud) = probability that the transaction is fraud.

Confusion Matrix

A table that compares predicted classes with actual classes, showing correct and incorrect classifications.

Example: TP = 70, FP = 30, FN = 20, TN = 880. This means 70 frauds were caught, 20 frauds were missed, and 30 genuine transactions were wrongly flagged.TP = True Positive, FP = False Positive, FN = False Negative, TN = True Negative.

Accuracy

The proportion of total predictions that are correct. It is useful only when class balance and error costs are reasonable.

Using TP = 70, FP = 30, FN = 20, TN = 880, accuracy is (TP + TN) / total = (70 + 880) / 1000 = 95%.TP = True Positive, TN = True Negative. Accuracy counts all correct predictions.

Precision

Among the cases predicted as positive, the proportion that are truly positive. It matters when false alarms are costly.

Using TP = 70 and FP = 30, precision is TP / (TP + FP) = 70 / 100 = 70%. Of all flagged cases, 70% were truly fraud.TP = True Positive, FP = False Positive. Precision asks: among predicted positives, how many were correct?

Recall

Among all truly positive cases, the proportion the model successfully finds. It matters when missing positives is costly.

Using TP = 70 and FN = 20, recall is TP / (TP + FN) = 70 / 90 = 77.8%. The model caught 77.8% of actual frauds.TP = True Positive, FN = False Negative. Recall asks: among actual positives, how many did the model find?

F1 Score

A single metric that balances precision and recall, useful when both false alarms and missed cases matter.

If precision = 70% and recall = 77.8%, then F1 = 2PR / (P + R) = 2(0.70)(0.778)/(0.70 + 0.778) = 0.737.F1 = F1 score. P = Precision, R = Recall in this formula.

ROC-AUC

A metric showing how well a classifier ranks positive cases above negative cases across thresholds.

If Model A has AUC = 0.82 and Model B has AUC = 0.68, Model A is better at ranking actual churners above non-churners across thresholds.AUC = Area Under the Curve. ROC = Receiver Operating Characteristic. Churner = customer who leaves.

Specificity

The proportion of actual negative cases that the model correctly identifies as negative.

Using TN = 880 and FP = 30, specificity is TN / (TN + FP) = 880 / 910 = 96.7%.TN = True Negative, FP = False Positive.

Loss Function

The mathematical penalty a model tries to reduce during training. It defines what kind of error the model cares about.

For regression, squared error is (actual - predicted)^2. If actual = 100 and predicted = 90, squared error is 10^2 = 100.

Residual

The difference between the actual value and the predicted value in a regression model.

If actual delivery time is 45 minutes and predicted time is 38 minutes, residual is 45 - 38 = 7. Positive means the model under-predicted.

Error

The gap between what the model predicts and what is actually true. ML training and evaluation are mainly about understanding and reducing error.

A fare prediction of 9,500 when actual fare is 10,000 has error 10000 - 9500 = 500. Absolute error is 500, ignoring direction.

MAE

A regression metric that averages absolute prediction errors, keeping errors in the original unit.

If absolute errors are 100, 200, and 300, then MAE = (100 + 200 + 300) / 3 = 200.MAE = Mean Absolute Error.

MSE

A regression metric that averages squared errors, giving larger mistakes a stronger penalty.

If errors are 2, -3, and 5, then MSE = (2^2 + (-3)^2 + 5^2) / 3 = 38 / 3 = 12.67.MSE = Mean Squared Error.

RMSE

The square root of MSE, bringing squared-error measurement back to the original unit of the target.

If MSE = 12.67, then RMSE = sqrt(12.67) = 3.56. For delivery-time prediction, this means about 3.56 minutes typical error.RMSE = Root Mean Squared Error.

R-squared

A regression metric showing how much variation in the target is explained by the model.

If R-squared = 0.78, the model explains about 78% of the variation in the target. The remaining 22% is not explained by the model.R-squared is also written as R2.

Bias

In ML, bias can mean systematic error from overly simple assumptions, or unfair distortion in the data or predictions.

If the true pattern is curved, such as y = x^2, but the model forces a straight line y = b0 + b1x, predictions may be systematically wrong.

Bias-Variance Tradeoff

The balance between a model that is too simple to capture patterns and a model that is too sensitive to training data noise.

A depth-2 tree may miss real structure and underfit. A depth-30 tree may memorize training rows. The useful model is usually between these extremes.

Overfitting

When a model learns training data too closely, including noise, and performs poorly on unseen data.

If training accuracy = 99% and test accuracy = 68%, the gap 99 - 68 = 31 percentage points is a strong overfitting warning.

Underfitting

When a model is too simple or poorly trained to capture the real patterns in the data.

Using only distance to predict airfare may produce high error because date, route, demand, and season also matter. The model is too simple for the problem.

Generalization

The ability of a model to perform well on new, unseen data from the same real-world problem.

If training AUC = 0.86 and test AUC = 0.84, the model generalizes reasonably. If test AUC falls to 0.60, the learned pattern may not transfer.AUC = Area Under the Curve, commonly used to judge ranking quality in classification models.

Cross-Validation

A resampling method that trains and evaluates a model across multiple data splits to estimate performance more reliably.

In 5-fold cross-validation, scores may be 0.81, 0.83, 0.79, 0.82, 0.80. The mean score is 0.81, giving a steadier estimate than one split.

Parameter

A value learned by the model from training data, such as a coefficient or weight.

In log_odds = b0 + b1x1 + b2x2, the learned values b0, b1, and b2 are parameters. Training estimates them from data.

Hyperparameter

A setting chosen before or during model training that controls how the algorithm learns.

For a decision tree, max_depth = 4 is a hyperparameter. It is chosen by the developer or tuning process, not directly learned like a coefficient.

Regularization

A technique that discourages overly complex models by penalizing large coefficients or excessive flexibility.

L2 regularization adds a penalty such as lambda x sum(coefficients^2). Large coefficients become costly, reducing overfitting risk.

Gradient Descent

An optimization method that repeatedly adjusts model parameters in the direction that reduces the loss function.

If a coefficient starts at 0.50 and the update is -0.03, the next value becomes 0.50 - 0.03 = 0.47. Many such steps reduce loss.

Learning Rate

A hyperparameter that controls the step size during optimization.

If the gradient is 4 and learning rate is 0.01, the parameter update is 0.01 x 4 = 0.04. Too large may overshoot; too small may learn slowly.

Data Leakage

A serious mistake where information unavailable at prediction time accidentally enters training, making performance look better than reality.

Using cancellation_date to predict whether a customer will cancel is leakage, because that date is known only after cancellation. Test accuracy may look high but fail in real use.

Pipeline

A repeatable sequence of preprocessing and modeling steps applied consistently to training and future data.

A pipeline may perform imputation, one-hot encoding, scaling, and logistic regression in order. This reduces mistakes when the same steps are needed on new data.

p-value

A statistical measure of evidence against a null hypothesis. In ML, it is more common in model interpretation than in pure prediction.

If fare_class has p-value = 0.003 in a regression summary, it suggests the relationship is unlikely to be pure random noise under the null hypothesis.

Confidence Interval

A range expressing uncertainty around an estimated value, coefficient, metric, or prediction.

A model's AUC may be reported as 0.82 with 95% CI [0.79, 0.85]. This communicates that performance is estimated, not known with perfect certainty.AUC = Area Under the Curve. CI = Confidence Interval.

Model

A learned representation of patterns in data, used to predict, classify, rank, or explain new cases.

A trained model may compute P(churn) = 0.15 for Customer A and P(churn) = 0.81 for Customer B, allowing ranking and action.P(churn) = probability that the customer will leave, cancel, or become inactive.

Inference

Drawing conclusions from data. In ML, inference can also mean using a trained model to generate predictions on new data.

After training, the model receives today's applicant features and outputs P(default) = 0.37. That prediction step is model inference.P(default) = probability that the applicant will default, meaning fail to repay as agreed.