Logistic Regression

classification by probability and boundary

Machine Learning View.

In machine learning, logistic regression is a supervised learning algorithm used mainly for classification. It learns from labelled examples and estimates the probability that a new observation belongs to a particular class.

The word regression can be confusing here. Logistic regression is not mainly used to predict a continuous number. It predicts a probability first, and that probability is then converted into a class such as yes/no, pass/fail, accepted/rejected, or survived/not survived.

The learning problem.
The training data contains input features and a known target label. The model looks for a pattern that separates one class from another. During training, it adjusts its weights so that examples from one class receive higher probabilities and examples from the other class receive lower probabilities.

Features, weights, and score.
Each input feature is multiplied by a learned weight. These weighted values are combined into a single score. A positive weight pushes the prediction toward class 1. A negative weight pushes it toward class 0. A weight close to zero means that feature contributes very little to the decision.

Why the sigmoid function is used.
The raw score can be any number, but a probability must stay between 0 and 1. Logistic regression uses the sigmoid function to squash the score into that range. The output can then be read as the model's confidence for class 1.

Probability comes before the class.
Logistic regression does not directly jump to an answer like "yes" or "no". It first gives a probability, such as 0.82, and then a threshold turns that probability into a final class.

The decision threshold.
A common threshold is 0.50. If the predicted probability is 0.50 or higher, the model predicts class 1. If it is below 0.50, it predicts class 0. In real projects, the threshold can be changed depending on the cost of false alarms and missed cases.

The decision boundary.
Logistic regression draws a boundary in the feature space. On one side of the boundary, the model leans toward class 0. On the other side, it leans toward class 1. With two features, this boundary is easy to imagine as a line. With many features, it becomes a higher-dimensional boundary.

Training with loss.
The model is trained by reducing a loss value. The loss becomes large when the model gives high confidence to the wrong class. Training repeatedly changes the weights to reduce this loss over the training examples.

Regularization.
Logistic regression can overfit when it tries too hard to match the training data. Regularization controls this by discouraging unnecessarily large weights. This helps the model generalize better to unseen data.

Binary and multi-class use.
The simplest form is binary classification. For more than two classes, machine learning systems commonly extend logistic regression using one-vs-rest or softmax-style approaches.

Statistical Reference page: Regression from Statistics Perspective

Important caution.
Logistic regression is useful when the classes can be separated reasonably by a linear boundary. If the boundary is highly curved or complex, another model may perform better.

A simple mental picture.
Think of logistic regression as a probability meter. The features push the meter upward or downward. The threshold decides where the meter changes from one class to the other.

Logistic regression is best understood as a classification model that learns a weighted boundary and reports probability before giving the final class.

Examples for Understanding.

Copy-paste starter code.
This small example uses built-in sample data, so a beginner can run it without downloading any file. It shows the most important idea: the model first creates probabilities, then a threshold converts those probabilities into classes.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

data = load_breast_cancer()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)

probabilities = model.predict_proba(X_test)[:, 1]
threshold = 0.50
predictions = (probabilities >= threshold).astype(int)

print("First 5 probabilities:", probabilities[:5])
print("First 5 predictions:", predictions[:5])
print("Accuracy:", accuracy_score(y_test, predictions))
print("Confusion matrix:")
print(confusion_matrix(y_test, predictions))

Try changing the same code.
Change threshold from 0.50 to 0.30 or 0.70 and watch how the confusion matrix changes. Change test_size to 0.20 or 0.30 to see how train-test splitting affects evaluation. Change C in LogisticRegression(C=0.5) or LogisticRegression(C=2.0) to observe regularization.

1. Email spam classification.
Features may include number of links, suspicious words, sender reputation, and message length. The target label is spam or not spam.

The model may output a spam probability of 0.91. With a 0.50 threshold, the email is classified as spam. If the probability is 0.18, it is classified as not spam.

2. Loan approval risk.
Features may include income, repayment history, loan amount, and existing debt. The model estimates the probability that the applicant may default.

If the probability of default is 0.76, the system may mark the case as high risk. If it is 0.22, the case may be marked as lower risk. The final business decision can still include human review.

3. Medical screening example.
A screening model may use age, test measurements, symptoms, and history to estimate the probability of a condition being present.

Here the threshold may not be 0.50. If missing a real case is dangerous, the threshold may be lowered so more patients are flagged for further testing. This increases false positives but reduces missed cases.

4. Customer churn.
A company may predict whether a customer will leave. Features can include complaint count, usage drop, plan price, tenure, and support history.

A churn probability of 0.68 may trigger a retention offer. A probability of 0.12 may need no immediate action. The model supports prioritization rather than replacing judgement.

5. Understanding weights.
In a churn model, a positive weight for complaint count means more complaints push the probability of churn upward. A negative weight for tenure means longer relationship may push churn probability downward.

6. Threshold trade-off.
Suppose fraud detection uses a threshold of 0.50. Many fraud cases are caught, but some genuine payments may be blocked. If the threshold is raised to 0.80, fewer genuine payments are blocked, but more fraud may pass.

7. Confusion matrix reading.
After prediction, results can be counted as true positives, true negatives, false positives, and false negatives. This is more useful than accuracy alone when one mistake is more serious than another.

8. Why scaling can help.
If one feature is measured in rupees and another as a small ratio, their numeric ranges can be very different. Scaling features helps the training process treat them more evenly.