PRE-PROCESSING METHODS - OVERVIEW

preparing data for reliable machine learning


Methods List.

1. Data Cleaning: Used to correct or remove poor-quality data.
Handling missing values, removing duplicates, correcting wrong values, and treating outliers.

2. Data Transformation: Used to convert data into a more useful format for ML models.
Normalization, standardization, log transformation, and Box-Cox or Yeo-Johnson transformation.

3. Encoding Categorical Data: Used to convert text/category values into numbers.
Label encoding, one-hot encoding, ordinal encoding, and target encoding.

4. Feature Scaling: Important for models affected by distance or magnitude.
Min-max scaling, standard scaling, and robust scaling.

5. Feature Engineering: Used to create better input variables.
Creating new features, combining features, date-time extraction, and binning.

6. Feature Selection: Used to keep only useful variables.
Correlation analysis, chi-square test, ANOVA, recursive feature elimination, and lasso-based selection.

7. Dimensionality Reduction: Used when there are too many columns/features.
PCA, LDA, and visualization-oriented methods like t-SNE or UMAP.

8. Handling Imbalanced Data: Used when one class is much larger than another.
Oversampling, undersampling, SMOTE, and class weighting.

9. Text Preprocessing: Used for Natural Language Processing.
Lowercasing, punctuation removal, stopword removal, stemming, lemmatization, and TF-IDF conversion.

10. Data Splitting: Used to prepare data for training and testing.
Train-test split, validation split, and cross-validation.

11. Data Leakage Prevention: Very important in real ML projects.
Fitting preprocessors on training data only, time-based split, and pipeline usage.

Key principle.
Preprocessing is not a single step. It is a sequence of careful choices based on data type, model family, and project objective.

Important caution.
Wrong preprocessing can reduce model quality even when the model algorithm is correct.

Usage / Applicability.

1. Data Cleaning
Use at the beginning of every project.
Essential when datasets come from multiple sources or manual entries.

2. Data Transformation: involves applying mathematical functions
—like logarithms, square roots to reshape your dataset's features. It is essential for algorithms like linear or logistic regression, which perform best when data is bell-shaped normal distribution.

3. Encoding Categorical Data
converts non-numerical text labels (like cities or product types) into numbers so machine learning algorithms can mathematically process them.

4. Feature Scaling
Mandatory for distance-based and gradient-based models such as KNN, SVM, PCA, and neural networks.

5. Feature Engineering
Feature engineering transforms complex, raw data into highly predictive inputs — it directly boosts a machine learning model's accuracy..

6. Feature Selection
Use when there are too many columns, correlated variables, or risk of overfitting and poor interpretability—ensuring faster, leaner, sharper models.

7. Dimensionality Reduction
Use for high-dimensional data, noise reduction, faster training, and compact visual summaries—unlocking hidden patterns while slashing complexity.

8. Handling Imbalanced Data
Use when minority class is critical, such as fraud, default, defect, disease, or churn prediction—preventing bias to find rare truths.

9. Text Preprocessing
Use in NLP tasks like sentiment analysis, topic classification, intent detection, and search ranking—converting raw chaos into structured insight, by transforming raw words into intelligence.

10. Data Splitting
Use in all supervised ML tasks to estimate generalization and tune models safely—shielding your model from hidden overfitting risks.

11. Data Leakage Prevention
This critical final step ensures that information from the test dataset never inadvertently seeps into the training process, which would otherwise trick you with artificially high accuracy. To guarantee a foolproof pipeline, always apply data transformations (like scaling or encoding) after splitting your data, and strictly avoid using future data points to predict past events.

Strong ML pipelines combine cleaning, transformation, encoding, scaling, feature work, and evaluation discipline in a repeatable sequence.

Method Usage / Applicability
Data CleaningFix missing, duplicate, invalid, and outlier records before modeling.
Data TransformationAdjust skewed or non-ideal distributions to better fit model assumptions.
Encoding Categorical DataConvert text categories into numeric form for ML algorithms.
Feature ScalingAlign numeric ranges for distance-based and gradient-based models.
Feature EngineeringCreate or combine variables to improve predictive signal.
Feature SelectionKeep relevant variables and reduce noise, redundancy, and overfitting risk.
Dimensionality ReductionCompress high-dimensional data for speed, stability, and visualization.
Handling Imbalanced DataImprove minority-class learning with resampling or class-weight strategies.
Text PreprocessingStandardize and vectorize text for NLP tasks.
Data SplittingMeasure generalization safely with train/test/validation design.
Data Leakage PreventionPrevent future/test information from contaminating model training.

Good preprocessing improves data quality,
aligns inputs with model assumptions,
reduces leakage and overfitting risk,
and improves model reliability.