Methods List.
1. Data Cleaning: Used to correct or remove poor-quality data.
Handling missing values, removing duplicates, correcting wrong values, and treating outliers.
2. Data Transformation: Used to convert data into a more useful format for ML models.
Normalization, standardization, log transformation, and Box-Cox or Yeo-Johnson transformation.
3. Encoding Categorical Data: Used to convert text/category values into numbers.
Label encoding, one-hot encoding, ordinal encoding, and target encoding.
4. Feature Scaling: Important for models affected by distance or magnitude.
Min-max scaling, standard scaling, and robust scaling.
5. Feature Engineering: Used to create better input variables.
Creating new features, combining features, date-time extraction, and binning.
6. Feature Selection: Used to keep only useful variables.
Correlation analysis, chi-square test, ANOVA, recursive feature elimination, and lasso-based selection.
7. Dimensionality Reduction: Used when there are too many columns/features.
PCA, LDA, and visualization-oriented methods like t-SNE or UMAP.
8. Handling Imbalanced Data: Used when one class is much larger than another.
Oversampling, undersampling, SMOTE, and class weighting.
9. Text Preprocessing: Used for Natural Language Processing.
Lowercasing, punctuation removal, stopword removal, stemming, lemmatization, and TF-IDF conversion.
10. Data Splitting: Used to prepare data for training and testing.
Train-test split, validation split, and cross-validation.
11. Data Leakage Prevention: Very important in real ML projects.
Fitting preprocessors on training data only, time-based split, and pipeline usage.
Key principle.
Preprocessing is not a single step. It is a sequence of careful choices based on data type,
model family, and project objective.
Important caution.
Wrong preprocessing can reduce model quality even when the model algorithm is correct.
Usage / Applicability.
1. Data Cleaning
Use at the beginning of every project.
Essential when datasets come from multiple sources or manual entries.
2. Data Transformation: involves applying mathematical functions
—like logarithms, square roots to reshape your dataset's features.
It is essential for algorithms like linear or logistic regression, which perform best when data is bell-shaped normal distribution.
3. Encoding Categorical Data
converts non-numerical text labels (like cities or product types) into numbers so machine learning algorithms can mathematically process them.
4. Feature Scaling
Mandatory for distance-based and gradient-based models such as KNN, SVM, PCA, and neural networks.
5. Feature Engineering
Feature engineering transforms complex, raw data into highly predictive inputs
— it directly boosts a machine learning model's accuracy..
6. Feature Selection
Use when there are too many columns, correlated variables,
or risk of overfitting and poor interpretability—ensuring faster, leaner, sharper models.
7. Dimensionality Reduction
Use for high-dimensional data, noise reduction, faster training, and compact visual
summaries—unlocking hidden patterns while slashing complexity.
8. Handling Imbalanced Data
Use when minority class is critical, such as fraud, default, defect, disease, or churn prediction—preventing bias to find rare truths.
9. Text Preprocessing
Use in NLP tasks like sentiment analysis, topic classification, intent detection, and search ranking—converting raw
chaos into structured insight, by transforming raw words into intelligence.
10. Data Splitting
Use in all supervised ML tasks to estimate generalization and tune models safely—shielding your model from hidden overfitting risks.
11. Data Leakage Prevention
This critical final step ensures that information from the test dataset never inadvertently seeps into the training process,
which would otherwise trick you with artificially high accuracy. To guarantee a
foolproof pipeline, always apply data transformations (like scaling or encoding)
after splitting your data, and strictly avoid using future data points to predict
past events.
Strong ML pipelines combine cleaning, transformation, encoding, scaling, feature work, and evaluation discipline in a repeatable sequence.