Data Preprocessing

mlcli includes a comprehensive data preprocessing module to prepare your data for training. Preprocess your data with a single command or configure preprocessing in your YAML files.

Quick Start

Preprocess a CSV file with default settings:

Terminal
mlcli preprocess data/raw.csv \
  --output data/processed.csv \
  --scaler standard \
  --encoder onehot \
  --imputer mean

Available Preprocessors

Scalers

Normalize feature values to a common scale.

standardminmaxrobustmaxabs

Encoders

Convert categorical variables to numeric.

labelonehotordinaltarget

Imputers

Handle missing values in your data.

meanmedianmost_frequentconstant

Feature Selectors

Select the most important features.

variancekbestmutual_inforfe

Scalers

Scalers normalize feature values to a common scale, which is important for algorithms that are sensitive to feature magnitudes.

Standard Scaler

Standardizes features by removing the mean and scaling to unit variance. Best for normally distributed data.

Terminal
mlcli preprocess data.csv --scaler standard

MinMax Scaler

Scales features to a given range (default 0 to 1). Good for neural networks and algorithms that expect bounded inputs.

Terminal
mlcli preprocess data.csv --scaler minmax

Robust Scaler

Uses median and IQR for scaling, making it robust to outliers.

Terminal
mlcli preprocess data.csv --scaler robust

Encoders

Encoders convert categorical variables to numeric values that ML algorithms can process.

Label Encoder

Converts each unique category to an integer. Simple but introduces ordinal relationship.

Terminal
mlcli preprocess data.csv --encoder label --categorical-cols category1,category2

One-Hot Encoder

Creates binary columns for each category. Best for nominal categorical variables.

Terminal
mlcli preprocess data.csv --encoder onehot --categorical-cols category1,category2

Feature Selection

Select the most relevant features to improve model performance and reduce training time.

Select K Best

Select the top K features based on statistical tests.

Terminal
mlcli preprocess data.csv --selector kbest --k 10

Variance Threshold

Remove features with low variance.

Terminal
mlcli preprocess data.csv --selector variance --threshold 0.01

YAML Configuration

Configure preprocessing in your experiment configuration file:

YAML
# config.yaml
preprocessing:
  scaler: standard
  encoder: onehot
  imputer:
    strategy: mean
    columns: [age, income]
  feature_selection:
    method: kbest
    k: 20
  categorical_columns:
    - category
    - status
    - type
  drop_columns:
    - id
    - timestamp

Pipeline Example

Run preprocessing as part of a training pipeline:

Terminal
# Full training pipeline with preprocessing
mlcli train data/raw.csv \
  --model random_forest \
  --preprocess \
  --scaler standard \
  --encoder onehot \
  --imputer median \
  --output models/rf_model.pkl