Data Preprocessing
mlcli includes a comprehensive data preprocessing module to prepare your data for training. Preprocess your data with a single command or configure preprocessing in your YAML files.
Quick Start
Preprocess a CSV file with default settings:
mlcli preprocess data/raw.csv \
--output data/processed.csv \
--scaler standard \
--encoder onehot \
--imputer meanAvailable Preprocessors
Scalers
Normalize feature values to a common scale.
standardminmaxrobustmaxabsEncoders
Convert categorical variables to numeric.
labelonehotordinaltargetImputers
Handle missing values in your data.
meanmedianmost_frequentconstantFeature Selectors
Select the most important features.
variancekbestmutual_inforfeScalers
Scalers normalize feature values to a common scale, which is important for algorithms that are sensitive to feature magnitudes.
Standard Scaler
Standardizes features by removing the mean and scaling to unit variance. Best for normally distributed data.
mlcli preprocess data.csv --scaler standardMinMax Scaler
Scales features to a given range (default 0 to 1). Good for neural networks and algorithms that expect bounded inputs.
mlcli preprocess data.csv --scaler minmaxRobust Scaler
Uses median and IQR for scaling, making it robust to outliers.
mlcli preprocess data.csv --scaler robustEncoders
Encoders convert categorical variables to numeric values that ML algorithms can process.
Label Encoder
Converts each unique category to an integer. Simple but introduces ordinal relationship.
mlcli preprocess data.csv --encoder label --categorical-cols category1,category2One-Hot Encoder
Creates binary columns for each category. Best for nominal categorical variables.
mlcli preprocess data.csv --encoder onehot --categorical-cols category1,category2Feature Selection
Select the most relevant features to improve model performance and reduce training time.
Select K Best
Select the top K features based on statistical tests.
mlcli preprocess data.csv --selector kbest --k 10Variance Threshold
Remove features with low variance.
mlcli preprocess data.csv --selector variance --threshold 0.01YAML Configuration
Configure preprocessing in your experiment configuration file:
# config.yaml
preprocessing:
scaler: standard
encoder: onehot
imputer:
strategy: mean
columns: [age, income]
feature_selection:
method: kbest
k: 20
categorical_columns:
- category
- status
- type
drop_columns:
- id
- timestampPipeline Example
Run preprocessing as part of a training pipeline:
# Full training pipeline with preprocessing
mlcli train data/raw.csv \
--model random_forest \
--preprocess \
--scaler standard \
--encoder onehot \
--imputer median \
--output models/rf_model.pkl