Data Preprocessing Module

Comprehensive data preprocessing with automated analysis, intelligent recommendations, and AI-powered optimization.

Overview

The Preprocessing module provides automated data analysis, intelligent preprocessing, and AI-powered recommendations for optimal data preparation. It handles missing values, categorical encoding, feature scaling, and feature engineering automatically.

Basic Usage

from ai_ml_framework.preprocessing import AutoPreprocessor

preprocessor = AutoPreprocessor(target_column='target')
X_processed, y_processed = preprocessor.fit_transform(df)

Advanced Configuration

preprocessor = AutoPreprocessor(
    target_column='target',
    config={
        'numeric_features': {'scaling': 'robust'},
        'categorical_features': {'encoding': 'target'}
    }
)

DataAnalyzer

class DataAnalyzer()

Comprehensive dataset analysis and quality assessment.

Key Features

  • Automatic data type detection
  • Missing value analysis
  • Outlier detection
  • Data quality scoring
  • Statistical summary generation

Methods

analyze_dataset(df: pd.DataFrame, target_column: str = None) -> Dict[str, Any]

Analyze dataset characteristics and quality.

Parameters:
  • df: Input DataFrame
  • target_column: Target column name (optional)
Returns: Dictionary containing analysis results
get_data_quality_score(df: pd.DataFrame) -> float

Calculate data quality score (0-100).

Parameters:
  • df: Input DataFrame
Returns: Quality score

AutoPreprocessor

class AutoPreprocessor(target_column: str, config: Dict[str, Any] = None)

Automated data preprocessing with AI recommendations.

Key Features

  • Automated missing value handling
  • Intelligent categorical encoding
  • Automatic feature scaling
  • Feature selection
  • Outlier detection and handling

Methods

fit_transform(df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.Series]

Fit preprocessor and transform data.

Parameters:
  • df: Input DataFrame
Returns: Tuple of (X_processed, y_processed)
transform(df: pd.DataFrame) -> pd.DataFrame

Transform data using fitted preprocessor.

Parameters:
  • df: Input DataFrame
Returns: Transformed DataFrame

PreprocessingRecommender

class PreprocessingRecommender()

AI-powered preprocessing recommendations.

Key Features

  • Data-driven preprocessing recommendations
  • Imputation strategy suggestions
  • Encoding method recommendations
  • Feature engineering suggestions

Methods

get_recommendations(df: pd.DataFrame, target_column: str) -> List[Recommendation]

Get preprocessing recommendations.

Parameters:
  • df: Input DataFrame
  • target_column: Target column name
Returns: List of recommendations

Examples

Basic Preprocessing

python
from ai_ml_framework.preprocessing import AutoPreprocessor
import pandas as pd

# Load your data
df = pd.read_csv('your_data.csv')

# Create preprocessor
preprocessor = AutoPreprocessor(target_column='target')

# Fit and transform data
X_processed, y_processed = preprocessor.fit_transform(df)

print(f"Original features: {df.shape[1]}")
print(f"Processed features: {X_processed.shape[1]}")
print(f"Preprocessing steps: {len(preprocessor.get_preprocessing_steps())}")

Advanced Configuration

python
# Custom preprocessing configuration
config = {
    'numeric_features': {
        'imputation': 'knn',
        'scaling': 'robust',
        'outlier_detection': True
    },
    'categorical_features': {
        'encoding': 'target',
        'handle_unknown': 'ignore'
    },
    'feature_engineering': {
        'polynomial_features': True,
        'interaction_features': True
    }
}

preprocessor = AutoPreprocessor(
    target_column='target',
    config=config
)

X_processed, y_processed = preprocessor.fit_transform(df)

Data Analysis

python
from ai_ml_framework.preprocessing import DataAnalyzer

# Analyze dataset
analyzer = DataAnalyzer()
analysis = analyzer.analyze_dataset(df, target_column='target')

print(f"Data quality score: {analysis['data_quality_score']:.2f}")
print(f"Missing values: {analysis['missing_values_summary']}")
print(f"Feature types: {analysis['feature_types']}")
print(f"Outliers detected: {analysis['outlier_summary']}")

# Get quality score
quality_score = analyzer.get_data_quality_score(df)
print(f"Overall data quality: {quality_score}/100")