Data Preprocessing Module
Comprehensive data preprocessing with automated analysis, intelligent recommendations, and AI-powered optimization.
Overview
The Preprocessing module provides automated data analysis, intelligent preprocessing, and AI-powered recommendations for optimal data preparation. It handles missing values, categorical encoding, feature scaling, and feature engineering automatically.
DataAnalyzer
class DataAnalyzer()
Comprehensive dataset analysis and quality assessment.
Key Features
- Automatic data type detection
- Missing value analysis
- Outlier detection
- Data quality scoring
- Statistical summary generation
Methods
analyze_dataset(df: pd.DataFrame, target_column: str = None) -> Dict[str, Any]
Analyze dataset characteristics and quality.
Parameters:
df: Input DataFrametarget_column: Target column name (optional)
Returns: Dictionary containing analysis results
get_data_quality_score(df: pd.DataFrame) -> float
Calculate data quality score (0-100).
Parameters:
df: Input DataFrame
Returns: Quality score
AutoPreprocessor
class AutoPreprocessor(target_column: str, config: Dict[str, Any] = None)
Automated data preprocessing with AI recommendations.
Key Features
- Automated missing value handling
- Intelligent categorical encoding
- Automatic feature scaling
- Feature selection
- Outlier detection and handling
Methods
fit_transform(df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.Series]
Fit preprocessor and transform data.
Parameters:
df: Input DataFrame
Returns: Tuple of (X_processed, y_processed)
transform(df: pd.DataFrame) -> pd.DataFrame
Transform data using fitted preprocessor.
Parameters:
df: Input DataFrame
Returns: Transformed DataFrame
PreprocessingRecommender
class PreprocessingRecommender()
AI-powered preprocessing recommendations.
Key Features
- Data-driven preprocessing recommendations
- Imputation strategy suggestions
- Encoding method recommendations
- Feature engineering suggestions
Methods
get_recommendations(df: pd.DataFrame, target_column: str) -> List[Recommendation]
Get preprocessing recommendations.
Parameters:
df: Input DataFrametarget_column: Target column name
Returns: List of recommendations
Examples
Basic Preprocessing
python
from ai_ml_framework.preprocessing import AutoPreprocessor
import pandas as pd
# Load your data
df = pd.read_csv('your_data.csv')
# Create preprocessor
preprocessor = AutoPreprocessor(target_column='target')
# Fit and transform data
X_processed, y_processed = preprocessor.fit_transform(df)
print(f"Original features: {df.shape[1]}")
print(f"Processed features: {X_processed.shape[1]}")
print(f"Preprocessing steps: {len(preprocessor.get_preprocessing_steps())}")
Advanced Configuration
python
# Custom preprocessing configuration
config = {
'numeric_features': {
'imputation': 'knn',
'scaling': 'robust',
'outlier_detection': True
},
'categorical_features': {
'encoding': 'target',
'handle_unknown': 'ignore'
},
'feature_engineering': {
'polynomial_features': True,
'interaction_features': True
}
}
preprocessor = AutoPreprocessor(
target_column='target',
config=config
)
X_processed, y_processed = preprocessor.fit_transform(df)
Data Analysis
python
from ai_ml_framework.preprocessing import DataAnalyzer
# Analyze dataset
analyzer = DataAnalyzer()
analysis = analyzer.analyze_dataset(df, target_column='target')
print(f"Data quality score: {analysis['data_quality_score']:.2f}")
print(f"Missing values: {analysis['missing_values_summary']}")
print(f"Feature types: {analysis['feature_types']}")
print(f"Outliers detected: {analysis['outlier_summary']}")
# Get quality score
quality_score = analyzer.get_data_quality_score(df)
print(f"Overall data quality: {quality_score}/100")