特征工程
data: https://www.kaggle.com/c/forest-cover-type-prediction/data
notebook: https://www.kaggle.com/sharmasanthosh/exploratory-study-on-feature-selection
Data statistics
Shape
Datatypes
Description
使用print(dataset.describe())命令对数据进行分析,主要分析
count:是否有数据缺失,需不需要补全,
min是否存在负值,
编码是怎样的,如果是独热编码,可以转换成原编码以进行统计。
std:是否存在常数项,如果存在可以被删除。
mean: 数值的大小是否一致,如果不一致则需要进行归一化处理。
Skew
1
print(dataset.skew())
数据的偏度,计算数据是左偏的还是右偏的
Class distribution
计算类别分布
1
dataset.groupby('Cover_Type').size()
Data Interaction
Correlation
相关系数需要连续数据,因此使用类别编码的将无法使用
Scatter plot
Data Visualization
- Box and density plots
- Grouping of one hot encoded attributes
Data Cleaning
- Remove unnecessary columns
Data Preparation
- Original
- Delete rows or impute values in case of missing
- StandardScaler
- MinMaxScaler
- Normalizer
Feature selection
- ExtraTreesClassifier
- GradientBoostingClassifier
- RandomForestClassifier
- XGBClassifier
- RFE
- SelectPercentile
- PCA
- PCA + SelectPercentile
- Feature Engineering
Evaluation, prediction, and analysis
- LDA (Linear algo)
- LR (Linear algo)
- KNN (Non-linear algo)
- CART (Non-linear algo)
- Naive Bayes (Non-linear algo)
- SVC (Non-linear algo)
- Bagged Decision Trees (Bagging)
- Random Forest (Bagging)
- Extra Trees (Bagging)
- AdaBoost (Boosting)
- Stochastic Gradient Boosting (Boosting)
- Voting Classifier (Voting)
- MLP (Deep Learning)
- XGBoost