dxzmpk

endless hard working

0%

Prediction-特征选择

特征工程

data: https://www.kaggle.com/c/forest-cover-type-prediction/data

notebook: https://www.kaggle.com/sharmasanthosh/exploratory-study-on-feature-selection

Data statistics

  • Shape

  • Datatypes

  • Description

    使用print(dataset.describe())命令对数据进行分析,主要分析

    count:是否有数据缺失,需不需要补全,

    min是否存在负值,

    编码是怎样的,如果是独热编码,可以转换成原编码以进行统计。

    std:是否存在常数项,如果存在可以被删除。

    mean: 数值的大小是否一致,如果不一致则需要进行归一化处理。

  • Skew

    1
    print(dataset.skew())

    数据的偏度,计算数据是左偏的还是右偏的

    img

  • Class distribution

    计算类别分布

    1
    dataset.groupby('Cover_Type').size()

Data Interaction

  • Correlation

    相关系数需要连续数据,因此使用类别编码的将无法使用

  • Scatter plot

Data Visualization

  • Box and density plots
  • Grouping of one hot encoded attributes

img

Data Cleaning

  • Remove unnecessary columns

Data Preparation

  • Original
  • Delete rows or impute values in case of missing
  • StandardScaler
  • MinMaxScaler
  • Normalizer

Feature selection

  • ExtraTreesClassifier
  • GradientBoostingClassifier
  • RandomForestClassifier
  • XGBClassifier
  • RFE
  • SelectPercentile
  • PCA
  • PCA + SelectPercentile
  • Feature Engineering

Evaluation, prediction, and analysis

  • LDA (Linear algo)
  • LR (Linear algo)
  • KNN (Non-linear algo)
  • CART (Non-linear algo)
  • Naive Bayes (Non-linear algo)
  • SVC (Non-linear algo)
  • Bagged Decision Trees (Bagging)
  • Random Forest (Bagging)
  • Extra Trees (Bagging)
  • AdaBoost (Boosting)
  • Stochastic Gradient Boosting (Boosting)
  • Voting Classifier (Voting)
  • MLP (Deep Learning)
  • XGBoost