dxzmpk

endless hard working

0%

House_Prices-数据分析

data: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

notebooks: https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python

  1. Understand the problem. We’ll look at each variable and do a philosophical analysis about their meaning and importance for this problem.
  2. Univariable study. We’ll just focus on the dependent variable (‘SalePrice’) and try to know a little bit more about it.
  3. Multivariate study. We’ll try to understand how the dependent variable and independent variables relate.
  4. Basic cleaning. We’ll clean the dataset and handle the missing data, outliers and categorical variables.
  5. Test assumptions. We’ll check if our data meets the assumptions required by most multivariate techniques.

对缺失值的处理

对于缺失值来说不只是有进行默认填充、填充中位数等方式,还可以根据特征是否重要进行相应的删除,但这需要以事实来支撑。

Let’s analyse this to understand how to handle the missing data.

We’ll consider that when more than 15% of the data is missing, we should delete the corresponding variable and pretend it never existed. This means that we will not try any trick to fill the missing data in these cases. According to this, there is a set of variables (e.g. ‘PoolQC’, ‘MiscFeature’, ‘Alley’, etc.) that we should delete. The point is: will we miss this data? I don’t think so. None of these variables seem to be very important, since most of them are not aspects in which we think about when buying a house (maybe that’s the reason why data is missing?). Moreover, looking closer at the variables, we could say that variables like ‘PoolQC’, ‘MiscFeature’ and ‘FireplaceQu’ are strong candidates for outliers, so we’ll be happy to delete them.

In what concerns the remaining cases, we can see that ‘GarageX‘ variables have the same number of missing data. I bet missing data refers to the same set of observations (although I will not check it; it’s just 5% and we should not spend 20𝑖𝑛5in5 problems). Since the most important information regarding garages is expressed by ‘GarageCars’ and considering that we are just talking about 5% of missing data, I’ll delete the mentioned ‘GarageX‘ variables. The same logic applies to ‘BsmtX‘ variables.

Regarding ‘MasVnrArea’ and ‘MasVnrType’, we can consider that these variables are not essential. Furthermore, they have a strong correlation with ‘YearBuilt’ and ‘OverallQual’ which are already considered. Thus, we will not lose information if we delete ‘MasVnrArea’ and ‘MasVnrType’.

Finally, we have one missing observation in ‘Electrical’. Since it is just one observation, we’ll delete this observation and keep the variable.

In summary, to handle missing data, we’ll delete all the variables with missing data, except the variable ‘Electrical’. In ‘Electrical’ we’ll just delete the observation with missing data.

对异常值的处理

使用散点图绘制,观察其中的异常点,对于特别明显的异常点,进行删除。但是对于可能成为特殊案例的要予以保留。

对数据分布的验证

  • Normality - When we talk about normality what we mean is that the data should look like a normal distribution. This is important because several statistic tests rely on this (e.g. t-statistics). In this exercise we’ll just check univariate normality for ‘SalePrice’ (which is a limited approach). Remember that univariate normality doesn’t ensure multivariate normality (which is what we would like to have), but it helps. Another detail to take into account is that in big samples (>200 observations) normality is not such an issue. However, if we solve normality, we avoid a lot of other problems (e.g. heteroscedacity) so that’s the main reason why we are doing this analysis.

  • Homoscedasticity - I just hope I wrote it right. Homoscedasticity refers to the ‘assumption that dependent variable(s) exhibit equal levels of variance across the range of predictor variable(s)’ (Hair et al., 2013). Homoscedasticity is desirable because we want the error term to be the same across all values of the independent variables.

  • Linearity- The most common way to assess linearity is to examine scatter plots and search for linear patterns. If patterns are not linear, it would be worthwhile to explore data transformations. However, we’ll not get into this because most of the scatter plots we’ve seen appear to have linear relationships.

  • Absence of correlated errors - Correlated errors, like the definition suggests, happen when one error is correlated to another. For instance, if one positive error makes a negative error systematically, it means that there’s a relationship between these variables. This occurs often in time series, where some patterns are time related. We’ll also not get into this. However, if you detect something, try to add a variable that can explain the effect you’re getting. That’s the most common solution for correlated errors.

What do you think Elvis would say about this long explanation? ‘A little less conversation, a little more action please’? Probably… By the way, do you know what was Elvis’s last great hit?