Objective: Zillow is trying to predict the log-error between their estimate and the actual sales price using features of a home. My goal was to determine which features, given their correlation coefficients, should be included in a predictive model.
Data: I was provided with a full list of real estate properties in three counties (Los Angeles, Orange and Ventura, California) data in 2016.
Analysis: The provided dataset has ~ 3M rows and 58 columns; but, yikes! ~ 49.2% of the cells are empty. Let's visualize this to see which few columns might even be helpful for making predictions.
In particular, 23 columns of data have less than 10% missing values. For these columns, which have the datatype float, I'll replace the missing values with the average value for that column. Furthermore, 53 (or 83%) of 60 rows are float datatypes so I don’t really need to do any categorical encoding for inputting into my model. Now, I can also plot variable correlation coffecients against eachother in the form of a heat map.
Conclusion: The structuretaxvaluedollarcnt field seems highly correlated with most of the important potentially useful fields. For predictive modeling, I would definitively include that one!