Here is a descriptive analysis and regression modeling while attempting to build a comprehensive housing market study. The main goal is to extract valuable insights related to any association between sales prices, home values, or other property features and how these contribute to the overall value of a property.
Data Ingestion
The ingested data pertains to properties found in a given California district and summary stats about them based on the 1990 census data. The data has been conditioned or curated and is available at Kaggle. A list of attributes are ordered below.
The dataset contains the following attributes:
longitude: Longitude of the house location.
latitude: Latitude of the house location.
housing_median_age: Median age of the houses.
total_rooms: Total number of rooms in a block.
total_bedrooms: Total number of bedrooms in a block.
population: Population of a block.
households: Number of households.
median_income: Median income of households in the block.
median_house_value: Median house value for households in the block (target variable).
ocean_proximity: Proximity of the block to the ocean.
Broad overview of the data frame using skim(). Notice the number of missing values (207) across the total_bedrooms variable.
skim(data, where(is.numeric))
Data summary
Name
data
Number of rows
20640
Number of columns
10
_______________________
Column type frequency:
numeric
9
________________________
Group variables
None
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
longitude
0
1.00
-119.57
2.00
-124.35
-121.80
-118.49
-118.01
-114.31
▂▆▃▇▁
latitude
0
1.00
35.63
2.14
32.54
33.93
34.26
37.71
41.95
▇▁▅▂▁
housing_median_age
0
1.00
28.64
12.59
1.00
18.00
29.00
37.00
52.00
▃▇▇▇▅
total_rooms
0
1.00
2635.76
2181.62
2.00
1447.75
2127.00
3148.00
39320.00
▇▁▁▁▁
total_bedrooms
207
0.99
537.87
421.39
1.00
296.00
435.00
647.00
6445.00
▇▁▁▁▁
population
0
1.00
1425.48
1132.46
3.00
787.00
1166.00
1725.00
35682.00
▇▁▁▁▁
households
0
1.00
499.54
382.33
1.00
280.00
409.00
605.00
6082.00
▇▁▁▁▁
median_income
0
1.00
3.87
1.90
0.50
2.56
3.53
4.74
15.00
▇▇▁▁▁
median_house_value
0
1.00
206855.82
115395.62
14999.00
119600.00
179700.00
264725.00
500001.00
▅▇▅▂▂
Exploratory Data Analysis
Median House Values Distribution
Frequency distribution of the the main dependable variable of “median_house_value" as given by the data steward website. Max number of bins limited to 50 for easy interpretation.
# Visualggplot(data, aes(x =median_house_value))+geom_histogram(bins =50, fill ="darkblue", col ="grey")+labs(title ="Distribution of Median House Values", x ="Median House Value", y ="Frequency")+theme_minimal()
As illustrated, mean value distribution is right-skewed as the observations are mainly concentrated on the figures below the median. This could indicate that the sampled data has clusters of highly priced properties but not uniformly.
Correlation Matrix
Notice the potential correlation across median_house_value and localization of the property as well as the representative income and total_rooms.
numeric_cols<-data%>%select(-ocean_proximity)corr_matrix<-cor(numeric_cols)corrplot(corr_matrix, method ="circle", type ="upper", tl.cex =0.9)
Handling of Missing Values and Outliers
As previously mentioned, the total_bedrooms variable is missing some values, so I went ahead and to preserve the rest of the values, used the median of the entire attribute to replace those lacking the rooms data point. Additionally, I capped the data at +/- 3 standard deviations from the mean to have a more symmetrical data set. Lastly, given that I’m using “randomForest” there was no need to normalize (scale/center) the data.
longitude latitude housing_median_age total_rooms
Min. :-124.3 Min. :32.54 Min. : 1.00 Min. : 2
1st Qu.:-121.8 1st Qu.:33.93 1st Qu.:18.00 1st Qu.:1448
Median :-118.5 Median :34.26 Median :29.00 Median :2127
Mean :-119.6 Mean :35.63 Mean :28.64 Mean :2560
3rd Qu.:-118.0 3rd Qu.:37.71 3rd Qu.:37.00 3rd Qu.:3148
Max. :-114.3 Max. :41.95 Max. :52.00 Max. :9181
total_bedrooms population households median_income
Min. : 1.0 Min. : 3 Min. : 1.0 Min. :0.4999
1st Qu.: 297.0 1st Qu.: 787 1st Qu.: 280.0 1st Qu.:2.5634
Median : 435.0 Median :1166 Median : 409.0 Median :3.5348
Mean : 523.4 Mean :1391 Mean : 487.7 Mean :3.8362
3rd Qu.: 643.2 3rd Qu.:1725 3rd Qu.: 605.0 3rd Qu.:4.7432
Max. :1795.0 Max. :4823 Max. :1646.5 Max. :9.5701
median_house_value ocean_proximity
Min. : 14999 Length:20640
1st Qu.:119600 Class :character
Median :179700 Mode :character
Mean :206856
3rd Qu.:264725
Max. :500001
Encoding the Categorical Values
My last conditioning step prior to fitting the model was encoding the categorical variable of ocean_proximity to maximize data understanding, calculation, and value extraction by the model on this given attribute.
library(stats)clean_data<-capped_data%>%dplyr::mutate(across(ocean_proximity, as.factor))%>%model.matrix(~ocean_proximity-1, data =.)%>%as.data.frame()%>%bind_cols(capped_data)# extra step to ensure clear naming conventionclean_data<-clean_data%>%dplyr::rename("ocean_proximity.1H.OCEAN"="ocean_proximity<1H OCEAN","ocean_proximity.INLAND"="ocean_proximityINLAND","ocean_proximity.ISLAND"="ocean_proximityISLAND","ocean_proximity.NEAR.BAY"="ocean_proximityNEAR BAY","ocean_proximity.NEAR.OCEAN"="ocean_proximityNEAR OCEAN")# Remove the original categorical columnclean_data<-clean_data%>%select(-ocean_proximity)
Model Building and Diagnostics
With a cleaner data set, I was ready to start running my linear model and study the quality of the product. As recorded below, I used caret::createDataPartition package function to split the train and test data subsets using a 80:20 ratio.
Train - Test Split
# data splittingset.seed(1212)trainIndex<-createDataPartition(clean_data$median_house_value, p =0.8, list =FALSE)trainData<-clean_data[trainIndex, ]testData<-clean_data[-trainIndex, ]
Fit the Model
Fit Random Forest model to predict median house value.
library(randomForest)# Train a Random Forest modelmodel<-randomForest(median_house_value~., data =trainData, na.action =na.omit, importance =TRUE, mtry =4)# Model summaryprint(model)
Call:
randomForest(formula = median_house_value ~ ., data = trainData, importance = TRUE, mtry = 4, na.action = na.omit)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 4
Mean of squared residuals: 2420723357
% Var explained: 81.88
Observe Model Characteristics
varImpPlot(model, main ="Variable Importance Plot", col ="darkblue")
Model Assessment
# model predictionpredicted_values<-predict(model, testData)
Actual vs. Predicted Viz
library(plotly)pp<-testData%>%ggplot(aes(median_house_value, predicted_values))+labs(title ="California House Value Predicted vs. Actual", subtitle ="Median Housing Price at 95% CI")+geom_point(alpha=0.4, col ="darkblue")+geom_abline(intercept =0, slope =1, color ="green", linetype ="dashed")+xlab('Actual Value')+ylab('Predicted Value')+theme_ipsum()ggplotly(pp)
The Root Mean Square Error (RMSE) resulted in ~47,500 which can be further improved using feature extraction, rebuilding, and training the model. Keep in mind that this standard deviation of the residuals represent the distance between the regression line and the data points.
Suggestions for Improving the Model
Feature Engineering: Consider creating new features like room-to-population ratio or income categories.
Hyperparameter Tuning: Optimize the model’s parameters using grid search or cross-validation.
Different Models: Experiment with other models like Gradient Boosting or XGBoost.
Ensemble Methods: Combine predictions from multiple models to improve accuracy.
Handling Outliers: Investigate and address potential outliers in the data.
Extra Visualization
The following depiction shows population per latitude/longitude and the respective median price of a property.
# Map depiction library(maps)ca_df<-map_data("state")%>%filter(region=="california")counties<-map_data("county")ca_county<-subset(counties, region=="california")ca_base<-ggplot(data =ca_df, mapping =aes(x =long, y =lat, group =group))+coord_fixed(1.0)+geom_polygon(color ="black", fill ="gray")ca_map<-ca_base+geom_polygon(data =ca_county, fill =NA, color ="gray")+geom_polygon(color ="darkblue", fill =NA)bb<-ca_map+geom_jitter(data =clean_data, aes(x =longitude, alpha =0.4, y =latitude, size =population, col =median_house_value, group =population))+theme_minimal()+labs(title ="Population Distribution", subtitle ="Median Housing Price", col ="Median Price", size ="Population", x ="Longitude", y ="Latitude")ggplotly(bb)