House Market Analysis - Regression Model

random forest

regression

Author

Oscar Cardec

Published

July 21, 2022

Introduction

Here is a descriptive analysis and regression modeling while attempting to build a comprehensive housing market study. The main goal is to extract valuable insights related to any association between sales prices, home values, or other property features and how these contribute to the overall value of a property.

Data Ingestion

The ingested data pertains to properties found in a given California district and summary stats about them based on the 1990 census data. The data has been conditioned or curated and is available at Kaggle. A list of attributes are ordered below.

The dataset contains the following attributes:
- longitude: Longitude of the house location.
- latitude: Latitude of the house location.
- housing_median_age: Median age of the houses.
- total_rooms: Total number of rooms in a block.
- total_bedrooms: Total number of bedrooms in a block.
- population: Population of a block.
- households: Number of households.
- median_income: Median income of households in the block.
- median_house_value: Median house value for households in the block (target variable).
- ocean_proximity: Proximity of the block to the ocean.

Broad overview of the data frame using skim(). Notice the number of missing values (207) across the total_bedrooms variable.

skim(data, where(is.numeric))

Data summary
Name	data
Number of rows	20640
Number of columns	10
_______________________
Column type frequency:
numeric	9
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
longitude	0	1.00	-119.57	2.00	-124.35	-121.80	-118.49	-118.01	-114.31	▂▆▃▇▁
latitude	0	1.00	35.63	2.14	32.54	33.93	34.26	37.71	41.95	▇▁▅▂▁
housing_median_age	0	1.00	28.64	12.59	1.00	18.00	29.00	37.00	52.00	▃▇▇▇▅
total_rooms	0	1.00	2635.76	2181.62	2.00	1447.75	2127.00	3148.00	39320.00	▇▁▁▁▁
total_bedrooms	207	0.99	537.87	421.39	1.00	296.00	435.00	647.00	6445.00	▇▁▁▁▁
population	0	1.00	1425.48	1132.46	3.00	787.00	1166.00	1725.00	35682.00	▇▁▁▁▁
households	0	1.00	499.54	382.33	1.00	280.00	409.00	605.00	6082.00	▇▁▁▁▁
median_income	0	1.00	3.87	1.90	0.50	2.56	3.53	4.74	15.00	▇▇▁▁▁
median_house_value	0	1.00	206855.82	115395.62	14999.00	119600.00	179700.00	264725.00	500001.00	▅▇▅▂▂

Exploratory Data Analysis

Median House Values Distribution

Frequency distribution of the the main dependable variable of “median_house_value" as given by the data steward website. Max number of bins limited to 50 for easy interpretation.

# Visual
ggplot(data, aes(x = median_house_value)) +
  geom_histogram(bins = 50, fill = "darkblue", col = "grey") +
  labs(title = "Distribution of Median House Values", 
       x = "Median House Value", 
       y = "Frequency")+ 
  theme_minimal()

As illustrated, mean value distribution is right-skewed as the observations are mainly concentrated on the figures below the median. This could indicate that the sampled data has clusters of highly priced properties but not uniformly.

Correlation Matrix

Notice the potential correlation across median_house_value and localization of the property as well as the representative income and total_rooms.

numeric_cols <- data %>% 
  select(-ocean_proximity)

corr_matrix <- cor(numeric_cols)

corrplot(corr_matrix, 
         method = "circle", 
         type = "upper", 
         tl.cex = 0.9)

Handling of Missing Values and Outliers

As previously mentioned, the total_bedrooms variable is missing some values, so I went ahead and to preserve the rest of the values, used the median of the entire attribute to replace those lacking the rooms data point. Additionally, I capped the data at +/- 3 standard deviations from the mean to have a more symmetrical data set. Lastly, given that I’m using “randomForest” there was no need to normalize (scale/center) the data.

library(DMwR2)
data$total_bedrooms[is.na(data$total_bedrooms)]<- median(data$total_bedrooms, na.rm =TRUE)

cap_outliers <-function(x){
  upper_limit <- mean(x)+3*sd(x)
  lower_limit <- mean(x)-3*sd(x)
  x <- ifelse(x > upper_limit, upper_limit, x)
  x <- ifelse(x < lower_limit, lower_limit, x)
  return(x)
  }

numeric_columns <- sapply(data,is.numeric)

capped_data <- data
capped_data[numeric_columns]<- lapply(capped_data[numeric_columns], cap_outliers)

summary(capped_data)

   longitude         latitude     housing_median_age  total_rooms  
 Min.   :-124.3   Min.   :32.54   Min.   : 1.00      Min.   :   2  
 1st Qu.:-121.8   1st Qu.:33.93   1st Qu.:18.00      1st Qu.:1448  
 Median :-118.5   Median :34.26   Median :29.00      Median :2127  
 Mean   :-119.6   Mean   :35.63   Mean   :28.64      Mean   :2560  
 3rd Qu.:-118.0   3rd Qu.:37.71   3rd Qu.:37.00      3rd Qu.:3148  
 Max.   :-114.3   Max.   :41.95   Max.   :52.00      Max.   :9181  
 total_bedrooms     population     households     median_income   
 Min.   :   1.0   Min.   :   3   Min.   :   1.0   Min.   :0.4999  
 1st Qu.: 297.0   1st Qu.: 787   1st Qu.: 280.0   1st Qu.:2.5634  
 Median : 435.0   Median :1166   Median : 409.0   Median :3.5348  
 Mean   : 523.4   Mean   :1391   Mean   : 487.7   Mean   :3.8362  
 3rd Qu.: 643.2   3rd Qu.:1725   3rd Qu.: 605.0   3rd Qu.:4.7432  
 Max.   :1795.0   Max.   :4823   Max.   :1646.5   Max.   :9.5701  
 median_house_value ocean_proximity   
 Min.   : 14999     Length:20640      
 1st Qu.:119600     Class :character  
 Median :179700     Mode  :character  
 Mean   :206856                       
 3rd Qu.:264725                       
 Max.   :500001

Encoding the Categorical Values

My last conditioning step prior to fitting the model was encoding the categorical variable of ocean_proximity to maximize data understanding, calculation, and value extraction by the model on this given attribute.

library(stats)
clean_data <- capped_data %>%
  dplyr::mutate(across(ocean_proximity, as.factor)) %>%
  model.matrix(~ ocean_proximity -1, data = .) %>%
  as.data.frame() %>%
  bind_cols(capped_data)

# extra step to ensure clear naming convention
clean_data <- clean_data %>% 
  dplyr::rename("ocean_proximity.1H.OCEAN" = "ocean_proximity<1H OCEAN",
         "ocean_proximity.INLAND" = "ocean_proximityINLAND",
         "ocean_proximity.ISLAND" = "ocean_proximityISLAND",
         "ocean_proximity.NEAR.BAY" = "ocean_proximityNEAR BAY",
         "ocean_proximity.NEAR.OCEAN" = "ocean_proximityNEAR OCEAN")

# Remove the original categorical column
clean_data <- clean_data %>% select(-ocean_proximity)

Model Building and Diagnostics

With a cleaner data set, I was ready to start running my linear model and study the quality of the product. As recorded below, I used caret::createDataPartition package function to split the train and test data subsets using a 80:20 ratio.

Train - Test Split

# data splitting
set.seed(1212)
trainIndex <- createDataPartition(clean_data$median_house_value, p = 0.8, list = FALSE)
trainData <- clean_data[trainIndex, ]
testData <- clean_data[-trainIndex, ]

Fit the Model

Fit Random Forest model to predict median house value.

library(randomForest)
# Train a Random Forest model
model <- randomForest(median_house_value ~ ., data = trainData, na.action = na.omit,
                      importance = TRUE, mtry = 4)

# Model summary
print(model)


Call:
 randomForest(formula = median_house_value ~ ., data = trainData,      importance = TRUE, mtry = 4, na.action = na.omit) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 4

          Mean of squared residuals: 2420723357
                    % Var explained: 81.88

Observe Model Characteristics

varImpPlot(model, main = "Variable Importance Plot", col = "darkblue")

Model Assessment

# model prediction
predicted_values <- predict(model, testData)

Actual vs. Predicted Viz

library(plotly)
pp <- testData %>% 
  ggplot(aes(median_house_value, predicted_values)) +
  labs(title = "California House Value Predicted vs. Actual", 
       subtitle ="Median Housing Price at 95% CI")+
  geom_point(alpha=0.4, col = "darkblue") + 
  geom_abline(intercept = 0, slope = 1, color = "green", linetype = "dashed")+
  xlab('Actual Value') +
  ylab('Predicted Value')+
  theme_ipsum()

ggplotly(pp)

# Calculate RMSE
library(ModelMetrics)
model_rmse <- rmse(predicted_values, testData$median_house_value)
model_rmse

[1] 47556.24

Results

The Root Mean Square Error (RMSE) resulted in ~47,500 which can be further improved using feature extraction, rebuilding, and training the model. Keep in mind that this standard deviation of the residuals represent the distance between the regression line and the data points.

Suggestions for Improving the Model

Feature Engineering: Consider creating new features like room-to-population ratio or income categories.
Hyperparameter Tuning: Optimize the model’s parameters using grid search or cross-validation.
Different Models: Experiment with other models like Gradient Boosting or XGBoost.
Ensemble Methods: Combine predictions from multiple models to improve accuracy.
Handling Outliers: Investigate and address potential outliers in the data.

Extra Visualization

The following depiction shows population per latitude/longitude and the respective median price of a property.

# Map depiction 
library(maps)

ca_df <- map_data("state") %>% filter(region =="california")
counties <- map_data("county")
ca_county <- subset(counties, region == "california")

ca_base <- ggplot(data = ca_df, mapping = aes(x = long, y = lat, group = group)) + 
  coord_fixed(1.0) + 
  geom_polygon(color = "black", fill = "gray")

ca_map <- ca_base +
  geom_polygon(data = ca_county, fill = NA, color = "gray") +
  geom_polygon(color = "darkblue", fill = NA)  

bb <- ca_map +
  geom_jitter(data = clean_data, 
              aes(x = longitude, alpha = 0.4,
              y = latitude, size = population, 
              col = median_house_value,
              group = population))+
  theme_minimal()+

  labs(title = "Population Distribution", 
       subtitle ="Median Housing Price",
       col = "Median Price",
       size = "Population",
       x = "Longitude",
       y = "Latitude")

ggplotly(bb)