Kaggle Competition : House Prices — Advanced Regression Techniques

Sultan Ardiansyah
10 min readSep 26, 2021

Kaggle link : https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

Competition description : Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

Goal : It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable.

Submission file format : The file should contain a header and have the following format:

Id,SalePrice
1461,169000.1
1462,187724.1233
1463,175221
etc.

LET’S GET START!

Before continuing, I assume that the reader already understands at least the basic of Python programming.

What needs to be prepared when following this practice?

  1. Datasets
  2. Google Colab
  3. Basic of python programming language for data science
  4. Basic of statistics and analytics knowledgement
  5. A lots of coffee :)

Here’s a brief version of what you’ll find in the data description file that can you see the full version at the link above.

  • SalePrice — the property’s sale price in dollars. This is the target variable that you’re trying to predict.
  • MSSubClass: The building class
  • MSZoning: The general zoning classification
  • LotFrontage: Linear feet of street connected to property
  • LotArea: Lot size in square feet
  • Street: Type of road access
  • Alley: Type of alley access
  • LotShape: General shape of property
  • LandContour: Flatness of the property
  • Utilities: Type of utilities available
  • LotConfig: Lot configuration
  • LandSlope: Slope of property
  • Neighborhood: Physical locations within Ames city limits
  • Condition1: Proximity to main road or railroad
  • Condition2: Proximity to main road or railroad (if a second is present)
  • BldgType: Type of dwelling
  • HouseStyle: Style of dwelling
  • OverallQual: Overall material and finish quality
  • OverallCond: Overall condition rating
  • YearBuilt: Original construction date
  • YearRemodAdd: Remodel date
  • RoofStyle: Type of roof
  • RoofMatl: Roof material
  • Exterior1st: Exterior covering on house
  • Exterior2nd: Exterior covering on house (if more than one material)
  • MasVnrType: Masonry veneer type
  • MasVnrArea: Masonry veneer area in square feet
  • ExterQual: Exterior material quality
  • ExterCond: Present condition of the material on the exterior
  • Foundation: Type of foundation
  • BsmtQual: Height of the basement
  • BsmtCond: General condition of the basement
  • BsmtExposure: Walkout or garden level basement walls
  • BsmtFinType1: Quality of basement finished area
  • BsmtFinSF1: Type 1 finished square feet
  • BsmtFinType2: Quality of second finished area (if present)
  • BsmtFinSF2: Type 2 finished square feet
  • BsmtUnfSF: Unfinished square feet of basement area
  • TotalBsmtSF: Total square feet of basement area
  • Heating: Type of heating
  • HeatingQC: Heating quality and condition
  • CentralAir: Central air conditioning
  • Electrical: Electrical system
  • 1stFlrSF: First Floor square feet
  • 2ndFlrSF: Second floor square feet
  • LowQualFinSF: Low quality finished square feet (all floors)
  • GrLivArea: Above grade (ground) living area square feet
  • BsmtFullBath: Basement full bathrooms
  • BsmtHalfBath: Basement half bathrooms
  • FullBath: Full bathrooms above grade
  • HalfBath: Half baths above grade
  • Bedroom: Number of bedrooms above basement level
  • Kitchen: Number of kitchens
  • KitchenQual: Kitchen quality
  • TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
  • Functional: Home functionality rating
  • Fireplaces: Number of fireplaces
  • FireplaceQu: Fireplace quality
  • GarageType: Garage location
  • GarageYrBlt: Year garage was built
  • GarageFinish: Interior finish of the garage
  • GarageCars: Size of garage in car capacity
  • GarageArea: Size of garage in square feet
  • GarageQual: Garage quality
  • GarageCond: Garage condition
  • PavedDrive: Paved driveway
  • WoodDeckSF: Wood deck area in square feet
  • OpenPorchSF: Open porch area in square feet
  • EnclosedPorch: Enclosed porch area in square feet
  • 3SsnPorch: Three season porch area in square feet
  • ScreenPorch: Screen porch area in square feet
  • PoolArea: Pool area in square feet
  • PoolQC: Pool quality
  • Fence: Fence quality
  • MiscFeature: Miscellaneous feature not covered in other categories
  • MiscVal: $Value of miscellaneous feature
  • MoSold: Month Sold
  • YrSold: Year Sold
  • SaleType: Type of sale
  • SaleCondition: Condition of sale

CODE TIME!!

A. Import Libraries

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
from scipy.stats import norm

We will use basic libraries universally, such as pandas which is used to process datasets, numpy for mathematical operations, seaborn and matplotlib for data visualization.

B. Read Data

df = pd.read_csv('/content/drive/MyDrive/Dataset/Kaggle 1/train.csv')
df.head() # Show the top 5 rows

The resulting output is as below. Actually, there are still 81 more columns to the right, because the space shown is limited so there are only a few that are displayed on this forum.

df.describe()

Displays basic statistical information from data

df.info()

Display information from columns

C. Learn About the Data

corr = df.corr()plt.subplots(figsize = (20, 20))
sns.heatmap(corr, annot = True)

Displays data correlation between columns

There seem to be so many correlations and it’s confusing, so we’re going to take some with the best correlations based on the SalePrice data for analysis.

corr_cols = corr['SalePrice'].sort_values(ascending = False).head(10).index
corr_SalePrice = df[corr_cols].corr()
plt.subplots(figsize = (20, 20))
sns.heatmap(corr_SalePrice, annot = True)

Next we will analyze the linearity of the data that appears above with SalePrice using scatterplot.

fig, axs = plt.subplots(2, 5, figsize = (20, 15))x, y = 0, 0for col in corr_cols:
axs[x, y].scatter(x = df[col], y = df['SalePrice'])
axs[x, y].set_xlabel(col)
axs[x, y].set_ylabel('SalePrice')
x += 1 if x == 2:
x = 0
y += 1

This is really awesome and let’s analyze it, I hope readers feel the adrenaline when they see a graph like this.

  1. It can be seen that the OverallQual, GarageCars, FullBath, and TotRmsAbvGrd data are non-numeric data
  2. GrLivArea, TotalBsmtSF, and 1stFirSF data show a linear pattern to SalePrice. Next, we will examine the GrLivArea and TotalBsmtSF

D. Missing Value Handling

As we know that missing data is very influential when modelling later. Therefore a missing value must be handled.

total = df.isnull().sum().sort_values(ascending = False)
pcg = (total / df.isnull().count()).sort_values(ascending = False)
miss_val = pd.concat([total, pcg], axis = 1, keys = ['Total', 'Percentage'])
miss_val.head(20)
  1. PoolQC, MiscFeature, Alley, Fence, FireplaceQu and LotFrontage data have a dominant percentage because it is likely that when people buy a house they don’t pay attention to this aspect and finally the above data is less important. Then the data will be deleted.
  2. GarageCond, GarageType, GarageYrBlt, GarageFinish and GarageQual data have the same percentage so it can be assumed that the data has binding. This data will be deleted considering that it has been represented by GarageArea and GarageCars data which have a good correlation with SalePrice.
  3. MasVnrArea and MasVnrType data can be deleted, because they have a positive correlation with OverallQual and YearBuilt data.
  4. Electrical data can be preserved because it only has 1 missing observation data and this row can be deleted.

We will process according to what has been analyzed.

drop_col = miss_val[miss_val['Total'] > 1].index
df.drop(drop_col, axis = 1, inplace = True)
df.dropna(subset = ['Electrical'], how = 'any', axis = 0, inplace = True)

E. Outliers

Outliers can also affect the data training process later. Outliers arise due to “naughty” data. There are two types of outlier analysis, namely univariate analysis and bivariate analysis.

  • Univariate analysis

Univariate analysis is a one-column data analysis only. Univariate analysis is useful in analyzing target data, the target data here is SalePrice. We can use the Interquartile Range (IQR) equation in detecting outliers with univariate analysis.

(Q1, Q3) = np.log(df['SalePrice']).quantile([.25, .75])
IQR = Q3 - Q1
outlier = df[(np.log(df['SalePrice']) < Q1 - (1.5 * IQR)) | (np.log(df['SalePrice']) > Q1 + (1.5 * IQR))]df = df.drop(outlier.index).reset_index(drop = True)
  • Multivariate analysis

Multivariate or bivariate analysis is an analysis of two or more columns. Multivariate analysis is useful in analyzing feature data, the feature data here is the data above which we have analyzed the correlation. Multivariate analysis using Mahalanobis Distance and Chi Square equations in detecting outliers.

Create the function first

from scipy.stats import chi2def mahalanobis(data):
x_mu = data - np.mean(data)
inv_cov = np.linalg.inv(np.cov(data.values.T))
mah = np.dot(np.dot(x_mu, inv_cov), x_mu.T)
return mah.diagonal()

For more details about Mahalanobis Distance, readers can read existing references.

df['mahalanobis'] = mahalanobis(df[['OverallQual', 'GarageCars', 'FullBath', 'TotRmsAbvGrd', 'GrLivArea', 'TotalBsmtSF']])df['chi-square'] = chi2.rvs(df['mahalanobis'])
df[['OverallQual', 'GarageCars', 'FullBath', 'TotRmsAbvGrd', 'GrLivArea', 'TotalBsmtSF', 'mahalanobis', 'chi-square']].head()

It is said to be an outlier if the Chi Square value is below .001 and fortunately there is no outlier

F. Normality Analysis

The data is said to reach normality if the data is balanced skew. If the data does not reach normality, it can be solved by using the logarithmic function.

from scipy.stats import norm
from scipy import stats

We start with GrLivArea

sns.distplot(df['GrLivArea'], fit = norm)
fig = plt.figure()
res = stats.probplot(df['GrLivArea'], plot = plt)

Seems to have not reached normality. The graph looks skewed to the left and the probability is slightly off the line.

df['GrLivArea'] = np.log(df['GrLivArea'])sns.distplot(df['GrLivArea'], fit = norm)
fig = plt.figure()
res = stats.probplot(df['GrLivArea'], plot = plt)

Seems to have reached normality now. The reader can continue with other normalities according to the above method.

G. Encoding the Categorical Data

There are many ways to apply encoding, this time we will use dummy method. This categorical data encoding method transforms the categorical variable into a set of binary variables (also known as dummy variables).

df_final = pd.get_dummies(df)
df_final.sort_index(axis = 1, inplace = True)
df_final.head()

H. Execute Time!

Now we enter the data training stage after the data is ready.

We will divide the data for training and testing purposes with a ratio of 80:20. On the other hand we will separate feature data and target data.

from sklearn.model_selection import train_test_splitX = df_final.drop(['Id', 'mahalanobis', 'chi-square', 'SalePrice'], axis = 1)
Y = df_final['SalePrice']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 10)

We will use regression algorithms such as Linear Regression, Logistic Regression, RidgeCV and LassoCV as machine learning algorithms as well as several evaluation matrices such as Root Mean Square Error, Mean Absolute Error, and R-Square.

  • Linear Regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
lr = LinearRegression()
lr.fit(X_train, Y_train)
Y_pred_lr = lr.predict(X_test)print('Linear Regression')
print('RMSE :', np.sqrt(mean_squared_error(Y_test, Y_pred_lr)))
print('MAE :', mean_absolute_error(Y_test, Y_pred_lr))
print('R2 :', r2_score(Y_test, Y_pred_lr) * 100)
  • Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
log = LogisticRegression()
log.fit(X_train, Y_train)
Y_pred_log = log.predict(X_test)print('Logistic Regression')
print('RMSE :', np.sqrt(mean_squared_error(Y_test, Y_pred_log)))
print('MAE :', mean_absolute_error(Y_test, Y_pred_log))
print('R2 :', r2_score(Y_test, Y_pred_log) * 100)
  • RidgeCV
from sklearn.linear_model import RidgeCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
rid = RidgeCV()
rid.fit(X_train, Y_train)
Y_pred_rid = rid.predict(X_test)print('Ridge')
print('RMSE :', np.sqrt(mean_squared_error(Y_test, Y_pred_rid)))
print('MAE :', mean_absolute_error(Y_test, Y_pred_rid))
print('R2 :', r2_score(Y_test, Y_pred_rid) * 100)
  • LassoCV
from sklearn.linear_model import LassoCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
las = LassoCV()
las.fit(X_train, Y_train)
Y_pred_las = las.predict(X_test)print('Lasso')
print('RMSE :', np.sqrt(mean_squared_error(Y_test, Y_pred_las)))
print('MAE :', mean_absolute_error(Y_test, Y_pred_las))
print('R2 :', r2_score(Y_test, Y_pred_las) * 100)

From the evaluation results above, it is found that the RidgeCV model has the best evaluation value. We will use the RidgeCV model for implementation later.

I. Implementation

in the implementation phase, the data used must have the same structure based on the model’s training data (shape, sorting and column).

df_test = pd.read_csv('/content/drive/MyDrive/Dataset/Kaggle 1/test.csv')df_test.drop(drop_col, axis = 1, inplace = True)
df_test.dropna(axis = 0, inplace = True)
Id = df_test['Id']df_test.drop('Id', axis = 1, inplace = True)
df_test_dummy = pd.get_dummies(df_test)
missing_col = set(X_train.columns) - set(df_test_dummy)
for c in missing_col:
df_test_dummy[c] = 0
extra_col = set(df_test_dummy) - set(X_train.columns)
df_test_dummy.drop(extra_col, axis = 1, inplace = True)
df_test_dummy.sort_index(axis = 1, inplace = True)

When it is ready, we will predict using the RidgeCV model that has been trained.

df_test_predict = las.predict(df_test_dummy)pd.DataFrame({
'Id' : Id,
'Predict' : df_test_predict
})

Voila!!!

It’s quite tiring but it’s really fun. I hope this article can be useful. Let’s discuss!!

GitHub link : https://github.com/sultanardia/House-Prices---Advanced-Regression-Techniques/blob/main/House_Prices_Advanced_Regression_Techniques.ipynb

--

--

Sultan Ardiansyah

Hi, my name is Sultan Ardiansyah. I'm try to writing here. So, I'm appreciate if you read my stories