Analisis Exploratorio De Datos y Modelo de Machine Learning Para La Predicción de Precios de Apartamentos en Colombia Y Argentina

El presente notebook es un trabajo realizado por Tomás Ertola, Segundo Puesto en la competición de data science llamada "Predicción De Precios De Apartamentos En Argentina y Colombia". El competidor ha generado un repositorio en Github que podrá ser encontrado en los siguientes links para su estudio y detalle.

El objetivo de este Notebook es que le pueda servir para la comunidad de data science, y entender el proceso que el competidor llevó a cabo para obtener este excelente resultado.

La publicación del presente Notebook ha sido autorizada por el autor del mismo.

Cualquier inquietud, por favor déjala en los comentarios y la haremos llegar al autor.

In [1]:
%%html
<a href="https://www.datasource.ai/es/home/data-science-competitions-for-startups/prediccion-de-precios-de-apartamentos-en-argentina-y-colombia" target="blank">Competición en Data Science</a>
In [2]:
%%html
<a href="https://www.datasource.ai/es/data-science-articles/entrevista-a-los-ganadores-de-la-competencia-de-data-science-prediccion-de-precios-de-apartamentos-en-argentina-y-colombia" target="blank">Entrevista Tomás Ertola y a otros de los ganadores</a>
In [3]:
%%html
<a href="https://github.com/rubzk/datasourceai-pice-competition" target="blank">Repositorio de la Competición</a>

Análisis Exploratorio de Datos y Modelo de Machine Learning

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [3]:
train = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')
In [4]:
train.shape
Out[4]:
(25660, 12)
In [5]:
test.shape
Out[5]:
(6416, 11)
In [6]:
X = train
In [7]:
X.head()
Out[7]:
Id pais provincia_departamento ciudad property_type operation_type rooms bedrooms bathrooms surface_total currency price
0 0 Argentina Capital Federal Villa Crespo Departamento Venta 2.0 1.0 1.0 37.0 USD 85000.0
1 1 Argentina Capital Federal Palermo Departamento Venta 6.0 4.0 4.0 300.0 USD 1590000.0
2 2 Colombia Atlántico Barranquilla Apartamento Venta 3.0 3.0 3.0 95.0 USD 85329.0
3 3 Colombia Valle del Cauca Cali Apartamento Venta 3.0 3.0 1.0 60.0 USD 22846.0
4 4 Argentina Capital Federal Balvanera Departamento Venta 3.0 2.0 1.0 45.0 USD 80000.0

EDA

In a very short analysis we can see that features operation_type, property_type and currency are useless

Check numerical features correlation with our target price

In [9]:
num = train.select_dtypes(exclude='object')

num_corr = num.corr()

#Plot the correlations

f, ax = plt.subplots(figsize=(17,1))

sns.heatmap(num_corr.sort_values(by=['price'], ascending=False).head(1), cmap='Reds')
plt.title('Numerical Features Corr with price', weight='bold', fontsize=18)
plt.xticks(weight='bold')
plt.yticks(weight='bold', color='red', rotation=0)
plt.show()

Seems pretty obvious that the price is highly correlated with the total surface of the building. On the other hand we have the number of rooms, bedrooms and bathrooms.

In [10]:
num_corr = num_corr['price'].sort_values(ascending=False).head(5).to_frame()

num_corr
Out[10]:
price
price 1.000000
surface_total 0.664942
rooms 0.430070
bathrooms 0.405673
bedrooms 0.300356

Let's dive into the categorial features

In [11]:
sns.set(style="darkgrid")
sns.countplot(x='pais', data=X)
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f220a8a7af0>
In [12]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15,5))
columns = ['Argentina','Colombia']

for column,axes in zip(columns, ax.flatten()):
    plt.setp(axes.get_xticklabels(), rotation=90)
    sns.countplot(x='provincia_departamento', data=X[X.pais == column],ax=axes)

     
plt.show()
In [13]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15,5))
columns = ['Argentina','Colombia']

for column,axes in zip(columns, ax.flatten()):
    axes.title.set_text('Distplot Total Surface {}'.format(column))
    sns.distplot(X[X.pais == column]['surface_total'],ax=axes)

     
plt.show()

Spoiler alert: I think we have a negative skewness in the target price. Checking other numerical features

In [14]:
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(15,5))
columns = ['price','rooms', 'surface_total']

for column,axes in zip(columns, ax.flatten()):
    sns.distplot(train[column],ax=axes)
      
plt.show()
In [15]:
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(15,5))
for column,axes in zip(columns, ax.flatten()):
    sns.boxplot(train[column],ax=axes)
      
plt.show()

Okay so we not only have a negative skewness in the features, we also have a lot of outliers

In [16]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15,5))
columns = ['Argentina','Colombia']
color = ['red']

for column,axes in zip(columns, ax.flatten()):
    axes.title.set_text('Distplot Total Surface {}'.format(column))
    sns.scatterplot(y=train[train.pais == column]['surface_total'],x=train[train.pais == column]['price'],ax=axes, alpha=0.5)

     
plt.show()

Seems like Argentina is the one with more outliers

In [17]:
from sklearn.linear_model import LinearRegression

le = LinearRegression()

le.fit(train['surface_total'].values.reshape(-1,1), train['price'].values)

pred = le.predict(train['surface_total'].values.reshape(-1,1))

plt.figure(figsize=(10,5))
sns.scatterplot(train['price'], train['surface_total'], alpha=0.5)
plt.plot(pred, train['surface_total'],color='black', linewidth=2)
plt.show()

Feature Engineering

Transformation Log to Normal distribution

In [6]:
y = X['price']
X = pd.concat([X.loc[:, X.columns != 'price'], test])
In [7]:
numeric_features = ['surface_total', 'rooms', 'bedrooms', 'bathrooms']
cat_features = ['pais','provincia_departamento', 'ciudad']
features = cat_features + numeric_features
In [8]:
from scipy.stats import skew


skewed_feats = X[numeric_features].apply(lambda x: skew(x))
#skewed_feats = skewed_feats[skewed_feats > 0.75]
#skewed_feats
skewed_feats = skewed_feats.index

X[skewed_feats] = np.log1p(X[skewed_feats])
y = np.log1p(y)
In [9]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
cat_features= ['pais', 'ciudad', 'provincia_departamento']

X['pais'] = le.fit_transform(X['pais'])
X['ciudad'] = le.fit_transform(X['ciudad'])
X['provincia_departamento'] = le.fit_transform(X['provincia_departamento'])
In [10]:
final_features = pd.get_dummies(X[cat_features + numeric_features]).reset_index(drop=True)
In [11]:
final_features
Out[11]:
pais ciudad provincia_departamento surface_total rooms bedrooms bathrooms
0 0 235 9 3.637586 1.098612 0.693147 0.693147
1 0 148 9 5.707110 1.945910 1.609438 1.609438
2 1 20 1 4.564348 1.386294 1.386294 1.386294
3 1 35 35 4.110874 1.386294 1.386294 0.693147
4 0 16 9 3.828641 1.386294 1.098612 0.693147
... ... ... ... ... ... ... ...
32071 0 59 15 4.290459 1.386294 0.693147 0.693147
32072 0 173 9 3.688879 0.693147 0.693147 0.693147
32073 0 31 9 4.465908 1.609438 1.098612 0.693147
32074 1 20 1 4.343805 1.386294 1.386294 1.098612
32075 1 28 14 4.465908 1.386294 1.386294 1.386294

32076 rows × 7 columns

In [12]:
test = final_features[len(y):]
X_train = final_features[:len(y)]

Machine Learning

Models

Regularized linear regression

In [13]:
from sklearn.linear_model import Ridge, RidgeCV, ElasticNetCV, LassoCV, LassoLarsCV
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler

def rmse_cv(model):
    rmse= np.sqrt(-cross_val_score(model, X_train, y, scoring='neg_mean_squared_error', cv=5))
    return rmse
In [14]:
m_ridge = make_pipeline(RobustScaler(),Ridge())
In [15]:
alphas_ridge = [0.05,0.1,0.3,1,3,5,10]

cv_ridge = [rmse_cv(Ridge(alpha=alpha)).mean() for alpha in alphas_ridge]
In [16]:
cv_ridge = pd.Series(cv_ridge, index=alphas_ridge)
cv_ridge.plot(title = 'validatioin')
plt.xlabel('alpha')
plt.ylabel('rmse')
Out[16]:
Text(0, 0.5, 'rmse')
In [17]:
from sklearn.model_selection import KFold

kfolds = KFold(n_splits=10,shuffle=True)

def cv_rmse(model, X_train):
    rmse = np.sqrt(-cross_val_score(model, X_train, y,
                                    scoring="neg_mean_squared_error",
                                    cv=kfolds))
    return (rmse)

Lasso

In [18]:
lasso = make_pipeline(RobustScaler(), LassoCV(max_iter=1e7,alphas=[5e-05, 0.0001, 0.0002, 0.0003, 0.0004, 0.0005, 0.0006, 0.0007, 0.0008]))
In [19]:
score = cv_rmse(lasso, X_train)
print('Lasso Score {:.4f} ({:.4f})'.format(score.mean(),score.std()))
Lasso Score 0.3756 (0.0065)

Ridge

In [20]:
ridge = make_pipeline(RobustScaler(), RidgeCV(alphas=[14.5, 14.6, 14.7, 14.8, 14.9, 15, 15.1, 15.2, 15.3, 15.4, 15.5]))

score = cv_rmse(ridge, X_train)
print('Ridge Score {:.4f} ({:.4f})'.format(score.mean(),score.std()))
Ridge Score 0.3756 (0.0084)

ElasticNet

In [21]:
elastic = make_pipeline(RobustScaler(), ElasticNetCV(alphas=[0.0001, 0.0002, 0.0003, 0.0004, 0.0005, 0.0006, 0.0007], 
                                                     l1_ratio=[0.8, 0.85, 0.9, 0.95, 0.99, 1]))

score = cv_rmse(elastic, X_train)
print('Elastic Score {:.4f} ({:.4f})'.format(score.mean(),score.std()))
Elastic Score 0.3756 (0.0098)

SVR

In [22]:
from sklearn.svm import SVR

svr = make_pipeline(RobustScaler(), SVR(C=20, epsilon=0.008, gamma=0.0003))

score = cv_rmse(svr, X_train)
print('SVR Score {:.4f} ({:.4f})'.format(score.mean(),score.std()))
SVR Score 0.3581 (0.0137)

GradientBoostingRegressor

In [23]:
from sklearn.ensemble import GradientBoostingRegressor

gbr = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05, max_depth=4, loss='huber', random_state=42)

score = cv_rmse(gbr, X_train)
print('GBR Score {:.4f} ({:.4f})'.format(score.mean(),score.std()))
GBR Score 0.2551 (0.0096)

LightGBM

In [24]:
from lightgbm import LGBMRegressor
lgb = LGBMRegressor(objective='regression', learning_rate=0.08, n_estimators=4000, num_leaves=32,max_depth=7, reg_alpha=0.0001)

score = cv_rmse(lgb, X_train)
print('LGBM Score {:.4f} ({:.4f})'.format(score.mean(),score.std()))
LGBM Score 0.2589 (0.0101)

Catboost

In [25]:
from catboost import CatBoostRegressor


catboost = CatBoostRegressor(learning_rate=0.05, n_estimators=10000, max_depth=5, verbose=False, random_seed=42,
                            max_leaves=32, min_data_in_leaf=1)

score = cv_rmse(catboost, X_train)
print('Catboost Score {:.4f} ({:.4f})'.format(score.mean(),score.std()))
Catboost Score 0.2567 (0.0084)

XGboost

In [26]:
from xgboost import XGBRegressor
xgb = XGBRegressor(learning_rate=0.08, n_estimators=3000, max_depth=7, reg_alpha=0.0001, objective='reg:squarederror', max_leaves=10)
score = cv_rmse(xgb, X_train)
print('XGB Score {:.4f} ({:.4f})'.format(score.mean(),score.std()))
XGB Score 0.2703 (0.0089)

RandomForest

In [27]:
from sklearn.ensemble import RandomForestRegressor

rnd = RandomForestRegressor(n_estimators=4000, criterion='mse', max_depth=7, random_state=42, n_jobs=4, min_samples_leaf=10)

score = cv_rmse(rnd, X_train)
print('Random_forest Score {:.4f} ({:.4f})'.format(score.mean(),score.std()))
Random_forest Score 0.3107 (0.0094)

Stack Pred

In [28]:
from sklearn.ensemble import StackingRegressor

est = [('lasso', lasso), ('ridge', ridge), ('elastic', elastic), ('rnadomforest',rnd), ('gradient', gbr),('xgboost', xgb), ('lightgmb',lgb), ('cat', catboost)]
stack_gen = StackingRegressor(estimators=est,
                                final_estimator=gbr,
                                n_jobs=4)

Fit

In [ ]:
stack_gen.fit(X_train, y)
lasso.fit(X_train,y)
ridge.fit(X_train,y)
elastic.fit(X_train,y)
svr.fit(X_train,y)
gbr.fit(X_train,y)
xgb.fit(X_train,y)
lgb.fit(X_train,y)
rnd.fit(X_train,y)
catboost.fit(X_train, y)
/opt/conda/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py:691: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  "timeout or by a memory leak.", UserWarning

Test Prediction

In [30]:
def blend_models_prediction(X):
      return ((0.4 * gbr.predict(X)) + \
            (0.15 * xgb.predict(X)) + \
            (0.3 * catboost.predict(X)) + \
            (0.15 * stack_gen.predict(X)))
In [31]:
from sklearn.metrics import mean_squared_error
def rmsle(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))
In [32]:
score = rmsle(y,blend_models_prediction(X_train))
print('RMSLE Score on train data{}'.format(score))
RMSLE Score on train data0.20494364569622692
In [57]:
prediction = blend_models_prediction(test)
In [58]:
submission = pd.DataFrame({'id': test.index,
                           'price':np.expm1(prediction)})
In [59]:
submission.to_csv('blend.csv', index=False)