Interview with the winners of the data science competition "Real Estate Price Forecast"

Daniel Morales
Oct 16, 2020

Interview with the winners of the data science competition "Real Estate Price Forecast"

Oct 16, 2020 14 minutes read

Learn how they made their machine learning models and what tools they used with this interview to the top 10 of the competition leaderboard. 

A few days ago we finished the data science competition called "Real Estate Price Forecast" in which 139 data scientists joined, 51 of them sent at least 1 machine learning model to the platform and we received and evaluated a total of 831 models, that is an average of 16 models for each active participant. From here we can draw several conclusions, and it is the need to build different models to evaluate their effectiveness and find the best result. 

Since this was an error metric, the minimum and winning score was 0.248616099466774. Only two people managed to score 0.24, let's look at these ranges:

Ranges of
  • 0.24 = 2 competitors
  • 0.25 = 7 competitors
  • 0.26 = 9 competitors

Due to these good results, we wanted to know in detail what the competitors who were in the first places did. So here we have the questions and answers of our winners.


Tomás Ertola - Argentina - Second Place


Q: In general terms, how did you address the problem raised in the competition?
A: The pipeline I followed was very basic, EDA, Distribution Transformation and application of different models.

Q: For this particular competition, did you have any previous experience in this field? 
A: When I did a DS bootcamp in August 2019 the properati dataset was used to train what would be data cleaning, I had never done a similar model and so I took it as a challenge.

Q: What important results/conclusions did you find in your exploration of the data? What challenges did you have to deal with?
A: The most important fact that makes your model stand out is that the distribution of the three cardinal variables are shifted to the left, so they don't follow a normal distribution. Realizing this, you might want to check which distribution is most like it and apply a correction. In my case I transformed them to a logarithmic distribution.

Q: In general terms, what data processing and feature engineering did you do for this competition?
A: I took only 3 categorical variables: Country, City and Department to which I applied a one hot encoding and forgot about it. The hard work was related to correcting the distribution of the cardinal variables.

Q: What Machine Learning algorithms did you use for the competition? 
A: First I made a regressor stacking with a L1 and L2 (Lasso and Ridge) regularization and then an ensembling with Gradient Boosting, XGBoost, Catboost, LightGBM, RandomForest

Q: What was the Machine Learning algorithm that gave you the best score and why do you think it worked better than the others? 
A: Within the ensemble the ones that performed better were the gradient boosting and to that we should add that the assemblies corrected the defects that each one had which made them an optimal solution. And I believe that it is not by chance that they have worked so well, the gradient boosting have been the winners in various competitions for years, the basis on which these algorithms are based in itself already makes them stand out among the most common.

Q: What libraries did you use for this particular competition?
A: Sklearn, scipy, numpy, catboost, xgboost, lightgbm

Q: How many years of experience do you have in Data Science and where do you currently work?
A: I have 1 year of data science on my own and I am currently working in Data Analyst for the Government of the City of Buenos Aires.

Q: What advice would you give to those who did not have such good scores in the competition? 
A: The important thing is not the score, but understanding what you are doing. When you start to understand more about the models and the things you do the result improves but the reality is that it is more valuable to understand than the score.

Pablo Neira Vergara - Chile- Third Place


Q: In general terms, how did you address the problem raised in the competition?
A: The main thing is to understand the problem well and think about what new variables could be used, and I don't mean just transformations to normalize or standardize and make dummies, I'm talking about things like thinking that the number of times a city is repeated in the training set can tell us something about the population density, or that maybe even some types of clustering using only some variables can give us more information than those same variables separately.

Q: For this particular competition, did you have any previous experience in this field? 
A: None

Q: What important results/conclusions did you find in exploring the data? What challenges did you have to deal with?
A: The city seemed to be an excellent categorical variable, however some cities in the test were not present in the training, so I had to think about how to compensate for that lack of information. Also some cities in the training, even though they belong to the same city, and have the same amount of rooms and square meters varied considerably in price.

Q: In general terms, what data processing and feature engineering did you do for this competition?
A: Dummies of virtually all variables that looked suspiciously categorical, and then applied Standard Scaler, plus applied a k-means along with gridsearch to determine which groupings could provide more information to a base model. Among other things, of course.

Q: What Machine Learning algorithms did you use for the competition? 
A: I tried many algorithms and assemblies, but it turned out that the best results were obtained using k-means along with a gridsearch-optimized GBR.

Q: What was the Machine Learning algorithm that gave you the best score and why do you think it worked better than the others? 
A: I think I made good use of the information available, plus I ended up doing an assembly with the data that was exactly the same from testing to training with the model. If it looks like a duck, quacks like a duck, and flies like a duck the most sensible thing to do is to assume it's a duck, of course sometimes it can be a goose, hence the need to assemble it.

Q: What libraries did you use for this particular competition?
A: The usual: pandas, numpy, sklearn, seaborn, matplotlib and lightgbm

Q: How many years of experience do you have in Data Science and where do you currently work?
A: I have been in the area for a little over 4 years. I currently work for the Logistics Observatory of the Ministry of Transport and Telecommunications of the Chilean Government

Q: What advice would you give to those who did not score as well in the competition? 
A: Think about the problem to be solved before going to sleep, look for information about the state of the possible suitable algorithms, and if possible the state of the art of the particular problem itself.


Cesar Gustavo Seminario Calle - Perú - Sixth Position


Q: In general terms, how did you address the problem raised in the competition?
A: I focused on testing a variety of features based on target statistics by province, city and country and setting up my validation scheme with the train data, to compare results before making the submissions.

Q: For this particular competition, did you have any previous experience in this field? 
A: Yes, I have worked on time series models.

Q: What important results/conclusions did you find in exploring the data? What challenges did you have to deal with?
A: The total area was a very important variable in predicting the price. Some cities had very high prices because they were probably condominiums, and for prices above 100,000 the ratio was almost linear, while for lower values the ratio seemed to be exponential.

Q: In general terms, what data processing and feature engineering did you do for this competition?
A: I applied target encoding on the price features, number of rooms and total surface area separated for both countries, then a clustering of the apartments using the built features.

Q: What Machine Learning algorithms did you use for the competition? 
A: Simple neural network models and tree based models. Model Stacking.

Q: What was the Machine Learning algorithm that gave you the best score and why do you think it worked better than the others? 
A: Extreme gradient boosting, because of the regularization parameters (depth, learning rate, column sampling) and the boosting technique

Q: What libraries did you use for this particular competition?
A: scikit-learn, keras, pandas, seaborn, mlflow

Q: How many years of experience do you have in Data Science and where do you currently work?
A: 2 years of experience, currently working at Voxiva.ai

Q: What advice would you give to those who did not score as well in the competition? 
A: Maintaining an orderly and simple validation scheme allows you to test many hypotheses and obtain results so as not to repeat tasks.


Federico Gutiérrez - Colombia - Seventh Place


Q: In general terms, how did you address the problem raised in the competition?
A: Since we were processing information on two different countries, I decided that a good strategy would be to divide all the information by country. The real estate markets in Colombia and Argentina are very different and each has its own particular way of behaving, so I don't think it is a good idea to create a model for both markets simultaneously.

Q: For this particular competition, did you have any previous experience in this field? 
A: Although I had never created models to make this type of forecast, I did have experience and general understanding of the real estate market and property evaluation. I gained this experience by working for an insurance company.

Q: What important results/conclusions did you find in exploring the data? What challenges did you have to deal with?
A: This competition presented several challenges, including cleaning and debugging the dataset. To do a good cleanup I had to assume several things, among these I had to manually correct several prices that were far from the realistic values. For this I simply used common sense and basic knowledge of the industry.

Q: In general terms, what data processing and feature engineering did you do for this competition?
A: One of the factors that helped me the most was dividing the dataset by country, this made it easier for the models to get closer to reality. For the Argentine market, I discovered that the presence of an extra bathroom for visits directly impacts the final price of the property, so I decided to calculate this new variable and include it in my analysis.

Q: What Machine Learning algorithms did you use for the competition? 
A: I used different algorithms including: linear regression, randomized forest regression, Gradient boosting regression, and XG Boosting regression.

Q: What was the Machine Learning algorithm that gave you the best score and why do you think it worked better than the others? 
A: The best algorithm was XGBoost, I think this is because this algorithm includes an internal regularization which allows to reduce the overfitting.

Q: What libraries did you use for this particular competition?
A: Pandas, Numpy, Seaborn, Matplotlib, Scipy, Xgboost and Scikit Learn.

Q: How many years of experience do you have in Data Science and where do you currently work?
A: 2 years, I work at The Clay Project

Q: What advice would you give to those who did not score as well in the competition? 
A: I think that for this competition it is worth focusing on understanding very well how the real estate sector works and which variables are critical. I think that if you understand the sector and the context of the problem well, you will be able to include only the relevant variables in your algorithms and thus obtain better results.


Germán Goñi - Chile - Eighth Place


Q: In general terms, how did you address the problem raised in the competition?
A: In my experience the real estate market tends to be different in each country. It is important to have a solution that is not generic, nor does it generate over-adjustment.
It is also relevant to use feature transformation when dealing with variables with asymmetric distribution. 

Q: For this particular competition, did you have any previous experience in this field? 
A: Yes, but with data from another country.

Q: What important results/conclusions did you find in exploring the data? What challenges did you have to deal with?
A: Test Set contained provinces/cities not observed in Test Set: eye with "naive" models

Q: In general terms, what data processing and feature engineering did you do for this competition?
A: Box-Cox transformations, Different types of encoding for qualitative variables

Q: What Machine Learning algorithms did you use for the competition? 
A: Random Forest, Catboost

Q: What was the Machine Learning algorithm that gave you the best score and why do you think it worked better than the others? 
A: Catboost

Q: What libraries did you use for this particular competition?
A: Pandas, Numpy, Matplotlib, Seaborn

Q: How many years of experience do you have in Data Science and where do you currently work?
R: 5

Q: What advice would you give to those who did not score as well in the competition? 
A: Consult experts in Domain, real estate management in this case. Review literature and tutorials on Feature Engineering.


Alejandro Anachuri - Argentina - Ninth Place


Q: In general terms, how did you address the problem raised in the competition?
A: I started by doing an analysis of the dataset, seeing what types of data it had, verifying if there were any null values and then looking through graphs to see if there was any correlation between the variables, basically an EDA process.

Q: For this particular competition, did you have any previous experience in this field? 
A: Only experience in similar problems raised in some book or course I was following as a learning experience.

Q: What important results/conclusions did you find in exploring the data? What challenges did you have to deal with?
A: The most important was to have found the relationship between prices and total surface area and by logging the price, this relationship was seen more clearly, this helped me to improve the outcome of the models.

Q: In general terms, what data processing and feature engineering did you do for this competition?
A: label encoder, one hot encoder, data transformation.

Q: What Machine Learning algorithms did you use for the competition? 
A: Random forest

Q: What was the Machine Learning algorithm that gave you the best score and why do you think it worked better than the others? 
A: I only used Random forest which is the model I was studying and I wanted to use this competition to try to understand it more deeply.

Q: What libraries did you use for this particular competition?
A: I mainly used pandas, matplotlib, numpy, sklearn, seaborn

Q: How many years of experience do you have in Data Science and where do you currently work?
A: I have no real experience in data science, I am just self-taught since a couple of months ago. I am a systems engineer and I work in software development in a multinational company.

Q: What advice would you give to those who did not score as well in the competition? 
A: Keep trying and testing different models or tuning on the same models, consult with the people in the group and in this way this community can also begin to grow and be nourished by those who know best.



Conclusions

As we can read, each competitor has followed their own methods, and models, but there is something in particular, and that is the need to try different approaches, questions, answers, and models. We hope you have drawn your own conclusions, you can share them with us in the comments, and we look forward to seeing you in the competition that is currently active, and perhaps you could be the TOP 10 interviewee for the next competition!


PS: we did not get the answers from the competitor who got first place. :( 
Join our private community in Discord

Keep up to date by participating in our global community of data scientists and AI enthusiasts. We discuss the latest developments in data science competitions, new techniques for solving complex challenges, AI and machine learning models, and much more!