Data Science Study Case - Real Estate Price Forecast

Daniel Morales
Apr 17, 2021

Contents Outline

Data Science Study Case - Real Estate Price Forecast

Apr 17, 2021 6 minutes read

Predicting or estimating the selling price of a property can be of great help when making important decisions such as the purchase of a home or real estate as an investment vehicle. It can also be an important tool for a real estate sales agency, since it will allow them to estimate the sale value of the real estate that for them in this case are assets. 

Having an estimate of the value of the property allows to increase the negotiation capacity, both for the buyer and for the seller. In addition, having this knowledge serves as a comparative tool to evaluate growth projections in different residential sectors.

In we have conducted a competition in data science where participants could download a dataset with information about sales that have been made in the past, and try to estimate the value of a new property for sale.

The objective of this competition was to create a machine learning model to predict the price of apartments for Argentina and Colombia, but these same models can be applied to other countries or even cities, given the main variables that describe these properties, such as: the area, the number of bathrooms, the location, etc.

In this case study we will go through the following points:

  1. Defining the problem
  2. Acquiring the data (dataset)
  3. Configuration and requirements of the competition
  4. Choosing a winning model
  5. Deploying an API in production 
  6. Deploying the model in a visual app (Streamlit)

Definition of the problem
After the previous introduction, where we talked about the importance of being able to predict the prices of a property, and the competitive advantages that this gives us, the problem we want to face in this case is a Regression task, where we have a list of properties that have been sold in the past, with the characteristics that define it, such as: the area, the number of bathrooms, the location, etc. So we must predict a continuous value

Gathering datasets
The main data set contains information about the departments/apartments for sale in Argentina and Colombia during the period 2019 - 2020. "data provided by properati"

If you want to see the datasets that this platform provides you can find them here:, these data are open, so we thank Properati for taking the trouble to open their data! cleaned and pre-prepared the data needed for the competition.

We defined the following features:

ID = Unique property identifier
country = Country where the property is located
city = City where the property is located
province_department = province or department where the property is located
property_type = Type of property (In our case it is only an apartment)
operation_type = Type of business (sale)
rooms = Number of rooms
bedrooms = Number of rooms
surface_total = Total area in m2
currency = currency in dollars
price = price of the property

Note: For Colombia, a conversion rate of $3,633 COP was applied to calculate the price in USD

The dataset was divided as follows:

train_apartmentos.csv corresponds to 80% of the data, with this set you will train the machine learning model.

test_apartments.csv corresponds to 20% of the data, with this set of data you will predict the price column. Unlike the set Train, the dataset Test does not contain the data in the column price, these must predict with your model.

Competition configuration and requirements
  • In order to set up the competitions, we will need the following guidelines:
  • Competition start date.
  • End date of the public phase: in this period we evaluate the models based on a first subdivision of the data.
  • End date of the private phase: in this period we evaluate the models based on a second and last subdivision of the data to avoid overfitting.
  • The description of the details of the competition
  • The rules of the competition
  • The evaluation metrics of the models. This metric is subject to the type of problem to be solved.
  • The prizes to be awarded to the competitors who achieve the best scores.
  • The data in csv format and with the corresponding subdivisions.

Selection of a winning model
At the end of the competition, these were the top 5 winners

Each of them has sent us the source code with their respective solutions, and we have chosen one of them to deploy in production. 

Deploying an API in production 

This is a screenshot of the code we used to deploy an API on our own servers, and be able to attend HTTP requests.

You probably don't understand much about it, don't worry, we have the engineers to do this whole process of deployment to production of the winning models. The important thing is what comes next. 

Let's say hypothetically your company has a web platform, to which you want to connect these predictions, and every time you need to make a prediction you only have to enter the characteristics of the property by filling out a form, and you get these predictions. Well, that is the intention of this API, that the form you fill in your web page communicates automatically and via API with our server, and immediately returns the predicted sale price. 

Let's perform the simulation using Postman

Here we can see that at the top we are sending the data to a URL with the following structure:

This is the endpoint where we have uploaded the code shown above. Then in the Body we are sending some characteristics about the property to which we want to predict the final value, or sale price. Which is located in Argentina, has a total of 4 rooms, 2 bedrooms and 3 bathrooms, and has an area of 137 mts2. Given these characteristics and using one of the winning models, we obtain the result at the bottom, with an estimated sales value of $402,473 usd. 

This is where the power of Machine Learning lies with a model deployed to production, we can make predictions on the fly!

Deploying the model in a visual app (Streamlit)

As a final and additional step, we can simulate a visual platform, where by means of a form the prediction results are obtained. In this case we are using Streamlit. Let's see the result: You can access the following link and play a little with the prediction parameters

In the left bar we have the model parameters, in order to make the predictions. Such as country, rooms, bethrooms, etc. And in the central part we have the Score of the model, the parameters specified on the left and the result of the price prediction. For these characteristics, for example, the final price is $314,011 usd (they are different parameters from the previous one, that is why the price changed).


As you can see, with the competitions we cover the whole machine learning process, and we do it hand in hand with you trying to solve a specific problem with data science.

Join our private community in Discord

Keep up to date by participating in our global community of data scientists and AI enthusiasts. We discuss the latest developments in data science competitions, new techniques for solving complex challenges, AI and machine learning models, and much more!