Blog Post Visitors Prediction

Name: Blog Post Visitors Prediction
Creator: DataSource.ai
License: https://creativecommons.org/publicdomain/zero/1.0/
Keywords: Blog Post Visitors Prediction

Total pool prize

Funding Starts
05/12/2021

Regular Season
05/31/2021

Quarterfinals
06/14/2021

Semifinals
06/21/2021

Final
06/28/2021

Last Submission
07/05/2021

Completed
07/11/2021

Congratulations to our winners!

kasati

2

Santiago Serna

1

Sidereus

3

Tournament brackets

Quarterfinals

Semifinals

Finals

Winner

0.8784979762986012

1.0974240703460731

0.8484965514846635

Adam Michaels

1.0551794064434958

1.3188986464605459

0.8472341265732635

0.8535143425850108

1.096539714035341

0.8502086695166483

0.9456896325238584

0.8502086695166483

How do the podium work?

At the end we will have a podium of 3 winners. The grand final is the stage where the two best competitors of the whole tournament will face each other for a week, and the best score will be the winner and the #1 place.

His opponent will be #2. And the #3 place will be the best competitor (score) of the semifinal among those who did not advance to the final. All this is done automatically by the system!

Machine Learning Problem

Let's say you work as senior Data Scientist in a marketing agency that offers Search Engine Optimization SEO services for corporate clients. Within the SEO services the agency recommends to all its clients to publish a greater number of articles in their blog posts, and as a result have a bigger impact, because Google rewards the creation of unique and frequent content.

So far, the agency has historical data on its clients' posts with a column that has recorded the number of unique and total visitors for each blog post. The agency believes that with this information it can predict whether a client's new post will be successful, as measured by the number of visits.

This will allow the client to focus on creating relevant posts that are expected to have a high number of visitors (and impact), and avoid creating posts that don't have that same impact. This prediction will save valuable resources (like time and money) for the client and the agency.

Evaluation

The evaluation of the model will be done using the RMSLE (Root Mean Squared Logarithmic Error). What we do is calculate the Square Root to the MLSE metric that implements Scikit-learn. If you want to know more about the MLSE metric that Scikit Learn calculates, you can find it here:

https://scikit-learn.org/stable/modules/model_evaluation.html#mean-squared-log-error

Where:

RMSLE

N = Number of rows in your submission file

= true values

= predicted values

Rules

The code should not be shared privately. Any code that is shared, must be available to all participants of the competition through the platform
The solution should use only publicly available open source libraries
If two solutions get identical scores in the ranking table, the tie-breaker will be the date and time of the submission (the first solution submitted will win).
We reserve the right to request any user's code at any time during a tournament. You will have 48 hours to submit your code following the code review rules.
We reserve the right to update these rules at any time.
Your solution must not infringe the rights of any third party and you must be legally authorized to assign ownership of all copyrights in and to the winning solution code to DataSource.ai.
Competitors may register and submit solutions as individuals (not as teams, at least for now).
Apart from the rules in the DataSource.ai Terms of Use, no other particular rules apply.
Maximum 50 solutions submitted per day.

If you reach the Final Stage, at the end of this stage you must submit the complete model in .ipynb (Jupyter Notebook) format through our form (as an attachment) that we’ll display for you inside the platform- no other file formats or submission channels will be accepted. Normally, you'll have 3 days (Final Shot Stage) to send it through our "Submit Modal" button - This final machine learning model will help us to get the final evaluations, so the winners will be determined on the basis of the final score and this Notebook.

Within our tournaments everybody wins!

Our tournaments are community-funded, this means that the money that will be distributed will be the total amount collected by the community within the established deadlines. In order to participate in the tournament, each competitor must contribute a sum ranging from $USD 10 to $USD 300.

What do you receive with your contribution?

The Machine Learning models of the winners in a Jupyter Notebook format.
1. At the end of the tournament we will share with you the winning Machine Learning models in a Notebook format, for educational purposes, as you will be able to study them and learn from the best!
2. This provides the transparency of the competition, as well as the proof-of-work of the winners!
Learn competitive and applied Machine Learning in a real-world environment. You will learn about the process of participating in a tournament, do your best to advance to the different stages, and for sure you will get an adrenaline rush once you are competing in the playoffs!
Show off your skills to recruiters: we will award certificates of achievement to those who reach the quarterfinals or higher. You can also share your public profile with recruiters, where they will see your achievements in the tournament.
Measure your level of learning, and your skills mastery, through the scores you achieve
And of course, the chance to win money!

The total amount raised will be distributed as follows

First Place: 50% of the total amount, and 20.000 points
Second Place: 30% of the total amount, and 15.000 points
Third Place: 20% of the total amount, and 10.000 points

For more info, please go to our tournament FAQs: https://datasource.ai/en/home/data-science-tournaments#faqs

Total Pool Prize: $USD 105.0

The dataset contains historical information about the number of visitors to different blog posts, which are hosted on different websites. Each blog post has certain features, which will be used to make the predictions.

The dataset is public, but due to transparency issues in the tournament, and in order to avoid possible cheating, we will not provide the original columns and further information about the data. The columns have been anonymized as C2, C3... C60, and the only column we keep original is the "target" column.

PS: remember to check the tournament rules, because if there is suspicion of cheating, you could be disqualified from the tournament and the platform.

Submission File

For each "id" in the test set, you must predict a number for the "target" variable. The file should contain a header and have the following format:

id,target
1,1200
2,600
3,2300
4,1500
5,1300
6,500
....

The total number of predicted rows you need to send in your submission file on this stage are: 4.006 rows

Datasets: one of the peculiarities of our tournaments is that each stage has new datasets (new observations). This means that the data science problem keeps the same, but we make a release of new observations, so the competitors must re-train the models based on them. This faithfully simulates the input of new real data, and the improvement of the model based on them. It also keeps the excitement high, as no one has their position completely won!

Release datasets: just as we will have a TrainQuarterfinals.csv or a TrainSemifinal.csv depending on the stage of the tournament, we will also make a TestRelease.csv with the true labels of the immediately previous stage. This allows us to keep the transparency of each stage, since each competitor can test his model with the true labels of that file

This tournament has been successfully completed! You can check our tournaments page to participate in the active tournaments!