This is perhaps the most important news of all, as we have a new CEO who has been working on DataSource.ai for the last 3 months, understanding the internal processes, the competitions, the community and the technology we work with. This is a huge achievement since he is a person with extensive experience in tech companies and startups, with more than 20 years working for 500 fortune companies like IBM, Cisco and AT&T and is based in San Francisco, CA. heart of Silicon Valley. His name is Dimitry Kushelevsky, you can contact him on Linkedin, or email [email protected]
Our main goal with Dimitry is to fulfill our mission of democratizing Artificial Intelligence to small and medium businesses, as well as creating a great culture, having a non-technical leader to grow the team and having sponsors for data science competitions, bringing value to these companies with the results obtained from Machine Learning models sent by our community, and as a result having prize money, constantly and consistently for the whole community. Welcome Dimitry!
So far we have held 6 competitions, and we are in the middle of the seventh competition. In a journey of more than a year we have learned tons about competitions, how they work in detail, how to host them, how to evaluate them, how to automate tasks, and much more. At the same time we have learned from you, from those who have won competitions, and have filled out our feedback forms. We would like to take this opportunity to thank you for this!
Based on this knowledge we have made a number of changes that are worth sharing with you.
Discussions within competitions
Maximum of 50 submissions per day
If a competitor is sending this amount of models per day, he is probably doing it automatically, trying to overfit the Test.csv, which is not good for the competitor, nor for the other competitors.
Remember to always choose your best models to send, so you don't have to wait until the next day. As an additional tip, we recommend that you make different splits of the data in the Train.csv, which in turn serve as Hold-out test sets, so that you can simulate new unseen data scenarios, and run the same competition evaluation metrics on them. This way you will be more confident about the possible results when you send the csv file to our platform.
Competition completion process
This is perhaps the most important change we have made within the platform, so pay close attention.
The normal process of participation within the competition is as follows:
- You download the Train.csv dataset.
- You make EDAs and build a base model
- Make a .predict on the dataset Test.csv
- Create a csv following the guidelines of the SampleSumbission.csv file.
- Upload the csv to our platform to obtain the score.
- You appear in the public leaderboard
- You continue to work the model with advanced techniques and test different models
- You repeat the submission process
- You get different scores (and you improve them).
This is the normal process, but it has the problem of overfitting. The model that has the best score, we can say that it has been overfitted to the data given in Test.csv. That is why we have decided to introduce a new dataset which will be released at the end of the competition, which will act as a "real life" dataset, on which the model has not been "overfitted". We will call this dataset FinalTest.csv. And the process to send the model is as follows:
- Once the date is reached (see competition timeline) the dataset called FinalTest.csv is enabled.
- You download them to your environment
- You choose your best model (the one that has given you the best score so far with Test.csv)
- And you do a .predict on FinalTest.csv
- Be careful because you will only have ONE chance to send this last model, so choose well.
- Create the csv following the guidelines of the SampleSumbission.csv file. In the final form (from the Submit Final Model button) you must include
- The csv to obtain the score
- The .ipynb (Notebook)
- You will no longer need to send the Notebook to our email address.
- You will see your final score on the screen, but it will not be immediately reflected in the private leaderboard.
- You will have a period of one week to make this submission.
- At the end of this week, all scores will be revealed and the private leaderboard will be revealed.
- The private leaderboard is the one we will use for points, gifts and/or cash prizes.
Timeline of completion
- Until April 14 you can use the Test.csv dataset to make your predictions and be on the public leaderboard.
- From the following day, and until one week later, is the final submission window. It closes definitively on April 21. In this period of time you must send your predictions on the FinalTest.csv dataset.
But you can download the datasets, play with them, have fun, learn, send your results, get the scores and finally appear in the public table. This is a good way to keep practicing and demonstrating your data science skills!
Certificate of participation in the competitions
Here is an example of the certificate in PDF
Your public profile
We have also changed a bit the public profiles, to see the participation in competitions, so you can show it to recruiters and the community.
“What's new in DataSource.ai?”– Daniel Morales Tweet