Keyword Recency Prediction
Share:
USD $2,000

Keyword Recency Prediction

Image
Description

Welcome to our next exciting competition!For this challenge, we have teamed up with Battelle Memorial Institute - one of the most respected names i...

Prizes
There are TWO winners for this competition. Awarded on the basis of private leaderboard rank. 1st place: USD $1.5002nd place: USD $500For this comp...
Competitors
  • MockTurtle
  • satyacode
  • Nikos_DataSource
  • sanial
  • madara
  • roberto.holgado-en
175 Competitors Published at: 07/01/2021
Total Prize
$2,000
graphical divider

Public Leaderboard


Ranking
Data Scientist
Country
# Submissions
Last submission
Best Score
1
ottob ottob Featured
Mexico
47
almost 2 years ago
0.129871068157291
2
sammy786 sammy786 Featured
India
19
almost 2 years ago
0.1298826714927
3
margeperez margeperez Featured
Colombia
25
almost 2 years ago
0.12997125018146
4
Diegonov-en Diegonov-en Featured
Chile
33
over 1 year ago
0.130553676070521
5
Amanda Amanda Featured
Colombia
4
almost 2 years ago
0.132681725419753
6
Hans Hidalgo Alta Hans Hidalgo Alta Featured
Peru
36
almost 2 years ago
0.133900643307075
7
edcalderin edcalderin Featured
Argentina
25
over 1 year ago
0.133927686141685
8
c.olate-en c.olate-en
Chile
92
over 1 year ago
0.134179987541261
9
Valentino Valentino Featured
Colombia
3
almost 2 years ago
0.134381296411218
10
Felipe Nunez B Felipe Nunez B Featured
Chile
10
almost 2 years ago
0.135085005321184
11
Santiago Serna Santiago Serna Featured
Colombia
17
almost 2 years ago
0.135417679209297
12
jayantsogikar jayantsogikar Featured
India
21
almost 2 years ago
0.135717616428963
13
Elizabeth Dominguez Elizabeth Dominguez
Colombia
2
almost 2 years ago
0.135961431282156
14
Pablo Lucero Pablo Lucero Featured
Ecuador
22
almost 2 years ago
0.13684530739476
15
M33ssi M33ssi
Peru
2
almost 2 years ago
0.136863276800597
16
Sidereus Sidereus Featured
Colombia
3
over 1 year ago
0.137948248431841
17
kudasov.dm kudasov.dm
Russian Federation
5
over 1 year ago
0.140518083813079
18
romazepa romazepa
Russian Federation
8
over 1 year ago
0.141083277373482
19
Emmy Emmy
Uganda
9
over 1 year ago
0.141575625713195
20
Sean Robinson Sean Robinson
United Kingdom
3
almost 2 years ago
0.142655453393989
21
María Paula María Paula
Canada
6
almost 2 years ago
0.142901359138741
22
simoncerda-en simoncerda-en
Chile
2
almost 2 years ago
0.142914490431918
23
Victor Andres De La Puente Ancco-en Victor Andres De La Puente Ancco-en
Peru
1
almost 2 years ago
0.143261959115854
24
Luis Salazar-en Luis Salazar-en
Colombia
62
over 1 year ago
0.143327399881598
25
ESTHER PINILLA ESTHER PINILLA
España
37
over 1 year ago
0.143911537956907
26
Johan David Erazo Avila-en Johan David Erazo Avila-en
Colombia
14
almost 2 years ago
0.144053885012541
27
diegoethi-en diegoethi-en
Chile
1
almost 2 years ago
0.144615069713798
28
Pablo Neira Vergara Pablo Neira Vergara Featured
Chile
4
almost 2 years ago
0.147770188750051
29
Lautaro Pacella-en Lautaro Pacella-en
Argentina
1
almost 2 years ago
0.147794791763907
30
Guillermo Ruiz Guillermo Ruiz
Peru
2
almost 2 years ago
0.14783199403615
31
Adrian Monsalve Adrian Monsalve
Guatemala
3
almost 2 years ago
0.149116721132589
32
Anurag Maji Anurag Maji
India
54
almost 2 years ago
0.149716692736577
33
Diego Fernando Rua-en Diego Fernando Rua-en
Mexico
2
almost 2 years ago
0.167378124601123
34
Bharathi Bharathi
India
11
almost 2 years ago
0.217223446743259



Timeline

Begin
2021/07/03
Finish
2021/09/03
Complete
2021/09/10

Competition start: 2021/07/03 00:01:00
Competition closes on: 2021/09/03 23:59:00
Final Submission Limit: 2021/09/10 23:59:00

This competition has a total duration of 2 months, within which you will be able to make your submissions and obtain results automatically. Once the first part of the competition is over, you will have one week to choose your best model and submit it to be scored and considered for the cash prize. 


Description

Welcome to our next exciting competition!

For this challenge, we have teamed up with Battelle Memorial Institute - one of the most respected names in the global scientific & research community - to launch a Data Science competition that can help to dramatically accelerate the pace of global innovation. The goal of this project is to break down several barriers that currently stand in the way of advanced research publications getting noticed, and receiving prompt recognition from the world's brightest minds. This competition will also offer cash prizes to the authors of the top two ML models, as determined by our platform's evaluation algorithm. Please read on for more details, and good luck!

About Battelle (battelle.org)


Battelle is solving the world’s most pressing challenges. We deliver when others can’t. We conduct research and development, manage laboratories, design and manufacture products, and deliver critical services for our clients – whether you are a multi-national corporation, a small start-up organization or a government agency. We are valued for our independence and ability to innovate.

We are part of a community working to encourage the discovery of new and interesting research in Artificial Intelligence and Machine Learning, especially in languages other than English. Much of the research being done in these fields is easily available on the web through sites like Arxiv.org, but many interesting discoveries are happening every day in different corners of the internet that may take time to identify and bring to the attention of the rest of the community. 

This is especially true of research that is in a language other than English, which may easily missed by much of the community. We are passionate about finding the best current research, and identifying trends so that the cutting edge can continue to be pushed. In order to push in that direction, we devised a problem that attempts to measure when new ideas are being discussed, in any language. Based on a metric for recency of key words, how can we identify when a research paper is bringing forth new ideas so that we can better isolate them?
 
The Problem
The data is a collection of 42.912 abstracts from recent publications, along with the language and year of publication. The abstracts have author given keywords associated with them, and they have been given scores based on the average number of years that those keywords show up in our database. The goal of this competition is to build a model that is able to take in the abstract, the language, and the publication year, and predict the recency score. These models will be scored based on the accuracy of their predictions.


Evaluation

The evaluation of the model will be done using the RMSLE (Root Mean Squared Logarithmic Error). What we do is to calculate the Square Root to the MLSE metric that implements Scikit-learn.

If you want to know more about the MLSE metric that Scikit Learn calculates, you can find it here: https://scikit-learn.org/stable/modules/model_evaluation.html#mean-squared-log-error

Where:

RMSLE

N = Number of rows of the dataset Test.csv

 = true value

 = predicted value


Rules

Competition Rules

  1. The code should not be shared privately. Any code that is shared, must be available to all participants of the competition through the platform
  2. The solution should use only publicly available open source libraries
  3. If two solutions get identical scores in the ranking table, the tie-breaker will be the date and time of the submission (the first solution submitted will win).
  4. We reserve the right to request any user's code at any time during a challenge. You will have 72 hours to submit your code following the code review rules.
  5. We reserve the right to update these rules at any time.
  6. Your solution must not infringe the rights of any third party and you must be legally authorized to assign ownership of all copyrights in and to the winning solution code to the competition host/sponsor.
  7. Competitors may register and submit solutions as individuals (not as teams, at least for now).
  8. Maximum 50 solutions submitted per day.

At the end of the competition you must submit the complete model in .ipynb (Jupyter Notebook) format - no other formats will be accepted. Normally, you'll have 1 week after the end of the competition to send it through our "Submit Final Model" button - This model will help us to get the real final evaluations, so the Private Leaderboard could change when the final private evaluation is shown.


There are TWO winners for this competition. Awarded on the basis of private leaderboard rank. 

  • 1st place: USD $1.500
  • 2nd place: USD $500

For this competition we want to give another a very special gift for the 3th and 4th places!

We will ship this prize to any country or city in the world! (made by https://www.devwear.co/)


*The hoddie is for men or women (Unisex)


Total Score Scale

These will be the awards in platform points once the competition is over:

  • 1st Place: 30.000 pts + USD $1.500
  • 2nd Place: 29.000 pts + USD $500
  • 3rd Place: 28.000 pts + Python Hoodie (Delivery to any city around the world)
  • 4th Place: 27.000 pts + Python Hoodie (Delivery to any city around the world)
  • 5th Place: 26.000 pts 
  • 6th Place: 25.000 pts 
  • 7th Place: 24.000 pts 
  • 8th Place: 23.000 pts
  • 9th Place: 22.000 pts 
  • 10th Place: 21.000 pts

Total Prize: $2,000


The data is a collection of 32.184 abstracts from recent publications, along with the language and year of publication. The abstracts have author given keywords associated with them, and they have been given scores based on the average number of years that those keywords show up in our database. The goal of this competition is to build a model that is able to take in the abstract, the language, and the publication year, and predict the recency score.

Data fields
  • Language: language in which the papers are written
  • Year: year of paper publication
  • Abstract: paper abstract
  • Title: paper title
Target var
  • total_rel_score: metric calculating recency

The total_rel_score was calculated using the year of publication of the paper and the year in which the paper's keyword first appeared in another document. Essentially a value close to 1 means that it is a recent paper (given its keywords), and a value close to 0 means that it is an older paper. The task is to predict this value for a given set of features (Language, Year, Abstract and Title).

Submission file
For each "id" in the test set, you must predict a label for the "total_rel_score" variable. The file should contain a header and have the following format:

id,total_rel_score
1,0.545714
2,0.635714
3,0.532713
4,0.335710
5,0.135714
6,0.535710
....
10725,0.187
10726,0.525
10727,0.014
10728,0.690

For this competition stage, you need to send your submission file with this details:

# of columns: 2
Column names: id,total_rel_score
# of rows: 10729

This competition is finished


13 Comments
  1. Daniel Morales
    Daniel Morales
    over 1 year ago
    Hola SDG. Una vez tengamos los resultados finales, se mostrara una tabla privada. Saludos
  2. SDG
    SDG
    over 1 year ago
    Hola. Se mostrará una tabla privada o los scores públicos son los resultados finales?
  3. Daniel Morales
    Daniel Morales
    over 1 year ago
    Hola Hydroinfmtk, si puedes enviar un modelo diferente, sin embargo debes tener en cuenta que para el modelo final solo tienes una oportunidad final de envio, y debes enviar el Notebook que soporte ese modelo. Porfavor revisa las reglas de la competición. Saludos
  4. Daniel Morales
    Daniel Morales
    over 1 year ago
    Hi jayantsogikar, thanks for letting us know! we have already fixed and uploaded the proper dataset. Please check and let us know if everything is ok now
  5. jayantsogikar
    jayantsogikar
    over 1 year ago
    Could you inform us about what we should do about the 'FinalTest.csv' file as it is similar to the 'SampleSubmission.csv' file present before
  6. Hydroinfmtk-en
    Hydroinfmtk-en
    over 1 year ago
    Una pregunta, el modelo final enviado puede ser diferente a los evaluados en la etapa inicial de la competencia? es decir, se puede enviar un modelo nuevo o es solo una formalidad de envío de uno de los modelos ya rankeados en el ranking público.
  7. Bharathi
    Bharathi
    almost 2 years ago
    I'm just curious. Is it impossible to get best results without using advanced models like BERT? I haven't built many NLP models, so just want to know if any of you have got an error less than 0.15 without using BERT or similar transformer based methods? 
  8. Daniel Morales
    Daniel Morales
    almost 2 years ago
    Hi Bharathi

    Thanks for reaching out to us. We were inspecting the file named: "submission_df (5).csv" and we found a number with scientific notation in line 6.579 (we sent you an email with the evidence) inside your submitted file. Please be aware of this kind of notation, because it contains letters or dashes (1e-04), and that means that these are not numeric values, so the evaluation metric cannot compute a result. 

    If you have any other questions, please let us know 

    Regards!
  9. Bharathi
    Bharathi
    almost 2 years ago
    Hi, I am unable to make a submission since I keep getting this error: 
    Error: You have the following error in your submission file:
    
    * Scientific notation: The system does not allow scientific notation values similar or equal to this syntax: '5.54538E+11'
    
    Please make sure your file is correct and run the submission again.
    But there are no scientific notations in my CSV file at all. I have checked it. And have also rounded my decimal values to 7 in pandas. why does this happen? Please help me
  10. ottob
    ottob
    almost 2 years ago
    Personally, I'm trying to use Keras for Bag-of-words. I'll see if I can use other NLP advanced methods like BERT
  11. 5hr3ya5h
    5hr3ya5h
    almost 2 years ago
    Which algos are you guys using?
  12. Daniel Morales
    Daniel Morales
    almost 2 years ago
    Hola Santiago. Gracias por avisarnos. El problema ya fue solucionado. Deberia validar valores negativos y valores nulos automaticamente. El archivo que usted habia enviado, tenia un solo valor negativo, lo cambiamos a positivo y corrimos manualmente la metrica, dando como resultado: 0.13573531711075593 para dicho archivo. Sigue adelante, esperamos verte en los primeros lugares al finalizar la competición! 
  13. Santiago Serna
    Santiago Serna
    almost 2 years ago
    Hola, hay un problema con la evaluación de la métrica, si hay algún valor negativo da como resultado 0.

Do you have any comments or questions about the competition?
Log In to Comment


Share this competition:

Other Competitions

Ready to start?

It's free! Just enter your name and email to join our global data science community, enter competitions, learn, have fun, and win cash prizes

You will be notified shortly about your successful registration.
deco-ring-1 decoration
deco-dots-3 decoration
Icon

Join our private community in Slack

Keep up to date by participating in our global community of data scientists and AI enthusiasts. We discuss the latest developments in data science competitions, new techniques for solving complex challenges, AI and machine learning models, and much more!

 
We'll send you an invitational link to your email immediatly.
arrow-up icon