Machine Learning Dataset #1:

predicting engagement in video ads

Watch the video: Machine Learning Dataset #1:

We are releasing a first dataset containing 3 million labeled lines (advertising auctions). This dataset can be freely used as resources for Machine Learning courses.

It was originally created by Cyrille Dubarry and previously used as competition material for a Machine Learning class he gives at the École Polytechnique. We thought it would be nice to share it with a wider audience!

This dataset can be used to predict the time a user will spend watching a video ad. Each line is identified by its auction_id and depicts one impression for a given context with: user, publisher, and advertiser information.


Columns description:

  • auction_id – unique id for identifying each line
  • timestamp – the timestamp (in seconds) of the ad impression
  • creative_duration – the total duration of the video that has been played
  • campaign_id – the advertising campaign id
  • advertiser_id – the advertiser id
  • placement_id – the id of a zone in the web page where the video was played
  • placement_language – the language of this zone
  • website_id – the corresponding website id
  • referer_deep_three – the URL of the page where the video was played, truncated at its 3rd level
  • ua_country – the country of the user who saw the video
  • ua_os – the user Operating System
  • ua_browser – the user internet browser
  • ua_browser_version – the user browser version
  • ua_device – the user device
  • user_average_seconds_played – the average duration the user watched video ads in the past. It can be null if the user never watched any ad.
  • seconds_played – the observed time the video has been watched. This is the quantity we are trying to predict.


CC0 1.0 Universal (CC0 1.0) – Public Domain Dedication

Get the dataset

File description:

  • dataset.csv.gz – a gzip .csv file containing 3 million labeled lines (147.16 MB)
All fields required except Newsletter subscription. 100% Non-Spam.

Machine Learning at Teads

Digital Advertising is an astonishing Machine Learning playground, it combines data-rich activities, scaling challenges and a lot of automation, especially since the rise of Programmatic buying and selling of ads in real-time.

If you want to know more about our Machine Learning stack and use cases you can have a look at our blog articles on the subject and also watch the talk Cyrille Dubarry and Han Ju gave at Spark Summit Europe 2018: Machine Learning for AdTech in action.

Our speaker(s)

Robert Dupuy
VP Engineering
We don’t compromise quality for speed
Cyrille Dubarry
Engineering Manager
Alban Perillat-Merceroz
Engineering Manager in Tech Montpellier
Han Ju
Senior Software Engineer
Tristan Sallé
Senior Software Engineer
Xavier Bucchiotty
Director of Engineering
Putting people in condition of success
Loïc Jaures
SVP Technology
Jean-Baptiste Pringuey
VP Engineering
Innovate constantly to sustain our growth
Kévin Margueritte
Software Engineer
Benjamin Davy
Sustainability Director
There is an important lack of resources to measure the impact of digital services. I’m glad and thankful to be able to work on this issue with such a motivated team!
Antoine Brechon
Engineering Manager - Infrastructure Team
Damien Pacaud
Former Infrastructure Director
Matthias Kunter
Senior Software Engineer @ Analytics
Benoit Daviaud
Senior Software Engineer @ Buying Engine
Damien Islam-Frenoy
Chief Technology Officer
Innovation is at the heart of our business
Scroll to Top
Here is your Dataset
Enjoy your Machine Learning experiments!