PhD position Unsupervised Deep Representation Learning for Video

Updated: 26 days ago
Deadline: 31 May 2021

In the last years and with the advent of deep learning, video understanding, be it action or activity classification, video object recognition or object tracking, has benefited significantly. Interestingly, the majority of progress has focused on stylized, short-video segments of utmost few seconds long, which can be and are strongly supervised. However, strong supervision is a significant constraint. It is a significant constraint for learning representations from images and much research is undertaken at the moment with contrastive learning and similar methodologies. More importantly, in video there is the time signal, which is presumably a strong signal for supervision, provided it is exploited correctly, for instance by taking into account the arrow of time, spatiotemporal continuity, or causality. In fact, for video, and spatiotemporal data in general, unsupervised learning is even more critical, given that:

  • the longer the video, the less relevant is to have a single manually provided label that corresponds to the whole duration of the video;
  • be it for short or long videos, the human annotations bias the model towards specific semantics, which, however, are not necessarily the appropriate ones when training video deep networks. For one, motion or dynamic changes have often little to do with the attached label, and it is virtually impossible to manually annotated or dynamic changes in complex, in the wild videos;
  • current strong supervision does not exploit any of the spatiotemporal redundancy and continuity present in videos. As a result, to train a standard video deep network nowadays requires an exorbitant amount of GPU computer, and most importantly, without any sign it could become more manageable with the current approaches. It should be simpler than that;
  • strong supervision gives little information in the way of causality. However, for robust models and good generalization, taken into account the cause and effect is important. Thankfully, videos and recordings over time give great opportunities of exploiting cause and effect, also for learning models without manually provided models;
  • it is very likely that strong and generalizable learning, even at an image level, can only be attained with unsupervised learning on spatiotemporal data and videos. What moves is what acts, and what acts is what influences the world and must be modelled;
  • current unsupervised learning methods from core machine learning is not designed to take the time dimension into account or exploit it optimally;
  • besides the aforementioned arguments, the practical ones of having many more unannotated video data than annotated ones are also relevant and important.

These are few only basic points why unsupervised learning in videos is of critical value, presumably even more so than for images. These differences alone, however, already show that for the next generation of deep neural networks for video we need fundamentally different and more nuanced learning objectives.

In this PhD position, we will research unsupervised deep representation learning for videos and complex spatiotemporal sequences. Our lab has significant experience in long-term video action classification and tracking, showing that decomposition of long spatiotemporal convolutions (Hussein, Gavves, Smeulders, 2019a), and spatiotemporal graphs (Hussein, Gavves, Smeulders, 2019b), and Siamese networks (Tao, Gavves, Smeulders, 2015) are key to very scalable in video. Inspired by prior work, and given the aforementioned (subset of) fundamental challenge, the research includes overarching questions like:

  • What are optimal unsupervised learning objectives for deep neural networks for short videos?
  • What are optimal unsupervised learning objectives for deep neural networks for long and complex videos?
  • Can unsupervised learning objectives exploit the spatiotemporal redundancy, continuity and causality to pretrain deep spatiotemporal neural networks?
  • What are optimal architectures to facilitate unsupervised representation learning in videos?
  • Is it possible to rely on unsupervised representation learning while still maintaining a reasonable computational budget?

You will be supervised by Dr. E. Gavves , Associate Professor at the University of Amsterdam (UvA). This project is financed by the winning H2020 ERC Starting Grant ‘EVA: Expectational Visual Artificial Intelligence’ and NWO VIDI Grant ‘TIMING: Learning Time in Videos’.

What are you going to do

You will carry out research and development in the area of Deep Machine Learning and Vision. The research is embedded in the VISlab group at the UvA. Your tasks will be to:

  • develop new deep machine learning and/or computer vision methods on Unsupervised deep representation learning for video;
  • collaborate with other researchers within the lab;
  • regularly present internally on your progress;
  • regularly present intermediate research at international conferences and workshops, and publish them in proceedings (CVPR, ICCV, ECCV, NeurIPS, IMCL, ICLR) and journals (PAMI, IJCV, CVIU);
  • assist in relevant teaching activities;
  • complete and defend a PhD thesis within the official appointment duration of four years.

View or Apply

Similar Positions