In this project we investigate the performance of the SetCon model on realistic video footage. Therefore, we implemented the model in PyTorch and tested the model on two example videos.

Overview

Contrastive Learning of Object Representations

Supervisor:

Institutions:

Project Description

Contrastive Learning is an unsupervised method for learning similarities or differences in a dataset, without the need of labels. The main idea is to provide the machine with similar (so called positive samples) and with very different data (negative or corrupted samples). The task of the machine then is to leverage this information and to pull the positive examples in the embedded space together, while pushing the negative examples further apart. Next to being unsupervised, another major advantage is that the loss is applied on the latent space rather than being pixel-base. This saves computation and memory, because there is no need for a decoder and also delivers more accurate results.

eval_3_obj

In this work, we will investigate the SetCon model from 'Learning Object-Centric Video Models by Contrasting Sets' by Löwe et al. [1] (Paper) The SetCon model has been published in November 2020 by the Google Brain Team and introduces an attention-based object extraction in combination with contrastive learning. It incorporates a novel slot-attention module [2](Paper), which is an iterative attention mechanism to map the feature maps from the CNN-Encoder to a predefined number of object slots and has been inspired by the transformer models from the NLP world.

We investigate the utility of this architecture when used together with realistic video footage. Therefore, we implemented the SetCon with PyTorch according to its description and build upon it to meet our requirements. We then created two different datasets, in which we film given objects from different angles and distances, similar to Pirk [3] (Github, Paper). However, they relied on a faster-RCNN for the object detection, whereas the goal of the SetCon is to extract the objects solely by leveraging the contrastive loss and the slot attention module. By training a decoder on top of the learned representations, we found that in many cases the model can successfully extract objects from a scene.

This repository contains our PyTorch-implementation of the SetCon-Model from 'Learning Object-Centric Video Models by Contrasting Sets' by Löwe et al. Implementation is based on the description in the article. Note, this is not the official implementation. If you have questions, feel free to reach out to me.

Results

For our work, we have taken two videos, a Three-Object video and a Seven-Object video. In these videos we interacted with the given objects and moved them to different places and constantly changed the view perspective. Both are 30mins long, such that each contains about 54.000 frames.

eval_3_obj
Figure 1: An example of the object extraction on the test set of the Three-Object dataset.

We trained the contrastive pretext model (SetCon) on the first 80% and then evaluated the learned representations on the remaining 20%. Therefore, we trained a decoder, similar to the evaluation within the SetCon paper and looked into the specialisation of each slot. Figures 1 and 2 display two evaluation examples, from the test-set of the Three-Object Dataset and the Seven-Object Dataset. Bot figures start with the ground truth for three timestamps. During evaluation only the ground truth at t will be used to obtain the reconstructed object slots as well as their alpha masks. The Seven-Object video is itended to be more complex and one can perceive in figure 2 that the model struggles more than on the Three-Obejct dataset to route the objects to slots. On the Three-Object dataset, we achieved 0.0043 ± 0.0029 MSE and on the Seven-Object dataset 0.0154 ± 0.0043 MSE.

eval_7_obj
Figure 2: An example of the object extraction on the test set of the Seven-Object dataset.

How to use

For our work, we have taken two videos, a Three-Object video and Seven-Object video. Both datasets are saved as frames and are then encoded in a h5-files. To use a different dataset, we further provide a python routine process frames.py, which converts frames to h5 files.

For the contrastive pretext-task, the training can be started by:

python3 train_pretext.py --end 300000 --num-slots 7
        --name pretext_model_1 --batch-size 512
        --hidden-dim=1024 --learning-rate 1e-5
        --feature-dim 512 --data-path ’path/to/h5file’

Further arguments, like the size of the encoder or for an augmentation pipeline, use the flag -h for help. Afterwards, we froze the weights from the encoder and the slot-attention-module and trained a downstream decoder on top of it. The following command will train the decoder upon the checkpoint file from the pretext task:

python3 train_decoder.py --end 250000 --num-slots 7
        --name downstream_model_1 --batch-size 64
        --hidden-dim=1024 --feature-dim 512
        --data-path ’path/to/h5file’
        --pretext-path "path/to/pretext.pth.tar"
        --learning-rate 1e-5

For MSE evaluation on the test-set, use both checkpoints, from the pretext- model for the encoder- and slot-attention-weights and from the downstream- model for the decoder-weights and run:

python3 eval.py --num-slots 7 --name evaluation_1
        --batch-size 64 --hidden-dim=1024
        --feature-dim 512 --data-path ’path/to/h5file’
        --pretext-path "path/to/pretext.pth.tar"
        --decoder-path "path/to/decoder.pth.tar"

Implementation Adjustments

Instead of many small sequences of artificially created frames, we need to deal with a long video-sequence. Therefore, each element in our batch mirrors a single frame at a given time t, not a sequence. For this single frame at time t, we load its two predecessors, which are then used to predict the frame at t, and thereby create a positive example. Further, we found, that the infoNCE-loss to be numerically unstable in our case, hence we opted for the almost identical but more stable NT-Xent in our implementation.

References

[1] Löwe, Sindy et al. (2020). Learning object-centric video models by contrasting sets. Google Brain team.

[2] Locatello, Francesco et al. Object-centric learning with slot attention.

[3] Pirk, Sören et al. (2019). Online object representations with contrastive learning. Google Brain team.

Owner
Dirk Neuhäuser
Dirk Neuhäuser
Jittor is a high-performance deep learning framework based on JIT compiling and meta-operators.

Jittor: a Just-in-time(JIT) deep learning framework Quickstart | Install | Tutorial | Chinese Jittor is a high-performance deep learning framework bas

2.7k Jan 03, 2023
Code for the paper BERT might be Overkill: A Tiny but Effective Biomedical Entity Linker based on Residual Convolutional Neural Networks

Biomedical Entity Linking This repo provides the code for the paper BERT might be Overkill: A Tiny but Effective Biomedical Entity Linker based on Res

Tuan Manh Lai 24 Oct 24, 2022
Scalable training for dense retrieval models.

Scalable implementation of dense retrieval. Training on cluster By default it trains locally: PYTHONPATH=.:$PYTHONPATH python dpr_scale/main.py traine

Facebook Research 90 Dec 28, 2022
Disentangled Face Attribute Editing via Instance-Aware Latent Space Search, accepted by IJCAI 2021.

Instance-Aware Latent-Space Search This is a PyTorch implementation of the following paper: Disentangled Face Attribute Editing via Instance-Aware Lat

67 Dec 21, 2022
Mini-hmc-jax - A simple implementation of Hamiltonian Monte Carlo in JAX

mini-hmc-jax This is a simple implementation of Hamiltonian Monte Carlo in JAX t

Martin Marek 6 Mar 03, 2022
Paddle pit - Rethinking Spatial Dimensions of Vision Transformers

基于Paddle实现PiT ——Rethinking Spatial Dimensions of Vision Transformers,arxiv 官方原版代

Hongtao Wen 4 Jan 15, 2022
Self-supervised Product Quantization for Deep Unsupervised Image Retrieval - ICCV2021

Self-supervised Product Quantization for Deep Unsupervised Image Retrieval Pytorch implementation of SPQ Accepted to ICCV 2021 - paper Young Kyun Jang

Young Kyun Jang 71 Dec 27, 2022
Prometheus exporter for Cisco Unified Computing System (UCS) Manager

prometheus-ucs-exporter Overview Use metrics from the UCS API to export relevant metrics to Prometheus This repository is a fork of Drew Stinnett's or

Marshall Wace 6 Nov 07, 2022
ELSED: Enhanced Line SEgment Drawing

ELSED: Enhanced Line SEgment Drawing This repository contains the source code of ELSED: Enhanced Line SEgment Drawing the fastest line segment detecto

Iago Suárez 125 Dec 31, 2022
Segcache: a memory-efficient and scalable in-memory key-value cache for small objects

Segcache: a memory-efficient and scalable in-memory key-value cache for small objects This repo contains the code of Segcache described in the followi

TheSys Group @ CMU CS 78 Jan 07, 2023
Official implementation of "Motif-based Graph Self-Supervised Learning forMolecular Property Prediction"

Motif-based Graph Self-Supervised Learning for Molecular Property Prediction Official Pytorch implementation of NeurIPS'21 paper "Motif-based Graph Se

zaixi 71 Dec 20, 2022
Controlling a game using mediapipe hand tracking

These scripts use the Google mediapipe hand tracking solution in combination with a webcam in order to send game instructions to a racing game. It features 2 methods of control

3 May 17, 2022
A hybrid SOTA solution of LiDAR panoptic segmentation with C++ implementations of point cloud clustering algorithms. ICCV21, Workshop on Traditional Computer Vision in the Age of Deep Learning

ICCVW21-TradiCV-Survey-of-LiDAR-Cluster Motivation In contrast to popular end-to-end deep learning LiDAR panoptic segmentation solutions, we propose a

YimingZhao 103 Nov 22, 2022
FedTorch is an open-source Python package for distributed and federated training of machine learning models using PyTorch distributed API

FedTorch is a generic repository for benchmarking different federated and distributed learning algorithms using PyTorch Distributed API.

Machine Learning and Optimization Lab @PennState 136 Dec 23, 2022
Deep-Learning-Book-Chapter-Summaries - Attempting to make the Deep Learning Book easier to understand.

Deep-Learning-Book-Chapter-Summaries This repository provides a summary for each chapter of the Deep Learning book by Ian Goodfellow, Yoshua Bengio an

Aman Dalmia 1k Dec 27, 2022
A library that allows for inference on probabilistic models

Bean Machine Overview Bean Machine is a probabilistic programming language for inference over statistical models written in the Python language using

Meta Research 234 Dec 29, 2022
Multiple-Object Tracking with Transformer

TransTrack: Multiple-Object Tracking with Transformer Introduction TransTrack: Multiple-Object Tracking with Transformer Models Training data Training

Peize Sun 537 Jan 04, 2023
PyTorch implementation for Partially View-aligned Representation Learning with Noise-robust Contrastive Loss (CVPR 2021)

2021-CVPR-MvCLN This repo contains the code and data of the following paper accepted by CVPR 2021 Partially View-aligned Representation Learning with

XLearning Group 33 Nov 01, 2022
An self sufficient AI that crawls the web to learn how to generate art from keywords

Roxx-IO - The Smart Artist AI! TO DO / IDEAS Implement Web-Scraping Functionality Figure out a less annoying (and an off button for it) text to speech

Tatz 5 Mar 21, 2022
Trainable PyTorch reproduction of AlphaFold 2

OpenFold A faithful PyTorch reproduction of DeepMind's AlphaFold 2. Features OpenFold carefully reproduces (almost) all of the features of the origina

AQ Laboratory 1.7k Dec 29, 2022