Making Structure-from-Motion (COLMAP) more robust to symmetries and duplicated structures

Overview

SfM disambiguation with COLMAP

About

Structure-from-Motion generally fails when the scene exhibits symmetries and duplicated structures. In this repository, we implement several state-of-the-art algorithms that aim at addressing this problem, we integrate them into COLMAP, and we extensively analyze their performance. We hope that this effort can ease further research on this problem.

We focus on filtering out incorrect matches between images prior to SfM. The filtering process is done by reimplementing the ideas from Distinguishing the indistinguishable: Exploring structural ambiguities via geodesic context by Yan et al. (CVPR 2017) and Global Structure-from-Motion by Similarity Averaging by Cui et al. (ICCV 2015). We also include the experiment results from Improving Structure from Motion with Reliable Resectioning by Kataria et al. (3DV 2020) based on their implementation. We refer to these three papers as Yan's method, Cui's method, and Kataria's method respectively. This repository uses COLMAP and hloc for feature extraction and matching, and COLMAP for geometric verification and sparse reconstruction.

TLDR: No method consistently works well over all datasets with a single set of hyperparameters. Tuning the parameters for large scenes is difficult and time-consuming for all three methods.

Drop an email to Lixin Xue if you are interested in this problem and want to chat!

teaser
Results on the Alexander Nevsky Cathedral Dataset

Conclusions

Based on our experiments, we have the following observations:

  • Duplicate structures in images often lead to an excessive number of image matches and non-converging bundle adjustment. These will lengthen the reconstruction time significantly. Removing the wrong matches or initializing the poses correctly can significantly speed up the reconstruction process.
  • With a correct initial pair of images and a perfect next view selection, colmap might still output a reconstruction with many wrongly registered images. Even Kataria's method to initialize poses based on reliable matches was still insufficient to disambiguate some datasets. Therefore, a less noisy pose graph is necessary for a correct reconstruction.
  • Both Yan's method and Cui's method need some scene-specific parameter tuning. Kataria's method has better generalization ability, though still fails on several datasets even though we tune the parameters for it. No method consistently works well over all datasets with a single set of parameters. Tuning the parameters for certain scenes (especially large scale ones) is difficult and time-consuming for all three methods.

Installation

# python 3.7 is required for the 'capture_output' keyword in `subprocess.run`
# which is only used in the notebooks
conda create -n sfm python=3.7 -y

# install [colmap](https://colmap.github.io/install.html).
# install [hloc](https://github.com/cvg/Hierarchical-Localization#installation)

# library for the plot of match graphs
sudo apt-get install graphviz graphviz-dev
pip install pygraphviz
conda install -c anaconda networkx -y
# for interactive control of parameters in the notebooks
conda install -c conda-forge jupyterlab ipywidgets -y
# can skip this if using colmap for visualization of 3D models
conda install -c open3d-admin -c conda-forge open3d -y

# install this library in development mode for further modifications
python install -e .

We also provide the datasets we use via google drive. Please download it and then unzip it under the folder datasets to have the following layout:

|---datasets
    |---heinly2014
        |---...
    |---yan2017
        |---...

Pipelines

We provide two jupyter notebooks as examples for the complete pipelines of Yan's method and Cui's method. These two methods share a similar pipeline by first computing a score for each image pair and then removing wrong matches based on the score. After that, the filtered matches are passed to the incremental reconstruction stage. In Yan's method, they use raw matches to compute tracks and use the percentage of shared unique tracks between two images as the score for an image pair. While in Cui's method, they do a local reconstruction for every image and use the missing correspondence idea to create a score for every image pair.

pipeline yan pipeline cui
Similar Pipelines for Yan's method and Cui's method

1. Correspondence

First, we can extract features from images using colmap or hloc. We provide the following features with their corresponding keywords in the parentheses:

  1. colmap SIFT with default parameters (sift_default)
  2. colmap sparser SIFT features with the first octave set to be 0 (sift_sparse)
  3. SuperPoint (superpoint)
  4. D2-Net (d2net)
  5. R2D2 (r2d2)
  6. DISK (disk, still a pull request in hloc)

For SIFT features extracted by colmap, we use the exhaustive nearest neighbor matching provided by colmap.

For learned features extracted by hloc, we use the exhaustive nearest neighbor matching provided by hloc. Specifically for the SuperPoint feature, we can also use the trained SuperGlue model for matching.

Then, we use colmap matches_importer to perform geometric verification (compute two-view geometries from the matches) with different RANSAC parameters (check colmap_matching_options in options/matching_options.py).

2. Disambiguation

Next, we can use Yan's method or Cui's method to compute the scores for all the matches. After that, we can choose to use a threshold filter, a top k filter, or a percentile filter to remove suspicious matches. We create a new database with the filtered matches and recompute two-view geometries.

We choose to pre-filter matches rather than post-process the reconstructed model as done in the paper Correcting for Duplicate Scene Structure in Sparse 3D Reconstruction by Heinly et al based on the observations stated in the section Conclusions.

3. Reconstruction

Lastly, we use colmap mapper to reconstruct the whole scene incrementally. Depending on the dataset, you can choose to fix the intrinsics from EXIF or not (check options/mapper_options.py).

Yan's Method

Distinguishing the Indistinguishable: Exploring Structural Ambiguities via Geodesic Context. CVPR 2017.

By Qingan Yan, Long Yang, Ling Zhang, Chunxia Xiao.

Basic Steps

The key idea is that geodesic neighbors capturing the same instance of duplicate structures usually share more matches than images of different instances of duplicate structures. With this in mind, they:

  1. generate tracks from raw matches;
  2. select the iconic set of images to summarize the whole scene by maximizing an objective function that favors completeness (#observed tracks) and penalizes repetitiveness (#tracks occurring in more than one image);
  3. split the tracks covered by the images in the iconic set into two parts: those appearing only in one image of the iconic set are defined as unique tracks, while the other tracks appearing more than once in the images of the iconic set are defined as confusing tracks.
  4. define a score for match_ij: len(unique_ij) / max(len(unique_i), len(unique_j)), i.e. the percentage of common unique tracks shared by the two images

Differences between the Original Implementation and the Paper

Our implementation follows the original implementation shared by the author. However, there are some differences between the author's code and the paper:

  1. In the actual implementation, the non-iconic images are also used as the backbone of the path network. Therefore, the match graph is not "a bipartite graph with nodes respectively are iconic images and non-iconic images". The construction of the iconic set is only for the construction of the unique tracks and confusing tracks.
  2. In the original paper, the match will be kept as long as two images share enough unique tracks. However, the percentage of unique tracks is also used to filter out matches in the implementation.
  3. The $\alpha$ in formula (3) is set to 0 in the code while it is required to be larger than 0 in the paper. As a consequence, the stopping criterion for the construction of the iconic set is different: in the paper the iconic set will be fixed once adding any one of the images will not increase the objective function (3). Instead, the author provides a coverage threshold to stop the expansion of the iconic set once the objective is larger than this threshold. This change is necessary since the objective function will be monotonically increasing with $\alpha = 0$.

Parameters Tuning

The original implementation has two parameters (coverage_thres and score_thres) to tune. Here is the comment from the author:

The coverage controls how many iconic images will be selected. As for small-scale indoor scenes, a large value between 0.7 and 0.9 is recommended; otherwise, for large-scale unstructured datasets, the value around 0.6 would be enough.

Parameter score_thres defines whether an image pair is acceptable. As for small-scale scenes, similarly, a large threshold (around 0.3) is recommended; otherwise, for large-scale outdoor scenes, score_thres between 0.04 and 0.1 would be a good choice.

For instance, for the Alexander Nevsky Cathedral dataset, we use coverage = 0.6 and score_thres = 0.1 to achieve well-registered 3D point clouds.

In our implementation, we expose another 4 parameters to tune:

  • track_degree: the minimal length of a track to be taken into account. Increasing it will discard more short tracks.
  • alpha: the $\alpha$ in formula (3) of the paper. Increasing it will require the images in the iconic set to be more distinctive.
  • minimal_views: the minimal number of shared tracks for a match to be valid. Increasing it means fewer matches will be valid.
  • ds: the data structure used to store the list of matches. You can leave it unchanged (default largearray) as the default value is a good tradeoff between speed and memory. For large datasets with thousands of images like berliner_dom (1618 images), it is necessary to use the smallarray data structure or limit the maximum number of keypoints in an image. In this case, it would be extremely slow (more than 8 hours for berliner_dom) due to a large number of images and the inefficient data structure.

Cui's Method

Global Structure-from-Motion by Similarity Averaging. ICCV 2015.

By Zhaopeng Cui and Ping Tan.

Basic Steps

In this paper, the authors utilize the missing correspondences cue to remove wrong matches.

  1. For each image and its direct neighbors, compute a two-view reconstruction by triangulation with the baseline set to 1.
  2. Merge these two-view reconstructions into one local reconstruction by solving a linear system based on the scale consistency in the stellar graph.
  3. Project the merged 3D points to each neighbor and calculate its missing correspondence score.

For more details, please check Section 4 of the paper.

Missing Correspondence Score

In the paper, there are two types of projected 3D points in the field of view of the image: the observed keypoints (matched features) in this image and the ones observed in other images but not in this image (missing features).

The author takes the bounding boxes of matched features and calculates the percentage of missing points in the bounding boxes as the missing correspondence score. For example, these two images below display the projected 3D points from image 0 into image 1 and image 16, respectively.

projected points from image 0 to image 1 projected points from image 0 to image 17

However, this formulation is not stable since the bounding box is extremely sensitive to outliers: one outlier might enlarge the box to the entire image frame. In addition, the bounding box doesn't take into account the perspective transformation between images, where a square in one image would not be a square in another image.

Therefore, we propose a more fine-grained score by using a small square box around every keypoints instead of a big bounding box around all the keypoints. Below is the visualization of the new masks of the images shown above. With this modification, the score is more reliable and distinctive for the disambiguation.

projected points from image 0 to image 1 projected points from image 0 to image 17

We further try to take the distance between matched features and missing features into account by using a Gaussian mask around the keypoints:

projected points from image 0 to image 1 projected points from image 0 to image 17

This scoring method requires a smaller threshold since the score is typical around ~0.1. We did not experiment much with this scoring method, so we are not sure if it is better than the previous version.

Parameters

  • score_version: we provide three score formulations as stated above. Version 1 is the method using a bounding box around keypoints presented in the paper. Version 2 is using a uniform square around every keypoint. Version 3 is using a Gaussian square around every keypoint.
  • min_num_valid_depths: minimal number of correct reconstructed depths for a 3D point to be valid (The depth consistency check in the paper).
  • max_num_neighbors: maximal number of neighbors used for the reconstruction. It is used to avoid the inefficiency caused by dense clusters of images.
  • square_radius: the size of the square around each keypoint for the score versions 2 and 3.
  • parallel: whether to use multiple threads for the computation of the scores. The statistics of the runtime is not accurate in parallel mode.
  • plot: whether to plot the masks and the image with projected 3D points. For large datasets, it is advisable not to plot as plotting will have a significant overhead.

Visualizations

Fine-tuned Parameters

Now we show some results of the implemented methods. The gif on the upper left displays the images in each dataset. The gif on the upper right is the reconstructions from the colmap with default parameters. The bottom left and the bottom right gifs display the reconstructions after disambiguation with Yan's method and Cui's method, respectively.

One thing worth noticing is that the parameters used for different datasets are different: we kind of cheat by tuning the parameters based on the pose graphs.

Books

Books Dataset Books COLMAP

Books Yan's Method Books Cui's Method

Cereal

Cereal Dataset Cereal COLMAP

Cereal Yan's Method Cereal Cui's Method

Cup

Cup Dataset Cup COLMAP

Cup Yan's Method Cup Cui's Method

Desk

Desk Dataset Desk COLMAP

Desk Yan's Method Desk Cui's Method

(the image on the most left is misregistered with colmap, while it is corrected with either one of the two methods)

Oats

Oats Dataset Oats COLMAP

Oats Yan's Method Oats Cui's Method

(both methods failed as the ground truth should be something like a sequence instead of two sequences in parallel)

Street

Street Dataset Street COLMAP

Street Yan's Method Street Cui's Method

Temple of Heaven

ToH Dataset ToH COLMAP

ToH Yan's Method ToH Cui's Method

Alexander Nevsky Cathedral

Alexander Nevsky Cathedral Dataset Alexander Nevsky Cathedral COLMAP

Alexander Nevsky Cathedral Yan's Method Alexander Nevsky Cathedral Cui's Method

Same Parameters

To investigate to what extent a set of parameters would be applicable for all datasets, we apply the parameters tuned for the Alexander Nevsky Cathedral dataset on other Internet collections of images.

Arc de Triomphe

Berliner Dom

Big Ben

Brandenburg Gate

(With a proper choice of the threshold, we can disambiguate the model into several parts.)

Church of Savior on the Spilled Blood

(With a proper choice of the threshold, we can disambiguate the model into several parts.)

Radcliffe Camera

(The correct reconstruction is split into two parts due to the lack of transitional camera views)

Kataria's Method

Here we would like to also display the results from the paper Improving Structure from Motion with Reliable Resectioning by Rajbir Kataria, Joseph DeGol, Derek Hoiem. For more details, please refer to the repository provided by the authors. We refer to this method as Kataria's method hereafter.

Based on the observation that longer tracks are more likely to contain wrong matches, the authors propose to use a track-length-adjusted number of matches as the criterion for the next view selection. More importantly, the initial pose of the image to be registered will rely only on 3D points from reliable images instead of all triangulated points. This is important as our experiments show that a correct registration order does not necessarily lead to a correct reconstruction. This method only contains two parameters to be set: the track length discount factor λ and the reliable image threshold τ. More significantly, the same set of parameters could work on many different scenes, greatly reducing the burden of tuning parameters for the above mentioned two methods.

For a fair comparison, we investigate the changed files in the original repository and integrate them with the current version of colmap with small modifications. We run the exhaustive_matcher instead of vocab_tree_matcher as done in previous methods. Since the parameters provided by the author are tuned for OpenSfm, we also tried to tune the parameters (λ changed from 0.5 to 0.3, τ changed from 2.0 to 1.3) for colmap on the cup and the oats dataset. The results are shown below:

Cup

(The reconstruction on the left is with the parameters provided by the authors, while the one on the right is with the parameters tuned by us)

Oats

(The reconstruction on the left is with the parameters provided by the authors, while the one on the right is with the parameters tuned by us. Note that we did not find a set of suitable parameters for Yan's or Cui's method to disambiguate this scene)

Results on Large Scale Datasets

However, when we use these two sets of parameters on the large scale Internet datasets provided by Heinly et al, both sets of the parameters give us similar reconstructions and they are somewhat inferior to what we can get from Yan's or Cui's method:

Alexander Nevsky Cathedral

(In the left reconstruction, some of the misregistered cameras should be placed in the blue circle to create a correct reconstruction like the one on the right)

Big Ben

(Note the suspicious wall in the blue circle in the left reconstruction, which should be an empty street as in the right reconstruction)

Radcliffe Camera

(This set of parameters for Kataria's method cannot distinguish the two sides of Radcliffe Camera, while Yan's method and Cui's method work)

Reproduction

For the reproduction of the above results for Kataria's method, we put the changed/added files in the reliable_resectioning folder. You can merge all the files in this directory with colmap's source code and then compile it. We also provide a bash script example for generating sparse reconstruction with the newly compiled colmap.

Codebase Walkthrough

|---datasets
    |---heinly2014
        |---...
    |---yan2017
        |---...
|---disambiguation
    |---geodesic_consistency        # code for Yan's method
    |---mmissing_correspondences    # code for Cui's method
    |---options     # parameters for features/matching/mapper
    |---utils       # some helper functions
    |---calculate_geodesic_consistency_scores.py        # interface for calculating match scores based on Yan's method
    |---calculate_missing_correspondences_scores.py     # interface for calculating match scores based on Cui's method
    |---extract_match_features.py                       # interface for extract and match features
|---reliable_resectioning
    |---src         # modified colmap source files for Kataria's method
|---results
    |---${dataset_name}
        |---${feature_type}_${matching_type}_${geometric_verification_type}
            |---plots_${parameters}             # plots for missing correspondences in Cui's method
            |---sparse                          # reconstruction using colmap without disambiguation
            |---sparse_yan_${parameters}        # reconstruction using Yan's method
            |---sparse_cui_${parameters}        # reconstruction using Cui's method
            |---db_yan_${parameters}.db         # storing matches filtered with Yan's method
            |---db_cui_${parameters}.db         # storing matches filtered with Cui's method
            |---${dataset_name}.db              # storing unfiltered matches
            |---scores_yan_${parameters}.npy    # socres for matches using Yan's method
            |---scores_cui_${parameters}.npy    # scores for matches using Cui's method
            |---...
|---notebooks
    |---$steet_${method_name}.ipynb   # example for running the codebase and tuning the parameters on the street dataset.
|---scripts
    |---disambiguate_yan.py     # example of using Yan's method for scores
    |---disambiguate_cui.py     # example of using Cui's method for scores
    |---filter_matches.py       # example of filtering matches based on scores
    |---match_features.py       # example of extracting and matching features

Datasets

We mainly use the datasets from Yan's repo and Heinly's website, where they include some datasets from Roberts et al. and Jiang et al.. We packed the cleaned-up version of these datasets (with images renamed and features removed) into a zip file for downloads.

To experiment with other datasets, you can place new datasets under yan2017 or heinly2014 with the following structure:

|---datasets
    |---heinly2014
        |---${your_dataset_name}
            |---images
                |--- *.[jpg/png/...]

then you can use your ${your_dataset_name} as argument dataset_name to run the code on new datasets.

Literature Review

Here are some relevant papers and their summaries:

  • Pose Graph: nodes are images, edges are epipolar geometries

  • Visibility Graph: nodes are images and tracks(3D points), edges are visibility

  • Some other papers

    • [Roberts CVPR 2011]: missing correspondences + timestamp cues
    • [Cohen CVPR 2012]: detect symmetries and enforce them as constraints in optimization
    • [Ceylan TOG 2013]: user-specified pattern to detect repeated patterns for facade
    • [Heinly 3DV 2014]: similar idea but faster compared to [Heinly ECCV 2014]
    • [Kataria 3DV 2020]: use track length to adjust match score for next view selection; only use reliable 3D points for image registration
  • Common techniques: Minimum Spanning Tree

Acknowledgement

This work was done by Lixin Xue, a Master's student at ETH Zurich, under the supervision of Paul‑Edouard Sarlin and Mihai Dusmanu. We would like to thank Qingan Yan for sharing the original implementation and datasets, Jared Heinly for sharing the datasets, and Rajbir Kataria for helping out with the setup of his work.

Owner
Computer Vision and Geometry Lab
Computer Vision and Geometry Lab
Contextual Attention Network: Transformer Meets U-Net

Contextual Attention Network: Transformer Meets U-Net Contexual attention network for medical image segmentation with state of the art results on skin

Reza Azad 67 Nov 28, 2022
Pytorch implementation of AREL

Status: Archive (code is provided as-is, no updates expected) Agent-Temporal Attention for Reward Redistribution in Episodic Multi-Agent Reinforcement

8 Nov 25, 2022
Official implementation of the RAVE model: a Realtime Audio Variational autoEncoder

Official implementation of the RAVE model: a Realtime Audio Variational autoEncoder

Antoine Caillon 589 Jan 02, 2023
A general framework for deep learning experiments under PyTorch based on pytorch-lightning

torchx Torchx is a general framework for deep learning experiments under PyTorch based on pytorch-lightning. TODO list gan-like training wrapper text

Yingtian Liu 6 Mar 17, 2022
Unrolled Generative Adversarial Networks

Unrolled Generative Adversarial Networks Luke Metz, Ben Poole, David Pfau, Jascha Sohl-Dickstein arxiv:1611.02163 This repo contains an example notebo

Ben Poole 292 Dec 06, 2022
Official pytorch code for SSC-GAN: Semi-Supervised Single-Stage Controllable GANs for Conditional Fine-Grained Image Generation(ICCV 2021)

SSC-GAN_repo Pytorch implementation for 'Semi-Supervised Single-Stage Controllable GANs for Conditional Fine-Grained Image Generation'.PDF SSC-GAN:Sem

tyty 4 Aug 28, 2022
(CVPR 2022 - oral) Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry

Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry Official implementation of the paper Multi-View Depth Est

Bae, Gwangbin 138 Dec 28, 2022
SoK: Vehicle Orientation Representations for Deep Rotation Estimation

SoK: Vehicle Orientation Representations for Deep Rotation Estimation Raymond H. Tu, Siyuan Peng, Valdimir Leung, Richard Gao, Jerry Lan This is the o

FIRE Capital One Machine Learning of the University of Maryland 12 Oct 07, 2022
CVPR 2022 "Online Convolutional Re-parameterization"

OREPA: Online Convolutional Re-parameterization This repo is the PyTorch implementation of our paper to appear in CVPR2022 on "Online Convolutional Re

Mu Hu 121 Dec 21, 2022
A toolkit for developing and comparing reinforcement learning algorithms.

Status: Maintenance (expect bug fixes and minor updates) OpenAI Gym OpenAI Gym is a toolkit for developing and comparing reinforcement learning algori

OpenAI 29.6k Jan 08, 2023
TensorFlow, PyTorch and Numpy layers for generating Orthogonal Polynomials

OrthNet TensorFlow, PyTorch and Numpy layers for generating multi-dimensional Orthogonal Polynomials 1. Installation 2. Usage 3. Polynomials 4. Base C

Chuan 29 May 25, 2022
Multi-label Co-regularization for Semi-supervised Facial Action Unit Recognition (NeurIPS 2019)

MLCR This is the source code for paper Multi-label Co-regularization for Semi-supervised Facial Action Unit Recognition. Xuesong Niu, Hu Han, Shiguang

Edson-Niu 60 Nov 29, 2022
Pytorch implementation for "Distribution-Balanced Loss for Multi-Label Classification in Long-Tailed Datasets" (ECCV 2020 Spotlight)

Distribution-Balanced Loss [Paper] The implementation of our paper Distribution-Balanced Loss for Multi-Label Classification in Long-Tailed Datasets (

Tong WU 304 Dec 22, 2022
PPO is a very popular Reinforcement Learning algorithm at present.

PPO is a very popular Reinforcement Learning algorithm at present. OpenAI takes PPO as the current baseline algorithm. We use the PPO algorithm to train a policy to give the best action in any situat

Rosefintech 11 Aug 23, 2021
Yas CRNN model training - Yet Another Genshin Impact Scanner

Yas-Train Yet Another Genshin Impact Scanner 又一个原神圣遗物导出器 介绍 该仓库为 Yas 的模型训练程序 相关资料 MobileNetV3 CRNN 使用 假设你会设置基本的pytorch环境。 生成数据集 python main.py gen 训练

wormtql 18 Jan 08, 2023
Code for generating a single image pretraining dataset

Single Image Pretraining of Visual Representations As shown in the paper A critical analysis of self-supervision, or what we can learn from a single i

Yuki M. Asano 12 Dec 19, 2022
The official implementation of Equalization Loss v1 & v2 (CVPR 2020, 2021) based on MMDetection.

The Equalization Losses for Long-tailed Object Detection and Instance Segmentation This repo is official implementation CVPR 2021 paper: Equalization

Jingru Tan 129 Dec 16, 2022
Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention

cosFormer Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention Update log 2022/2/28 Add core code License This

120 Dec 15, 2022
ViViT: Curvature access through the generalized Gauss-Newton's low-rank structure

ViViT is a collection of numerical tricks to efficiently access curvature from the generalized Gauss-Newton (GGN) matrix based on its low-rank structure. Provided functionality includes computing

Felix Dangel 12 Dec 08, 2022
Computer-Vision-Paper-Reviews - Computer Vision Paper Reviews with Key Summary along Papers & Codes

Computer-Vision-Paper-Reviews Computer Vision Paper Reviews with Key Summary along Papers & Codes. Jonathan Choi 2021 50+ Papers across Computer Visio

Jonathan Choi 2 Mar 17, 2022