• Publikationen
Show publication details

Werner, Christoph; Kuijper, Arjan [1. Review]; Pöllabauer, Thomas Jürgen [2. Review]

Neural Radiance Fields for Real-time Object Pose Estimation on unseen Object Appearances


Darmstadt, TU, Master Thesis, 2021

I present an extension to the iNeRF (inverting Neural Radianc Fields) framework, which allows mesh-free 6DoF pose estimation even on previously unseen appearance variations of known objects from RGB-only inputs in near real-time. The iterative analysis-by-synthesis approach of iNeRF uses Neural Radiance Fields (NeRFs) to render photorealistic views from freely chosen directions. An object pose is then estimated by matching the camera pose of a synthesized view with the input image. Thus, ”inverting” the NeRF formula. While NeRF allows the volumetric rendering of complex scenes without 3D models, it is only capable of representing a single static scene. Variation in scenes, due to illumination changes or different object textures, requires an explicitly trained model for these conditions. Since lighting settings can quickly change in any natural environment, there is no feasible way to prepare NeRF models for every expected object appearance beforehand. Especially, when no additional synthetic data can be generated due to the non-availability of 3D meshes. This, in turn, limits iNeRF use cases in realistic scenarios. Additionally, the iterative iNeRF algorithm is very computationally expensive, which further reduces possible cases of application. The goal of this thesis is to enhance the iNeRF method, so it is able to predict poses on various object appearances under changing illumination conditions, even when the specific looks of these object were not seen before. Furthermore, performance enhancements are supposed to reduce the computation time and allow a wider variety of use cases. By introducing a variational autoencoder (VAE) to the NeRF formulation, my method allows NeRF to represent object appearances in a latent space. Instead of a single fixed scene, my NeRF model is trained to reproduce objects under a wide distribution of differing textures and illuminations, based on a single input image. A single trained NeRF-VAE model, capable of recreating various different known and unseen textures, thus replaces multiple individually trained NeRFs, which can only render already seen objects. With this NeRF-VAE upgrade to the rendering process, iNeRF-VAE can now adapt the once static neural radiance field to reconstruct the appearance from any given input image of the object. In my experiments, I show the expressiveness of this latent vector to represent unseen object appearances in a custom dataset. I also establish my model’s capability to still capture fine geometric details. Additionally, I examine that the introduced changes allow pose estimation on various object appearances. My approach also reveals proficiency in predicting poses of partially occluded objects by inpainting missing parts. Finally, to achieve closer to realtime pose estimation, the iterative process is sped up by predicting a closer initial starting position. I also reduce the number of rendered pixels by sampling more informative rays and introduce an early stopping regime, to prioritize convergence speed over minor accuracy progress.

Show publication details

Rojtberg, Pavel; Pöllabauer, Thomas Jürgen; Kuijper, Arjan

Style-transfer GANs for Bridging the Domain Gap in Synthetic Pose Estimator Training


2020 IEEE International Conference on Artificial Intelligence and Virtual Reality (AIVR). Proceedings

IEEE International Conference on Artificial Intelligence and Virtual Reality (AIVR) <2020, online>

Given the dependency of current CNN architectures on a large training set, the possibility of using synthetic data is alluring as it allows generating a virtually infinite amount of labeled training data. However, producing such data is a nontrivial task as current CNN architectures are sensitive to the domain gap between real and synthetic data.We propose to adopt general-purpose GAN models for pixellevel image translation, allowing to formulate the domain gap itself as a learning problem. The obtained models are then used either during training or inference to bridge the domain gap. Here, we focus on training the single-stage YOLO6D [20] object pose estimator on synthetic CAD geometry only, where not even approximate surface information is available. When employing paired GAN models, we use an edge-based intermediate domain and introduce different mappings to represent the unknown surface properties.Our evaluation shows a considerable improvement in model performance when compared to a model trained with the same degree of domain randomization, while requiring only very little additional effort.

Show publication details

Pöllabauer, Thomas Jürgen; Rojtberg, Pavel [1. Prüfer]; Kuijper, Arjan [2. Prüfer]

STYLE: Style Transfer for Synthetic Training of a YoLo6D Pose Estimator


Darmstadt, TU, Master Thesis, 2020

Supervised training of deep neural networks requires a large amount of training data. Since labeling is time-consuming and error prone and many applications lack data sets of adequate size, research soon became interested in generating this data synthetically, e.g. by rendering images, which makes the annotation free and allows utilizing other sources of available data, for example, CAD models. However, unless much effort is invested, synthetically generated data usually does not exhibit the exact same properties as real-word data. In context of images, there is a difference in the distribution of image features between synthetic and real imagery, a domain gap. This domain gap reduces the transfer-ability of synthetically trained models, hurting their real world inference performance. Current state-of-the-art approaches trying to mitigate this problem concentrate on domain randomization: Overwhelming the model’s feature extractor with enough variation to force it to learn more meaningful features, effectively rendering real-world images nothing more but one additional variation. The main problem with most domain randomization approaches is that it requires the practitioner to decide on the amount of randomization required, a fact research calls "blind" randomization. Domain adaptation in contrast directly tackles the domain gap without the assistance of the practitioner, which makes this approach seem superior. This work deals with training of a DNN-based object pose estimator in three scenarios: First, a small amount of real-world images of the objects of interest is available, second, no images are available, but object specific texture is given, and third, no images and no textures are available. Instead of copying successful randomization techniques, these three problems are tackled mainly with domain adaptation techniques. The main proposition is the adaptation of general-purpose, widely-available, pixel-level style transfer to directly tackle the differences in features found in images from different domains. To that end several approaches are introduced and tested, corresponding to the three different scenarios. It is demonstrated that in scenario one and two, conventional conditional GANs can drastically reduce the domain gap, thereby improving performance by a large margin when compared to non-photo-realistic renderings. More importantly: ready-to-use style transfer solutions improve performance significantly when compared to a model trained with the same degree of randomization, even when there is no real-world data of the target objects available (scenario three), thereby reducing the reliance on domain randomization.