Scene Understanding Meets Realistic Scene Synthesis: Novel Learning-based Technologies for Scene Digitization, Analysis and Editing

Doctorate of Saptarshi Neil Sinha

News /

Congratulations! Saptarshi Neil Sinha, a researcher in the “Virtual and Augmented Reality” department in Darmstadt, successfully defended his dissertation, Scene Understanding Meets Realistic Scene Synthesis: Novel Learning-based Technologies for Scene Digitization, Analysis and Editing,” on November 24, 2025.

Abstract

Understanding a scene from acquired visual data is a primary objective of computer vision and serves as the foundation for critical tasks such as semantic segmentation, extrapolation and interpolation of sparse scene observations in real-time systems like autonomous driving, anomaly and defect detection, tracking of objects, material-based segmentation, as well as estimating physical properties like lighting and material characteristics. This process involves the detection, classification, realistic reconstruction and interpretation of physical objects and their relationships in a visual environment to enable meaningful analysis and informed decision-making. While humans can effortlessly extract insights from visual data, neural vision systems face challenges in integrating information from multiple sensory sources, such as audio, acceleration, and 3D depth sensors like LiDAR, RADAR, or Kinect.

Achieving multi-modal scene understanding, which identifies semantic connections between different sensors, is essential for creating a comprehensive representation and understanding of the scene. This task is complicated by inherent ambiguities in the data, often arising from the physical properties of the scene, such as varying material characteristics and lighting conditions. Accurately representing this data is crucial for enhanced scene understanding for tasks like scene digitization in terms of inference of geometry, material properties and lighting characteristics, scene analysis and editing. These capabilities are particularly important in fields such as virtual prototyping, advertisement, digital preservation of artifacts, autonomous driving, surveillance, architectural design, creation of digital twins, immersive media development, interactive gaming and product evaluation.

This thesis presents technologies that improve scene understanding by leveraging learning-based approaches for scene digitization, analysis, and editing. We begin by introducing scene digitization in terms of inference of geometry, material properties, and lighting characteristics from RGB and sparse spectral data.

Our novel learning-based spectral scene digitization approach leverages 3D Gaussian Splatting (3DGS) to create a comprehensive multi-spectral explicit scene representation framework. This framework enhances the accuracy and realism of rendered outputs through improved physically-based rendering techniques that estimate reflectance and lighting for each spectrum. Additionally, it facilitates enhanced scene analysis by enabling semantic segmentation of the scene per spectrum. We also present technologies for scene digitization from sparse observations, such as visualizing fragile historical artifacts in Virtual Reality. Furthermore, by employing a calibrated measurement-arm-camera (MAC) setup, we improve the accuracy and alignment of reconstructed models using 3D Gaussian Splatting (3DGS) from a limited number of views. Finally, to improve material asset management of inferred digitized materials, we introduce a framework for generating digital material assets utilizing learning-based approaches, ensuring these assets are available in standardized formats.

The thesis further investigates scene editing in terms of segment-wise style and material appearance transfer, introducing methods for 3D semantic style transfer. By incorporating semantic information into the style transfer process, the research achieves superior fidelity and multi-view consistency in stylization. Furthermore, a novel hybrid pipeline for scene editing is proposed that allows learning based scene analysis to performed on scenes digitized using learning-based or high-quality scanning devices. Additionally, it also includes a use case for controllable style transfer between portraits and busts. Finally, the thesis addresses scene editing in terms of data restoration based on purely synthetic data. It presents a method for synthesizing defects in visual arts and train deep learning models aimed at restoring degraded artworks. This innovative technique effectively addresses the challenges posed by the scarcity of ground-truth data in restoration, showcasing the potential of synthetic data to enhance restoration practices.The effectiveness of the proposed solutions is validated through extensive evaluations, showing notable improvements in the accuracy and realism of reconstructed scenes, as well as better user experiences in interactive platforms like virtual reality.

In conclusion, the techniques developed in this thesis — including multi-spectral learning-based scene digitization, scene digitization from sparse observations, advanced stylization and material transfer methods, and data restoration based on purely synthetic data — provide a strong foundation for future applications. These contributions are especially beneficial for projects that aim to utilize learning-based approaches in scene digitization, analysis and editing, effectively tackling the complexities of diverse datasets and enhancing the quality of visual representations across various fields. The methods proposed not only have aesthetic applications but also serve functional purposes in industries such as automotive design, visual inspection, medical applications, and smart farming, where precise material representation and scene understanding are essential.