In recent years, advancements in deep learning have greatly improved our ability to understand and interpret three-dimensional scenes, especially in fields like autonomous driving, robotics, and virtual reality. However, achieving a consistent and high-quality representation of visual data remains a significant challenge. This issue is particularly evident with three-dimensional data, which often presents intrinsic problems such as sparsity, uneven data and label distributions. In particular, performing tasks like semantic scene understanding can be difficult due to uneven label distribution in the dataset; additionally, label shifts during training can result in catastrophic forgetting. Uneven data distribution can cause problems such as domain shifts or misalignment in models that handle different types of sensory information. Furthermore, 3D data can be quite complex and challenging to label accurately; consequently, tasks like scene understanding and 3D model reconstruction often suffer from data scarcity or struggle to achieve optimal visual representations. This thesis tackles these challenges by proposing transfer learning based techniques for obtaining robust visual representation in semantic scene understanding across multiple modalities, particularly handling three-dimensional scenes. The dissertation begins by exploring the realm of 3D scene representations and semantic understanding methods, followed by exploring methods to tackle class imbalance in 3D semantic segmentation, introducing coarse-to-fine self-regularizing strategies to improve the representation of infrequent classes in point cloud data. Then, it focuses on continual learning and proposes novel techniques to allow models to evolve over time by incorporating new information without the need for retraining from scratch or forgetting previously learned tasks. Multimodal learning is another key focus, with methods proposed to integrate sensory inputs such as LiDAR and RGB images to enhance scene interpretation in complex, real-world environments. The research further investigates domain adaptation, to improve model robustness when dealing with corrupted or degraded input data, ensuring more reliable performance across varying conditions. Finally, the thesis explores the intersection between semantic scene understanding and 3D reconstruction, proposing few-shot learning techniques to enhance the understanding of 3D models from scarce data, particularly for applications in architectural modelling. Through these contributions, this thesis advances the state of the art in visual representation for semantic scene understanding, offering new insights into the integration of transfer learning techniques in standard pipelines.
Robust Visual Representation across Modalities in Semantic Scene Understanding / Camuffo, Elena. - (2025 Mar 24).
Robust Visual Representation across Modalities in Semantic Scene Understanding
CAMUFFO, ELENA
2025
Abstract
In recent years, advancements in deep learning have greatly improved our ability to understand and interpret three-dimensional scenes, especially in fields like autonomous driving, robotics, and virtual reality. However, achieving a consistent and high-quality representation of visual data remains a significant challenge. This issue is particularly evident with three-dimensional data, which often presents intrinsic problems such as sparsity, uneven data and label distributions. In particular, performing tasks like semantic scene understanding can be difficult due to uneven label distribution in the dataset; additionally, label shifts during training can result in catastrophic forgetting. Uneven data distribution can cause problems such as domain shifts or misalignment in models that handle different types of sensory information. Furthermore, 3D data can be quite complex and challenging to label accurately; consequently, tasks like scene understanding and 3D model reconstruction often suffer from data scarcity or struggle to achieve optimal visual representations. This thesis tackles these challenges by proposing transfer learning based techniques for obtaining robust visual representation in semantic scene understanding across multiple modalities, particularly handling three-dimensional scenes. The dissertation begins by exploring the realm of 3D scene representations and semantic understanding methods, followed by exploring methods to tackle class imbalance in 3D semantic segmentation, introducing coarse-to-fine self-regularizing strategies to improve the representation of infrequent classes in point cloud data. Then, it focuses on continual learning and proposes novel techniques to allow models to evolve over time by incorporating new information without the need for retraining from scratch or forgetting previously learned tasks. Multimodal learning is another key focus, with methods proposed to integrate sensory inputs such as LiDAR and RGB images to enhance scene interpretation in complex, real-world environments. The research further investigates domain adaptation, to improve model robustness when dealing with corrupted or degraded input data, ensuring more reliable performance across varying conditions. Finally, the thesis explores the intersection between semantic scene understanding and 3D reconstruction, proposing few-shot learning techniques to enhance the understanding of 3D models from scarce data, particularly for applications in architectural modelling. Through these contributions, this thesis advances the state of the art in visual representation for semantic scene understanding, offering new insights into the integration of transfer learning techniques in standard pipelines.File | Dimensione | Formato | |
---|---|---|---|
PhD_Thesis-final-pdfA.pdf
accesso aperto
Descrizione: Tesi Finale
Tipologia:
Tesi di dottorato
Dimensione
21.52 MB
Formato
Adobe PDF
|
21.52 MB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.