Abstract

With the emergence of Gaussian Splats, recent efforts have focused on large-scale scene geometric reconstruction. However, most of these efforts either concen- trate on memory reduction or spatial space division, neglecting information in the semantic space. In this paper, we propose a novel method, named SA-GS, for fine-grained 3D geometry reconstruction using semantic-aware 3D Gaussian Splats. Specifically, we leverage prior information stored in large vision models such as SAM and DINO to generate semantic masks. We then introduce a geo- metric complexity measurement function to serve as soft regularization, guiding the shape of each Gaussian Splat within specific semantic areas. Additionally, we present a method that estimates the expected number of Gaussian Splats in different semantic areas, effectively providing a lower bound for Gaussian Splats in these areas. Subsequently, we extract the point cloud using a novel probability density-based extraction method, transforming Gaussian Splats into a point cloud crucial for downstream tasks. Our method also offers the potential for detailed semantic inquiries while maintaining high image-based reconstruction results. We provide extensive experiments on publicly available large-scale scene reconstruction datasets with highly accurate point clouds as ground truth and our novel dataset. Our results demonstrate the superiority of our method over current state-of-the-art Gaussian Splats reconstruction methods by a significant margin in terms of geometric-based measurement metrics.


Method

Point Cloud Comparison
Figure 1: Qualitative comparison between our method and other 3DGS based methods. We proposed Shape constrain, alpha constrain and point cloud extraction in the current study. Quantitative ablation is shown in the right handside of the figure.
NIPS Teaser
Figure 2: The blue section of the figure illustrates common methods for reconstructing geometrically aligned Gaussian Splats. The input for all Gaussian Splatting methods includes a COLMAP initialization consisting of images, camera positions, and SfM sparse point clouds. The output will be a traditional representation such as a mesh or point cloud, as shown in the right blue box. During training, in addition to the common image rendering loss, most methods encourage all 3D Gaussians to form a disk-like shape. After several training iterations, or at the end of the training process, other methods select a hard threshold for the alpha value and use the remaining Gaussians for geometric reconstruction. However, these hard constraints often result in poorer reconstruction, as demonstrated in our experiments. Instead of encouraging all Gaussians to adopt the same shape, our method uses semantic information to control the shape in detail. We first produce semantic masks for each input image, then extract shape information for each semantic group, and use this information to locally control the shape of each Gaussian. Additionally, we provide an opacity field sampling method that can dynamically allocate the desired number of points and ignore defective reconstruction parts.
Fantasy Surface
Figure 3: Explanation of fantasy-surace probelmIn the first row of this figure, we display the results of using SuGaR to reconstruct the Campus and College scenes from GauUsceneV2. Many surfaces incorrectly model the lighting conditions due to complex effects, such as how glass reflects sunlight at different angles and how clouds block sunlight. These imaginary surfaces that do not represent the true surface are regarded as fantasy surfaces. Our method, shown in the bottom rows, largely alleviates this problem, as evident in the figure. Another major source of geometric error occurs at the edges of unbounded scenes. However, this issue is common to all methods due to the sparsity of images at the edges and is not the focus of our current work.
Inconsistancy
Figure 4: Explanation of Inconsistency problem. The semantic segmentation results are sometimes inconsistent with previous judgments. As shown in Figures (a) and (b), two tunnels are regarded as ground using GroundingSAM. However, in the images captured from a camera position immediately adjacent to them (Figures (c) and (d)), the left tunnel is not regarded as ground. This inconsistency between consecutive images is the primary cause of failure in naive reconstruction methods.
Methods
Figure 5: Method Overview: Our method pipeline consists of three main stages. Initially, we utilize the same input as vanilla Gaussian Splatting, but enhance it with semantic information extracted via Grounding SAM. Next, we assess the geometric complexity of each semantic group by calculating high-frequency power. Our geometric constraint is implemented through a soft regularization, facilitated by a semantic loss function. This guides the Gaussian shapes to match the expected shapes determined earlier. The rendering loss further refines the shape and attributes of the 3DGS, while the shape constraint, indicated by a negative sign, ensures alignment between rendered and real images. Controlling the shapes of different 3DGS is achieved by mapping their projected pixels onto the semantic map obtained earlier. Additionally, by reducing the number of low-opacity Gaussian splats to the expected count, we minimize GPU memory consumption during training. Finally, we offer a user-friendly point cloud extraction method via hierarchical probability density sampling. Initially, we create a multinomial distribution using the opacity values stored in each 3DGS. Then, based on user inputs and the multinomial distribution, we determine the number of points to sample from each Gaussian distribution. Detailed experimental results demonstrate significant improvements at each step, showcasing superior geometric reconstruction compared to current state-of-the-art methods.
Rendering Comparison
Figure 6: This the comparison between our method and vanilla Gaussian Splats. As one can see that from the figure shown above, our methods largely sharpening the edge of image. The tower shown in the figure (a) merges together and sharpened in our method, while in (b) figures, we eliminate the noise around the high building. While for the last group of pictures shows that our steadily alpha decreasing strategy is successful.

Comparison


Location Left-Hand Side Right-Hand Side

Location: Campus