On Scaling Up 3D Gaussian Splatting Training

Zhao, Hexu; Weng, Haoyang; Lu, Daohan; Li, Ang; Li, Jinyang; Panda, Aurojit; Xie, Saining

doi:10.1007/978-3-031-91989-3_2

Hexu Zhao¹²,
Haoyang Weng¹²,
Daohan Lu¹²,
Ang Li¹³,
Jinyang Li¹²,
Aurojit Panda¹² &
…
Saining Xie¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15645))

Included in the following conference series:

European Conference on Computer Vision

762 Accesses
29 Citations

Abstract

3D Gaussian Splatting (3DGS) is increasingly popular for 3D reconstruction due to its superior visual quality and rendering speed. However, 3DGS training currently occurs on a single GPU, limiting its ability to handle high-resolution and large-scale 3D reconstruction tasks due to memory constraints. We introduce Grendel, a distributed system designed to partition 3DGS parameters and parallelize computation across multiple GPUs. As each Gaussian affects a small, dynamic subset of rendered pixels, Grendel employs sparse all-to-all communication to transfer the necessary Gaussians to pixel partitions and performs dynamic load balancing. Unlike existing 3DGS systems that train using one camera view image at a time, Grendel supports batched training with multiple views. We explore various optimization hyperparameter scaling strategies and find that a simple sqrt(batch_size) scaling rule is highly effective. Evaluations using large-scale, high-resolution scenes show that Grendel enhances rendering quality by scaling up 3DGS parameters across multiple GPUs. On the “Rubble” dataset, we achieve a test PSNR of 27.28 by distributing 40.4 million Gaussians across 16 GPUs, compared to a PSNR of 26.28 using 11.2 million Gaussians on a single GPU.

H. Weng and D. Lu—Contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+

from €37.37 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Chapter: EUR 29.95; Price includes VAT (Netherlands)

eBook: EUR 74.89; Price includes VAT (Netherlands)

Softcover Book: EUR 98.09; Price includes VAT (Netherlands)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

CityGaussian: Real-Time High-Quality Large-Scale Scene Rendering with Gaussians

SplatFields: Neural Gaussian Splats for Sparse 3D and 4D Reconstruction

Pixel-GS: Density Control with Pixel-Aware Gradient for 3D Gaussian Splatting

References

Barron, J., Mildenhall, B., Verbin, D., Srinivasan, P., Hedman, P.: Mip-nerf 360: unbounded anti-aliased neural radiance fields. In: CVPR (2022). https://doi.org/10.1109/CVPR52688.2022.00539
Busbridge, D., et al.: How to scale your EMA. In: NeurIPS (2023). https://openreview.net/forum?id=DkeeXVdQyu
Ginsburg, B., Gitman, I., You, Y.: Large batch training of convolutional networks with layer-wise adaptive rate scaling (2018). https://openreview.net/forum?id=rJ4uaX2aW
Goyal, P., et al.: Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)
Granziol, D., Zohren, S., Roberts, S.: Learning rates as a function of batch size: a random matrix theory approach to neural network training. J. Mach. Learn. Res. 23(173), 1–65 (2022)
MathSciNet Google Scholar
Hedman, P., Philip, J., Price, T., Frahm, J.M., Drettakis, G., Brostow, G.: Deep blending for free-viewpoint image-based rendering. ACM Trans. Graph. (Proc. SIGGRAPH Asia) (2018)
Google Scholar
Huang, Y., et al.: Gpipe: efficient training of giant neural networks using pipeline parallelism. In: NeurIPS (2019). https://proceedings.neurips.cc/paper_files/paper/2019/file/093f65e080a295f8076b1c5722a46aa2-Paper.pdf
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42(4) (2023). https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/
Kerbl, B., Meuleman, A., Kopanas, G., Wimmer, M., Lanvin, A., Drettakis, G.: A hierarchical 3d gaussian representation for real-time rendering of very large datasets. ACM Trans. Graph. (2024). https://repo-sam.inria.fr/fungraph/hierarchical-3d-gaussians/
Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large-batch training for deep learning: generalization gap and sharp minima. In: ICLR (2017). https://openreview.net/forum?id=H1oyRlYgg
Knapitsch, A., Park, J., Zhou, Q.Y., Koltun, V.: Tanks and temples: benchmarking large-scale scene reconstruction. ACM Trans. Graph. (2017)
Google Scholar
Li, R., Fidler, S., Kanazawa, A., Williams, F.: Nerf-xl: scaling nerfs with multiple gpus (2024)
Google Scholar
Li, S., et al.: Pytorch distributed: experiences on accelerating data parallel training. In: VLDB (2020)
Google Scholar
Li, S., et al.: Surge phenomenon in optimal learning rate and batch size scaling. arXiv preprint arXiv:2405.14578 (2024)
Li, Y., et al.: Matrixcity: a large-scale city dataset for city-scale neural rendering and beyond. In: ICCV (2023)
Google Scholar
Li, Z., Malladi, S., Arora, S.: On the validity of modeling SGD with stochastic differential equations (SDEs). In: NeurIPS (2021). https://openreview.net/forum?id=goEdyJ_nVQI
Lin, J., et al.: Vastgaussian: vast 3d gaussians for large scene reconstruction. In: CVPR (2024)
Google Scholar
Liu, Y., Guan, H., Luo, C., Fan, L., Peng, J., Zhang, Z.: Citygaussian: real-time high-quality large-scale scene rendering with gaussians. In: CVPR (2024)
Google Scholar
Lu, T., et al.: Scaffold-gs: structured 3d gaussians for view-adaptive rendering. In: CVPR (2024)
Google Scholar
Malladi, S., Lyu, K., Panigrahi, A., Arora, S.: On the SDEs and scaling rules for adaptive gradient algorithms. In: NeurIPS (2022). https://openreview.net/forum?id=F2mhzjHkQP
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: representing scenes as neural radiance fields for view synthesis. In: ECCV (2020)
Google Scholar
Narayanan, D., et al.: Pipedream: generalized pipeline parallelism for dnn training. In: SOSP (2019)
Google Scholar
Narayanan, D., et al.: Efficient large-scale language model training on gpu clusters using megatron-lm. In: SOSP (2021)
Google Scholar
NERSC: Perlmutter architecture. https://docs.nersc.gov/systems/perlmutter/architecture/, Accessed 22 May 2024
Qiao, A., et al.: Pollux: co-adaptive cluster scheduling for goodput-optimized deep learning. In: OSDI (2021)
Google Scholar
Rajbhandari, S., Rasley, J., Ruwase, O., He, Y.: Zero: memory optimizations toward training trillion parameter models. In: SC (2020)
Google Scholar
Ren, K., et al.: Octree-gs: towards consistent real-time rendering with lod-structured 3d gaussians. arXiv preprint arXiv:2403.17898 (2024)
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., Catanzaro, B.: Megatron-lm: training multi-billion parameter language models using model parallelism. In: SC (2020)
Google Scholar
Turki, H., Ramanan, D., Satyanarayanan, M.: Mega-nerf: scalable construction of large-scale nerfs for virtual fly-throughs. In: CVPR (2022)
Google Scholar
Wang, M., Huang, C.c., Li, J.: Supporting very large models using automatic dataflow graph partitioning. In: EuroSys (2019)
Google Scholar
Xu, Y., et al.: Gspmd: general and scalable parallelization for ml computation graphs. In: arXiv:2105.04663 (2021)
You, Y., et al.: Large batch optimization for deep learning: training bert in 76 minutes. In: ICLR (2020). https://openreview.net/forum?id=Syx4wnEtvH
Yuanbo, X., et al.: Bungeenerf: progressive neural radiance field for extreme multi-scale scene rendering. In: ECCV (2022)
Google Scholar
Zhao, Y., et al.: Pytorch fsdp: experiences on scaling fully sharded data parallel (2023)
Google Scholar
Zheng, L., et al.: Alpa: automating inter-and intra-operator parallelism for distributed deep learning. In: OSDI (2022)
Google Scholar

Download references

Acknowledgements

We thank Xichen Pan and Youming Deng for their help on paper writing. We thank Matthias Niessner for his insightful and constructive feedback on our manuscript. We thank Yixuan Li and Lihan Jiang from the MatrixCity team for their assistance in providing initial data points of their dataset. We thank Kaifeng Lyu for discussions on Adam training dynamics analysis. This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility located at Lawrence Berkeley National Laboratory, operated under Contract No. DE-AC02-05CH11231.

Author information

Authors and Affiliations

New York University, New York, USA
Hexu Zhao, Haoyang Weng, Daohan Lu, Jinyang Li, Aurojit Panda & Saining Xie
Pacific Northwest National Laboratory, New York, USA
Ang Li

Authors

Hexu Zhao
View author publications
Search author on:PubMed Google Scholar
Haoyang Weng
View author publications
Search author on:PubMed Google Scholar
Daohan Lu
View author publications
Search author on:PubMed Google Scholar
Ang Li
View author publications
Search author on:PubMed Google Scholar
Jinyang Li
View author publications
Search author on:PubMed Google Scholar
Aurojit Panda
View author publications
Search author on:PubMed Google Scholar
Saining Xie
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Hexu Zhao .

Editor information

Editors and Affiliations

Istituto Italiano di Tecnologia, Genoa, Italy
Alessio Del Bue
Meta AI, Barcelona, Spain
Cristian Canton
Google DeepMind, Zürich, Switzerland
Jordi Pont-Tuset
Politecnico di Torino, Turin, Italy
Tatiana Tommasi

Appendices

A Additional Preliminaries and Observations Details

This appendix provides additional information about 3DGS, beyond what was covered in Sect. 2.

1.1 A.1 Densification Process

Densification is the process by which 3DGS adds more Gaussians to improve details in a particular region. A Gaussian that shows significant position variance across training steps, might either be clones or split. The decision on whether to clone or split depends on whether their scale exceeds a threshold. Hyperparameter determine the start and stop iteration for densification, its frequency, the gradient threshold for initiating densification, and the scale threshold that determines whether to split or clone. To create more Gaussians, we need to increase the stop iteration and frequency, and decrease the gradient threshold for densification. If we aim to capture more details using smaller Gaussians, we should lower the scale threshold to split more Gaussians. The training process also includes pruning strategies such as eliminating Gaussians with low opacity and using opacity reset techniques to remove redundant Gaussians.

1.2 A.2 Z-Buffer

The indices of intersecting gaussians for each pixel are stored in a Z-buffer, used in both forward and backward. This Z-buffer is the switch between View-dependent Gaussian Transformation and Pixel Render. Since a single gaussian can project onto multiple pixels within its footprint, the total size of all pixels’ Z-buffers exceeds both the count of 3DGS and pixels. The Z-buffer itself, along with auxiliary buffers needed for sorting it, etc., consumes significant activation memory. This can also lead to out-of-memory (OOM) errors if the resolution, scene size, or batch size is increased.

1.3 A.3 Mixed Parallelism

In the main text, some steps of 3DGS are not mentioned, but these steps can also be parallelized. The Gaussian transformation backward and gradient updates by the optimizer are also Gaussian-wise computations and will be distributed the same way as the Gaussian transformation forward. Similarly, the Render Backward and Loss Backward computations are pixel-wise and will be distributed just like the Render Forward.

Regarding the memory aspect, each Gaussian has independent transformed states, gradients, optimizer states, and parameters for each camera view. Therefore, we save these states together on the corresponding GPU that contains their parameters. And activation states like significant Z-buffers, auxiliary buffers for sorting and other functions, loss intermediate activations are managed pixel-wise along with the image distribution.

Regarding densification mechanism, since we clone, split or prune Gaussians independently based on their variance, we perform this process locally on the GPU that stores them.

1.4 A.4 Dynamic Unbalanced Workloads

Physical scenes are naturally sparse on a global scale. Different area has different densities of 3D gaussians (i.e. sky and a tree). Thus, the intensity of rendering not only varies from pixel to pixel within an image but also differs between various images, leading to workloads unbalance.

Besides, during the training, gaussians parameters are continuously changing. More precisely, the change of 3D position parameters and co-variance parameters affect each gaussian’s coverage of pixels on the screen. The change of opacity parameters affect the number of gaussians that contribute to each pixel. Both of them lead to render intensity change. The densification process targets areas under construction. During training, simpler scene elements are completed first, allowing more complex parts to be progressively densified. This means Gaussians from different regions densify at varying rates. The dynamic nature of the workloads is more pronounced at the beginning of training, as it initially focuses on constructing the global structure before filling in local details.

The different computational steps have distinct characteristics in terms of workload dynamicity. Even though, the rendering computation is dynamic and unbalanced; computation intensity for loss calculation remains consistent across pixels, and the view-dependent transformation maintains a uniform computational intensity across gaussians. Actually, render forward and backward have different pattern of unbalance and dynamicity. The computational complexity for the forward process scales with the number of 3DGS intersecting the ray. In contrast, the complexity of the backward process depends on Gaussians that contributed to color and loss before reaching opacity saturation, typically those on the first surface. Then, running time for render forward and backward, loss forward and backward have different dominating influence factors, and every step takes a significant amount of time.

B Additional Design Details

1.1 B.1 Scheduling Granularity: Pixel Block Size

In our design, we organize these pixels from all the images in a batch into a single row. Then, we divide this row into parts, and each GPU takes care of one part. However, if there are a lot of pixels, the strategy scheduler computation overhead will be very large. So we group the pixels into blocks of 16 by 16 pixels, put these blocks in a row and allocate these blocks instead. The size of block is essentially the scheduling granularity, which is a trade-off between scheduler overhead and uneven workloads due to additional blocks. After scheduling, we will have a 2D boolean array, compute_locally[i][j], indicating whether the pixel block at i-th row and j-th column should be computed by the local GPU. We will then render only the pixels within the blocks where compute_locally is true.

1.2 B.2 Gaussian Distribution Rebalance

An important observation is that distributing pixels to balance runtime doesn’t necessarily balance the number of Gaussians each GPU touches in rendering; So, to minimize total communication volume, GPUs may need to store varying quantity of Gaussians based on the formula above. Specifically, only the forward runtime correlates directly with the number of touched 3DGS; however, the time it takes for pixel-wise loss calculations and rendering backward depends on the quantity of pixels and the count of gaussians that are indeed contributed to the rendered pixel color, respectively. In our experiments, random redistribution leads to fastest training here, even if its overall communication volume is not the minimum solution. Because in our experiment setting, we use NCCL all2all as the underlying communication primitive, which prefers the uniform send and receive volume among different GPU. If we change to use communication primitive that only cares about the total communication volume, then we may need to change to other redistribution strategy.

Chemical structure diagram showing a hexagonal benzene ring with alternating double bonds. Attached to the ring are two hydroxyl groups (OH) at the first and second positions, and a carboxyl group (COOH) at the fourth position. The structure represents a dihydroxybenzoic acid derivative. — **Algorithm 1**

1.3 B.3 Empirical Evidence of Independent Gradients

To see if the Independent Gradients Hypothesis holds in practice, we analyze the average per-parameter variance of the gradients in real-world settings. We plot the sparsity and variance of the gradients of the diffuse color parameters starting at pre-trained checkpoints on the “Rubble” dataset [29] against the batch size in Fig. 13. We find that the inverse of the variance increases roughly linearly, then transitions into a plateau. We find this behavior in all three checkpoint iterations, representing early, middle, and late training stages. The initial linear increase of the precision suggests that gradients are roughly uncorrelated at batch sizes used in this work (up to 32) and supports the independent gradients hypothesis. However, it is worth noting that even though a single image has sparse gradients, when there are many images in a batch, the gradients overlap and become less sparse. They also become more correlated because we expect images with similar poses to offer similar gradients.

Three-panel X-Y chart showing relationships between batch size and gradient metrics. \\n\\n1. Left panel: "Batch size vs Grad Sparsity" with a log scale on the x-axis. Sparsity decreases as batch size increases, with three lines representing iterations 7000, 15000, and 30000.\\n\\n2. Middle panel: "Batch size vs Grad Variance" with average parameter variance on the y-axis. Variance decreases with increasing batch size, showing three lines for the same iterations.\\n\\n3. Right panel: "Batch size vs Grad Precision" with inverse average parameter variance on the y-axis. Precision increases with batch size, depicted by three lines for the iterations.\\n\\nA legend indicates line colors for each iteration. — **Fig. 13.**

Figure showing two sets of X-Y charts comparing learning rate and momentum scaling rules versus batch size (BS) invariance. \\n\\nPanel (a) displays two charts: \\n1. "Cumulative Update Direction" with cosine similarity on the Y-axis and train images on the X-axis. \\n2. "Cumulative Update Magnitude" with norm ratio on the Y-axis and train images on the X-axis. \\n\\nPanel (b) also displays two charts: \\n1. "Cumulative Update Direction" with cosine similarity on the Y-axis and train images on the X-axis. \\n2. "Cumulative Update Magnitude" with norm ratio on the Y-axis and train images on the X-axis. \\n\\nDifferent line styles and colors represent various batch sizes and scaling rules, as indicated in the legend. — **Fig. 14.**

1.4 B.4 Empirical Testing of Proposed Scaling Rules

To empirically test whether the proposed learning rate and momentum scaling rules work well, we train the “Rubble” scene to iteration 15,000 with a batch size of 1. Then, we reset the Adam optimizer states and continue training with different batch sizes. We compare how well different learning-rate and momentum scaling rules maintain a similar training trajectory when switching to larger batch sizes in Fig. 14. Since different parameter groups of 3D GS have vastly different magnitudes, we focus on one specific group, namely the diffuse color, to make the comparisons meaningful. Figure 14a compares three different learning rate scaling rules \(\in \) [constant, sqrt, linear] where only our proposed “sqrt” holds a high update cosine similarity and a similar update magnitude across different training batch sizes. Similarly, 14b shows our proposed exponential momentum scaling rule keeps update cosine similarity higher than the alternative which leaves the momentum coefficients unchanged (Table 2).

C Additional Experiments Setting and Statistics

1.1 C.1 Statistics for Mip-NeRF 360, Tank&Temples Dataset and DeepBlending

Table 2. Performance Comparison Between Non-Distribution and 4 GPU Distribution

Full size table

1.2 C.2 Scalability

Table 3, 4 and 5 show the increased reconstruction quality with more gaussians. While many hyperparameters influence the number of Gaussians created by densification, we focused on adjusting three key parameters: (1) the stop iteration for densification, (2) the threshold for initiating densification, and (3) the threshold for deciding whether to split or clone a Gaussian. Initially, we gradually increased the densification stop iteration to 5,000 iterations. However, due to the pruning mechanism, this adjustment alone proved insufficient. Consequently, we also lowered the two thresholds to generate more Gaussians. For a fair comparison, all other densification parameters-such as the interval, start iteration, and opacity reset interval-were kept constant. For the Rubble scene, each experiment run for the same 125 epochs, exposing models to 200,000 images, ensuring consistency. Although training larger models for longer durations and lowering the positional learning rate improved results in my observations, we maintained consistent training steps and learning rates across all experiments to ensure fairness.

Table 6, 7 show the Throughput Scalability by Increasing batch size and leveraging more GPUs, for Rubble and Train scene, respectively. Essentially, more GPUs and larger batch size give higher throughput.

Table 8 demonstrates that additional GPUs increase available memory for more Gaussians, evaluated on the Rubble scene with various batch sizes reflecting different levels of activation memory usage. Essentially, more GPUs provide additional memory to store Gaussians, while a larger batch size increases activation memory usage, leaving less memory available for Gaussians.”

Table 3. Scalablity on Rubble: Gaussian Quantity, Results and Hyperparameter Settings

Full size table

Table 4. MatrixCity Block_All Statistics: Gaussian Quantity, Results and Hyperparameter Settings

Full size table

Table 5. Bicycle Statistics: Gaussian Quantity, Results and Hyperparameter settings

Full size table

Table 6. Scalability on Rubble: Speed up from More GPU and Larger Batch Size

Full size table

Table 7. Scalability on Train: Speed up from More GPU and Larger Batch Size

Full size table

Table 8. Scalability on Rubble: More Available memory with more GPU

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, H. et al. (2025). On Scaling Up 3D Gaussian Splatting Training. In: Del Bue, A., Canton, C., Pont-Tuset, J., Tommasi, T. (eds) Computer Vision – ECCV 2024 Workshops. ECCV 2024. Lecture Notes in Computer Science, vol 15645. Springer, Cham. https://doi.org/10.1007/978-3-031-91989-3_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-91989-3_2
Published: 12 May 2025
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-91988-6
Online ISBN: 978-3-031-91989-3
eBook Packages: Computer ScienceComputer Science (R0)Springer Nature Proceedings Computer Science

Keywords

Publish with us

Policies and ethics

On Scaling Up 3D Gaussian Splatting Training

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

CityGaussian: Real-Time High-Quality Large-Scale Scene Rendering with Gaussians

SplatFields: Neural Gaussian Splats for Sparse 3D and 4D Reconstruction

Pixel-GS: Density Control with Pixel-Aware Gradient for 3D Gaussian Splatting

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

A Additional Preliminaries and Observations Details

1.1 A.1 Densification Process

1.2 A.2 Z-Buffer

1.3 A.3 Mixed Parallelism

1.4 A.4 Dynamic Unbalanced Workloads

B Additional Design Details

1.1 B.1 Scheduling Granularity: Pixel Block Size

1.2 B.2 Gaussian Distribution Rebalance

1.3 B.3 Empirical Evidence of Independent Gradients

1.4 B.4 Empirical Testing of Proposed Scaling Rules

C Additional Experiments Setting and Statistics

1.1 C.1 Statistics for Mip-NeRF 360, Tank&Temples Dataset and DeepBlending

1.2 C.2 Scalability

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Keywords

Publish with us