Abstract
3D Gaussian Splatting (3DGS) is increasingly popular for 3D reconstruction due to its superior visual quality and rendering speed. However, 3DGS training currently occurs on a single GPU, limiting its ability to handle high-resolution and large-scale 3D reconstruction tasks due to memory constraints. We introduce Grendel, a distributed system designed to partition 3DGS parameters and parallelize computation across multiple GPUs. As each Gaussian affects a small, dynamic subset of rendered pixels, Grendel employs sparse all-to-all communication to transfer the necessary Gaussians to pixel partitions and performs dynamic load balancing. Unlike existing 3DGS systems that train using one camera view image at a time, Grendel supports batched training with multiple views. We explore various optimization hyperparameter scaling strategies and find that a simple sqrt(batch_size) scaling rule is highly effective. Evaluations using large-scale, high-resolution scenes show that Grendel enhances rendering quality by scaling up 3DGS parameters across multiple GPUs. On the “Rubble” dataset, we achieve a test PSNR of 27.28 by distributing 40.4 million Gaussians across 16 GPUs, compared to a PSNR of 26.28 using 11.2 million Gaussians on a single GPU.
H. Weng and D. Lu—Contributed equally to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Barron, J., Mildenhall, B., Verbin, D., Srinivasan, P., Hedman, P.: Mip-nerf 360: unbounded anti-aliased neural radiance fields. In: CVPR (2022). https://doi.org/10.1109/CVPR52688.2022.00539
Busbridge, D., et al.: How to scale your EMA. In: NeurIPS (2023). https://openreview.net/forum?id=DkeeXVdQyu
Ginsburg, B., Gitman, I., You, Y.: Large batch training of convolutional networks with layer-wise adaptive rate scaling (2018). https://openreview.net/forum?id=rJ4uaX2aW
Goyal, P., et al.: Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)
Granziol, D., Zohren, S., Roberts, S.: Learning rates as a function of batch size: a random matrix theory approach to neural network training. J. Mach. Learn. Res. 23(173), 1–65 (2022)
Hedman, P., Philip, J., Price, T., Frahm, J.M., Drettakis, G., Brostow, G.: Deep blending for free-viewpoint image-based rendering. ACM Trans. Graph. (Proc. SIGGRAPH Asia) (2018)
Huang, Y., et al.: Gpipe: efficient training of giant neural networks using pipeline parallelism. In: NeurIPS (2019). https://proceedings.neurips.cc/paper_files/paper/2019/file/093f65e080a295f8076b1c5722a46aa2-Paper.pdf
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42(4) (2023). https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/
Kerbl, B., Meuleman, A., Kopanas, G., Wimmer, M., Lanvin, A., Drettakis, G.: A hierarchical 3d gaussian representation for real-time rendering of very large datasets. ACM Trans. Graph. (2024). https://repo-sam.inria.fr/fungraph/hierarchical-3d-gaussians/
Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large-batch training for deep learning: generalization gap and sharp minima. In: ICLR (2017). https://openreview.net/forum?id=H1oyRlYgg
Knapitsch, A., Park, J., Zhou, Q.Y., Koltun, V.: Tanks and temples: benchmarking large-scale scene reconstruction. ACM Trans. Graph. (2017)
Li, R., Fidler, S., Kanazawa, A., Williams, F.: Nerf-xl: scaling nerfs with multiple gpus (2024)
Li, S., et al.: Pytorch distributed: experiences on accelerating data parallel training. In: VLDB (2020)
Li, S., et al.: Surge phenomenon in optimal learning rate and batch size scaling. arXiv preprint arXiv:2405.14578 (2024)
Li, Y., et al.: Matrixcity: a large-scale city dataset for city-scale neural rendering and beyond. In: ICCV (2023)
Li, Z., Malladi, S., Arora, S.: On the validity of modeling SGD with stochastic differential equations (SDEs). In: NeurIPS (2021). https://openreview.net/forum?id=goEdyJ_nVQI
Lin, J., et al.: Vastgaussian: vast 3d gaussians for large scene reconstruction. In: CVPR (2024)
Liu, Y., Guan, H., Luo, C., Fan, L., Peng, J., Zhang, Z.: Citygaussian: real-time high-quality large-scale scene rendering with gaussians. In: CVPR (2024)
Lu, T., et al.: Scaffold-gs: structured 3d gaussians for view-adaptive rendering. In: CVPR (2024)
Malladi, S., Lyu, K., Panigrahi, A., Arora, S.: On the SDEs and scaling rules for adaptive gradient algorithms. In: NeurIPS (2022). https://openreview.net/forum?id=F2mhzjHkQP
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: representing scenes as neural radiance fields for view synthesis. In: ECCV (2020)
Narayanan, D., et al.: Pipedream: generalized pipeline parallelism for dnn training. In: SOSP (2019)
Narayanan, D., et al.: Efficient large-scale language model training on gpu clusters using megatron-lm. In: SOSP (2021)
NERSC: Perlmutter architecture. https://docs.nersc.gov/systems/perlmutter/architecture/, Accessed 22 May 2024
Qiao, A., et al.: Pollux: co-adaptive cluster scheduling for goodput-optimized deep learning. In: OSDI (2021)
Rajbhandari, S., Rasley, J., Ruwase, O., He, Y.: Zero: memory optimizations toward training trillion parameter models. In: SC (2020)
Ren, K., et al.: Octree-gs: towards consistent real-time rendering with lod-structured 3d gaussians. arXiv preprint arXiv:2403.17898 (2024)
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., Catanzaro, B.: Megatron-lm: training multi-billion parameter language models using model parallelism. In: SC (2020)
Turki, H., Ramanan, D., Satyanarayanan, M.: Mega-nerf: scalable construction of large-scale nerfs for virtual fly-throughs. In: CVPR (2022)
Wang, M., Huang, C.c., Li, J.: Supporting very large models using automatic dataflow graph partitioning. In: EuroSys (2019)
Xu, Y., et al.: Gspmd: general and scalable parallelization for ml computation graphs. In: arXiv:2105.04663 (2021)
You, Y., et al.: Large batch optimization for deep learning: training bert in 76 minutes. In: ICLR (2020). https://openreview.net/forum?id=Syx4wnEtvH
Yuanbo, X., et al.: Bungeenerf: progressive neural radiance field for extreme multi-scale scene rendering. In: ECCV (2022)
Zhao, Y., et al.: Pytorch fsdp: experiences on scaling fully sharded data parallel (2023)
Zheng, L., et al.: Alpa: automating inter-and intra-operator parallelism for distributed deep learning. In: OSDI (2022)
Acknowledgements
We thank Xichen Pan and Youming Deng for their help on paper writing. We thank Matthias Niessner for his insightful and constructive feedback on our manuscript. We thank Yixuan Li and Lihan Jiang from the MatrixCity team for their assistance in providing initial data points of their dataset. We thank Kaifeng Lyu for discussions on Adam training dynamics analysis. This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility located at Lawrence Berkeley National Laboratory, operated under Contract No. DE-AC02-05CH11231.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Additional Preliminaries and Observations Details
This appendix provides additional information about 3DGS, beyond what was covered in Sect. 2.
1.1 A.1 Densification Process
Densification is the process by which 3DGS adds more Gaussians to improve details in a particular region. A Gaussian that shows significant position variance across training steps, might either be clones or split. The decision on whether to clone or split depends on whether their scale exceeds a threshold. Hyperparameter determine the start and stop iteration for densification, its frequency, the gradient threshold for initiating densification, and the scale threshold that determines whether to split or clone. To create more Gaussians, we need to increase the stop iteration and frequency, and decrease the gradient threshold for densification. If we aim to capture more details using smaller Gaussians, we should lower the scale threshold to split more Gaussians. The training process also includes pruning strategies such as eliminating Gaussians with low opacity and using opacity reset techniques to remove redundant Gaussians.
1.2 A.2 Z-Buffer
The indices of intersecting gaussians for each pixel are stored in a Z-buffer, used in both forward and backward. This Z-buffer is the switch between View-dependent Gaussian Transformation and Pixel Render. Since a single gaussian can project onto multiple pixels within its footprint, the total size of all pixels’ Z-buffers exceeds both the count of 3DGS and pixels. The Z-buffer itself, along with auxiliary buffers needed for sorting it, etc., consumes significant activation memory. This can also lead to out-of-memory (OOM) errors if the resolution, scene size, or batch size is increased.
1.3 A.3 Mixed Parallelism
In the main text, some steps of 3DGS are not mentioned, but these steps can also be parallelized. The Gaussian transformation backward and gradient updates by the optimizer are also Gaussian-wise computations and will be distributed the same way as the Gaussian transformation forward. Similarly, the Render Backward and Loss Backward computations are pixel-wise and will be distributed just like the Render Forward.
Regarding the memory aspect, each Gaussian has independent transformed states, gradients, optimizer states, and parameters for each camera view. Therefore, we save these states together on the corresponding GPU that contains their parameters. And activation states like significant Z-buffers, auxiliary buffers for sorting and other functions, loss intermediate activations are managed pixel-wise along with the image distribution.
Regarding densification mechanism, since we clone, split or prune Gaussians independently based on their variance, we perform this process locally on the GPU that stores them.
1.4 A.4 Dynamic Unbalanced Workloads
Physical scenes are naturally sparse on a global scale. Different area has different densities of 3D gaussians (i.e. sky and a tree). Thus, the intensity of rendering not only varies from pixel to pixel within an image but also differs between various images, leading to workloads unbalance.
Besides, during the training, gaussians parameters are continuously changing. More precisely, the change of 3D position parameters and co-variance parameters affect each gaussian’s coverage of pixels on the screen. The change of opacity parameters affect the number of gaussians that contribute to each pixel. Both of them lead to render intensity change. The densification process targets areas under construction. During training, simpler scene elements are completed first, allowing more complex parts to be progressively densified. This means Gaussians from different regions densify at varying rates. The dynamic nature of the workloads is more pronounced at the beginning of training, as it initially focuses on constructing the global structure before filling in local details.
The different computational steps have distinct characteristics in terms of workload dynamicity. Even though, the rendering computation is dynamic and unbalanced; computation intensity for loss calculation remains consistent across pixels, and the view-dependent transformation maintains a uniform computational intensity across gaussians. Actually, render forward and backward have different pattern of unbalance and dynamicity. The computational complexity for the forward process scales with the number of 3DGS intersecting the ray. In contrast, the complexity of the backward process depends on Gaussians that contributed to color and loss before reaching opacity saturation, typically those on the first surface. Then, running time for render forward and backward, loss forward and backward have different dominating influence factors, and every step takes a significant amount of time.
B Additional Design Details
1.1 B.1 Scheduling Granularity: Pixel Block Size
In our design, we organize these pixels from all the images in a batch into a single row. Then, we divide this row into parts, and each GPU takes care of one part. However, if there are a lot of pixels, the strategy scheduler computation overhead will be very large. So we group the pixels into blocks of 16 by 16 pixels, put these blocks in a row and allocate these blocks instead. The size of block is essentially the scheduling granularity, which is a trade-off between scheduler overhead and uneven workloads due to additional blocks. After scheduling, we will have a 2D boolean array, compute_locally[i][j], indicating whether the pixel block at i-th row and j-th column should be computed by the local GPU. We will then render only the pixels within the blocks where compute_locally is true.
1.2 B.2 Gaussian Distribution Rebalance
An important observation is that distributing pixels to balance runtime doesn’t necessarily balance the number of Gaussians each GPU touches in rendering; So, to minimize total communication volume, GPUs may need to store varying quantity of Gaussians based on the formula above. Specifically, only the forward runtime correlates directly with the number of touched 3DGS; however, the time it takes for pixel-wise loss calculations and rendering backward depends on the quantity of pixels and the count of gaussians that are indeed contributed to the rendered pixel color, respectively. In our experiments, random redistribution leads to fastest training here, even if its overall communication volume is not the minimum solution. Because in our experiment setting, we use NCCL all2all as the underlying communication primitive, which prefers the uniform send and receive volume among different GPU. If we change to use communication primitive that only cares about the total communication volume, then we may need to change to other redistribution strategy.
. Calculation of Division Points
1.3 B.3 Empirical Evidence of Independent Gradients
To see if the Independent Gradients Hypothesis holds in practice, we analyze the average per-parameter variance of the gradients in real-world settings. We plot the sparsity and variance of the gradients of the diffuse color parameters starting at pre-trained checkpoints on the “Rubble” dataset [29] against the batch size in Fig. 13. We find that the inverse of the variance increases roughly linearly, then transitions into a plateau. We find this behavior in all three checkpoint iterations, representing early, middle, and late training stages. The initial linear increase of the precision suggests that gradients are roughly uncorrelated at batch sizes used in this work (up to 32) and supports the independent gradients hypothesis. However, it is worth noting that even though a single image has sparse gradients, when there are many images in a batch, the gradients overlap and become less sparse. They also become more correlated because we expect images with similar poses to offer similar gradients.
Gradients are roughly uncorrelated in practice. On the “Rubble” dataset [29], the inverse of the average parameter variance increases linearly, then rises to a plateau, suggesting that the gradients are roughly uncorrelated initially but become less so as the batch size becomes large. Averaged over 32 random trials.
We plot the training trajectories of the diffuse color parameters on “Rubble”, when training with batch size \(\in [4, 16, 32]\) using different learning rate and momentum scaling strategies. Cumulative weight updates using the square-root learning rate scaling rule (a,
curves) and exponential momentum scaling rule (b,
curves) maintain high cosine similarity to batch-size 1 updates and have norms that are roughly invariant to the batch size. (Color figure online)
1.4 B.4 Empirical Testing of Proposed Scaling Rules
To empirically test whether the proposed learning rate and momentum scaling rules work well, we train the “Rubble” scene to iteration 15,000 with a batch size of 1. Then, we reset the Adam optimizer states and continue training with different batch sizes. We compare how well different learning-rate and momentum scaling rules maintain a similar training trajectory when switching to larger batch sizes in Fig. 14. Since different parameter groups of 3D GS have vastly different magnitudes, we focus on one specific group, namely the diffuse color, to make the comparisons meaningful. Figure 14a compares three different learning rate scaling rules \(\in \) [constant, sqrt, linear] where only our proposed “sqrt” holds a high update cosine similarity and a similar update magnitude across different training batch sizes. Similarly, 14b shows our proposed exponential momentum scaling rule keeps update cosine similarity higher than the alternative which leaves the momentum coefficients unchanged (Table 2).
C Additional Experiments Setting and Statistics
1.1 C.1 Statistics for Mip-NeRF 360, Tank&Temples Dataset and DeepBlending
1.2 C.2 Scalability
Table 3, 4 and 5 show the increased reconstruction quality with more gaussians. While many hyperparameters influence the number of Gaussians created by densification, we focused on adjusting three key parameters: (1) the stop iteration for densification, (2) the threshold for initiating densification, and (3) the threshold for deciding whether to split or clone a Gaussian. Initially, we gradually increased the densification stop iteration to 5,000 iterations. However, due to the pruning mechanism, this adjustment alone proved insufficient. Consequently, we also lowered the two thresholds to generate more Gaussians. For a fair comparison, all other densification parameters-such as the interval, start iteration, and opacity reset interval-were kept constant. For the Rubble scene, each experiment run for the same 125 epochs, exposing models to 200,000 images, ensuring consistency. Although training larger models for longer durations and lowering the positional learning rate improved results in my observations, we maintained consistent training steps and learning rates across all experiments to ensure fairness.
Table 6, 7 show the Throughput Scalability by Increasing batch size and leveraging more GPUs, for Rubble and Train scene, respectively. Essentially, more GPUs and larger batch size give higher throughput.
Table 8 demonstrates that additional GPUs increase available memory for more Gaussians, evaluated on the Rubble scene with various batch sizes reflecting different levels of activation memory usage. Essentially, more GPUs provide additional memory to store Gaussians, while a larger batch size increases activation memory usage, leaving less memory available for Gaussians.”
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhao, H. et al. (2025). On Scaling Up 3D Gaussian Splatting Training. In: Del Bue, A., Canton, C., Pont-Tuset, J., Tommasi, T. (eds) Computer Vision – ECCV 2024 Workshops. ECCV 2024. Lecture Notes in Computer Science, vol 15645. Springer, Cham. https://doi.org/10.1007/978-3-031-91989-3_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-91989-3_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-91988-6
Online ISBN: 978-3-031-91989-3
eBook Packages: Computer ScienceComputer Science (R0)Springer Nature Proceedings Computer Science


