Skip to main content

On Scaling Up 3D Gaussian Splatting Training

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 Workshops (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15645))

Included in the following conference series:

  • 762 Accesses

  • 29 Citations

Abstract

3D Gaussian Splatting (3DGS) is increasingly popular for 3D reconstruction due to its superior visual quality and rendering speed. However, 3DGS training currently occurs on a single GPU, limiting its ability to handle high-resolution and large-scale 3D reconstruction tasks due to memory constraints. We introduce Grendel, a distributed system designed to partition 3DGS parameters and parallelize computation across multiple GPUs. As each Gaussian affects a small, dynamic subset of rendered pixels, Grendel employs sparse all-to-all communication to transfer the necessary Gaussians to pixel partitions and performs dynamic load balancing. Unlike existing 3DGS systems that train using one camera view image at a time, Grendel supports batched training with multiple views. We explore various optimization hyperparameter scaling strategies and find that a simple sqrt(batch_size) scaling rule is highly effective. Evaluations using large-scale, high-resolution scenes show that Grendel enhances rendering quality by scaling up 3DGS parameters across multiple GPUs. On the “Rubble” dataset, we achieve a test PSNR of 27.28 by distributing 40.4 million Gaussians across 16 GPUs, compared to a PSNR of 26.28 using 11.2 million Gaussians on a single GPU.

H. Weng and D. Lu—Contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+
from €37.37 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Chapter
EUR 29.95
Price includes VAT (Netherlands)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 74.89
Price includes VAT (Netherlands)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 98.09
Price includes VAT (Netherlands)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Barron, J., Mildenhall, B., Verbin, D., Srinivasan, P., Hedman, P.: Mip-nerf 360: unbounded anti-aliased neural radiance fields. In: CVPR (2022). https://doi.org/10.1109/CVPR52688.2022.00539

  2. Busbridge, D., et al.: How to scale your EMA. In: NeurIPS (2023). https://openreview.net/forum?id=DkeeXVdQyu

  3. Ginsburg, B., Gitman, I., You, Y.: Large batch training of convolutional networks with layer-wise adaptive rate scaling (2018). https://openreview.net/forum?id=rJ4uaX2aW

  4. Goyal, P., et al.: Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)

  5. Granziol, D., Zohren, S., Roberts, S.: Learning rates as a function of batch size: a random matrix theory approach to neural network training. J. Mach. Learn. Res. 23(173), 1–65 (2022)

    MathSciNet  Google Scholar 

  6. Hedman, P., Philip, J., Price, T., Frahm, J.M., Drettakis, G., Brostow, G.: Deep blending for free-viewpoint image-based rendering. ACM Trans. Graph. (Proc. SIGGRAPH Asia) (2018)

    Google Scholar 

  7. Huang, Y., et al.: Gpipe: efficient training of giant neural networks using pipeline parallelism. In: NeurIPS (2019). https://proceedings.neurips.cc/paper_files/paper/2019/file/093f65e080a295f8076b1c5722a46aa2-Paper.pdf

  8. Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42(4) (2023). https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

  9. Kerbl, B., Meuleman, A., Kopanas, G., Wimmer, M., Lanvin, A., Drettakis, G.: A hierarchical 3d gaussian representation for real-time rendering of very large datasets. ACM Trans. Graph. (2024). https://repo-sam.inria.fr/fungraph/hierarchical-3d-gaussians/

  10. Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large-batch training for deep learning: generalization gap and sharp minima. In: ICLR (2017). https://openreview.net/forum?id=H1oyRlYgg

  11. Knapitsch, A., Park, J., Zhou, Q.Y., Koltun, V.: Tanks and temples: benchmarking large-scale scene reconstruction. ACM Trans. Graph. (2017)

    Google Scholar 

  12. Li, R., Fidler, S., Kanazawa, A., Williams, F.: Nerf-xl: scaling nerfs with multiple gpus (2024)

    Google Scholar 

  13. Li, S., et al.: Pytorch distributed: experiences on accelerating data parallel training. In: VLDB (2020)

    Google Scholar 

  14. Li, S., et al.: Surge phenomenon in optimal learning rate and batch size scaling. arXiv preprint arXiv:2405.14578 (2024)

  15. Li, Y., et al.: Matrixcity: a large-scale city dataset for city-scale neural rendering and beyond. In: ICCV (2023)

    Google Scholar 

  16. Li, Z., Malladi, S., Arora, S.: On the validity of modeling SGD with stochastic differential equations (SDEs). In: NeurIPS (2021). https://openreview.net/forum?id=goEdyJ_nVQI

  17. Lin, J., et al.: Vastgaussian: vast 3d gaussians for large scene reconstruction. In: CVPR (2024)

    Google Scholar 

  18. Liu, Y., Guan, H., Luo, C., Fan, L., Peng, J., Zhang, Z.: Citygaussian: real-time high-quality large-scale scene rendering with gaussians. In: CVPR (2024)

    Google Scholar 

  19. Lu, T., et al.: Scaffold-gs: structured 3d gaussians for view-adaptive rendering. In: CVPR (2024)

    Google Scholar 

  20. Malladi, S., Lyu, K., Panigrahi, A., Arora, S.: On the SDEs and scaling rules for adaptive gradient algorithms. In: NeurIPS (2022). https://openreview.net/forum?id=F2mhzjHkQP

  21. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: representing scenes as neural radiance fields for view synthesis. In: ECCV (2020)

    Google Scholar 

  22. Narayanan, D., et al.: Pipedream: generalized pipeline parallelism for dnn training. In: SOSP (2019)

    Google Scholar 

  23. Narayanan, D., et al.: Efficient large-scale language model training on gpu clusters using megatron-lm. In: SOSP (2021)

    Google Scholar 

  24. NERSC: Perlmutter architecture. https://docs.nersc.gov/systems/perlmutter/architecture/, Accessed 22 May 2024

  25. Qiao, A., et al.: Pollux: co-adaptive cluster scheduling for goodput-optimized deep learning. In: OSDI (2021)

    Google Scholar 

  26. Rajbhandari, S., Rasley, J., Ruwase, O., He, Y.: Zero: memory optimizations toward training trillion parameter models. In: SC (2020)

    Google Scholar 

  27. Ren, K., et al.: Octree-gs: towards consistent real-time rendering with lod-structured 3d gaussians. arXiv preprint arXiv:2403.17898 (2024)

  28. Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., Catanzaro, B.: Megatron-lm: training multi-billion parameter language models using model parallelism. In: SC (2020)

    Google Scholar 

  29. Turki, H., Ramanan, D., Satyanarayanan, M.: Mega-nerf: scalable construction of large-scale nerfs for virtual fly-throughs. In: CVPR (2022)

    Google Scholar 

  30. Wang, M., Huang, C.c., Li, J.: Supporting very large models using automatic dataflow graph partitioning. In: EuroSys (2019)

    Google Scholar 

  31. Xu, Y., et al.: Gspmd: general and scalable parallelization for ml computation graphs. In: arXiv:2105.04663 (2021)

  32. You, Y., et al.: Large batch optimization for deep learning: training bert in 76 minutes. In: ICLR (2020). https://openreview.net/forum?id=Syx4wnEtvH

  33. Yuanbo, X., et al.: Bungeenerf: progressive neural radiance field for extreme multi-scale scene rendering. In: ECCV (2022)

    Google Scholar 

  34. Zhao, Y., et al.: Pytorch fsdp: experiences on scaling fully sharded data parallel (2023)

    Google Scholar 

  35. Zheng, L., et al.: Alpa: automating inter-and intra-operator parallelism for distributed deep learning. In: OSDI (2022)

    Google Scholar 

Download references

Acknowledgements

We thank Xichen Pan and Youming Deng for their help on paper writing. We thank Matthias Niessner for his insightful and constructive feedback on our manuscript. We thank Yixuan Li and Lihan Jiang from the MatrixCity team for their assistance in providing initial data points of their dataset. We thank Kaifeng Lyu for discussions on Adam training dynamics analysis. This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility located at Lawrence Berkeley National Laboratory, operated under Contract No. DE-AC02-05CH11231.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hexu Zhao .

Editor information

Editors and Affiliations

Appendices

A Additional Preliminaries and Observations Details

This appendix provides additional information about 3DGS, beyond what was covered in Sect. 2.

1.1 A.1 Densification Process

Densification is the process by which 3DGS adds more Gaussians to improve details in a particular region. A Gaussian that shows significant position variance across training steps, might either be clones or split. The decision on whether to clone or split depends on whether their scale exceeds a threshold. Hyperparameter determine the start and stop iteration for densification, its frequency, the gradient threshold for initiating densification, and the scale threshold that determines whether to split or clone. To create more Gaussians, we need to increase the stop iteration and frequency, and decrease the gradient threshold for densification. If we aim to capture more details using smaller Gaussians, we should lower the scale threshold to split more Gaussians. The training process also includes pruning strategies such as eliminating Gaussians with low opacity and using opacity reset techniques to remove redundant Gaussians.

1.2 A.2 Z-Buffer

The indices of intersecting gaussians for each pixel are stored in a Z-buffer, used in both forward and backward. This Z-buffer is the switch between View-dependent Gaussian Transformation and Pixel Render. Since a single gaussian can project onto multiple pixels within its footprint, the total size of all pixels’ Z-buffers exceeds both the count of 3DGS and pixels. The Z-buffer itself, along with auxiliary buffers needed for sorting it, etc., consumes significant activation memory. This can also lead to out-of-memory (OOM) errors if the resolution, scene size, or batch size is increased.

1.3 A.3 Mixed Parallelism

In the main text, some steps of 3DGS are not mentioned, but these steps can also be parallelized. The Gaussian transformation backward and gradient updates by the optimizer are also Gaussian-wise computations and will be distributed the same way as the Gaussian transformation forward. Similarly, the Render Backward and Loss Backward computations are pixel-wise and will be distributed just like the Render Forward.

Regarding the memory aspect, each Gaussian has independent transformed states, gradients, optimizer states, and parameters for each camera view. Therefore, we save these states together on the corresponding GPU that contains their parameters. And activation states like significant Z-buffers, auxiliary buffers for sorting and other functions, loss intermediate activations are managed pixel-wise along with the image distribution.

Regarding densification mechanism, since we clone, split or prune Gaussians independently based on their variance, we perform this process locally on the GPU that stores them.

1.4 A.4 Dynamic Unbalanced Workloads

Physical scenes are naturally sparse on a global scale. Different area has different densities of 3D gaussians (i.e. sky and a tree). Thus, the intensity of rendering not only varies from pixel to pixel within an image but also differs between various images, leading to workloads unbalance.

Besides, during the training, gaussians parameters are continuously changing. More precisely, the change of 3D position parameters and co-variance parameters affect each gaussian’s coverage of pixels on the screen. The change of opacity parameters affect the number of gaussians that contribute to each pixel. Both of them lead to render intensity change. The densification process targets areas under construction. During training, simpler scene elements are completed first, allowing more complex parts to be progressively densified. This means Gaussians from different regions densify at varying rates. The dynamic nature of the workloads is more pronounced at the beginning of training, as it initially focuses on constructing the global structure before filling in local details.

The different computational steps have distinct characteristics in terms of workload dynamicity. Even though, the rendering computation is dynamic and unbalanced; computation intensity for loss calculation remains consistent across pixels, and the view-dependent transformation maintains a uniform computational intensity across gaussians. Actually, render forward and backward have different pattern of unbalance and dynamicity. The computational complexity for the forward process scales with the number of 3DGS intersecting the ray. In contrast, the complexity of the backward process depends on Gaussians that contributed to color and loss before reaching opacity saturation, typically those on the first surface. Then, running time for render forward and backward, loss forward and backward have different dominating influence factors, and every step takes a significant amount of time.

B Additional Design Details

1.1 B.1 Scheduling Granularity: Pixel Block Size

In our design, we organize these pixels from all the images in a batch into a single row. Then, we divide this row into parts, and each GPU takes care of one part. However, if there are a lot of pixels, the strategy scheduler computation overhead will be very large. So we group the pixels into blocks of 16 by 16 pixels, put these blocks in a row and allocate these blocks instead. The size of block is essentially the scheduling granularity, which is a trade-off between scheduler overhead and uneven workloads due to additional blocks. After scheduling, we will have a 2D boolean array, compute_locally[i][j], indicating whether the pixel block at i-th row and j-th column should be computed by the local GPU. We will then render only the pixels within the blocks where compute_locally is true.

1.2 B.2 Gaussian Distribution Rebalance

An important observation is that distributing pixels to balance runtime doesn’t necessarily balance the number of Gaussians each GPU touches in rendering; So, to minimize total communication volume, GPUs may need to store varying quantity of Gaussians based on the formula above. Specifically, only the forward runtime correlates directly with the number of touched 3DGS; however, the time it takes for pixel-wise loss calculations and rendering backward depends on the quantity of pixels and the count of gaussians that are indeed contributed to the rendered pixel color, respectively. In our experiments, random redistribution leads to fastest training here, even if its overall communication volume is not the minimum solution. Because in our experiment setting, we use NCCL all2all as the underlying communication primitive, which prefers the uniform send and receive volume among different GPU. If we change to use communication primitive that only cares about the total communication volume, then we may need to change to other redistribution strategy.

Algorithm 1
Chemical structure diagram showing a hexagonal benzene ring with alternating double bonds. Attached to the ring are two hydroxyl groups (OH) at the first and second positions, and a carboxyl group (COOH) at the fourth position. The structure represents a dihydroxybenzoic acid derivative.The alternative text for this image may have been generated using AI.

. Calculation of Division Points

1.3 B.3 Empirical Evidence of Independent Gradients

To see if the Independent Gradients Hypothesis holds in practice, we analyze the average per-parameter variance of the gradients in real-world settings. We plot the sparsity and variance of the gradients of the diffuse color parameters starting at pre-trained checkpoints on the “Rubble” dataset [29] against the batch size in Fig. 13. We find that the inverse of the variance increases roughly linearly, then transitions into a plateau. We find this behavior in all three checkpoint iterations, representing early, middle, and late training stages. The initial linear increase of the precision suggests that gradients are roughly uncorrelated at batch sizes used in this work (up to 32) and supports the independent gradients hypothesis. However, it is worth noting that even though a single image has sparse gradients, when there are many images in a batch, the gradients overlap and become less sparse. They also become more correlated because we expect images with similar poses to offer similar gradients.

Fig. 13.
Three-panel X-Y chart showing relationships between batch size and gradient metrics. \\n\\n1. Left panel: "Batch size vs Grad Sparsity" with a log scale on the x-axis. Sparsity decreases as batch size increases, with three lines representing iterations 7000, 15000, and 30000.\\n\\n2. Middle panel: "Batch size vs Grad Variance" with average parameter variance on the y-axis. Variance decreases with increasing batch size, showing three lines for the same iterations.\\n\\n3. Right panel: "Batch size vs Grad Precision" with inverse average parameter variance on the y-axis. Precision increases with batch size, depicted by three lines for the iterations.\\n\\nA legend indicates line colors for each iteration.The alternative text for this image may have been generated using AI.

Gradients are roughly uncorrelated in practice. On the “Rubble” dataset [29], the inverse of the average parameter variance increases linearly, then rises to a plateau, suggesting that the gradients are roughly uncorrelated initially but become less so as the batch size becomes large. Averaged over 32 random trials.

Fig. 14.
Figure showing two sets of X-Y charts comparing learning rate and momentum scaling rules versus batch size (BS) invariance. \\n\\nPanel (a) displays two charts: \\n1. "Cumulative Update Direction" with cosine similarity on the Y-axis and train images on the X-axis. \\n2. "Cumulative Update Magnitude" with norm ratio on the Y-axis and train images on the X-axis. \\n\\nPanel (b) also displays two charts: \\n1. "Cumulative Update Direction" with cosine similarity on the Y-axis and train images on the X-axis. \\n2. "Cumulative Update Magnitude" with norm ratio on the Y-axis and train images on the X-axis. \\n\\nDifferent line styles and colors represent various batch sizes and scaling rules, as indicated in the legend.The alternative text for this image may have been generated using AI.

We plot the training trajectories of the diffuse color parameters on “Rubble”, when training with batch size \(\in [4, 16, 32]\) using different learning rate and momentum scaling strategies. Cumulative weight updates using the square-root learning rate scaling rule (a, curves) and exponential momentum scaling rule (b, curves) maintain high cosine similarity to batch-size 1 updates and have norms that are roughly invariant to the batch size. (Color figure online)

1.4 B.4 Empirical Testing of Proposed Scaling Rules

To empirically test whether the proposed learning rate and momentum scaling rules work well, we train the “Rubble” scene to iteration 15,000 with a batch size of 1. Then, we reset the Adam optimizer states and continue training with different batch sizes. We compare how well different learning-rate and momentum scaling rules maintain a similar training trajectory when switching to larger batch sizes in Fig. 14. Since different parameter groups of 3D GS have vastly different magnitudes, we focus on one specific group, namely the diffuse color, to make the comparisons meaningful. Figure 14a compares three different learning rate scaling rules \(\in \) [constant, sqrt, linear] where only our proposed “sqrt” holds a high update cosine similarity and a similar update magnitude across different training batch sizes. Similarly, 14b shows our proposed exponential momentum scaling rule keeps update cosine similarity higher than the alternative which leaves the momentum coefficients unchanged (Table 2).

C Additional Experiments Setting and Statistics

1.1 C.1 Statistics for Mip-NeRF 360, Tank&Temples Dataset and DeepBlending

Table 2. Performance Comparison Between Non-Distribution and 4 GPU Distribution

1.2 C.2 Scalability

Table 3, 4 and 5 show the increased reconstruction quality with more gaussians. While many hyperparameters influence the number of Gaussians created by densification, we focused on adjusting three key parameters: (1) the stop iteration for densification, (2) the threshold for initiating densification, and (3) the threshold for deciding whether to split or clone a Gaussian. Initially, we gradually increased the densification stop iteration to 5,000 iterations. However, due to the pruning mechanism, this adjustment alone proved insufficient. Consequently, we also lowered the two thresholds to generate more Gaussians. For a fair comparison, all other densification parameters-such as the interval, start iteration, and opacity reset interval-were kept constant. For the Rubble scene, each experiment run for the same 125 epochs, exposing models to 200,000 images, ensuring consistency. Although training larger models for longer durations and lowering the positional learning rate improved results in my observations, we maintained consistent training steps and learning rates across all experiments to ensure fairness.

Table 6, 7 show the Throughput Scalability by Increasing batch size and leveraging more GPUs, for Rubble and Train scene, respectively. Essentially, more GPUs and larger batch size give higher throughput.

Table 8 demonstrates that additional GPUs increase available memory for more Gaussians, evaluated on the Rubble scene with various batch sizes reflecting different levels of activation memory usage. Essentially, more GPUs provide additional memory to store Gaussians, while a larger batch size increases activation memory usage, leaving less memory available for Gaussians.”

Table 3. Scalablity on Rubble: Gaussian Quantity, Results and Hyperparameter Settings
Table 4. MatrixCity Block_All Statistics: Gaussian Quantity, Results and Hyperparameter Settings
Table 5. Bicycle Statistics: Gaussian Quantity, Results and Hyperparameter settings
Table 6. Scalability on Rubble: Speed up from More GPU and Larger Batch Size
Table 7. Scalability on Train: Speed up from More GPU and Larger Batch Size
Table 8. Scalability on Rubble: More Available memory with more GPU

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhao, H. et al. (2025). On Scaling Up 3D Gaussian Splatting Training. In: Del Bue, A., Canton, C., Pont-Tuset, J., Tommasi, T. (eds) Computer Vision – ECCV 2024 Workshops. ECCV 2024. Lecture Notes in Computer Science, vol 15645. Springer, Cham. https://doi.org/10.1007/978-3-031-91989-3_2

Download citation

Keywords

Publish with us

Policies and ethics