Deployment Considerations documentation#9933
Conversation
There was a problem hiding this comment.
I really like the way this has been laid out. It talks through a lot of important factors that advanced users will want to think about. Especially when rolling Dask out within orgs. As you say much of this is out of scope for core Dask.
I like how it starts with the challenges and then neatly directs users off to other projects and companies that solve those problems. This seems like a really great way to direct folks away to other projects and commercial offerings.
|
|
||
| Thanks to the efforts of the open-source community, there are tools to deploy Dask :ref:`pretty much anywhere <deployment-options>`—if you can get computers to talk to each other, you can probably turn them into a Dask cluster. | ||
|
|
||
| **However, getting Dask running is often not the last step, but the first step.** This document attempts to cover some of the things *outside of Dask* you may have to think about when managing a Dask deployment. |
There was a problem hiding this comment.
I like how this is clearly setting the stage that these things are out of scope or on the periphery for Dask.
| Additional challenges can include getting local packages or scripts onto the cluster (and ensuring they're up to date), as well as packages installed from private Git or PyPI repos. | ||
|
|
||
|
|
||
| Observability |
There was a problem hiding this comment.
This is super important, but I feel like this is one of the last things folks think about. I would probably move this below other sections like cost and credentials.
There was a problem hiding this comment.
That was actually why I moved it up here; I know people don't usually think about it up front, but I wanted to make it more prominent since it's so important.
Also note that I mixed log retention in with metrics. Maybe those are worth splitting; I think log retention should be quite high (you're really not going to have a good time if you don't even keep logs around), but metrics usually come a bit later in your deployment journey.
| - What are we spending it on? (machines, machines that should have been turned off, network egress that shouldn't have happened, etc.) | ||
| - Who/what is responsible? | ||
|
|
||
| Non-commercial deployment tools generally don't build in this sort of monitoring. Organizations that need it either end up building their own tools, or turning to commercial deployment offerings. |
There was a problem hiding this comment.
Maybe soften this a bit.
| Non-commercial deployment tools generally don't build in this sort of monitoring. Organizations that need it either end up building their own tools, or turning to commercial deployment offerings. | |
| Many deployment tools generally don't build in this sort of monitoring. Organizations that need it either end up building their own tools, or turning to commercial deployment offerings. |
There was a problem hiding this comment.
I'd originally written that, but then couldn't think of any non-commercial tools that actually did have built-in capabilities for cost monitoring. Is there something I'm not thinking of?
Arguably Coiled doesn't even do what I've described here. Coiled can tell you how you spent your Coiled bill, but for your AWS bill, you still have to look yourself in the cost explorer (though this is facilitated by tags Coiled adds to all your dask infrastructure).
|
|
||
| You may also have other systems on restricted networks that workers need to access to read and write data, or call APIs. Connecting to those networks could add additional complexity. | ||
|
|
||
| Some organizations may have additional network security policies, such as requiring all traffic to be encrypted. Dask supports this with :doc:`TLS <tls>`, which requires additional configuration, and managing certificates. |
There was a problem hiding this comment.
Side note but Dask Cloud Provider turns this on by default. I wonder if we should do that in more deployment tooling and make it more of an opt out.
Co-authored-by: Jacob Tomlinson <jacobtomlinson@users.noreply.github.com>
|
@jacobtomlinson I think I've addressed your comments! |
jacobtomlinson
left a comment
There was a problem hiding this comment.
Thanks for writing this up @gjoseph92
This document tries to cover some of the infrastructure challenges outside of Dask that people commonly run into when setting up serious (production, multi-tenant) Dask deployments. The goal here is to give a more realistic picture of what it takes to run a production-grade Dask deployment to people who might be thinking of setting one up.
This is spun out from https://github.com/dask/dask/pull/9912/files#r1096578227, and based loosely on @mrocklin's PyData NYC talk: https://www.youtube.com/watch?v=5hUkUj1VYW4.
cc @scharlottej13 @jacobtomlinson