[rllib] Fix torch TD error, IMPALA LR updates#9477
Conversation
|
Can one of the admins verify this patch? |
|
Test PASSed. |
| trainer = impala.ImpalaTrainer(config=local_cfg, env="CartPole-v0") | ||
|
|
||
| def get_lr(result): | ||
| return result["info"]["learner"]["default_policy"]["cur_lr"] |
There was a problem hiding this comment.
This is a bit janky... is there a better to access cur_lr?
There was a problem hiding this comment.
Fine with me, but yeah: trainer.get_policy().cur_lr would be shorter.
sven1977
left a comment
There was a problem hiding this comment.
All good, just the tests are still failing with e.g.:
AttributeError: 'PPOTorchPolicy' object has no attribute '_optimizers'
| "mean_q": torch.mean(q_t_selected), | ||
| "min_q": torch.min(q_t_selected), | ||
| "max_q": torch.max(q_t_selected), | ||
| "td_error": self.td_error, |
There was a problem hiding this comment.
yeah, this was annoying. Thanks for removing it.
sven1977
left a comment
There was a problem hiding this comment.
Looks good. Just please make sure the tests are ok.
|
Test FAILed. |
|
Test FAILed. |
* [Core] Enhance common client connection (ray-project#9367) * enhance client connection * add write buffer async * read message * add test * Bazel move more shell to native rules (ray-project#9314) Co-authored-by: Mehrdad <noreply@github.com> * [tune] Fix github readme (ray-project#9365) Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com> * Combine different severities into the same log files (ray-project#9230) * Combine different severities into the same log files Co-authored-by: Mehrdad <noreply@github.com> * [core] Pass owner address from the workers to the raylet (ray-project#9299) * Add intended worker ID to GetObjectStatus, tests * Remove TaskID owner_id * lint * Add owner address to task args * Make TaskArg a virtual class, remove multi args * Set owner address for task args * merge * Fix tests * Add ObjectRefs to task dependency manager, pass from task spec args * tmp * tmp * Fix * Add ownership info for task arguments * Convert WaitForDirectActorCallArgs * lint * build * update * build * java * Move code * build * Revert "Fix Google log directory again (ray-project#9063)" This reverts commit 275da2e. * Fix free * fix tests * Fix tests * build * build * fix * Change assertion to warning to fix java * [Core] Add placement group scheduler and some api in resource scheduler (ray-project#9039) * Add placement group scheduler and some api of resource scheduler. Merge fix cv hang in multithread variables race (ray-project#8984). * change the bundle id and delete unit count in bundle change vector<bundle_spec> to vector<shared_ptr<bundle_spec>> Add placement group scheduler and some api of resource scheduler. Merge fix cv hang in multithread variables race (ray-project#8984). change the bundle id and delete unit count in bundle remove CheckIfSchedulable() add comments and fix the bug in resource * fix placement group schedule * add placement group scheduler and change some api in resource scheduler * fix by the comments * fix conflict * fix lint * fix lint * fix bug in merge * fix lint Co-authored-by: Lingxuan Zuo <skyzlxuan@gmail.com> * [Core] New scheduler fixes (ray-project#9186) * . * test_args passes * . * test_basic.py::test_many_fractional_resources causes ray to hang * test_basic.py::test_many_fractional_resources causes ray to hang * . * . * useful * test_many_fractional_resources fails instead of hanging now :) * Passes test_fractional_resources * . * . * Some cleanup * git is hard * cleanup * Fixed scheduling tests * . * . * [Core] put small objects in memory store (ray-project#8972) * remove the put in memory store * put small objects directly in memory store * cast data type * fix another place that uses Put to spill to plasma store * fix multiple tests related to memory limits * partially fix test_metrics * remove not functioning codes * fix core_worker_test * refactor put to plasma codes * add a flag for the new feature * add flag to more places * do a warmup round for the plasma store * lint * lint again * fix warmup store * Update _raylet.pyx Co-authored-by: Eric Liang <ekhliang@gmail.com> * [autoscaler] Move command runners into separate file and clean up interface. (ray-project#9340) * cleanup * wip * fix imports * fix lint * [docs][rllib] Recommended workflow for training, saving, and testing (ray-project#9319) * [autoscaler] Allow users to disable the cluster config cache (ray-project#8117) * [autoscaler] Remove autoscaler config cache. * [autoscaler] Add flag allowing users to explicitly disable the config cache. * Update hiredis and remove Windows patches (ray-project#9289) Co-authored-by: Mehrdad <noreply@github.com> * Fix flaky test_dynres.py (ray-project#9310) * Fix gcs_table_storage testcase bug (ray-project#9393) Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> * [HOTFIX] Fix compile direct_actor_transport_test on mac (ray-project#9403) * Change Python's `ObjectID` to `ObjectRef` (ray-project#9353) * [Java] Improve JNI performance when submitting and executing tasks (ray-project#9032) * Remove the RAY_CHECK in Worker::Port() (ray-project#9348) * [RLlib] Issue ray-project#9366 (DQN w/o dueling produces invalid actions). (ray-project#9386) * Fix macos compliation bug (ray-project#9391) * Fix. * [Core] Plasma RAII support (ray-project#9370) * [Serve] Merge router with HTTPProxy (ray-project#9225) * Pass run args to DockerCommandRunner (ray-project#9411) * Fix copy to workspace (ray-project#9400) * [RLlib] Tf2.x native. (ray-project#8752) * Update conda and ray wheel on GCP images (ray-project#9388) * [Core] Simplify Raylet Client (ray-project#9420) * Masking error. With t*valid_mask, we get the error np.inf*0 = np.inf (ray-project#9407) * [RLLib] WindowStat bug fix (ray-project#9213) * WindowStat error catching, which processes NaNs properly instead of erroring. This ought to resolve issue ray-project#7910. ray-project#7910 * [tune] handling nan values (ray-project#9381) * TRAVIS_PULL_REQUEST is false for non-PRs, not empty (ray-project#9439) Co-authored-by: Mehrdad <noreply@github.com> * [GCS] Fix the bug about raylet receiving duplicate actor creation tasks (ray-project#9422) * [Tune] Trainable documentation fix (ray-project#9448) * Allow --lru-evict to be passed into `ray start` (ray-project#8959) * GCP authentication using oauth tokens (ray-project#9279) * Bazel selects compiler flags based on compiler (ray-project#9313) Co-authored-by: Mehrdad <noreply@github.com> * [Core] Build raylet client as an independent component (ray-project#9434) * [tune] sklearn comment out (ray-project#9454) * Add ability to specify SOCKS proxy for SSH connections (ray-project#8833) * [docs] Render ActorPool documentation, etc (ray-project#9433) * [tune] Put examples under proper version control (ray-project#9427) Co-authored-by: krfricke <krfricke@users.noreply.github.com> * Fix test-multi-node (ray-project#9453) * Machine View Sorting / Grouping (ray-project#9214) * Convert NodeInfo.tsx to a functional component * Update NodeRowGroup to be a functional component * lint * Convert TotalRow to functional component. * lint * move node info over to using the sortable table head component. spacing is still a little wonky. * Factor a NoewWorkerRow class out of NodeRowGroup that will be usable when grouping / ungrouping * Compilation checkpoint, I factored the worker filtering logic out of node info into the reducer * Add sort accessors for CPU * Add sort accessors for Disk * Add sort accessors for RAM * add a table sort util for function based accessors (rather than flat attribute-based accessor) * wip refactor node info features * wip * Rendering Checkpoint. I've refactored the features and how they are called to add sorting support. Also reworks the way error counts and log counts are passed to the front-end to remove some ugly logic * wip * wip * wip * Finish adding sorting and grouping of machine view * lint * fix bug in filtration of logs and errors by worker from recent refactor. * Add export of Cluster Disk feature * fix some merge issues Co-authored-by: Max Fitton <max@semprehealth.com> * [RLlib] Layout of Trajectory View API (new class: Trajectory; not used yet). (ray-project#9269) * [RLlib] Issue 9402 MARWIL producing nan rewards. (ray-project#9429) * Fix gcs_pubsub_test bug(ray-project#9438) Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> * change error code name of boost timer (ray-project#9417) * [tune] PyTorch CIFAR10 example (ray-project#9338) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Kai Fricke <kai@anyscale.com> * Remove legacy C++ code (ray-project#9459) * Fix ObjectRef and ActorHandle serialization (ray-project#9462) * [Stats] metrics agent exporter (ray-project#9361) * [Core] Support GCS server port assignment. (ray-project#8962) * Add scripts symlink back (ray-project#9219) (ray-project#9475) (cherry picked from commit 77933c9) Co-authored-by: Simon Mo <xmo@berkeley.edu> * [tune] Issue 8821: ExperimentAnalysis doesn't expand user (ray-project#9461) * [docker] Include base-deps image in rayproject Docker Hub (ray-project#9458) * [Core] remove create_and_seal and create_and_seal_batch (ray-project#9457) * Speedups for GitHub Actions (ray-project#9343) Co-authored-by: Mehrdad <noreply@github.com> * Fix flaky test_object_manager.py (ray-project#9472) * [Java] fix redis-server binary path (ray-project#9398) * [core] Handle out-of-order actor table notifications (ray-project#9449) * Drop stale actor table notifications * build * Add num_restarts to disconnect handler * Unit test and increment num_restarts on ALIVE, not RESTARTING * Wait for pid to exit * Fix name clash on Windows (ray-project#9412) Co-authored-by: Mehrdad <noreply@github.com> * Add job configs to gcs (ray-project#9374) * Make pip install verbose (ray-project#9496) Co-authored-by: Mehrdad <noreply@github.com> * Make more tests compatible with Windows (ray-project#9303) * [tune] extend PTL template (GPU, typing fixes, tensorboard) (ray-project#9451) Co-authored-by: Kai Fricke <kai@anyscale.com> * [core] Replace task resubmission in raylet with ownership protocol (ray-project#9394) * Add intended worker ID to GetObjectStatus, tests * Remove TaskID owner_id * lint * Add owner address to task args * Make TaskArg a virtual class, remove multi args * Set owner address for task args * merge * Fix tests * Add ObjectRefs to task dependency manager, pass from task spec args * tmp * tmp * Fix * Add ownership info for task arguments * Convert WaitForDirectActorCallArgs * lint * build * update * build * java * Move code * build * Revert "Fix Google log directory again (ray-project#9063)" This reverts commit 275da2e. * Fix free * Regression tests - shorten timeouts in reconstruction unit tests * Remove timeout for non-actor tasks * Modify tests using ray.internal.free * Clean up future resolution code * Raylet polls the owner * todo * comment * Update src/ray/core_worker/core_worker.cc Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> * Drop stale actor table notifications * Fix bug where actor restart hangs * Revert buggy code for duplicate tasks * build * Fix errors for lru_evict and internal.free * Revert "Drop stale actor table notifications" This reverts commit 193c5d2. * Revert "build" This reverts commit 5644edb. * Fix free test * Fixes for freed objects Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> * release gil in global state accessor (ray-project#9357) * [Java] Named java actor (ray-project#9037) * Fix clang-cl build (ray-project#9494) Co-authored-by: Mehrdad <noreply@github.com> * [GCS Actor Management] Gcs actor management broken detached actor (ray-project#9473) * [RLlib] Issue ray-project#9437 (PyTorch converts to CPU tensor, even if on GPU). (ray-project#9497) * Get rid of build shell scripts and move them to Python (ray-project#6082) * Fix broken test_raylet_info_endpoint (ray-project#9511) * Fix. (ray-project#9464) * [Autoscaler] Making bootstrap config part of the node provider interface (ray-project#9443) * supporting custom bootstrap config for external node providers * bootstrap config * renamed config to cluster_config * lint * remove 2 args from importer * complete move of bootstrap to node_provider * renamed provider_cls * move imports outside functions * lint * Update python/ray/autoscaler/node_provider.py Co-authored-by: Eric Liang <ekhliang@gmail.com> * final fixes * keeping lines to reduce diff * lint * lamba config * filling in -> adding for lint Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local> Co-authored-by: Eric Liang <ekhliang@gmail.com> * Fix flaky test_actor_failures::test_actor_restart (ray-project#9509) * Fix flaky test * os exit * [rllib] MAML Transform (ray-project#9463) * MAML Transform * Moved Inner Adapt to Method in Execution Plan * Cleanup Plasma Store (hash utilities) (ray-project#9524) * [Serve] Improve buffering for simple cases (ray-project#9485) * [Serve] Use pickle instead of clouldpickle (ray-project#9479) * Fix pip and Bazel interaction messing up CI (ray-project#9506) Co-authored-by: Mehrdad <noreply@github.com> * [Core] Fix Java detached error (ray-project#9526) * fix java createActor NPE bug (ray-project#9532) * [RLlib] Issue 9218: PyTorch Policy places Model on GPU even with num_gpus=0 (ray-project#9516) * [Stats] Fix metric exporter test (ray-project#9376) * Hotfix Lint for Serve (ray-project#9535) * Windows cleanup (ray-project#9508) * Remove unneeded code for Windows * Get rid of usleep() * Make platform_shims includes non-transitive Co-authored-by: Mehrdad <noreply@github.com> * [RLlib] Issue 8384: QMIX doesn't learn anything. (ray-project#9527) * Add placement group manager and some code in core_worker (ray-project#9120) Co-authored-by: Lingxuan Zuo <skyzlxuan@gmail.com> * [core] Add flag to enable object reconstruction during ray start (ray-project#9488) * Add flag * doc * Fix tests * Pipelining task submission to workers (ray-project#9363) * first step of pipelining * pipelining tests & default configs - added pipelining unit tests in direct_task_transport_test.cc - added an entry in ray_config_def.h, ray_config.pxi, and ray_config.pxd to configure the parameter controlling the maximum number of tasks that can be in fligh to each worker - consolidated worker_to_lease_client_ and worker_to_lease_client_ hash maps in direct_task_transport.h into a single one called worker_to_lease_entry_ * post-review revisions * linting, following naming/style convention * linting * [New scheduler] Queueing refactor (ray-project#9491) * . * test_args passes * . * test_basic.py::test_many_fractional_resources causes ray to hang * test_basic.py::test_many_fractional_resources causes ray to hang * . * . * useful * test_many_fractional_resources fails instead of hanging now :) * Passes test_fractional_resources * . * . * Some cleanup * git is hard * cleanup * . * . * . * . * . * . * . * cleanup * address reviews * address reviews * more refactor * :) * travis pls * . * travis pls * . * [Serve] Add internal instruction for running benchmarks (ray-project#9531) * MADDPG learning confirmation test. (ray-project#9538) * Fix Bazel in Docker (ray-project#9530) Co-authored-by: Mehrdad <noreply@github.com> * Fix bug that `test_multi_node.py::test_multi_driver_logging` hangs when GCS actor management is turned on (ray-project#9539) Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> * [tune] Unflattened lookup for ProgressReporter (ray-project#9525) Co-authored-by: Kai Fricke <kai@anyscale.com> * Add plasma store benchmark for small objects (ray-project#9549) * [Tune] Copy default_columns in new ProgressReporter instances (ray-project#9537) * quickfix (ray-project#9552) * [tune] pin tune-sklearn (ray-project#9498) * [cli] ray memory: added redis_password (ray-project#9492) * [GCS]Fix lease worker leak bug when gcs server restarts (ray-project#9315) * add part code * fix compile bug * fix review comments * fix review comments * fix review comments * fix review comments * fix review comment * fix ut bug * fix lint error * fix review comment * fix review comments * add testcase * add testcase * fix bug * fix review comments * fix review comment * fix review comment * refine comments Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> Co-authored-by: Hao Chen <chenh1024@gmail.com> * [tune] fix pbt checkpoint_freq (ray-project#9517) * Only delete old checkpoint if it is not the same as the new one * Return early if old checkpoint value coincides with new checkpoint value Co-authored-by: Kai Fricke <kai@anyscale.com> * [Core] Remove socket pair exchange in Plasma Store (ray-project#9565) * try use boost::asio for notification processing * [Metric] new cython interface for python worker metric (ray-project#9469) * Bazel fixes (ray-project#9519) * GCS client add fetch operation before subscribe (ray-project#9564) * [RLlib] Fix combination of lockstep and multiple agnts controlled by the same policy. (ray-project#9521) * Change aggregation when lockstep is activated. Modification of MultiAgentBatch.timeslices to support the combination of lockstep and multiple agents controlled by the same policy. fix ray-project#9295 * Line too long. * [Core] Replace the Plasma eventloop with boost::asio (ray-project#9431) * Fix Java named actor bug (ray-project#9580) * Fix setup.py bug (ray-project#9581) Co-authored-by: Mehrdad <noreply@github.com> * [Serve] Serialize Query object directly (ray-project#9490) * Add dashboard dependencies to default ray installation (ray-project#9447) * Dashboard next-version API support in backend (ray-project#9345) * Fix log losses (ray-project#9559) * Close log on shutdown * Disable log buffering Co-authored-by: Mehrdad <noreply@github.com> * [docker] run Ubuntu 20.04 as base image (ray-project#9556) * Add PTL to README.rst (ray-project#9594) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Skip uneeded steps on CI (ray-project#9582) Co-authored-by: Mehrdad <noreply@github.com> * Fix Windows CI (ray-project#9588) Co-authored-by: Mehrdad <noreply@github.com> * [serve] Rename to `Controller` (ray-project#9566) * Handle warnings in core (ray-project#9575) * [New scheduler] Fix new scheduler bug (ray-project#9467) * fix new scheduler bug * add testcase for soft resource allocation * modify RemoveNode * Ensure unique log file names across same-node raylets. (ray-project#9561) * fix tag key typo (ray-project#9606) * Rename path variable due to zsh conflict (ray-project#9610) * [doc] [minor] Make API docs easier to find. (ray-project#9604) * Issue 9568: `rllib train` framework in config gets overridden with tf. (ray-project#9572) * Use UTF-8 for encoding of python code for collision hashing (ray-project#9586) Co-authored-by: Arne Sachtler <arne.sachtler@dlr.de> Co-authored-by: simon-mo <simon.mo@hey.com> * Add bazel to the PATH in setup.py (ray-project#9590) Co-authored-by: Mehrdad <noreply@github.com> * Fix Lint in setup.py (ray-project#9618) Co-authored-by: Mehrdad <noreply@github.com> * Shellcheck comments (ray-project#9595) * [Serve] Document Metric Infrastructure (ray-project#9389) * [CI] Do not run jenkins test on GHA (ray-project#9621) * Support ray task type checking (ray-project#9574) * [Metrics] Java metric API (ray-project#9377) * [GCS] fix the fault tolerance about gcs node manager (ray-project#9380) * Shellcheck quoting (ray-project#9596) * Fix SC2006: Use $(...) notation instead of legacy backticked `...`. * Fix SC2016: Expressions don't expand in single quotes, use double quotes for that. * Fix SC2046: Quote this to prevent word splitting. * Fix SC2053: Quote the right-hand side of == in [[ ]] to prevent glob matching. * Fix SC2068: Double quote array expansions to avoid re-splitting elements. * Fix SC2086: Double quote to prevent globbing and word splitting. * Fix SC2102: Ranges can only match single chars (mentioned due to duplicates). * Fix SC2140: Word is of the form "A"B"C" (B indicated). Did you mean "ABC" or "A\"B\"C"? * Fix SC2145: Argument mixes string and array. Use * or separate argument. * Fix SC2209: warning: Use var=$(command) to assign output (or quote to assign string). Co-authored-by: Mehrdad <noreply@github.com> * Fix bug in Bazel version check (ray-project#9626) Co-authored-by: Mehrdad <noreply@github.com> * [Java] Avoid data copy from C++ to Java for ByteBuffer type (ray-project#9033) * Revert "Dashboard next-version API support in backend (ray-project#9345)" (ray-project#9639) This reverts commit fca1fb1. * [Autoscaler] Command Line Interface improvements (ray-project#9322) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * [Core] GCS Actor management on by default. (ray-project#8845) * GCS Actor management on by default. * Fix travis config. * Change condition. * Remove unnecessary CI. * [Core] Fix concurrency issues in plasma store runner (ray-project#9642) * fix window jni unhappy compiler (ray-project#9635) * Fix TestObjectTableResubscribe testcase bug (ray-project#9650) * fix named actor single process mode bug (ray-project#9652) * [core] Fix Ray service startup when logging redirection is disabled. (ray-project#9547) * Fix TorchDeterministic (ray-project#9241) * [RaySGD] revised existing transformer example to work with transformers>=3.0 (ray-project#9661) Co-authored-by: Kai Fricke <kai@anyscale.com> * [rllib] Fix torch TD error, IMPALA LR updates (ray-project#9477) * update * add test * lint * fix super call * speed es test up * Auto-cancel build when a new commit is pushed (ray-project#8043) Co-authored-by: Mehrdad <noreply@github.com> * Fix lint in remote-watch.py (ray-project#9668) * [Core] Remove unnecessary windows syscall in plasma store (ray-project#9602) * Remove unused windows shims (ray-project#9583) * Temporarily disable remote watcher (ray-project#9669) * Drop support for Python 3.5. (ray-project#9622) * Drop support for Python 3.5. * Update setup.py * [Core] WorkerInterface refactor (ray-project#9655) * . * . * refactor WorkerInterface * . * Basic unit test structure complete? * . * . * . * . * Fixed tests * Fixed tests * . * [core] Enable object reconstruction for retryable actor tasks (ray-project#9557) * Test actor plasma reconstruction * Allow resubmission of actor tasks * doc * Test for actor constructor * Kill PID before removing node * Kill pid before node * fix java coreworker crash (ray-project#9674) * use help proto-init-macro for streaming config (ray-project#9272) * Update release information from 0.8.6. (ray-project#9124) * [BRING BACK TO MASTER] Update release information. * [MERGE TO MASTER] Add microbenchmark result. * Update asan tests to the doc. * Refinements to the Serve documentation (ray-project#9587) Co-authored-by: Dean Wampler <dean@concurrentthought.com> * [tune] survey (ray-project#9670) * Fix ERROR logging not being printed to standard error (ray-project#9633) Co-authored-by: Mehrdad <noreply@github.com> * [Tune Docs] Logging doc fix (ray-project#9691) * [rllib] Type annotations for model classes (ray-project#9646) * [Serve] Allow multiple HTTP servers. (ray-project#9523) * Issue 9631: Tf1.14 does not have tf.config.list_physical_devices. (ray-project#9681) * [Serve] Fix Formatting, stale docs (ray-project#9617) * fixed simplex initialisation seeding bug (ray-project#9660) Co-authored-by: Petros Christodoulou <petrochr@amazon.com> * Switch from GitHub checkout@v2 to checkout@v1 due to bugs in checkout (ray-project#9697) Co-authored-by: Mehrdad <noreply@github.com> * Add Ray Serve to README.rst (ray-project#9688) * Shellcheck rewrites (ray-project#9597) * Fix SC2001: See if you can use ${variable//search/replace} instead. * Fix SC2010: Don't use ls | grep. Use a glob or a for loop with a condition to allow non-alphanumeric filenames. * Fix SC2012: Use find instead of ls to better handle non-alphanumeric filenames. * Fix SC2015: Note that A && B || C is not if-then-else. C may run when A is true. * Fix SC2028: echo may not expand escape sequences. Use printf. * Fix SC2034: variable appears unused. Verify use (or export if used externally). * Fix SC2035: Use ./*glob* or -- *glob* so names with dashes won't become options. * Fix SC2071: > is for string comparisons. Use -gt instead. * Fix SC2154: variable is referenced but not assigned * Fix SC2164: Use 'cd ... || exit' or 'cd ... || return' in case cd fails. * Fix SC2188: This redirection doesn't have a command. Move to its command (or use 'true' as no-op). * Fix SC2236: Use -n instead of ! -z. * Fix SC2242: Can only exit with status 0-255. Other data should be written to stdout/stderr. * Fix SC2086: Double quote to prevent globbing and word splitting. Co-authored-by: Mehrdad <noreply@github.com> * [Autoscaler] CLI Logger docs (ray-project#9690) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Update rllib-algorithms.rst (ray-project#9640) * [tune] move jenkins tests to travis (ray-project#9609) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Kai Fricke <kai@anyscale.com> * [RLlib] Implement DQN PyTorch distributional head. (ray-project#9589) * Add placement group java api (ray-project#9611) * add part code * add part code * add part code * fix code style * fix review comment * fix review comment * add part code * add part code * add part code * add part code * fix review comment * fix review comment * fix code style * fix review comment * fix lint error * fix lint error Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> * [Stats] Improve Stats::Init & Add it to GCS server (ray-project#9563) * [Core] Try remove all windows compat shims (ray-project#9671) * try remove compat for arrow * remove unistd.h * remove socket compat * delete arrow windows patch * Fix a few flaky tests (ray-project#9709) Fix test_custom_resources, Remove test_pandas_parquet_serialization, Better error message for test_output.py, Potentially fix test_dynres::test_dynamic_res_creation_scheduler_consistency * [GCS]Open test_gcs_fault_tolerance testcase (ray-project#9677) * enable test_gcs_fault_tolerance * fix lint error Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> * [Tests]lock vector to avoid potential flaky test (ray-project#9656) * [tune] distributed torch wrapper (ray-project#9550) * changes * add-working * checkpoint * ccleanu * fix * ok * formatting * ok * tests * some-good-stuff * fix-torch * ddp-torch * torch-test * sessions * add-small-test * fix * remove * gpu-working * update-tests * ok * try-test * formgat * ok * ok * [GCS] Fix actor task hang when its owner exits before local dependencies resolved (ray-project#8045) * Only update raylet map when autoscaler configured (ray-project#9435) * [Dashboard] New dashboard skeleton (ray-project#9099) * Fixing multiple building issues * Make wait_for_condition raise exception when timing out. (ray-project#9710) * [GCS]GCS client support multi-thread subscribe&resubscribe&unsubscribe (ray-project#9718) * Package and upload ray cross-platform jar (ray-project#9540) * Revert "Package and upload ray cross-platform jar (ray-project#9540)" (ray-project#9730) This reverts commit 8810325. * Only build docker wheels in LINUX_WHEELS env (ray-project#9729) * Keep build-autoscaler-images.sh alive in CI (ray-project#9720) * [core] Removes Error when Internal Config is not set (ray-project#9700) * [Cluster Launcher] Re Org the cluster launcher pages. (ray-project#9687) * [RLlib] Offline Type Annotations (ray-project#9676) * Offline Annotations * Modifications * Fixed circular dependencies * Linter fix * Python api of placement group (ray-project#9243) * Include open-ssh-client for transparency (ray-project#9693) * Fix remote-watch.py (ray-project#9625) Co-authored-by: Mehrdad <noreply@github.com> * [docker] Uses Latest Conda & Py 3.7 (ray-project#9732) * Fix broken actor failure tests. (ray-project#9737) * [Stats] fix stats shutdown crash if opencensus exporter not initialized (ray-project#9727) * Fix package and upload ray jar (ray-project#9742) * Introduce file_mounts_sync_continuously cluster option (ray-project#9544) * Separate out file_mounts contents hashing into its own separate hash Add an option to continuously sync file_mounts from head node to worker nodes: monitor.py will re-sync file mounts whenver contents change but will only run setup_commands if the config also changes * add test and default value for file_mounts_sync_continuously * format code * Update comments * Add param to skip setup commands when only file_mounts content changed during monitor.py's update tick Fixed so setup commands run when ray up is run and file_mounts content changes * Refactor so that runtime_hash retains previous behavior runtime_hash is almost identical as before this PR. It is used to determine if setup_commands need to run file_mounts_contents_hash is an additional hash of the file_mounts content that is used to detect when only file syncing has to occur. Note: runtime_hash value will have changed from before the PR because we hash the hash of the contents of the file_mounts as a performance optimization * fix issue with hashing a hash * fix bug where trying to set contents hash when it wasn't generated * Fix lint error Fix bug in command_runner where check_output was no longer returning the output of the command * clear out provider between tests to get rid of flakyness * reduce chance of race condition from node_launcher launching a node in the middle of an autoscaler.update call * [dist] swap mac/linux wheel build order (ray-project#9746) * [RLlib] Enhance reward clipping test; add action_clipping tests. (ray-project#9684) * [RLlib] Issue 9667 DDPG Torch bugs and enhancements. (ray-project#9680) * [Metrics]Ray java worker metric registry (ray-project#9636) * ray worker metrics gauge init * ray java metric mapping * add jni source files for gauge and tagkey * mapping all metric classes to stats object * check non-null for tags and name * lint * add symbol for native metric JNI * extern c for symbol * add tests for all metrics * Update Metric.java use metricNativePointer instead. * unify metric native stuff to one class * fix jni file * add comments for metric transform function in jni utils * move metric function to native metric file * remove unused disconnect jni * Add a metric registry for java metircs * Restore install-bazel.sh * Add some comments for metric registry * Fix thread safe problem of metrics * Fix metric tests and remove sleep code from tests * Fix comments of metrics Co-authored-by: lingxuan.zlx <skyzlxuan@gmail.com> * fix windows compile bug (ray-project#9741) Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> * Run _with_interactive in Docker (ray-project#9747) * [New scheduler] First unit test for task manager (ray-project#9696) * . * . * refactor WorkerInterface * . * Basic unit test structure complete? * . * bad git >:-( * small clean up * CR * . * . * One more fixture * One more fixture * . * . * bazel-format * . * [Stats] Basic Metrics Infrastructure (Metrics Agent + Prometheus Exporter) (ray-project#9607) * [Release] Fix release tests (ray-project#9733) * Register function race (ray-project#9346) * Revert "[dist] swap mac/linux wheel build order (ray-project#9746)" and "Fix package and upload ray jar (ray-project#9742)" (ray-project#9758) * Revert "[dist] swap mac/linux wheel build order (ray-project#9746)" This reverts commit a934056. * Revert "Fix package and upload ray jar (ray-project#9742)" This reverts commit c290c30. * Fix some Windows CI issues (ray-project#9708) Co-authored-by: Mehrdad <noreply@github.com> * Pin pytest version (ray-project#9767) * [Java] Use test groups to filter tests of different run modes (ray-project#9703) * [Java] Fix MetricTest.java due to incomplete changes from ray-project#9703 (ray-project#9770) * Fix leased worker leak bug if lease worker requests that are still waiting to be scheduled when GCS restarts (ray-project#9719) * [Stats] enable core worker stats (ray-project#9355) * [GCS]Use a separate thread in node failure detector to handle heartbeat (ray-project#9416) * use a sole thread to handle heartbeat * separate signal thread * use work to avoid exiting when task is underway * protect shared data structure to avoid deadlock * add comments * decrease io service num * minor changes * fix test * per stephanie's comments * use single io service instead of 1-size io service pool * typo * [GCS Actor Management] Fix flaky test_dead_actors. (ray-project#9715) * Fix. * Add logs. * Add an unit test. * [TUNE] Tune Docs re-organization (ray-project#9600) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * [RLlib] Trajectory View API (preparatory cleanup and enhancements). (ray-project#9678) * [Core] Socket creation race condition bug fixes (ray-project#9764) * fix issues * hot fixes * test * test * Always info log * Fixed stderr logging (9765) * [Core] Custom socket name (ray-project#9766) * fix issues * hot fixes * test * test * socket name change only * Fix src/ray/core_worker/common.h deleted constructor (ray-project#9785) Co-authored-by: Mehrdad <noreply@github.com> * [Stats] Fix harvestor threads + Fix flaky stats shutdown. (ray-project#9745) * More fixes * Applying latest changes in travis.yml * Fixing fixture data exclusions * Disable some java tests * Fix some CI errors * Update hash * Fixing more build issues * Fixing more build issues * Fix pipeline cache path * More fixes * Fix bazel test command * Fix bazel test * Fix general info steps * Custom env var for docker build * Trying a different way to install bazel * Bazel fix * Updating hash Co-authored-by: Siyuan (Ryans) Zhuang <suquark@gmail.com> Co-authored-by: mehrdadn <mehrdadn@users.noreply.github.com> Co-authored-by: Mehrdad <noreply@github.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com> Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: Alisa <wuminyan0607@gmail.com> Co-authored-by: Lingxuan Zuo <skyzlxuan@gmail.com> Co-authored-by: Alex Wu <itswu.alex@gmail.com> Co-authored-by: Zhuohan Li <zhuohan123@vip.qq.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Stefan Schneider <stefan.schneider@upb.de> Co-authored-by: Patrick Ames <pdames@amazon.com> Co-authored-by: Hao Chen <chenh1024@gmail.com> Co-authored-by: fangfengbin <869218239a@zju.edu.cn> Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> Co-authored-by: Tao Wang <dooku.wt@antfin.com> Co-authored-by: Kai Yang <kfstorm@outlook.com> Co-authored-by: Sven Mika <sven@anyscale.io> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Ian Rodney <ian.rodney@gmail.com> Co-authored-by: Henk Tillman <henktillman@gmail.com> Co-authored-by: Tanay Wakhare <twakhare@gmail.com> Co-authored-by: Nicolaus93 <nicolo.campolongo@unimi.it> Co-authored-by: Vasily Litvinov <45396231+vnlitvinov@users.noreply.github.com> Co-authored-by: krfricke <krfricke@users.noreply.github.com> Co-authored-by: Max Fitton <maxfitton@gmail.com> Co-authored-by: Max Fitton <max@semprehealth.com> Co-authored-by: kisuke95 <2522134184@qq.com> Co-authored-by: Kai Fricke <kai@anyscale.com> Co-authored-by: Simon Mo <xmo@berkeley.edu> Co-authored-by: Michael Mui <68102089+heyitsmui@users.noreply.github.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: chaokunyang <shawn.ck.yang@gmail.com> Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu> Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local> Co-authored-by: Michael Luo <michael.luo123456789@gmail.com> Co-authored-by: Gabriele Oliaro <gabriele_oliaro@college.harvard.edu> Co-authored-by: Tom <veniat.tom@gmail.com> Co-authored-by: jerrylee.io <JerryDeKo@gmail.com> Co-authored-by: Raphael Avalos <raphael@avalos.fr> Co-authored-by: William Falcon <waf2107@columbia.edu> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: Robert Nishihara <robertnishihara@gmail.com> Co-authored-by: Arne Sachtler <arne.sachtler@gmail.com> Co-authored-by: Arne Sachtler <arne.sachtler@dlr.de> Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: ZhuSenlin <wumuzi520@126.com> Co-authored-by: Max Fitton <mfitton@berkeley.edu> Co-authored-by: Maksim Smolin <maximsmol@gmail.com> Co-authored-by: Dean Wampler <dean@polyglotprogramming.com> Co-authored-by: Dean Wampler <dean@concurrentthought.com> Co-authored-by: Bill Chambers <bill@anyscale.com> Co-authored-by: Petros Christodoulou <p.christodoulou2@gmail.com> Co-authored-by: Petros Christodoulou <petrochr@amazon.com> Co-authored-by: Justin Terry <justinkterry@gmail.com> Co-authored-by: Tao Wang <wangtaothetonic@163.com> Co-authored-by: fyrestone <fyrestone@outlook.com> Co-authored-by: Alan Guo <aguo@anyscale.com> Co-authored-by: bermaker <495571751@qq.com>
* Set up CI with Azure Pipelines
Specifically, we are setting a
travis like ADO pipeline following
what is already present in the .travis.yml
file in the root of the repo.
* Separating travis like pipeline from main pipeline
* Adding Jenkings jobs equivalent
* Making some improvements
* Adding validation of the upstream CI
* Disabling Tune and large memory tests
* Changing threshold for simple reservoir sampling test
* Addressing comments
* Updating Azure Pipelines with travis updates
* Updating Azure Pipelines with more travis updates
* Updating CI with new cpp worker tests
* Setting code owners
* Fixing the version number generation
* Making main pipeline also our release pipeline
* Updating Azure Pipelines with travis updates
* Fixing wheels test
* Fixing codeowners
* Updating Azure Pipelines with travis updates
* Bumping up MACOSX_DEPLOYMENT_TARGET
* Updating Azure Pipelines with travis updates
* Updating Azure Pipelines with travis updates
* Updating Azure Pipelines with travis updates
* Disabling Serve tests
* Making explicit which branches GitHubActions workflows should watch
* Desabling Ray serve tests
* Installing numpy explicitly
* consolidating Ray test steps in one yml
* Syncing with upstream master 2020-07-30 (#21)
* [Core] Enhance common client connection (#9367)
* enhance client connection
* add write buffer async
* read message
* add test
* Bazel move more shell to native rules (#9314)
Co-authored-by: Mehrdad <noreply@github.com>
* [tune] Fix github readme (#9365)
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
* Combine different severities into the same log files (#9230)
* Combine different severities into the same log files
Co-authored-by: Mehrdad <noreply@github.com>
* [core] Pass owner address from the workers to the raylet (#9299)
* Add intended worker ID to GetObjectStatus, tests
* Remove TaskID owner_id
* lint
* Add owner address to task args
* Make TaskArg a virtual class, remove multi args
* Set owner address for task args
* merge
* Fix tests
* Add ObjectRefs to task dependency manager, pass from task spec args
* tmp
* tmp
* Fix
* Add ownership info for task arguments
* Convert WaitForDirectActorCallArgs
* lint
* build
* update
* build
* java
* Move code
* build
* Revert "Fix Google log directory again (#9063)"
This reverts commit 275da2e4003b56e5c315ceae53a2e5f5ad7874c1.
* Fix free
* fix tests
* Fix tests
* build
* build
* fix
* Change assertion to warning to fix java
* [Core] Add placement group scheduler and some api in resource scheduler (#9039)
* Add placement group scheduler and some api of resource scheduler.
Merge fix cv hang in multithread variables race (#8984).
* change the bundle id and delete unit count in bundle
change vector<bundle_spec> to vector<shared_ptr<bundle_spec>>
Add placement group scheduler and some api of resource scheduler.
Merge fix cv hang in multithread variables race (#8984).
change the bundle id and delete unit count in bundle
remove CheckIfSchedulable()
add comments and fix the bug in resource
* fix placement group schedule
* add placement group scheduler and change some api in resource scheduler
* fix by the comments
* fix conflict
* fix lint
* fix lint
* fix bug in merge
* fix lint
Co-authored-by: Lingxuan Zuo <skyzlxuan@gmail.com>
* [Core] New scheduler fixes (#9186)
* .
* test_args passes
* .
* test_basic.py::test_many_fractional_resources causes ray to hang
* test_basic.py::test_many_fractional_resources causes ray to hang
* .
* .
* useful
* test_many_fractional_resources fails instead of hanging now :)
* Passes test_fractional_resources
* .
* .
* Some cleanup
* git is hard
* cleanup
* Fixed scheduling tests
* .
* .
* [Core] put small objects in memory store (#8972)
* remove the put in memory store
* put small objects directly in memory store
* cast data type
* fix another place that uses Put to spill to plasma store
* fix multiple tests related to memory limits
* partially fix test_metrics
* remove not functioning codes
* fix core_worker_test
* refactor put to plasma codes
* add a flag for the new feature
* add flag to more places
* do a warmup round for the plasma store
* lint
* lint again
* fix warmup store
* Update _raylet.pyx
Co-authored-by: Eric Liang <ekhliang@gmail.com>
* [autoscaler] Move command runners into separate file and clean up interface. (#9340)
* cleanup
* wip
* fix imports
* fix lint
* [docs][rllib] Recommended workflow for training, saving, and testing (#9319)
* [autoscaler] Allow users to disable the cluster config cache (#8117)
* [autoscaler] Remove autoscaler config cache.
* [autoscaler] Add flag allowing users to explicitly disable the config cache.
* Update hiredis and remove Windows patches (#9289)
Co-authored-by: Mehrdad <noreply@github.com>
* Fix flaky test_dynres.py (#9310)
* Fix gcs_table_storage testcase bug (#9393)
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
* [HOTFIX] Fix compile direct_actor_transport_test on mac (#9403)
* Change Python's `ObjectID` to `ObjectRef` (#9353)
* [Java] Improve JNI performance when submitting and executing tasks (#9032)
* Remove the RAY_CHECK in Worker::Port() (#9348)
* [RLlib] Issue #9366 (DQN w/o dueling produces invalid actions). (#9386)
* Fix macos compliation bug (#9391)
* Fix.
* [Core] Plasma RAII support (#9370)
* [Serve] Merge router with HTTPProxy (#9225)
* Pass run args to DockerCommandRunner (#9411)
* Fix copy to workspace (#9400)
* [RLlib] Tf2.x native. (#8752)
* Update conda and ray wheel on GCP images (#9388)
* [Core] Simplify Raylet Client (#9420)
* Masking error. With t*valid_mask, we get the error np.inf*0 = np.inf (#9407)
* [RLLib] WindowStat bug fix (#9213)
* WindowStat error catching, which processes NaNs properly instead of erroring. This ought to resolve issue #7910.
https://github.com/ray-project/ray/issues/7910
* [tune] handling nan values (#9381)
* TRAVIS_PULL_REQUEST is false for non-PRs, not empty (#9439)
Co-authored-by: Mehrdad <noreply@github.com>
* [GCS] Fix the bug about raylet receiving duplicate actor creation tasks (#9422)
* [Tune] Trainable documentation fix (#9448)
* Allow --lru-evict to be passed into `ray start` (#8959)
* GCP authentication using oauth tokens (#9279)
* Bazel selects compiler flags based on compiler (#9313)
Co-authored-by: Mehrdad <noreply@github.com>
* [Core] Build raylet client as an independent component (#9434)
* [tune] sklearn comment out (#9454)
* Add ability to specify SOCKS proxy for SSH connections (#8833)
* [docs] Render ActorPool documentation, etc (#9433)
* [tune] Put examples under proper version control (#9427)
Co-authored-by: krfricke <krfricke@users.noreply.github.com>
* Fix test-multi-node (#9453)
* Machine View Sorting / Grouping (#9214)
* Convert NodeInfo.tsx to a functional component
* Update NodeRowGroup to be a functional component
* lint
* Convert TotalRow to functional component.
* lint
* move node info over to using the sortable table head component. spacing is still a little wonky.
* Factor a NoewWorkerRow class out of NodeRowGroup that will be usable when grouping / ungrouping
* Compilation checkpoint, I factored the worker filtering logic out of node info into the reducer
* Add sort accessors for CPU
* Add sort accessors for Disk
* Add sort accessors for RAM
* add a table sort util for function based accessors (rather than flat attribute-based accessor)
* wip refactor node info features
* wip
* Rendering Checkpoint. I've refactored the features and how they are called to add sorting support. Also reworks the way error counts and log counts are passed to the front-end to remove some ugly logic
* wip
* wip
* wip
* Finish adding sorting and grouping of machine view
* lint
* fix bug in filtration of logs and errors by worker from recent refactor.
* Add export of Cluster Disk feature
* fix some merge issues
Co-authored-by: Max Fitton <max@semprehealth.com>
* [RLlib] Layout of Trajectory View API (new class: Trajectory; not used yet). (#9269)
* [RLlib] Issue 9402 MARWIL producing nan rewards. (#9429)
* Fix gcs_pubsub_test bug(#9438)
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
* change error code name of boost timer (#9417)
* [tune] PyTorch CIFAR10 example (#9338)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Kai Fricke <kai@anyscale.com>
* Remove legacy C++ code (#9459)
* Fix ObjectRef and ActorHandle serialization (#9462)
* [Stats] metrics agent exporter (#9361)
* [Core] Support GCS server port assignment. (#8962)
* Add scripts symlink back (#9219) (#9475)
(cherry picked from commit 77933c922d5136c5c2e2f0ac2edb4da67111d690)
Co-authored-by: Simon Mo <xmo@berkeley.edu>
* [tune] Issue 8821: ExperimentAnalysis doesn't expand user (#9461)
* [docker] Include base-deps image in rayproject Docker Hub (#9458)
* [Core] remove create_and_seal and create_and_seal_batch (#9457)
* Speedups for GitHub Actions (#9343)
Co-authored-by: Mehrdad <noreply@github.com>
* Fix flaky test_object_manager.py (#9472)
* [Java] fix redis-server binary path (#9398)
* [core] Handle out-of-order actor table notifications (#9449)
* Drop stale actor table notifications
* build
* Add num_restarts to disconnect handler
* Unit test and increment num_restarts on ALIVE, not RESTARTING
* Wait for pid to exit
* Fix name clash on Windows (#9412)
Co-authored-by: Mehrdad <noreply@github.com>
* Add job configs to gcs (#9374)
* Make pip install verbose (#9496)
Co-authored-by: Mehrdad <noreply@github.com>
* Make more tests compatible with Windows (#9303)
* [tune] extend PTL template (GPU, typing fixes, tensorboard) (#9451)
Co-authored-by: Kai Fricke <kai@anyscale.com>
* [core] Replace task resubmission in raylet with ownership protocol (#9394)
* Add intended worker ID to GetObjectStatus, tests
* Remove TaskID owner_id
* lint
* Add owner address to task args
* Make TaskArg a virtual class, remove multi args
* Set owner address for task args
* merge
* Fix tests
* Add ObjectRefs to task dependency manager, pass from task spec args
* tmp
* tmp
* Fix
* Add ownership info for task arguments
* Convert WaitForDirectActorCallArgs
* lint
* build
* update
* build
* java
* Move code
* build
* Revert "Fix Google log directory again (#9063)"
This reverts commit 275da2e4003b56e5c315ceae53a2e5f5ad7874c1.
* Fix free
* Regression tests - shorten timeouts in reconstruction unit tests
* Remove timeout for non-actor tasks
* Modify tests using ray.internal.free
* Clean up future resolution code
* Raylet polls the owner
* todo
* comment
* Update src/ray/core_worker/core_worker.cc
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
* Drop stale actor table notifications
* Fix bug where actor restart hangs
* Revert buggy code for duplicate tasks
* build
* Fix errors for lru_evict and internal.free
* Revert "Drop stale actor table notifications"
This reverts commit 193c5d20e5577befd43f166e16c972e2f9247c91.
* Revert "build"
This reverts commit 5644edbac906ff6ef98feb40b6f62c9e63698c29.
* Fix free test
* Fixes for freed objects
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
* release gil in global state accessor (#9357)
* [Java] Named java actor (#9037)
* Fix clang-cl build (#9494)
Co-authored-by: Mehrdad <noreply@github.com>
* [GCS Actor Management] Gcs actor management broken detached actor (#9473)
* [RLlib] Issue #9437 (PyTorch converts to CPU tensor, even if on GPU). (#9497)
* Get rid of build shell scripts and move them to Python (#6082)
* Fix broken test_raylet_info_endpoint (#9511)
* Fix. (#9464)
* [Autoscaler] Making bootstrap config part of the node provider interface (#9443)
* supporting custom bootstrap config for external node providers
* bootstrap config
* renamed config to cluster_config
* lint
* remove 2 args from importer
* complete move of bootstrap to node_provider
* renamed provider_cls
* move imports outside functions
* lint
* Update python/ray/autoscaler/node_provider.py
Co-authored-by: Eric Liang <ekhliang@gmail.com>
* final fixes
* keeping lines to reduce diff
* lint
* lamba config
* filling in -> adding for lint
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
* Fix flaky test_actor_failures::test_actor_restart (#9509)
* Fix flaky test
* os exit
* [rllib] MAML Transform (#9463)
* MAML Transform
* Moved Inner Adapt to Method in Execution Plan
* Cleanup Plasma Store (hash utilities) (#9524)
* [Serve] Improve buffering for simple cases (#9485)
* [Serve] Use pickle instead of clouldpickle (#9479)
* Fix pip and Bazel interaction messing up CI (#9506)
Co-authored-by: Mehrdad <noreply@github.com>
* [Core] Fix Java detached error (#9526)
* fix java createActor NPE bug (#9532)
* [RLlib] Issue 9218: PyTorch Policy places Model on GPU even with num_gpus=0 (#9516)
* [Stats] Fix metric exporter test (#9376)
* Hotfix Lint for Serve (#9535)
* Windows cleanup (#9508)
* Remove unneeded code for Windows
* Get rid of usleep()
* Make platform_shims includes non-transitive
Co-authored-by: Mehrdad <noreply@github.com>
* [RLlib] Issue 8384: QMIX doesn't learn anything. (#9527)
* Add placement group manager and some code in core_worker (#9120)
Co-authored-by: Lingxuan Zuo <skyzlxuan@gmail.com>
* [core] Add flag to enable object reconstruction during ray start (#9488)
* Add flag
* doc
* Fix tests
* Pipelining task submission to workers (#9363)
* first step of pipelining
* pipelining tests & default configs
- added pipelining unit tests in direct_task_transport_test.cc
- added an entry in ray_config_def.h, ray_config.pxi, and ray_config.pxd to configure the parameter controlling the maximum number of tasks that can be in fligh to each worker
- consolidated worker_to_lease_client_ and worker_to_lease_client_ hash maps in direct_task_transport.h into a single one called worker_to_lease_entry_
* post-review revisions
* linting, following naming/style convention
* linting
* [New scheduler] Queueing refactor (#9491)
* .
* test_args passes
* .
* test_basic.py::test_many_fractional_resources causes ray to hang
* test_basic.py::test_many_fractional_resources causes ray to hang
* .
* .
* useful
* test_many_fractional_resources fails instead of hanging now :)
* Passes test_fractional_resources
* .
* .
* Some cleanup
* git is hard
* cleanup
* .
* .
* .
* .
* .
* .
* .
* cleanup
* address reviews
* address reviews
* more refactor
* :)
* travis pls
* .
* travis pls
* .
* [Serve] Add internal instruction for running benchmarks (#9531)
* MADDPG learning confirmation test. (#9538)
* Fix Bazel in Docker (#9530)
Co-authored-by: Mehrdad <noreply@github.com>
* Fix bug that `test_multi_node.py::test_multi_driver_logging` hangs when GCS actor management is turned on (#9539)
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
* [tune] Unflattened lookup for ProgressReporter (#9525)
Co-authored-by: Kai Fricke <kai@anyscale.com>
* Add plasma store benchmark for small objects (#9549)
* [Tune] Copy default_columns in new ProgressReporter instances (#9537)
* quickfix (#9552)
* [tune] pin tune-sklearn (#9498)
* [cli] ray memory: added redis_password (#9492)
* [GCS]Fix lease worker leak bug when gcs server restarts (#9315)
* add part code
* fix compile bug
* fix review comments
* fix review comments
* fix review comments
* fix review comments
* fix review comment
* fix ut bug
* fix lint error
* fix review comment
* fix review comments
* add testcase
* add testcase
* fix bug
* fix review comments
* fix review comment
* fix review comment
* refine comments
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
Co-authored-by: Hao Chen <chenh1024@gmail.com>
* [tune] fix pbt checkpoint_freq (#9517)
* Only delete old checkpoint if it is not the same as the new one
* Return early if old checkpoint value coincides with new checkpoint value
Co-authored-by: Kai Fricke <kai@anyscale.com>
* [Core] Remove socket pair exchange in Plasma Store (#9565)
* try use boost::asio for notification processing
* [Metric] new cython interface for python worker metric (#9469)
* Bazel fixes (#9519)
* GCS client add fetch operation before subscribe (#9564)
* [RLlib] Fix combination of lockstep and multiple agnts controlled by the same policy. (#9521)
* Change aggregation when lockstep is activated.
Modification of MultiAgentBatch.timeslices to support the combination of lockstep and multiple agents controlled by the same policy.
fix ray-project/ray#9295
* Line too long.
* [Core] Replace the Plasma eventloop with boost::asio (#9431)
* Fix Java named actor bug (#9580)
* Fix setup.py bug (#9581)
Co-authored-by: Mehrdad <noreply@github.com>
* [Serve] Serialize Query object directly (#9490)
* Add dashboard dependencies to default ray installation (#9447)
* Dashboard next-version API support in backend (#9345)
* Fix log losses (#9559)
* Close log on shutdown
* Disable log buffering
Co-authored-by: Mehrdad <noreply@github.com>
* [docker] run Ubuntu 20.04 as base image (#9556)
* Add PTL to README.rst (#9594)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* Skip uneeded steps on CI (#9582)
Co-authored-by: Mehrdad <noreply@github.com>
* Fix Windows CI (#9588)
Co-authored-by: Mehrdad <noreply@github.com>
* [serve] Rename to `Controller` (#9566)
* Handle warnings in core (#9575)
* [New scheduler] Fix new scheduler bug (#9467)
* fix new scheduler bug
* add testcase for soft resource allocation
* modify RemoveNode
* Ensure unique log file names across same-node raylets. (#9561)
* fix tag key typo (#9606)
* Rename path variable due to zsh conflict (#9610)
* [doc] [minor] Make API docs easier to find. (#9604)
* Issue 9568: `rllib train` framework in config gets overridden with tf. (#9572)
* Use UTF-8 for encoding of python code for collision hashing (#9586)
Co-authored-by: Arne Sachtler <arne.sachtler@dlr.de>
Co-authored-by: simon-mo <simon.mo@hey.com>
* Add bazel to the PATH in setup.py (#9590)
Co-authored-by: Mehrdad <noreply@github.com>
* Fix Lint in setup.py (#9618)
Co-authored-by: Mehrdad <noreply@github.com>
* Shellcheck comments (#9595)
* [Serve] Document Metric Infrastructure (#9389)
* [CI] Do not run jenkins test on GHA (#9621)
* Support ray task type checking (#9574)
* [Metrics] Java metric API (#9377)
* [GCS] fix the fault tolerance about gcs node manager (#9380)
* Shellcheck quoting (#9596)
* Fix SC2006: Use $(...) notation instead of legacy backticked `...`.
* Fix SC2016: Expressions don't expand in single quotes, use double quotes for that.
* Fix SC2046: Quote this to prevent word splitting.
* Fix SC2053: Quote the right-hand side of == in [[ ]] to prevent glob matching.
* Fix SC2068: Double quote array expansions to avoid re-splitting elements.
* Fix SC2086: Double quote to prevent globbing and word splitting.
* Fix SC2102: Ranges can only match single chars (mentioned due to duplicates).
* Fix SC2140: Word is of the form "A"B"C" (B indicated). Did you mean "ABC" or "A\"B\"C"?
* Fix SC2145: Argument mixes string and array. Use * or separate argument.
* Fix SC2209: warning: Use var=$(command) to assign output (or quote to assign string).
Co-authored-by: Mehrdad <noreply@github.com>
* Fix bug in Bazel version check (#9626)
Co-authored-by: Mehrdad <noreply@github.com>
* [Java] Avoid data copy from C++ to Java for ByteBuffer type (#9033)
* Revert "Dashboard next-version API support in backend (#9345)" (#9639)
This reverts commit fca1fb18f366ebff6016978cb6440dd1ed8637fe.
* [Autoscaler] Command Line Interface improvements (#9322)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* [Core] GCS Actor management on by default. (#8845)
* GCS Actor management on by default.
* Fix travis config.
* Change condition.
* Remove unnecessary CI.
* [Core] Fix concurrency issues in plasma store runner (#9642)
* fix window jni unhappy compiler (#9635)
* Fix TestObjectTableResubscribe testcase bug (#9650)
* fix named actor single process mode bug (#9652)
* [core] Fix Ray service startup when logging redirection is disabled. (#9547)
* Fix TorchDeterministic (#9241)
* [RaySGD] revised existing transformer example to work with transformers>=3.0 (#9661)
Co-authored-by: Kai Fricke <kai@anyscale.com>
* [rllib] Fix torch TD error, IMPALA LR updates (#9477)
* update
* add test
* lint
* fix super call
* speed es test up
* Auto-cancel build when a new commit is pushed (#8043)
Co-authored-by: Mehrdad <noreply@github.com>
* Fix lint in remote-watch.py (#9668)
* [Core] Remove unnecessary windows syscall in plasma store (#9602)
* Remove unused windows shims (#9583)
* Temporarily disable remote watcher (#9669)
* Drop support for Python 3.5. (#9622)
* Drop support for Python 3.5.
* Update setup.py
* [Core] WorkerInterface refactor (#9655)
* .
* .
* refactor WorkerInterface
* .
* Basic unit test structure complete?
* .
* .
* .
* .
* Fixed tests
* Fixed tests
* .
* [core] Enable object reconstruction for retryable actor tasks (#9557)
* Test actor plasma reconstruction
* Allow resubmission of actor tasks
* doc
* Test for actor constructor
* Kill PID before removing node
* Kill pid before node
* fix java coreworker crash (#9674)
* use help proto-init-macro for streaming config (#9272)
* Update release information from 0.8.6. (#9124)
* [BRING BACK TO MASTER] Update release information.
* [MERGE TO MASTER] Add microbenchmark result.
* Update asan tests to the doc.
* Refinements to the Serve documentation (#9587)
Co-authored-by: Dean Wampler <dean@concurrentthought.com>
* [tune] survey (#9670)
* Fix ERROR logging not being printed to standard error (#9633)
Co-authored-by: Mehrdad <noreply@github.com>
* [Tune Docs] Logging doc fix (#9691)
* [rllib] Type annotations for model classes (#9646)
* [Serve] Allow multiple HTTP servers. (#9523)
* Issue 9631: Tf1.14 does not have tf.config.list_physical_devices. (#9681)
* [Serve] Fix Formatting, stale docs (#9617)
* fixed simplex initialisation seeding bug (#9660)
Co-authored-by: Petros Christodoulou <petrochr@amazon.com>
* Switch from GitHub checkout@v2 to checkout@v1 due to bugs in checkout (#9697)
Co-authored-by: Mehrdad <noreply@github.com>
* Add Ray Serve to README.rst (#9688)
* Shellcheck rewrites (#9597)
* Fix SC2001: See if you can use ${variable//search/replace} instead.
* Fix SC2010: Don't use ls | grep. Use a glob or a for loop with a condition to allow non-alphanumeric filenames.
* Fix SC2012: Use find instead of ls to better handle non-alphanumeric filenames.
* Fix SC2015: Note that A && B || C is not if-then-else. C may run when A is true.
* Fix SC2028: echo may not expand escape sequences. Use printf.
* Fix SC2034: variable appears unused. Verify use (or export if used externally).
* Fix SC2035: Use ./*glob* or -- *glob* so names with dashes won't become options.
* Fix SC2071: > is for string comparisons. Use -gt instead.
* Fix SC2154: variable is referenced but not assigned
* Fix SC2164: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.
* Fix SC2188: This redirection doesn't have a command. Move to its command (or use 'true' as no-op).
* Fix SC2236: Use -n instead of ! -z.
* Fix SC2242: Can only exit with status 0-255. Other data should be written to stdout/stderr.
* Fix SC2086: Double quote to prevent globbing and word splitting.
Co-authored-by: Mehrdad <noreply@github.com>
* [Autoscaler] CLI Logger docs (#9690)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* Update rllib-algorithms.rst (#9640)
* [tune] move jenkins tests to travis (#9609)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Kai Fricke <kai@anyscale.com>
* [RLlib] Implement DQN PyTorch distributional head. (#9589)
* Add placement group java api (#9611)
* add part code
* add part code
* add part code
* fix code style
* fix review comment
* fix review comment
* add part code
* add part code
* add part code
* add part code
* fix review comment
* fix review comment
* fix code style
* fix review comment
* fix lint error
* fix lint error
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
* [Stats] Improve Stats::Init & Add it to GCS server (#9563)
* [Core] Try remove all windows compat shims (#9671)
* try remove compat for arrow
* remove unistd.h
* remove socket compat
* delete arrow windows patch
* Fix a few flaky tests (#9709)
Fix test_custom_resources, Remove test_pandas_parquet_serialization, Better error message for test_output.py, Potentially fix test_dynres::test_dynamic_res_creation_scheduler_consistency
* [GCS]Open test_gcs_fault_tolerance testcase (#9677)
* enable test_gcs_fault_tolerance
* fix lint error
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
* [Tests]lock vector to avoid potential flaky test (#9656)
* [tune] distributed torch wrapper (#9550)
* changes
* add-working
* checkpoint
* ccleanu
* fix
* ok
* formatting
* ok
* tests
* some-good-stuff
* fix-torch
* ddp-torch
* torch-test
* sessions
* add-small-test
* fix
* remove
* gpu-working
* update-tests
* ok
* try-test
* formgat
* ok
* ok
* [GCS] Fix actor task hang when its owner exits before local dependencies resolved (#8045)
* Only update raylet map when autoscaler configured (#9435)
* [Dashboard] New dashboard skeleton (#9099)
* Fixing multiple building issues
* Make wait_for_condition raise exception when timing out. (#9710)
* [GCS]GCS client support multi-thread subscribe&resubscribe&unsubscribe (#9718)
* Package and upload ray cross-platform jar (#9540)
* Revert "Package and upload ray cross-platform jar (#9540)" (#9730)
This reverts commit 881032593d3c1b9360ea641c24d50a022677a25e.
* Only build docker wheels in LINUX_WHEELS env (#9729)
* Keep build-autoscaler-images.sh alive in CI (#9720)
* [core] Removes Error when Internal Config is not set (#9700)
* [Cluster Launcher] Re Org the cluster launcher pages. (#9687)
* [RLlib] Offline Type Annotations (#9676)
* Offline Annotations
* Modifications
* Fixed circular dependencies
* Linter fix
* Python api of placement group (#9243)
* Include open-ssh-client for transparency (#9693)
* Fix remote-watch.py (#9625)
Co-authored-by: Mehrdad <noreply@github.com>
* [docker] Uses Latest Conda & Py 3.7 (#9732)
* Fix broken actor failure tests. (#9737)
* [Stats] fix stats shutdown crash if opencensus exporter not initialized (#9727)
* Fix package and upload ray jar (#9742)
* Introduce file_mounts_sync_continuously cluster option (#9544)
* Separate out file_mounts contents hashing into its own separate hash
Add an option to continuously sync file_mounts from head node to worker nodes:
monitor.py will re-sync file mounts whenver contents change but will only run setup_commands if the config also changes
* add test and default value for file_mounts_sync_continuously
* format code
* Update comments
* Add param to skip setup commands when only file_mounts content changed during monitor.py's update tick
Fixed so setup commands run when ray up is run and file_mounts content changes
* Refactor so that runtime_hash retains previous behavior
runtime_hash is almost identical as before this PR. It is used to determine if setup_commands need to run
file_mounts_contents_hash is an additional hash of the file_mounts content that is used to detect when only file syncing has to occur.
Note: runtime_hash value will have changed from before the PR because we hash the hash of the contents of the file_mounts as a performance optimization
* fix issue with hashing a hash
* fix bug where trying to set contents hash when it wasn't generated
* Fix lint error
Fix bug in command_runner where check_output was no longer returning the output of the command
* clear out provider between tests to get rid of flakyness
* reduce chance of race condition from node_launcher launching a node in the middle of an autoscaler.update call
* [dist] swap mac/linux wheel build order (#9746)
* [RLlib] Enhance reward clipping test; add action_clipping tests. (#9684)
* [RLlib] Issue 9667 DDPG Torch bugs and enhancements. (#9680)
* [Metrics]Ray java worker metric registry (#9636)
* ray worker metrics gauge init
* ray java metric mapping
* add jni source files for gauge and tagkey
* mapping all metric classes to stats object
* check non-null for tags and name
* lint
* add symbol for native metric JNI
* extern c for symbol
* add tests for all metrics
* Update Metric.java
use metricNativePointer instead.
* unify metric native stuff to one class
* fix jni file
* add comments for metric transform function in jni utils
* move metric function to native metric file
* remove unused disconnect jni
* Add a metric registry for java metircs
* Restore install-bazel.sh
* Add some comments for metric registry
* Fix thread safe problem of metrics
* Fix metric tests and remove sleep code from tests
* Fix comments of metrics
Co-authored-by: lingxuan.zlx <skyzlxuan@gmail.com>
* fix windows compile bug (#9741)
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
* Run _with_interactive in Docker (#9747)
* [New scheduler] First unit test for task manager (#9696)
* .
* .
* refactor WorkerInterface
* .
* Basic unit test structure complete?
* .
* bad git >:-(
* small clean up
* CR
* .
* .
* One more fixture
* One more fixture
* .
* .
* bazel-format
* .
* [Stats] Basic Metrics Infrastructure (Metrics Agent + Prometheus Exporter) (#9607)
* [Release] Fix release tests (#9733)
* Register function race (#9346)
* Revert "[dist] swap mac/linux wheel build order (#9746)" and "Fix package and upload ray jar (#9742)" (#9758)
* Revert "[dist] swap mac/linux wheel build order (#9746)"
This reverts commit a9340565ff46626b18fd36f22a37d0380ae18d85.
* Revert "Fix package and upload ray jar (#9742)"
This reverts commit c290c308fe1e496480db5c37489df619cff6168f.
* Fix some Windows CI issues (#9708)
Co-authored-by: Mehrdad <noreply@github.com>
* Pin pytest version (#9767)
* [Java] Use test groups to filter tests of different run modes (#9703)
* [Java] Fix MetricTest.java due to incomplete changes from #9703 (#9770)
* Fix leased worker leak bug if lease worker requests that are still waiting to be scheduled when GCS restarts (#9719)
* [Stats] enable core worker stats (#9355)
* [GCS]Use a separate thread in node failure detector to handle heartbeat (#9416)
* use a sole thread to handle heartbeat
* separate signal thread
* use work to avoid exiting when task is underway
* protect shared data structure to avoid deadlock
* add comments
* decrease io service num
* minor changes
* fix test
* per stephanie's comments
* use single io service instead of 1-size io service pool
* typo
* [GCS Actor Management] Fix flaky test_dead_actors. (#9715)
* Fix.
* Add logs.
* Add an unit test.
* [TUNE] Tune Docs re-organization (#9600)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* [RLlib] Trajectory View API (preparatory cleanup and enhancements). (#9678)
* [Core] Socket creation race condition bug fixes (#9764)
* fix issues
* hot fixes
* test
* test
* Always info log
* Fixed stderr logging (9765)
* [Core] Custom socket name (#9766)
* fix issues
* hot fixes
* test
* test
* socket name change only
* Fix src/ray/core_worker/common.h deleted constructor (#9785)
Co-authored-by: Mehrdad <noreply@github.com>
* [Stats] Fix harvestor threads + Fix flaky stats shutdown. (#9745)
* More fixes
* Applying latest changes in travis.yml
* Fixing fixture data exclusions
* Disable some java tests
* Fix some CI errors
* Update hash
* Fixing more build issues
* Fixing more build issues
* Fix pipeline cache path
* More fixes
* Fix bazel test command
* Fix bazel test
* Fix general info steps
* Custom env var for docker build
* Trying a different way to install bazel
* Bazel fix
* Updating hash
Co-authored-by: Siyuan (Ryans) Zhuang <suquark@gmail.com>
Co-authored-by: mehrdadn <mehrdadn@users.noreply.github.com>
Co-authored-by: Mehrdad <noreply@github.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
Co-authored-by: Alisa <wuminyan0607@gmail.com>
Co-authored-by: Lingxuan Zuo <skyzlxuan@gmail.com>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@vip.qq.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Stefan Schneider <stefan.schneider@upb.de>
Co-authored-by: Patrick Ames <pdames@amazon.com>
Co-authored-by: Hao Chen <chenh1024@gmail.com>
Co-authored-by: fangfengbin <869218239a@zju.edu.cn>
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
Co-authored-by: Tao Wang <dooku.wt@antfin.com>
Co-authored-by: Kai Yang <kfstorm@outlook.com>
Co-authored-by: Sven Mika <sven@anyscale.io>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Ian Rodney <ian.rodney@gmail.com>
Co-authored-by: Henk Tillman <henktillman@gmail.com>
Co-authored-by: Tanay Wakhare <twakhare@gmail.com>
Co-authored-by: Nicolaus93 <nicolo.campolongo@unimi.it>
Co-authored-by: Vasily Litvinov <45396231+vnlitvinov@users.noreply.github.com>
Co-authored-by: krfricke <krfricke@users.noreply.github.com>
Co-authored-by: Max Fitton <maxfitton@gmail.com>
Co-authored-by: Max Fitton <max@semprehealth.com>
Co-authored-by: kisuke95 <2522134184@qq.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Simon Mo <xmo@berkeley.edu>
Co-authored-by: Michael Mui <68102089+heyitsmui@users.noreply.github.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: chaokunyang <shawn.ck.yang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
Co-authored-by: Michael Luo <michael.luo123456789@gmail.com>
Co-authored-by: Gabriele Oliaro <gabriele_oliaro@college.harvard.edu>
Co-authored-by: Tom <veniat.tom@gmail.com>
Co-authored-by: jerrylee.io <JerryDeKo@gmail.com>
Co-authored-by: Raphael Avalos <raphael@avalos.fr>
Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: Robert Nishihara <robertnishihara@gmail.com>
Co-authored-by: Arne Sachtler <arne.sachtler@gmail.com>
Co-authored-by: Arne Sachtler <arne.sachtler@dlr.de>
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: ZhuSenlin <wumuzi520@126.com>
Co-authored-by: Max Fitton <mfitton@berkeley.edu>
Co-authored-by: Maksim Smolin <maximsmol@gmail.com>
Co-authored-by: Dean Wampler <dean@polyglotprogramming.com>
Co-authored-by: Dean Wampler <dean@concurrentthought.com>
Co-authored-by: Bill Chambers <bill@anyscale.com>
Co-authored-by: Petros Christodoulou <p.christodoulou2@gmail.com>
Co-authored-by: Petros Christodoulou <petrochr@amazon.com>
Co-authored-by: Justin Terry <justinkterry@gmail.com>
Co-authored-by: Tao Wang <wangtaothetonic@163.com>
Co-authored-by: fyrestone <fyrestone@outlook.com>
Co-authored-by: Alan Guo <aguo@anyscale.com>
Co-authored-by: bermaker <495571751@qq.com>
* Sync Upstream master (#50)
* [core] Pull Manager exponential backoff (#13024)
* [RLlib] Issue 12789: RLlib throws the warning "The given NumPy array is not writeable" (#12793)
* [release tests] test_many_tasks fix (#12984)
* Add "beta" documentation for enabling object spilling manually (#13047)
* [Serve] Handle Bug Fixes (#12971)
* [Dashboard] Add GET /logical/actors API (#12913)
* [GCS]Decouple gcs resource manager and gcs node manager (#13012)
* [ray_client]: Insert decorators into the real ray module to allow for client mode (#13031)
* [GCS] Delete redis gcs client and redis_xxx_accessor (#12996)
* [RLlib] Fix broken unity3d_env import in example server script. (#13040)
* [RLlib] TorchPolicies: Accessing "infos" dict in train_batch causes `TypeError`. (#13039)
* [joblib] Fix flaky joblib test. (#13046)
* [Tune]Add integer loguniform support (#12994)
* Add integer quantization and loguniform support
* Fix hyperopt qloguniform not being np.log'd first
* Add tests, __init__
* Try to fix tests, better exceptions
* Tweak docstrings
* Type checks in SearchSpaceTest
* Update docs
* Lint, tests
* Update doc/source/tune/api_docs/search_space.rst
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
* [core][new scheduler] Move tasks from ready to dispatch to waiting on argument eviction (#13048)
* Add index for tasks to dispatch
* Task dependency manager interface
* Unsubscribe dependencies and tests
* NodeManager
* Revert "Add index for tasks to dispatch"
This reverts commit c6ccb9aa306e00f80d34b991055e4e83872595ea.
* tmp
* Move back to waiting if args not ready
* update
* Update to new form of brew cask install command
* [Autoscaler] New output log format (#12772)
* Fix typo RMSProp -> RMSprop (#13063)
* [serve] Centralize HTTP-related logic in HTTPState (#13020)
* Remove suppress output to see why wheel is not building
* Refactor TaskDependencyManager, allow passing bundles of objects to ObjectManager (#13006)
* New dependency manager
* Switch raylet to new DependencyManager
* PullManager accepts bundles
* Cleanup, remove old task dependency manager
* x
* PullManager unit tests
* lint
* Unit tests
* Rename
* lint
* test
* Update src/ray/raylet/dependency_manager.cc
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
* Update src/ray/raylet/dependency_manager.cc
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
* x
* lint
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
* [docs] Fix args + kwargs instead of docstrings (#13068)
* functools wraps
* Fix typo (functoools -> functools)
* Fix OS X Wheel Build - Update brew cask install (#13062)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* speed up local mode object store get (#13052)
Co-authored-by: senlin.zsl <senlin.zsl@antfin.com>
* [RLlib] Execution Annotation (#13036)
* [RLlib] Improved Documentation for PPO, DDPG, and SAC (#12943)
* [C++ API] Added reference counting to ObjectRef (#13058)
* Added reference counting to ObjectRef
* Addressed the comments
* [Core] Remove cuda support in plasma store (#13070)
* remove cuda support in plasma store
* [Core] Remote outdated external store (#13080)
* remove outdated external store
* [GCS] Move resource usage info to gcs resource manager (#13059)
* [RLlib] JAXPolicy prep. PR #1. (#13077)
* [RLlib] Preprocessor fixes (multi-discrete) and tests. (#13083)
* [RLlib] BC/MARWIL/recurrent nets minor cleanups and bug fixes. (#13064)
* [Collective][PR 3.5/6] Send/Recv calls and some initial code for communicator caching (#12935)
* other collectives all work
* auto-linting
* mannual linting #1
* mannual linting 2
* bugfix
* add send/recv point-to-point calls
* add some initial code for communicator caching
* auto linting
* optimize imports
* minor fix
* fix unpassed tests
* support more dtypes
* rerun some distributed tests for send/recv
* linting
* [Serve] [Doc] Front page update (#13032)
* Deprecate experimental / dynamic resources (#13019)
* [docs] fix wandb url (#13094)
* [Serve] Implement Graceful Shutdown (#13028)
* [Serve] Use ServeHandle in HTTP proxy (#12523)
* [Java] Format ray java code (#13056)
* [docker] Fix restart behavior with Docker (#12898)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: ijrsvt <ilr@anyscale.com>
* Disable broken streaming tests (#13095)
* [autoscaler] Make placement groups bypass max launch limit (#13089)
* Serve metrics docs (#13096)
* [RLlib] run_regression_tests.py: --framework flag (instead of --torch). (#13097)
* [RLLib] Readme.md Documentation for Almost All Algorithms in rllib/agents (#13035)
* [Doc] Fix Sphinx.add_stylesheet deprecation (#13067)
* Fix streaming ci failure (#12830)
* [RLlib] New Offline RL Algorithm: CQL (based on SAC) (#13118)
* [Bugfix][Dashboard] Fix undefined logCount, errorCount UI crash (#13113)
* [RLlib] Deflake test case: 2-step game MADDPG. (#13121)
* [RLlib] Trajectory view API docs. (#12718)
* Job module without submission (#13081)
Co-authored-by: 刘宝 <po.lb@antfin.com>
* [RLlib] JAXPolicy prep PR #2 (move get_activation_fn (backward-compatibly), minor fixes and preparations). (#13091)
* [Java] Avoid failure of serializing a user-defined unserializable exception. (#13119)
* [Tune] Update URL to fix 403 not found error in PBT tranformers test case (#13131)
* [serve] Async controller (#13111)
* [dashboard] Fix RAY_RAYLET_PID KeyError on Windows (#12948)
* [Serve] Use a small object to track requests (#13125)
* [docs][kubernetes][minor] Update K8s examples in doce (#13129)
* [RLlib] Support easy `use_attention=True` flag for using the GTrXL model. (#11698)
* [docs] Documentation + example for the C++ language API (#13138)
* [Java] Support `wasCurrentActorRestarted` in actor task. (#13120)
* Remove check.
* Add test
* fix lint
* lint
* Fix spotless lint
* Address comments.
* Fix lint
Co-authored-by: Qing Wang <jovany.wq@antgroup.com>
* [docs] Minor change to formating C++ docs. (#13151)
* Deprecate setResource java api (#13117)
* [docs] Small fix in C++ documentation. (#13154)
* prepare for head node
* move command runner interface outside _private
* remove space
* Eric
* flake
* min_workers in multi node type
* fixing edge cases
* eric not idle
* fix target_workers to consider min_workers of node types
* idle timeout
* minor
* minor fix
* test
* lint
* eric v2
* eric 3
* min_workers constraint before bin packing
* Update resource_demand_scheduler.py
* Revert "Update resource_demand_scheduler.py"
This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.
* reducing diff
* make get_nodes_to_launch return a dict
* merge
* weird merge fix
* auto fill instance types for AWS
* Alex/Eric
* Update doc/source/cluster/autoscaling.rst
* merge autofill and input from user
* logger.exception
* make the yaml use the default autofill
* docs Eric
* remove test_autoscaler_yaml from windows tests
* lets try changing the test a bit
* return test
* lets see
* edward
* Limit max launch concurrency
* commenting frac TODO
* move to resource demand scheduler
* use STATUS UP TO DATE
* Eric
* make logger of gc freed refs debug instead of info
* add cluster name to docker mount prefix directory
* grrR
* fix tests
* moving docker directory to sdk
* move the import to prevent circular dependency
* smallf fix
* ian
* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running
* small fix
* deflake test_joblib
* lint
* placement groups bypass
* remove space
* Eric
* first ocmmit
* lint
* exmaple
* documentation
* hmm
* file path fix
* fix test
* some format issue in docs
* modified docs
Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
Co-authored-by: root <root@ip-172-31-56-188.us-west-2.compute.internal>
* [Serve] [Doc] Add existing web server integration ServeHandle tutorial (#13127)
* [kubernetes][docs][minor] Kubernetes version warning (#13161)
* [Core] Locality-aware leasing: Milestone 1 - Owned refs, pinned location (#12817)
* Locality-aware leasing for owned refs (pinned locations).
* LessorPicker --> LeasePolicy.
* Consolidate GetBestNodeIdForTask and GetBestNodeIdForObjects.
* Update comments.
* Turn on locality-aware leasing feature flag by default.
* Move local fallback logic to LeasePolicy, move feature flag check to CoreWorker constructor, add local-only lease policy.
* Add lease policy consulting assertions to the direct task submitter tests.
* Add lease policy tests.
* LocalityLeasePolicy --> LocalityAwareLeasePolicy.
* Add missing const declarations.
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
* Add RAY_CHECK for raylet address nullptr when creating lease client.
* Make the fact that LocalLeasePolicy always returns the local node more explicit.
* Flatten GetLocalityData conditionals to make it more readable.
* Add ReferenceCounter::GetLocalityData() unit test.
* Add data-intensive microbenchmarks for single-node perf testing.
* Add data-intensive microbenchmarks for simulated cluster perf testing.
* Remove redundant comment.
* Remove data-intensive benchmarks.
* Add locality-aware leasing Python test.
* Formatting changes in ray_perf.py.
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
* Enabling the cancellation of non-actor tasks in a worker's queue (#12117)
* wrote code to enable cancellation of queued non-actor tasks
* minor changes
* bug fixes
* added comments
* rev1
* linting
* making ActorSchedulingQueue::CancelTaskIfFound raise a fatal error
* bug fix
* added two unit tests
* linting
* iterating through pending_normal_tasks starting from end
* fixup! iterating through pending_normal_tasks starting from end
* fixup! fixup! iterating through pending_normal_tasks starting from end
* post merge fixes
* added debugging instructions, pulled Accept() out of guarded loop
* removed debugging instructions, linting
* [Serve] Bug in Serve node memory-related resources calculation #11198 (#13061)
* [Release] Update Release Process Documentation (#13123)
* [Core] Remove Arrow dependencies (#13157)
* remove arrow ubsan
* remove arrow build depend
* remove arrow buffer
* [XGboost] Update Documentation (#13017)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* [SGD] Fix Docstring for `as_trainable` (#13173)
* Revert "Enabling the cancellation of non-actor tasks in a worker's queue (#12117)" (#13178)
This reverts commit b4d688b4a64c595a071e8c7380b653e0bfea4ad2.
* Surface object store spilling statistics in `ray memory` (#13124)
* [ray_client]: Move from experimental to util (#13176)
Change-Id: I9f054881f0429092d265cd6944d89804cce9d946
* Remove unused file(object_manager_integration_test.cc) (#12989)
* Notify listeners after registered node stored (#13069)
* [build]Update description and add some keywords (#13163)
* [Collective][PR 2/6] Driver program declarative interfaces (#12874)
* scaffold of the code
* some scratch and options change
* NCCL mostly done, supporting API#1
* interface 2.1 2.2 scratch
* put code into ray and fix some importing issues
* add an addtional Rendezvous class to safely meet at named actor
* fix some small bugs in nccl_util
* some small fix
* scaffold of the code
* some scratch and options change
* NCCL mostly done, supporting API#1
* interface 2.1 2.2 scratch
* put code into ray and fix some importing issues
* add an addtional Rendezvous class to safely meet at named actor
* fix some small bugs in nccl_util
* some small fix
* add a Backend class to make Backend string more robust
* add several useful APIs
* add some tests
* added allreduce test
* fix typos
* fix several bugs found via unittests
* fix and update torch test
* changed back actor
* rearange a bit before importing distributed test
* add distributed test
* remove scratch code
* auto-linting
* linting 2
* linting 2
* linting 3
* linting 4
* linting 5
* linting 6
* 2.1 2.2
* fix small bugs
* minor updates
* linting again
* auto linting
* linting 2
* final linting
* Update python/ray/util/collective_utils.py
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* Update python/ray/util/collective_utils.py
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* Update python/ray/util/collective_utils.py
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* added actor test
* lint
* remove local sh
* address most of richard's comments
* minor update
* remove the actor.option() interface to avoid changes in ray core
* minor updates
Co-authored-by: YLJALDC <dal177@ucsd.edu>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* [serve] Merge ActorReconciler and BackendState (#13139)
* [tune] better signature check for `tune.sample_from` (#13171)
* [tune] better signature check for `tune.sample_from`
* Update python/ray/tune/sample.py
Co-authored-by: Sumanth Ratna <sumanthratna@gmail.com>
Co-authored-by: Sumanth Ratna <sumanthratna@gmail.com>
* Disable atexit test on windows (#13207)
* [serve] Move controller state into separate files (#13204)
* Update multi_agent_independent_learning.py (#13196)
pettingzoo.utils.error.DeprecatedEnv: waterworld_v0 is now depreciated, use waterworld_v2 instead
* [Collective] Some necessary abstraction of collective calls before introducing stream management (#13162)
* [Tune] Fix PBT Transformers Example (#13174)
* [Serve] HTTPOptions for deployment modes (#13142)
* [tests] Fix Autoscaler Test failure on Windows (#13211)
* skip create_or_update tests
* Update python/ray/tests/test_autoscaler.py
Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>
Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>
* [BugFix][GCS]Fix gcs_actor_manager_test multithreading bug (#13158)
* [GCS]Fix TestActorSubscribeAll bug (#13193)
* [Metrics] Record per node and raylet cpu / mem usage (#12982)
* Record per node and raylet cpu / mem usage
* Add comments.
* Addressed code review.
* [Tune] Fix tune serve integration example (#13233)
* [Redis] Note that each Redis Connect retry takes two minutes (#12183)
* Slightly alter error message so it's the same in both cases.
* Each retry takes about two minutes.
* [Log] fix spdlog init race (#12973)
* fix spdlog init race
* use global logger
* refine logger name and constructor
* [Release] Add 1.1.0 release test logs (#13054)
* Add microbenchmark to release logs
* check in many_tasks stress test result
* Add results of placement group stress test for 1.1.0
* Add result for test_dead_actors test and correct the name of test_many_tasks.txt
* Add rllib regression test result
* Add pytorch test results for rllib
* remove extraneous log entries
* [Core] Fix incorrect comment (#13228)
* [Serialization] Fix cloudpickle (#13242)
* [GCS]Fix gcs table storage `GetAll` and `GetByJobId` api bug (#13195)
* Start ray client server with 'ray start' (#13217)
* [GCS]Add gcs actor schedule strategy (#13156)
* Publish job/worker info with Hex format instead of Binary (#13235)
* [RLlib] SquashedGaussians should throw error when entropy or kl are called. (#13126)
* [Serve] Rescale Serve's Long Running Test to Cluster Mode (#13247)
Now that `HeadOnly` becomes the new default HTTP location, we can
re-enable the long running tests to use local multi-clusters.
(also fixed the controller's API to match up to date, we should
have caught these, I will open issues for this.)
* Update autoscaler-cluster yaml files for release tests (#13114)
* [Release] Use ray-ml image for logn running test (#13267)
* [RLlib] Fix missing "info_batch" arg (None) in `compute_actions` calls. (#13237)
* [Tune] Improve error message for Session Detection (#13255)
* Improve error message
* log once
* [Tune] Pin Tune Dependencies (#13027)
Co-authored-by: Ian <ian.rodney@gmail.com>
* [Dependabot] Add Dependabot (#13278)
Co-authored-by: Ian <ian.rodney@gmail.com>
* [docker] Pull if image is not present (#13136)
* [GCS] Remove old lightweight resource usage report code path (#13192)
* [Dashboard] Add GET /log_proxy API (#13165)
* Fix a crash problem caused by GetActorHandle in ActorManager (#13164)
* [ray_client] Add metadata to gRPC requests (#13167)
* [RLlib] Preparatory PR for: Documentation on Model Building. (#13260)
* [tune](deps): Bump mlflow from 1.13.0 to 1.13.1 in /python/requirements (#13286)
* [tune](deps): Bump gluoncv from 0.9.0 to 0.9.1 in /python/requirements (#13287)
* Remove top-level ray.connect() and ray.disconnect() APIs (#13273)
* [Pull manager] Only pull once per retry period (#13245)
* .
* docs
* cleanup
* .
* .
* .
* .
Co-authored-by: Alex <alex@anyscale.com>
* [Cancellation] Make Test Cancel Easier to Debug (#13243)
* first commit
* lint-fix
* [ray_client]: first draft of documentation (#13216)
* Do not give an error if both `RAY_ADDRESS` and `address` is specified on initialization (#13305)
* Finalize handling of RAY_ADDRESS
* lint
* [serve] Clean up EndpointState interface, move checkpointing inside of EndpointState (#13215)
* [RLlib] SlateQ Documentation (#13266)
* [RLlib] Add more detailed Documentation on Model building API (#13261)
* [tune] convert search spaces: parse spec before flattening (#12785)
* Parse spec before flattening
* flatten after parse
* Test for ValueError if grid search is passed to search algorithms
* remove empty extras streaming deps (#12933)
* add the method annotation and a comment explaining what's happening (#13306)
Change-Id: I848cc2f0beaed95340d9de7cca19a50c78d9da9a
* Use wait_for_condition to reduce flakiness in test_queue.py::test_custom_resources (#13210)
* [RLlib] Issue 13330: No TF installed causes crash in `ModelCatalog.get_action_shape()` (#13332)
* [serve] Cleanup backend state, move checkpointing and async goal logic inside (#13298)
* fix removal of task dependencies (#13333)
Co-authored-by: senlin.zsl <senlin.zsl@antfin.com>
* [Serve] Support Starlette streaming response (#13328)
* [RLlib] Make TFModelV2 behave more like TorchModelV2: Obsolete register_variables. Unify variable dicts. (#13339)
* [client] Report number of currently active clients on connect (#13326)
* wip
* update
* update
* reset worker
* fix conn
* fix
* disable pycodestyle
* Implement internal kv in ray client (#13344)
* kv internal
* fix
* [Tune] Rename MLFlow to MLflow (#13301)
* Forgot overwrite parameter in Ray client internal kv
* Fix typo in Tune Docs (Checkpointing) (#13348)
See issue #13299
* [Kubernetes][Docs] GPU usage (#13325)
* gpu-note
* gpu-note
* More info
* lint?
* Update doc/source/cluster/kubernetes.rst
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* Update doc/source/cluster/kubernetes.rst
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* Update doc/source/cluster/kubernetes.rst
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* Update doc/source/cluster/kubernetes.rst
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* GKE->Kubernetes
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* Revert "[RLlib] Make TFModelV2 behave more like TorchModelV2: Obsolete register_variables. Unify variable dicts. (#13339)" (#13361)
This reverts commit e2b2abb88b82c0c2402a338bba51e5dbd1739419.
* [Dependabot] [CI] Re-configure Dependabot and disable duplicate builds (#13359)
* [tune] buffer trainable results (#13236)
* Working prototype
* Pass buffer length, fix tests
* Don't buffer per default
* Dispatch and process save in one go, added tests
* Fix tests
* Pass adaptive seconds to train_buffered, stop result processing after STOP decision
* Fix tests, add release test
* Update tests
* Added detailed logs for slow operations
* Update python/ray/tune/trial_runner.py
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* Apply suggestions from code review
* Revert tests and go back to old tuning loop
* nit
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* [Serve] Add dependency management support for driver not running in a conda env (#13269)
* [RLlib] Add `__len__()` method to SampleBatch (#13371)
* [Serve] Backend state unit tests (#13319)
* trigger doc build for serve updates (#13373)
* [Object Spilling] Long running object spilling test (#13331)
* done.
* formatting.
* Remove unimplemented GetAll method in actor info accessor (#13362)
* [Doc] Remove trailing whitespaces (#13390)
* Enable Ray client server by default (#13350)
* update
* fix
* fix test
* update
* [RLlib] Trajectory View API: Atari framestacking. (#13315)
* [ray_client]: Wait for ready and retry on ray.connect() (#13376)
* [ray_client]: wait until connection ready
Change-Id: Ie443be60c33ab7d6da406b3dcaa57fbb7ba57dd6
* lint
Change-Id: I30f8e870bbd5f8859a9f11ae244e210f077cedd0
* docs and retry minimum
Change-Id: I43f5378322029267ddd69f518ce8206876e2129d
* [Dashboard] Fix missing actor pid (#13229)
* [ray_client]: Fix multiple attempts at checking connection (#13422)
* Plumb retries update (#13411)
* [Serve] [Doc] Improve batching doc (#13389)
* [autoscaler/k8s] [CI] Kubernetes test ray up, exec, down (#12514)
* Fix Serve release test (#13385)
* Add bazel logs upload to GHA (#13251)
* [tune] Fix f-string in error message (#13423)
* [serve] Pull out goal management logic into AsyncGoalManager class (#13341)
* Make request_resources() use internal kv instead of redis pub sub (#13410)
* Remove unused handler methods (#13394)
* [Tune] Pin Transitive Dependencies (#13358)
* Split out the part of get_node_ip_address for which the docstring is correct (#12796)
* Fix raylet::MockWorker::GetProcess crashes (#13440)
Co-authored-by: 刘宝 <po.lb@antfin.com>
* Revert "Enable Ray client server by default (#13350)" (#13429)
This reverts commit 912d0cbbf912d5b52d6176155bdff02f504b657d.
* Fix linter error (#13451)
* [GCS]Add gcs resource scheduler (#13072)
* [RLlib] Redo: Make TFModelV2 fully modular like TorchModelV2 (soft-deprecate register_variables, unify var names wrt torch). (#13363)
* [Core]Fix raylet scheduling bug (#13452)
* [Core]Fix raylet scheduling bug
* fix lint error
* fix lint error
Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
* [joblib] joblib strikes again but this time on windows (#13212)
* [ray_client]: fix exceptions raised while executing on the server on behalf of the client (#13424)
* [kubernetes][minor] Operator garbage collection fix (#13392)
* [Core][CLI] `ray status` and `ray memory` no longer starts a new job (#13391)
* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()
* Modify ray status cli so that it doesn't start a new job via ray.init()
* Remove local test file
* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()
* Modify ray status cli so that it doesn't start a new job via ray.init()
* Remove local test file
* Make status and error args required in commands.py#debug.status
* Remove unnecessary imports
* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()
* Modify ray status cli so that it doesn't start a new job via ray.init()
* Remove local test file
* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()
* Modify ray status cli so that it doesn't start a new job via ray.init()
* Remove local test file
* Make status and error args required in commands.py#debug.status
* Remove unnecessary imports
* Job 38482.1 should now pass
* Resolve merge conflict
* [RLlib] Deflake 2x remote & local inference tests (external env). (#13459)
* [docs] Add more guideline on using ray in slurm cluster (#12819)
Co-authored-by: Sumanth Ratna <sumanthratna@gmail.com>
Co-authored-by: PENG Zhenghao <pengzh@ie.cuhk.edu.hk>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* [Dashboard] Fix GPU resource rendering issue (#13388)
* [Release] Fix Serve release test (#13303)
The Docker image we were using now uses `ray` users so we have to call
sudo.
* [serve] Properly obey SERVE_LOG_DEBUG=0 (#13460)
* Fix getting runtime context dict in driver (#13417)
* [xgb] re-enable xgboost_ray tests (#13416)
* re-enable
* fix
* update xgb_ray version
* [Serialization] New custom serialization API (#13291)
* new serialization API with doc & test
* add more notes
* refine notes
* doc
* [Core] Ownership-based Object Directory: Consolidate location table and reference table. (#13220)
* Added owned object reference before Plasma put on Create() + Seal() path.
* Consolidated location table and reference table in reference counter.
* Restore type in definition.
* Clean up owned reference on failed Seal().
* Added RemoveOwnedObject test for reference counter.
* Guard against ref going out of scope before location RPCs.
* Add 'owner must have ref in scope' precondition to documentation for object location methods.
* Move to separate Create() + Seal() methods for existing objects.
* Clearer distinction between Create() and Seal() methods.
* Make it clear that references will normally be cleaned up by reference counting.
* [ray_client]: Support runtime_context as metadata (#13428)
* [GCS]Remove unused class variable (#13454)
* [Object Spilling] Dedup restore objects (#13470)
* done.
* Addressed code review.
* [CI] Enable Dashboard tests for master (#13425)
* [docker/dashboard] Fix ray dashboard (#12899)
* [CI] Fix Windows Bazel Upload (#13436)
* Return version info from Ray client connect, to allow for discovering version mismatches
* Update ID specification doc (#13356)
* [ray_client]: fix wrong reference in server_pickler (#13474)
Change-Id: Ie3d219541b1875e986e72e3ae73ece145c715acf
* Bump dev branch to 2.0 to avoid endless version bump toil (#13497)
* wip
* fix
* fix
* Remove an unnecessary file (#13499)
* [Tests] Skip failing windows tests (#13495)
* skip failing windows tests
* skip more
* remove
* updates
* [tune] fix small docs typo (#13355)
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
* move message to debug (#13472)
* Minimal version of piping autoscaler events to driver logs (#13434)
* sync write internal config in gcs (#13197)
* Refactor node manager to eliminate `new_scheduler_enabled_` (#12936)
* [GCS]Only publish changed field when node dead (#13364)
* Only update changed field when node dead
* node_id missed
* [CI] Buildkite PR Environment for Simple Tests (#13130)
* [GCS] Remove task info publish as nowhere uses it (#13509)
* Remove task info publish as nowhere uses it
* simplify right publish channel
* [RLlib] Solve PyTorch/TF-eager A3C async race condition between calling model and its value function. (#13467)
* [tune] placement group support (#13370)
* [Serve] Allow ObjectRef for Composition (#12592)
* Add Dashboard Python Test to Buildkite (#13530)
* Add ability to not start Monitor when calling `ray start` (#13505)
* [tune] support experiment checkpointing for grid search (#13357)
* Fix typo (#13098)
* Remove PYTHON_MODE that is not defined in Ray so that import * will work from other packages. (#13544)
* [RLlib] MARWI…
Why are these changes needed?
Closes #9413
Closes #9428