[ROB-2447] Oomkill job fix #484

Avi-Robusta · 2025-10-28T12:55:19Z

No description provided.

coderabbitai · 2025-10-28T12:55:30Z

Walkthrough

Implements batched, paginated Kubernetes Job discovery with continue-token and 410-retry handling; adds LightweightJobInfo for grouped-job processing; exposes discovery_job_batch_size (5000) and discovery_job_max_batches (100) in Config and CLI; updates Prometheus pod-owner extraction for grouped jobs; and adds/updates tests for batching and grouped-job metrics logic.

Changes

Cohort / File(s)	Summary
Core Batching & Grouping `robusta_krr/core/integrations/kubernetes/__init__.py`	Added `LightweightJobInfo`; added `_is_job_owned_by_cronjob()` and `_is_job_grouped()` helpers; implemented async batched listing `_list_namespaced_or_global_objects_batched()` with continue-token and 410 retry logic; refactored `_list_all_jobs()` into async batched form to skip CronJob-owned and grouped jobs and build `K8sObjectData`; reworked `_list_all_groupedjobs()` to collect grouped jobs into groups using per-group template jobs for container introspection and enforce group size limits.
Metrics Service `robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py`	In `load_pods()` for `GroupedJob`, when `_grouped_jobs` is present, pod owners derive from `job.name` (LightweightJobInfo) rather than `job.metadata.name`.
Configuration `robusta_krr/core/models/config.py`	Added `discovery_job_batch_size: int = 5000` and `discovery_job_max_batches: int = 100` to `Config` with validation and descriptions.
CLI `robusta_krr/main.py`	Added Typer options `--discovery-job-batch-size` (default 5000) and `--discovery-job-max-batches` (default 100) to `run_strategy`, forwarded into `Config` and strategy invocation.
Tests — grouped jobs `tests/test_grouped_jobs.py`	Updated tests to use `_list_namespaced_or_global_objects_batched()` returning `(items, continue_token)`; adapted mocks to return batched shapes and futures; use lightweight grouping objects exposing `.name`; updated internal mock calls and assertions; added tests for multi-label jobs appearing in multiple groups and for global vs per-namespace batching behavior.
Tests — grouped jobs metrics logic `tests/test_grouped_jobs_metrics_logic.py`	New test module validating grouped-job metrics extraction: LightweightJobInfo shape, pod-owner extraction from `_grouped_jobs`, batched query construction/handling, multi-namespace behavior, batch-splitting of large grouped-job sets, fallback cases, and cross-group inclusion semantics.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant ClusterLoader
    participant K8sAPI as "K8s API"
    participant GroupCollector as "GroupedJob Collector"

    User->>ClusterLoader: request job discovery
    activate ClusterLoader

    ClusterLoader->>K8sAPI: _list_namespaced_or_global_objects_batched(limit, continue)
    activate K8sAPI
    K8sAPI-->>ClusterLoader: { items, continue_token }
    deactivate K8sAPI

    rect rgba(220,235,255,0.6)
        ClusterLoader->>ClusterLoader: process items
        alt CronJob-owned
            ClusterLoader->>ClusterLoader: skip
        else Grouped (labels)
            ClusterLoader->>GroupCollector: register LightweightJobInfo into groups
        else Regular Job
            ClusterLoader->>ClusterLoader: build K8sObjectData for containers
        end
    end

    alt continue_token and batch_count < max
        ClusterLoader->>K8sAPI: request next batch (continue_token)
    else
        ClusterLoader->>ClusterLoader: finish batching
    end

    GroupCollector->>ClusterLoader: emit grouped_jobs with templates
    ClusterLoader-->>User: discovery result (jobs + grouped_jobs)

    rect rgba(245,245,245,0.6)
        note right of K8sAPI: On 410 Gone -> refresh continue token and retry
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas needing extra attention:

Batching/pagination and 410-retry logic in _list_namespaced_or_global_objects_batched().
Correct async/await usage and call-site updates for _list_all_jobs() / _list_all_groupedjobs().
Grouping semantics: template selection for container introspection and multi-label inclusion (job appearing in multiple groups).
Metrics change: consistent usage of job.name (LightweightJobInfo) across callers and tests.
Tests: verify mocks/futures and batched return shapes align with new implementations.

Possibly related PRs

GroupedJobs + Prometheus credentials auto refresh #478 — strongly related: overlaps changes to GroupedJob discovery and Prometheus metrics handling (grouping, config fields, and listing/load_pods adjustments).

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 2 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 51.16% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title check	❓ Inconclusive	The title "[ROB-2447] Oomkill job fix" is vague and generic, referring to an issue ticket without describing what specific changes were made to fix the oomkill job problem.	Provide a more descriptive title that explains the specific changes, such as "Add batched job discovery with CronJob filtering" or "Implement batched Kubernetes job listing with continue token support."
Description check	❓ Inconclusive	No pull request description was provided by the author, making it impossible to assess whether the description relates to the changeset.	Add a pull request description explaining the changes, the problem being solved, and how the batched job discovery and grouped job improvements address the oomkill issue.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch oomkill_job_fix

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cc76032 and 64a7041.

📒 Files selected for processing (1)

robusta_krr/core/models/config.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

robusta_krr/core/models/config.py

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

robusta_krr/core/integrations/kubernetes/__init__.py (1)
609-711: Use bare raise to preserve exception context.

Line 672 re-raises the exception with raise e, which can modify the traceback. Use a bare raise statement to preserve the original exception context.

Based on static analysis.

Apply this diff:
     except Exception as e:
         logger.error(
             "Failed to run grouped jobs discovery",
             exc_info=True,
         )
-        raise e
+        raise

🧹 Nitpick comments (5)

robusta_krr/core/integrations/kubernetes/__init__.py (2)
27-31: Consider adding type hints to LightweightJobInfo.

The LightweightJobInfo class would benefit from type hints on the __init__ method parameters for better IDE support and type checking.

Apply this diff to add type hints:
 class LightweightJobInfo:
     """Lightweight job object containing only the fields needed for GroupedJob processing."""
-    def __init__(self, name: str, namespace: str):
+    def __init__(self, name: str, namespace: str) -> None:
         self.name = name
         self.namespace = namespace
552-599: Refactor exception handling for clarity.

Line 594 captures the exception in variable e but never uses it. Either include it in the log message or remove it. Also consider moving the return statement to an else block for clarity.

Apply this diff to improve exception handling:
     try:
         # ... batching logic ...
         
         logger.debug("Found %d regular jobs", len(all_jobs))
-        return all_jobs
         
-    except Exception as e:
+    except Exception:
         logger.error(
             "Failed to run jobs discovery",
             exc_info=True,
         )
-        return all_jobs
+    
+    return all_jobs
tests/test_grouped_jobs_metrics_logic.py (3)
1-4: Clean up unused imports.

Lines 1-4 import pytest, patch, and timedelta but never use them. Remove these to keep the imports clean.

Based on static analysis.

Apply this diff:
-import pytest
-from unittest.mock import MagicMock, patch
-from datetime import timedelta
-
+from unittest.mock import MagicMock

 from robusta_krr.core.integrations.kubernetes import LightweightJobInfo
44-83: Remove unused variable assignments in test.

Line 74 assigns pod_owner_kind but never uses it in the namespace-specific assertions. Consider removing this assignment or adding assertions for it.

Based on static analysis.

133-169: Remove unused variable assignment in test.

Line 151 assigns pod_owner_kind but never uses it afterward. Consider removing this assignment.

Based on static analysis.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7afd7d0 and 14379a1.

📒 Files selected for processing (6)

robusta_krr/core/integrations/kubernetes/__init__.py (4 hunks)
robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py (1 hunks)
robusta_krr/core/models/config.py (1 hunks)
robusta_krr/main.py (2 hunks)
tests/test_grouped_jobs.py (9 hunks)
tests/test_grouped_jobs_metrics_logic.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

tests/test_grouped_jobs_metrics_logic.py (1)

robusta_krr/core/integrations/kubernetes/__init__.py (1)

LightweightJobInfo (27-31)

tests/test_grouped_jobs.py (1)

robusta_krr/core/integrations/kubernetes/__init__.py (3)

namespaces (62-95)

_list_namespaced_or_global_objects_batched (282-352)

_list_all_groupedjobs (609-711)

robusta_krr/core/integrations/kubernetes/__init__.py (1)

robusta_krr/core/models/objects.py (2)

selector (76-83)

K8sObjectData (38-107)

🪛 Flake8 (7.3.0)

tests/test_grouped_jobs_metrics_logic.py

[error] 1-1: 'pytest' imported but unused

(F401)

[error] 2-2: 'unittest.mock.patch' imported but unused

(F401)

[error] 3-3: 'datetime.timedelta' imported but unused

(F401)

[error] 74-74: local variable 'pod_owner_kind' is assigned to but never used

(F841)

[error] 151-151: local variable 'pod_owner_kind' is assigned to but never used

(F841)

🪛 Ruff (0.14.1)

tests/test_grouped_jobs_metrics_logic.py

74-74: Local variable pod_owner_kind is assigned to but never used

Remove assignment to unused variable pod_owner_kind

(F841)

151-151: Local variable pod_owner_kind is assigned to but never used

Remove assignment to unused variable pod_owner_kind

(F841)

tests/test_grouped_jobs.py

75-75: Unused function argument: args

(ARG001)

75-75: Unused function argument: kwargs

(ARG001)

85-85: Unused function argument: item

(ARG001)

85-85: Unused function argument: kind

(ARG001)

141-141: Unused function argument: args

(ARG001)

141-141: Unused function argument: kwargs

(ARG001)

150-150: Unused function argument: item

(ARG001)

150-150: Unused function argument: kind

(ARG001)

196-196: Unused function argument: args

(ARG001)

196-196: Unused function argument: kwargs

(ARG001)

205-205: Unused function argument: item

(ARG001)

205-205: Unused function argument: kind

(ARG001)

250-250: Unused function argument: args

(ARG001)

250-250: Unused function argument: kwargs

(ARG001)

259-259: Unused function argument: item

(ARG001)

259-259: Unused function argument: kind

(ARG001)

294-294: Unused function argument: args

(ARG001)

294-294: Unused function argument: kwargs

(ARG001)

303-303: Unused function argument: item

(ARG001)

303-303: Unused function argument: kind

(ARG001)

316-316: Comment contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF003)

robusta_krr/core/integrations/kubernetes/__init__.py

332-332: Consider moving this statement to an else block

(TRY300)

592-592: Consider moving this statement to an else block

(TRY300)

594-594: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

672-672: Use raise without specifying exception name

Remove exception name

(TRY201)

🔇 Additional comments (5)

robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py (1)

352-358: LGTM! Correctly updated for LightweightJobInfo.

The change from job.metadata.name to job.name is correct and aligns with the new LightweightJobInfo structure that stores the name directly as an attribute.

robusta_krr/core/integrations/kubernetes/__init__.py (1)

272-280: LGTM! Helper methods are well-implemented.

The filtering logic correctly checks for CronJob ownership and grouping labels with appropriate null safety.

robusta_krr/core/models/config.py (1)

56-59: LGTM! Configuration fields are well-defined.

The new discovery settings have appropriate defaults, validation constraints, and clear descriptions.

tests/test_grouped_jobs.py (2)

7-20: LGTM! Mock config properly includes new fields.

The mock configuration correctly includes the new discovery_job_batch_size, discovery_job_max_batches, selector, and namespaces settings needed for the batched loading tests.

284-343: LGTM! Excellent test coverage for multi-group membership.

The new test test_list_all_groupedjobs_job_in_multiple_groups provides valuable coverage for the scenario where a single job belongs to multiple groups based on different labels. The assertions correctly verify that each group contains the appropriate jobs.

Note: Line 316's comment contains a multiplication sign (×) instead of 'x', but this is cosmetic and doesn't affect functionality.

robusta_krr/main.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

robusta_krr/core/integrations/kubernetes/__init__.py (1)
282-353: Critical: Continue token logic breaks pagination for multi-namespace scenarios.

When self.namespaces != "*", the method creates multiple parallel requests (one per namespace), but all requests receive the same continue_ref (Line 315), and only the first result's continue token is extracted (Line 331). This is incorrect because:

Continue tokens in the Kubernetes API are request-specific and cannot be shared across different namespace queries

Each namespace has its own independent result set requiring separate continue tokens

To properly paginate across multiple namespaces, you would need to track one continue token per namespace

The current implementation only works correctly when self.namespaces == "*" (single aggregated request). If pagination with continue tokens is attempted in multi-namespace mode, results will be incomplete or incorrect.

Recommended fix:

Either:

Option A (simpler): Only support pagination when namespaces == "*" and raise an error or warning if continue_ref is passed with multiple namespaces

Option B (complete): Track continue tokens per namespace using a dict structure

Apply this diff for Option A:
     async def _list_namespaced_or_global_objects_batched(
         self,
         kind: KindLiteral,
         all_namespaces_request: Callable,
         namespaced_request: Callable,
         limit: Optional[int] = None,
         continue_ref: Optional[str] = None,
     ) -> tuple[list[Any], Optional[str]]:
         logger.debug("Listing %s in %s with batching (limit=%d)", kind, self.cluster, limit)
         loop = asyncio.get_running_loop()
 
         try:
+            # Pagination with continue tokens only works for cluster-wide requests
+            if self.namespaces != "*" and continue_ref is not None:
+                raise ValueError(
+                    "Continue token pagination is only supported for cluster-wide requests (namespaces='*'). "
+                    "Per-namespace pagination would require tracking separate continue tokens."
+                )
+            
             if self.namespaces == "*":

🧹 Nitpick comments (1)

robusta_krr/core/integrations/kubernetes/__init__.py (1)
27-32: Consider using dataclass for better type safety and immutability.

The LightweightJobInfo class could benefit from using Python's dataclass decorator for cleaner syntax, automatic type hints, and optional immutability.

Apply this diff:
+from dataclasses import dataclass
+
-class LightweightJobInfo:
-    """Lightweight job object containing only the fields needed for GroupedJob processing."""
-    def __init__(self, name: str, namespace: str):
-        self.name = name
-        self.namespace = namespace
+@dataclass(frozen=True)
+class LightweightJobInfo:
+    """Lightweight job object containing only the fields needed for GroupedJob processing."""
+    name: str
+    namespace: str

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 14379a1 and 1a87982.

📒 Files selected for processing (1)

robusta_krr/core/integrations/kubernetes/__init__.py (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

robusta_krr/core/integrations/kubernetes/__init__.py (2)

robusta_krr/core/models/objects.py (2)

selector (76-83)

K8sObjectData (38-107)

robusta_krr/utils/object_like_dict.py (2)

items (28-29)

get (25-26)

🪛 Ruff (0.14.1)

robusta_krr/core/integrations/kubernetes/__init__.py

333-333: Consider moving this statement to an else block

(TRY300)

593-593: Consider moving this statement to an else block

(TRY300)

595-595: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

673-673: Use raise without specifying exception name

Remove exception name

(TRY201)

🔇 Additional comments (1)

robusta_krr/core/integrations/kubernetes/__init__.py (1)

272-280: LGTM!

The helper methods correctly handle edge cases with proper None-checking and defensive logic.

robusta_krr/core/integrations/kubernetes/__init__.py

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

robusta_krr/core/integrations/kubernetes/__init__.py (2)
547-594: Clean up exception handling per previous review.

The exception handling has minor issues that were flagged in the previous review:

Line 589: Variable e is assigned but never used (the exc_info=True parameter already logs the exception)

Line 587: The success-path return should be in a try...else block for cleaner structure

Apply this diff:
             logger.debug("Found %d regular jobs", len(all_jobs))
-            return all_jobs
             
-        except Exception as e:
+        except Exception:
             logger.error(
                 "Failed to run jobs discovery",
                 exc_info=True,
             )
-            return all_jobs
+        
+        return all_jobs
This aligns with Python best practices for try/except/else structure.

662-667: Simplify exception re-raise per previous review.

Line 667 uses raise e which resets the traceback. Use bare raise to preserve the original traceback for better debugging.

Apply this diff:
         except Exception as e:
             logger.error(
                 "Failed to run grouped jobs discovery",
                 exc_info=True,
             )
-            raise e
+            raise
This preserves the full exception context when propagating the error.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1a87982 and 8e11b77.

📒 Files selected for processing (1)

robusta_krr/core/integrations/kubernetes/__init__.py (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

robusta_krr/core/integrations/kubernetes/__init__.py (1)

robusta_krr/core/models/objects.py (2)

selector (76-83)

K8sObjectData (38-107)

🪛 Ruff (0.14.2)

robusta_krr/core/integrations/kubernetes/__init__.py

333-333: Consider moving this statement to an else block

(TRY300)

587-587: Consider moving this statement to an else block

(TRY300)

589-589: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

667-667: Use raise without specifying exception name

Remove exception name

(TRY201)

🔇 Additional comments (4)

robusta_krr/core/integrations/kubernetes/__init__.py (4)

27-32: LGTM! Clean lightweight data structure.

The LightweightJobInfo class is well-designed for its purpose—storing only the essential fields needed for grouped job processing.

272-280: LGTM! Helper methods are correctly implemented.

Both helper methods properly handle edge cases (missing owner references, missing labels) and implement their filtering logic correctly.

614-661: LGTM! Template-based grouping is well-designed.

The approach of maintaining a single template job per group (line 617, 657) is efficient—it provides the container specifications needed for K8sObjectData construction while storing only lightweight job info for the group members. This balances memory usage with the data requirements.

669-706: LGTM! Grouped job construction is correct.

The result-building logic properly:

Uses the template job for container specifications (lines 671, 681-689)

Applies the grouping limit per namespace (line 678)

Stores LightweightJobInfo instances in _grouped_jobs (line 700)

Sets the _label_filter for pod selection (line 701)

This integrates well with the list_pods method's GroupedJob handling (lines 158-172).

robusta_krr/core/integrations/kubernetes/__init__.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (3)

robusta_krr/core/integrations/kubernetes/__init__.py (3)
282-347: Critical: Multi-namespace pagination is broken.

This issue was previously identified but remains unresolved. The continue token handling doesn't work correctly when listing specific namespaces (when self.namespaces != "*"):

Lines 307-318 create one request per namespace, but all requests receive the same continue_ref parameter (lines 315-316)

Line 330 extracts the continue token only from the first namespace's result

On the next batch iteration, this single continue token is passed to all namespace requests

This is incorrect because each namespace has its own independent pagination state. A continue token from namespace A cannot be used to paginate namespace B's results. This will cause:

Only the first namespace to paginate correctly

Other namespaces to error, ignore the token, or return incorrect results

Job discovery to be incomplete in multi-namespace deployments

Solution: Either restrict batched listing to cluster-wide mode (namespaces == "*") only, or track separate continue tokens per namespace (requires restructuring the batching logic).

546-594: Clean up exception handling.

Line 589 assigns the exception to variable e but never uses it (just logs with exc_info=True). Additionally, Line 587 should be moved to an else block for cleaner structure.

Apply this diff:
         
-            logger.debug("Found %d regular jobs", len(all_jobs))
-            return all_jobs
             
-        except Exception as e:
+        except Exception:
             logger.error(
                 "Failed to run jobs discovery",
                 exc_info=True,
             )
-            return all_jobs
+        else:
+            logger.debug("Found %d regular jobs", len(all_jobs))
+        
+        return all_jobs
666-671: Simplify exception re-raise.

Line 671 unnecessarily specifies the exception name when re-raising. Use bare raise to preserve the original traceback.

Apply this diff:
         except Exception as e:
             logger.error(
                 "Failed to run grouped jobs discovery",
                 exc_info=True,
             )
-            raise e
+            raise

🧹 Nitpick comments (1)

robusta_krr/core/integrations/kubernetes/__init__.py (1)
27-32: Consider using dataclass for cleaner implementation.

The LightweightJobInfo class could benefit from the @dataclass decorator, which would automatically generate __init__, __repr__, __eq__, and other methods.

Apply this diff:
+from dataclasses import dataclass
+
+@dataclass
-class LightweightJobInfo:
-    """Lightweight job object containing only the fields needed for GroupedJob processing."""
-    def __init__(self, name: str, namespace: str):
-        self.name = name
-        self.namespace = namespace
+class LightweightJobInfo:
+    """Lightweight job object containing only the fields needed for GroupedJob processing."""
+    name: str
+    namespace: str

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8e11b77 and c41800c.

📒 Files selected for processing (1)

robusta_krr/core/integrations/kubernetes/__init__.py (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

robusta_krr/core/integrations/kubernetes/__init__.py (1)

robusta_krr/core/models/objects.py (2)

selector (76-83)

K8sObjectData (38-107)

🪛 Ruff (0.14.2)

robusta_krr/core/integrations/kubernetes/__init__.py

332-332: Consider moving this statement to an else block

(TRY300)

587-587: Consider moving this statement to an else block

(TRY300)

589-589: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

671-671: Use raise without specifying exception name

Remove exception name

(TRY201)

🔇 Additional comments (2)

robusta_krr/core/integrations/kubernetes/__init__.py (2)

272-280: LGTM!

The helper methods are well-implemented with clear logic, proper None handling, and good docstrings.

673-710: LGTM!

The result building logic correctly:

Uses the template job for container structure

Groups jobs by namespace

Applies the job_grouping_limit

Attaches the necessary metadata (_grouped_jobs, _label_filter) for downstream processing

robusta_krr/core/integrations/kubernetes/__init__.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (3)

robusta_krr/core/integrations/kubernetes/__init__.py (3)
588-595: Clean up exception handling.

Line 590 assigns the exception to variable e but never uses it (just logs with exc_info=True). Additionally, line 588 could be moved to an else block for cleaner control flow structure.

Apply this diff:
             logger.debug("Found %d regular jobs", len(all_jobs))
-            return all_jobs
             
-        except Exception as e:
+        except Exception:
             logger.error(
                 "Failed to run jobs discovery",
                 exc_info=True,
             )
-            return all_jobs
+        
+        return all_jobs
620-626: Batch limit is cumulative across all namespaces, and duplicate initialization.

Two issues here:

Duplicate initialization: batch_count is initialized twice (lines 620 and 623), with the second one being the effective initialization.

Global batch limit: Similar to _list_all_jobs, the batch_count limit is cumulative across all namespaces. If the first namespace consumes all batches, subsequent namespaces won't be processed, resulting in incomplete grouped job discovery.

Solution: Remove the duplicate initialization and move batch_count = 0 inside the namespace loop:
         grouped_jobs = defaultdict(list)
         grouped_jobs_template = {}  # Store only ONE full job as template per group - needed for class K8sObjectData
         continue_ref: Optional[str] = None
-        batch_count = 0
         namespaces = self.namespaces if self.namespaces != "*" else ["*"]
         try:
-            batch_count = 0
             for namespace in namespaces:
+                batch_count = 0
                 continue_ref = None
                 while batch_count < settings.discovery_job_max_batches:
554-557: Batch limit is cumulative across all namespaces.

The batch_count variable is initialized once before the namespace loop (line 554) and checked against settings.discovery_job_max_batches (line 557). This means if namespace A consumes all 10 batches, subsequent namespaces B, C, etc. will not be processed at all, leading to incomplete job discovery in multi-namespace deployments.

Solution: Move batch_count = 0 inside the namespace loop to apply the limit per-namespace:
         namespaces = self.namespaces if self.namespaces != "*" else ["*"]
         all_jobs = []
         try:
-            batch_count = 0
             for namespace in namespaces:
+                batch_count = 0
                 continue_ref: Optional[str] = None
                 while batch_count < settings.discovery_job_max_batches:

🧹 Nitpick comments (1)

robusta_krr/core/integrations/kubernetes/__init__.py (1)
337-337: Consider moving the JSON import to module level.

Importing json inside the exception handler works but is unconventional. For consistency with other imports and minor performance benefit, consider moving it to the top of the file.

Apply this diff:
 import asyncio
 import logging
+import json
 import re
 from collections import defaultdict
Then remove the import from line 337:
         except ApiException as e:
             if e.status == 410 and e.body:
                 # Continue token expired
-                import json
                 try:
                     error_body = json.loads(e.body)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9d880c4 and cc76032.

📒 Files selected for processing (1)

robusta_krr/core/integrations/kubernetes/__init__.py (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

robusta_krr/core/integrations/kubernetes/__init__.py (1)

robusta_krr/core/models/objects.py (2)

selector (76-83)

K8sObjectData (38-107)

🪛 Ruff (0.14.2)

robusta_krr/core/integrations/kubernetes/__init__.py

332-332: Consider moving this statement to an else block

(TRY300)

588-588: Consider moving this statement to an else block

(TRY300)

590-590: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

668-668: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

🔇 Additional comments (3)

robusta_krr/core/integrations/kubernetes/__init__.py (3)

27-32: LGTM: Clean lightweight container for grouped job metadata.

The class appropriately holds only the minimal fields (name, namespace) needed for grouped job processing, avoiding the overhead of storing full V1Job objects.

272-281: LGTM: Well-structured predicate helpers.

Both helpers correctly filter jobs with appropriate null-safety checks and clear intent.

675-712: LGTM: Solid implementation of grouped job processing.

The template pattern (lines 677, 687-695) efficiently reuses container definitions from a single representative job rather than storing full job objects for each grouped job. The per-namespace grouping and limit enforcement (lines 679-684) correctly handle multi-namespace scenarios.

robusta_krr/core/integrations/kubernetes/__init__.py

arikalon1

nice work

Avi-Robusta added 8 commits October 27, 2025 18:54

refactoring to batches

291e1e4

added metrics tests

b7bab3b

changes improvements

c16ab3d

changed batches

8e41e91

refactored to be cleaner

c0bd646

confusing names

6f9e85f

reverting names

35a93e5

fixed comment

14379a1

coderabbitai bot reviewed Oct 28, 2025

View reviewed changes

robusta_krr/main.py Show resolved Hide resolved

Avi-Robusta marked this pull request as ready for review October 28, 2025 13:03

fixed potential issue

1a87982

coderabbitai bot reviewed Oct 28, 2025

View reviewed changes

robusta_krr/core/integrations/kubernetes/__init__.py Show resolved Hide resolved

robusta_krr/core/integrations/kubernetes/__init__.py Outdated Show resolved Hide resolved

remove infinite loop

8e11b77

coderabbitai bot reviewed Oct 29, 2025

View reviewed changes

robusta_krr/core/integrations/kubernetes/__init__.py Show resolved Hide resolved

fix namespace continue bug

c41800c

coderabbitai bot reviewed Oct 30, 2025

View reviewed changes

robusta_krr/core/integrations/kubernetes/__init__.py Outdated Show resolved Hide resolved

robusta_krr/core/integrations/kubernetes/__init__.py Outdated Show resolved Hide resolved

Avi-Robusta added 2 commits October 30, 2025 17:58

do not count batches on emtpy namespaces and added test

9d880c4

small change

cc76032

coderabbitai bot reviewed Oct 31, 2025

View reviewed changes

robusta_krr/core/integrations/kubernetes/__init__.py Show resolved Hide resolved

Avi-Robusta and others added 3 commits November 16, 2025 10:59

Merge branch 'main' into oomkill_job_fix

1a7bf2b

Update config.py

f8f703e

Update config.py

64a7041

arikalon1 approved these changes Nov 20, 2025

View reviewed changes

arikalon1 merged commit 9ce49cd into main Nov 20, 2025
3 checks passed

arikalon1 deleted the oomkill_job_fix branch November 20, 2025 00:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROB-2447] Oomkill job fix #484

[ROB-2447] Oomkill job fix #484

Uh oh!

Avi-Robusta commented Oct 28, 2025

Uh oh!

coderabbitai bot commented Oct 28, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

arikalon1 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[ROB-2447] Oomkill job fix #484

[ROB-2447] Oomkill job fix #484

Uh oh!

Conversation

Avi-Robusta commented Oct 28, 2025

Uh oh!

coderabbitai bot commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

arikalon1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai bot commented Oct 28, 2025 •

edited

Loading