Reorganize Celery beat schedule for better maintainability#765
Conversation
|
Hey @SupRaKoshti ! Thanks for your work! Before I review I have two questions:
|
|
Hi @regulartim , Thanks for your questions! About "Breaking Change": Regarding local testing: I plan to revisit the environment setup to run a full local test soon. |
|
Hi @SupRaKoshti , |
|
Hi @regulartim , Thanks for the clarification! I understand that it’s not considered a breaking change, and I’ll leave the check mark as is. I plan to get my development environment fully up and running so I can test the changes locally and verify the functionality. Once I confirm everything works, I’ll reach out to you again for review. Thanks for your patience! |
|
Hi @regulartim , Great news! I've successfully got the full GreedyBear environment running locally(I am using Windows). 🎉 Setup completed:
Testing results for the schedule changes: I've verified the Celery Beat schedule is loading correctly with the new configuration: $ docker exec greedybear_celery_beat celery -A greedybear.celery inspect scheduled
$ docker exec greedybear_celery_beat celery -A greedybear.celery reportConfirmed schedule (from celery report): Validation summary: Local testing complete! The schedule reorganization is working as designed. Ready for final review! 🙂 Note: I did encounter a separate issue with Celery Beat container stability (exits with code 0 and restarts), but this appears to be an existing issue unrelated to my configuration changes. The schedule itself loads and validates correctly. |
regulartim
left a comment
There was a problem hiding this comment.
Very nice, thank you! :)
For all the jobs you timed at minute=5 I would suggest to change them to minute=7. This is not that important, however it avoids an etch case: If the user sets the EXTRACTION_INTERVAL=5 the time critical extractions run at minute 5. EXTRACTION_INTERVAL=7 however is not supported, because EXTRACTION_INTERVAL has to be a divisor of 60. Does that make sense?
Thanks for the suggestion! 🙂 I can make the change for the jobs currently scheduled at minute=5 to minute=7 to avoid the edge case you mentioned. However, I noticed that in the current codebase, there doesn’t seem to be any enforced validation or condition that EXTRACTION_INTERVAL must be a divisor of 60. So technically, if we want to align everything, EXTRACTION_INTERVAL could also be set to 7 minutes. I’m just flagging this for awareness — could you clarify if there’s an existing rule or check for this somewhere in the code? If you want, I can also add validation to ensure EXTRACTION_INTERVAL is a divisor of 60 to prevent misconfigurations. |
You are right. However in the In my opinion, that is enough. No validation needed. |
|
Hi @regulartim , I’ve pushed the changes for the Celery tasks as discussed: Added train_and_update_after_midnight to ensure training runs after the midnight extraction using a Celery chain. Updated jobs previously scheduled at minute=5 to minute=7 to avoid the edge case with EXTRACTION_INTERVAL. Scheduled train_and_update_after_midnight exactly at midnight. Note: I wasn’t able to fully test train_and_update_after_midnight because the Celery tasks currently can’t connect to RabbitMQ (ConnectionRefusedError). Everything else (task definitions, schedules) has been updated and should work once RabbitMQ is available. Thanks! 🙂 |
regulartim
left a comment
There was a problem hiding this comment.
Hey @SupRaKoshti ! Thanks for your work. Generally please only request a review from us when you ran your code successfully at least once. If that means that it takes a day longer, that's fine.
greedybear/celery.py
Outdated
| # COMMANDS | ||
| # =========================================== | ||
| # DAILY/WEEKLY: Maintenance Tasks | ||
| # All bundled at 1:05 AM |
greedybear/celery.py
Outdated
| }, | ||
| # =========================================== | ||
| # HOURLY: Monitoring Tasks | ||
| # Run at :05 and :15 (avoiding extraction slots) |
greedybear/celery.py
Outdated
| "train_and_update_after_midnight": { | ||
| "task": "greedybear.tasks.train_and_update_after_midnight", | ||
| "schedule": crontab(hour=0, minute=0), | ||
| "options": {"queue": "default"}, |
There was a problem hiding this comment.
But now this runs in addition to the "extract_all" job. So at midnight two jobs run doing the same thing. This would lead to invalid data.
|
Hi @regulartim 👋 I've addressed the duplicate extraction issue you identified! Here's what I changed and verified: ✅ Problem Fixed: Duplicate Midnight ExtractionPrevious behavior:
This caused duplicate data extraction as you correctly pointed out! ❌ New behavior: 🔧 Solution: Chain Flag ParameterChanges Made:1. Added @shared_task()
def extract_all(is_midnight_chain=False):
now = timezone.now()
if (now.hour == 0 and now.minute == 0 and not is_midnight_chain):
return "Skipped regular midnight extraction"
from greedybear.cronjobs.extract import ExtractionJob
ExtractionJob().execute()2. Midnight chain passes the flag: @shared_task()
def train_and_update_after_midnight():
chain(extract_all.s(is_midnight_chain=True), chain_train_and_update.s())()3. Added 📊 How It Works NowAt Midnight (00:00):
Result: Only ONE extraction at midnight, training guaranteed to run after extraction completes! ✅ 🎯 Why Chain-Based Approach?I used Celery's built-in ✅ Sequential execution guaranteed - Training can't start before extraction finishes ❌ Why NOT Inline Approach?I considered calling ❌ Tight coupling - Extraction shouldn't know about training The chain approach keeps tasks decoupled and follows Celery architecture principles. ✅ Testing & VerificationLocal Testing Completed:1. Fixed Docker/Celery environment issues:
2. Verified midnight skip logic: Instead of waiting until midnight to test, I simulated the behavior by adjusting the Celery Beat schedules temporarily: Temporary Test Configuration
What I Verified
3. Checked task statistics: docker exec greedybear_celery_beat celery -A greedybear.celery inspect stats
# Confirmed tasks executing without errors🎊 SummaryWhat changed:
What's fixed:
Testing:
Ready for review! Let me know if you'd like any changes to the approach. 🙂 |
|
Hi @regulartim , I’ve updated the implementation to use datetime.utcnow() instead of timezone.now() to align with the project’s UTC-only approach. I also verified the behavior by running the tasks locally after adjusting the relevant USE_TZ and TIME_ZONE settings to ensure there are no side effects. I tested the scheduled execution and confirmed that the midnight logic and chaining behavior work as expected. Let me know if anything else should be adjusted. |
regulartim
left a comment
There was a problem hiding this comment.
Hey @SupRaKoshti ! :) Sorry, but there are still some problems:
datetime.utcnow()is deprecated, just usedatetime.now()- There's still the danger of a duplicate extraction: if
train_and_update_after_midnightruns first and takes longer than a minute, the check inextract_allwill fail. If you don't have a better idea to solve this, just return to the old way of scheduling the training task. We might migrate away from celery soon and thus will maybe find another way to solve this.
|
Hi @regulartim , Thanks for the detailed feedback. You're right about datetime.utcnow(). I’ll remove it entirely — it was introduced as part of the new scheduling logic I implemented, so reverting the scheduling changes also makes that datetime adjustment unnecessary. Regarding the race condition, I understand the concern about train_and_update_after_midnight potentially running longer and allowing extract_all to execute in parallel. Since this still leaves a duplication risk and we may move away from Celery in the future, I agree that reverting to the previous scheduling approach is the safer and cleaner solution for now. I’ll restore the original scheduling logic to eliminate the concurrency risk and remove the additional datetime-based changes introduced with it. Please let me know if you'd prefer a different direction. |
- Move hourly monitoring tasks to :05 and :15 - Bundle all daily/weekly maintenance tasks at 1:05 AM - Add clear section headers for improved readability - Ensure no collisions with extraction slots (:00, :10, :20, :30, :40, :50) - Simplify from scattered times (1:03, 2:03, 4:03, 4:15, 4:30) to single time (1:05) Fixes intelowlproject#747
- Added `train_and_update_after_midnight` task to ensure training runs only after midnight extraction via Celery chain(). - Updated jobs previously scheduled at minute=5 to minute=7 to avoid edge cases with EXTRACTION_INTERVAL.
Previously, both the regular `extract_all` schedule (every 10 min) and `train_and_update_after_midnight` were running extraction at 00:00, causing duplicate data extraction. Changes: - Added `is_midnight_chain` parameter to `extract_all` task - Regular midnight(12:00 AM) extraction now skips when `is_midnight_chain=False` - Midnight(12:00 AM) chain explicitly passes `is_midnight_chain=True` - Added `*args, **kwargs` to `chain_train_and_update` for chain compatibility This ensures only ONE extraction runs at midnight (via the chain), and training runs sequentially after extraction completes. Fixes duplicate extraction issue raised in previous review.
Restore the original training task scheduling as discussed in review. Remove datetime.utcnow() which was introduced with the new scheduling logic.
615901e to
d2dfea0
Compare
|
Hi @regulartim, Thanks again for the clarification. I’ve now reverted the scheduling logic to the original implementation to eliminate the potential race condition between extract_all and the training task. This removes the duplication risk and restores the previous stable behavior. I’ve also removed the datetime.utcnow() change, since it was introduced as part of the modified scheduling logic and is no longer needed. Additionally, I realized this branch was originally created from my local main instead of develop, which caused unrelated differences to appear in the PR. I’ve now rebased the branch properly onto develop to ensure only the intended changes are included. Please let me know if anything else needs adjustment. |
regulartim
left a comment
There was a problem hiding this comment.
Thanks for your work! :)
Reorganize Celery beat schedule for better maintainability. Closes #747
Description
Reorganizes Celery beat schedule to be more systematic and avoid task collisions with timing-critical extraction tasks.
Changes:
Related issues
Closes #747
Type of change
Checklist
develop.Ruff) gave 0 errors. If you have correctly installed pre-commit, it does these checks and adjustments on your behalf.Changes Made
Hourly Monitoring Tasks
monitor_honeypotsfrom :18 to :05monitor_logsfrom :33 to :15Daily/Weekly Maintenance Tasks
Bundled all 6 maintenance tasks at 1:05 AM (as requested by maintainer @regulartim):
command_clustering: moved from 01:03 to 01:05clean_up: moved from 02:03 to 01:05get_mass_scanners: moved from 04:03 (Sunday) to 01:05 (Sunday)get_whatsmyip: moved from 04:03 (Saturday) to 01:05 (Saturday)extract_firehol_lists: moved from 04:15 (Sunday) to 01:05 (Sunday)get_tor_exit_nodes: moved from 04:30 (Sunday) to 01:05 (Sunday)Documentation Improvements
Schedule Overview
Benefits
✅ Systematic scheduling - Easy to remember and reason about
✅ No collisions - All tasks avoid extraction time slots
✅ Bundled maintenance - Follows maintainer's preference to group non-critical tasks
✅ Improved readability - Clear section headers and consistent formatting
✅ Simplified timing - From scattered times to single time (1:05)
✅ Better maintainability - Future developers can easily understand the schedule logic
Testing
Additional Notes
This is a configuration-only change that reorganizes task schedules without modifying any business logic. The timing-critical tasks (extraction and training) remain unchanged. All non-critical tasks have been systematically rescheduled to avoid collisions and improve maintainability as discussed in #747.