Conversation
Its type is random, so we may end up setting TZ hash for the main one or SHA256 for homomorphic. This can make tests break randomly as code evolves and relies more on _correct_ objects for various functions. Signed-off-by: Roman Khimov <roman@nspcc.ru>
Inhume() sets both graveyard and garbage marks for deleted objects and when checking for status graveyard has a precedence, so "tombstoned" is returned for an object instead of "gcmarked". Now if graveyard marks are omitted we still find a tombstone for simple objects, but childs only have a GC mark, so unless we're to check for their parent we can't return better status. Signed-off-by: Roman Khimov <roman@nspcc.ru>
We no longer need this microoptimization after 6afabe2. Signed-off-by: Roman Khimov <roman@nspcc.ru>
And drop tombstone if it's present since that's the only way to resurrect the object. Signed-off-by: Roman Khimov <roman@nspcc.ru>
It doesn't need db. Signed-off-by: Roman Khimov <roman@nspcc.ru>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #3744 +/- ##
==========================================
- Coverage 26.71% 26.56% -0.15%
==========================================
Files 656 655 -1
Lines 41615 41505 -110
==========================================
- Hits 11116 11027 -89
+ Misses 29487 29470 -17
+ Partials 1012 1008 -4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ef33b11 to
76733a5
Compare
There was a problem hiding this comment.
Pull request overview
This PR removes the graveyard bucket from the metabase, which fundamentally changes how tombstone-marked objects are tracked. Instead of maintaining a separate graveyard bucket with tombstone-to-object mappings, tombstone objects are now stored as regular objects and directly mark their targets with GC marks in the garbage bucket.
Key changes:
- Tombstone objects are now broadcast to all shards and stored like other system objects
- Objects marked by tombstones are immediately added to the garbage bucket
- Metabase version upgraded to 9 with migration to delete obsolete graveyard bucket
- Enhanced Put operation with rollback on metabase failure and broadcast rollback on fatal errors
Reviewed changes
Copilot reviewed 49 out of 49 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/local_object_storage/metabase/util.go | Renamed graveyard prefix to unusedGraveyardPrefix and removed graveyard bucket name |
| pkg/local_object_storage/metabase/version.go | Added migration from version 8 to 9 to delete graveyard and locked buckets |
| pkg/local_object_storage/metabase/put.go | Modified tombstone handling to directly mark targets as garbage and update counters |
| pkg/local_object_storage/metabase/inhume.go | Removed Inhume method, simplified MarkGarbage to no longer support tombstone parameter |
| pkg/local_object_storage/metabase/graveyard.go | Removed graveyard iteration functions, kept only garbage iteration |
| pkg/local_object_storage/metabase/exists.go | Replaced graveyard checks with garbage checks, simplified status logic |
| pkg/local_object_storage/metabase/revive.go | Modified to delete tombstone object when reviving instead of removing graveyard entry |
| pkg/local_object_storage/metabase/counter.go | Updated syncCounter to only count Regular objects as logical, made updateCounter non-method |
| pkg/local_object_storage/metabase/control.go | Fixed reset to collect bucket names before deletion to avoid iteration during modification |
| pkg/local_object_storage/shard/put.go | Added rollback logic to delete from blobstor/writecache on metabase Put failure |
| pkg/local_object_storage/shard/inhume.go | Removed Inhume method, kept only MarkGarbage |
| pkg/local_object_storage/shard/gc.go | Removed call to DropExpiredTSMarks |
| pkg/local_object_storage/engine/put.go | Removed special tombstone inhume logic, added broadcast rollback on fatal errors |
| pkg/local_object_storage/engine/inhume.go | Removed inhume and inhumeAddr methods, fixed deadlock in processExpiredObjects |
| cmd/neofs-lens/internal/meta/list-graveyard.go | Deleted entire file (graveyard listing command removed) |
| cmd/neofs-lens/internal/meta/root.go | Removed list-graveyard command registration |
| Various test files | Updated tests to use Put for tombstones instead of Inhume, adjusted expectations |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
7891017 to
4bcef0d
Compare
| tombAddress.SetObject(tombOID) | ||
| _, _, _, err := db.delete(tx, tombAddress) | ||
| if err != nil { | ||
| return err |
There was a problem hiding this comment.
Revive is so special that I don't think it's a problem. It's almost the only thing whole revival call does.
There was a problem hiding this comment.
as you wish, i try to always add any unexpected internal things to err context: this operation is expected to make an object available, and it is returning an error about removing a different object (i guess TS's ID will not be possible to find in logs/err message too)
| if err == nil { | ||
| if inGraveyardWithKey(metaCursor, garbageKey, graveyardBKT, garbageObjectsBKT) == statusAvailable { | ||
| // object is available, decrement the | ||
| // logical counter |
There was a problem hiding this comment.
haven't we dropped any "logic" mentions?
There was a problem hiding this comment.
It's still in the metabase. It still means nothing. But this waits for #3740.
This requires metabase put to process children and update counters in the same way Inhume() does, but ultimately the behavior is the same except that there are no graveyard marks created. All of the code currently can work with TS meta only, so it's not a problem. Signed-off-by: Roman Khimov <roman@nspcc.ru>
Dropping buckets while iterating over them is an undefined behavior in BoltDB (much like dropping keys while iterating over them). This was detected when some completely unrelated DB scheme changes somehow triggered permanent Reset test failures. Signed-off-by: Roman Khimov <roman@nspcc.ru>
Double check for addr makes zero sense. Signed-off-by: Roman Khimov <roman@nspcc.ru>
It's not used in production code now. This makes some tests totally obsolete, so they're removed. There is also some disagreement over logical counter meaning in the code (most of Inhume() calls are replaced with TS Put()), so for now we make it more consistent by counting Regular objects only. Signed-off-by: Roman Khimov <roman@nspcc.ru>
Nothing writes to it now, so it can be safely removed. This finally fixes #3476. Signed-off-by: Roman Khimov <roman@nspcc.ru>
This check is useless since d6c8cff. Signed-off-by: Roman Khimov <roman@nspcc.ru>
Clean up unsuccessful puts as it should've always been done. The problem is seen in tests added by fa9347b (which just followed the implementation) and also in some runs of tests.object.test_object_lock.TestObjectLockWithGrpc#test_the_object_lock_should_be_kept_after_metabase_deletion after engine tombstone put rework. The scenario is: 1. Object put to node (with two shards) 2. Lock put to node (to both shards) 025-12-18T22:41:37.008Z info log/log.go:12 local object storage operation {"shard_id": "NtAo5RCDEtxFqk3zdEyNFV", "address": "G4Be2VXRvieu4QH74CSYWq6NMaSQ7rZCxmHQBCYrmKS1/GAthAEwvYhCf7wdmBLgzQLB2fTYRxjRDt1fQuSx985P9", "op": "PUT"} 025-12-18T22:41:37.019Z info log/log.go:12 local object storage operation {"shard_id": "NtAo5RCDEtxFqk3zdEyNFV", "address": "G4Be2VXRvieu4QH74CSYWq6NMaSQ7rZCxmHQBCYrmKS1/GAthAEwvYhCf7wdmBLgzQLB2fTYRxjRDt1fQuSx985P9", "op": "metabase PUT"} 025-12-18T22:41:37.019Z debug engine/put.go:261 successfully put object on shard during broadcast {"type": "LOCK", "associated": "DN1moJr1Y6Qj1yoKbNvxumdfB81EFKFdJynEyWNCd8ve", "shard": "NtAo5RCDEtxFqk3zdEyNFV", "addr": "G4Be2VXRvieu4QH74CSYWq6NMaSQ7rZCxmHQBCYrmKS1/GAthAEwvYhCf7wdmBLgzQLB2fTYRxjRDt1fQuSx985P9"} 025-12-18T22:41:37.029Z info log/log.go:12 local object storage operation {"shard_id": "7UtCzpPxM42RX299xYajRh", "address": "G4Be2VXRvieu4QH74CSYWq6NMaSQ7rZCxmHQBCYrmKS1/GAthAEwvYhCf7wdmBLgzQLB2fTYRxjRDt1fQuSx985P9", "op": "PUT"} 025-12-18T22:41:37.041Z info log/log.go:12 local object storage operation {"shard_id": "7UtCzpPxM42RX299xYajRh", "address": "G4Be2VXRvieu4QH74CSYWq6NMaSQ7rZCxmHQBCYrmKS1/GAthAEwvYhCf7wdmBLgzQLB2fTYRxjRDt1fQuSx985P9", "op": "metabase PUT"} 025-12-18T22:41:37.041Z debug engine/put.go:261 successfully put object on shard during broadcast {"type": "LOCK", "associated": "DN1moJr1Y6Qj1yoKbNvxumdfB81EFKFdJynEyWNCd8ve", "shard": "7UtCzpPxM42RX299xYajRh", "addr": "G4Be2VXRvieu4QH74CSYWq6NMaSQ7rZCxmHQBCYrmKS1/GAthAEwvYhCf7wdmBLgzQLB2fTYRxjRDt1fQuSx985P9"} 025-12-18T22:41:37.041Z debug engine/put.go:284 object broadcast completed {"type": "LOCK", "addr": "G4Be2VXRvieu4QH74CSYWq6NMaSQ7rZCxmHQBCYrmKS1/GAthAEwvYhCf7wdmBLgzQLB2fTYRxjRDt1fQuSx985P9", "associated": "DN1moJr1Y6Qj1yoKbNvxumdfB81EFKFdJynEyWNCd8ve", "success_count": 2, "total_shards": 2} 3. An attempt to put a ts on object tries to put it to some shard and fails to do this because the object is locked, however the problem happens at the metabase level and ts is stored in the fstree 2025-12-18T22:42:07.261Z debug engine/put.go:235 broadcasting object to all shards {"type": "TOMBSTONE", "addr": "G4Be2VXRvieu4QH74CSYWq6NMaSQ7rZCxmHQBCYrmKS1/ASz59T9GzUWUmPQ7umakq16YAeK96RtZRzoFNySTPsr6", "associated": "DN1moJr1Y6Qj1yoKbNvxumdfB81EFKFdJynEyWNCd8ve", "shard_count": 2} ... 2025-12-18T22:42:07.272Z info log/log.go:12 local object storage operation {"shard_id": "NtAo5RCDEtxFqk3zdEyNFV", "address": "G4Be2VXRvieu4QH74CSYWq6NMaSQ7rZCxmHQBCYrmKS1/ASz59T9GzUWUmPQ7umakq16YAeK96RtZRzoFNySTPsr6", "op": "PUT"} 2025-12-18T22:42:07.283Z warn engine/engine.go:180 could not put object to shard {"shard_id": "NtAo5RCDEtxFqk3zdEyNFV", "error count": 1, "error": "could not put object to metabase: status: code = 2050 message = object is locked"} 2025-12-18T22:42:07.283Z debug engine/put.go:284 object broadcast completed {"type": "TOMBSTONE", "addr": "G4Be2VXRvieu4QH74CSYWq6NMaSQ7rZCxmHQBCYrmKS1/ASz59T9GzUWUmPQ7umakq16YAeK96RtZRzoFNySTPsr6", "associated": "DN1moJr1Y6Qj1yoKbNvxumdfB81EFKFdJynEyWNCd8ve", "success_count": 0, "total_shards": 2} 2025-12-18T22:42:07.283Z error util/log.go:10 object service error {"oid": "ASz59T9GzUWUmPQ7umakq16YAeK96RtZRzoFNySTPsr6", "node": "", "request": "PUT", "error": "write object locally: could not put object to local storage: failed to broadcast TOMBSTONE object to any shard, last error: could not put object to metabase: status: code = 2050 message = object is locked"} 4. Then metabase gets deleted and resynchonised, obviously there is an OID race between lock and ts on one shard and sometimes it gets win by TS. 2025-12-18T22:42:22.516Z info log/log.go:12 local object storage operation {"shard_id": "GmF1GLYsMNrMLdab56rGGm", "address": "G4Be2VXRvieu4QH74CSYWq6NMaSQ7rZCxmHQBCYrmKS1/ASz59T9GzUWUmPQ7umakq16YAeK96RtZRzoFNySTPsr6", "op": "metabase PUT"} 5. Then another attempt to delete an object is performed which succeeds on one shard, but fails on another, since ts has been saved successfully to at least one shard the operation succeeds. 2025-12-18T22:42:37.834Z debug engine/put.go:235 broadcasting object to all shards {"type": "TOMBSTONE", "addr": "G4Be2VXRvieu4QH74CSYWq6NMaSQ7rZCxmHQBCYrmKS1/DHHMTzTkpqrJZTkSJYd4ZFvoTGkve9WxAMWHqSE2zYUb", "associated": "DN1moJr1Y6Qj1yoKbNvxumdfB81EFKFdJynEyWNCd8ve", "shard_count": 2} ... 2025-12-18T22:42:37.857Z debug engine/put.go:261 successfully put object on shard during broadcast {"type": "TOMBSTONE", "associated": "DN1moJr1Y6Qj1yoKbNvxumdfB81EFKFdJynEyWNCd8ve", "shard": "GmF1GLYsMNrMLdab56rGGm", "addr": "G4Be2VXRvieu4QH74CSYWq6NMaSQ7rZCxmHQBCYrmKS1/DHHMTzTkpqrJZTkSJYd4ZFvoTGkve9WxAMWHqSE2zYUb"} 2025-12-18T22:42:37.868Z info log/log.go:12 local object storage operation {"shard_id": "5Yo6cnAvYbDF8ypSS14Pcx", "address": "G4Be2VXRvieu4QH74CSYWq6NMaSQ7rZCxmHQBCYrmKS1/DHHMTzTkpqrJZTkSJYd4ZFvoTGkve9WxAMWHqSE2zYUb", "op": "PUT"} 2025-12-18T22:42:37.879Z warn engine/engine.go:180 could not put object to shard {"shard_id": "5Yo6cnAvYbDF8ypSS14Pcx", "error count": 1, "error": "could not put object to metabase: status: code = 2050 message = object is locked"} 2025-12-18T22:42:37.879Z debug engine/put.go:284 object broadcast completed {"type": "TOMBSTONE", "addr": "G4Be2VXRvieu4QH74CSYWq6NMaSQ7rZCxmHQBCYrmKS1/DHHMTzTkpqrJZTkSJYd4ZFvoTGkve9WxAMWHqSE2zYUb", "associated": "DN1moJr1Y6Qj1yoKbNvxumdfB81EFKFdJynEyWNCd8ve", "success_count": 1, "total_shards": 2} 2025-12-18T22:42:37.884Z debug delete/delete.go:50 operation finished successfully {"component": "Object.Delete service", "request": "DELETE", "address": "G4Be2VXRvieu4QH74CSYWq6NMaSQ7rZCxmHQBCYrmKS1/DN1moJr1Y6Qj1yoKbNvxumdfB81EFKFdJynEyWNCd8ve", "local": false, "with session": true, "with bearer": false} Signed-off-by: Roman Khimov <roman@nspcc.ru>
Otherwise lock/TS conflict or other problems get completely hidden on resync leading to debugging surprises (missing a message for expected object). Signed-off-by: Roman Khimov <roman@nspcc.ru>
4bcef0d to
e63e54c
Compare
Limit the possible damage here and make the node more consistent, currently we can have one shard with a TS and another with a lock for the same object. Signed-off-by: Roman Khimov <roman@nspcc.ru>
This callback is used by shard GC. All of its logic implies the engine
is active and this status shouldn't change during the whole execution.
Also, its locking is inconsistent, IsLocked() takes a lock, but
processAddrDelete() is internal, so it doesn't and shouldn't, it relies
on caller for that.
But just locking isn't enough. There is a deadlock scenario between
engine Close() and shard GC doing its job:
goroutine 784 [sync.WaitGroup.Wait, 9 minutes]:
sync.runtime_SemacquireWaitGroup(0xc000625700?, 0x0?)
/opt/hostedtoolcache/go/1.25.5/x64/src/runtime/sema.go:114 +0x2e
sync.(*WaitGroup).Wait(0xc000625700)
/opt/hostedtoolcache/go/1.25.5/x64/src/sync/waitgroup.go:206 +0xd3
github.com/nspcc-dev/neofs-node/pkg/local_object_storage/shard.(*gc).stop(0xc0006256e0)
/home/runner/work/neofs-node/neofs-node/pkg/local_object_storage/shard/gc.go:132 +0xe8
github.com/nspcc-dev/neofs-node/pkg/local_object_storage/shard.(*Shard).Close(0xc000619200)
/home/runner/work/neofs-node/neofs-node/pkg/local_object_storage/shard/control.go:216 +0x616
github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).close(0xc00022d7a0, 0x1)
/home/runner/work/neofs-node/neofs-node/pkg/local_object_storage/engine/control.go:91 +0x316
github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).setBlockExecErr(0xc00022d7a0, {0x15360c0, 0x1c012c0})
/home/runner/work/neofs-node/neofs-node/pkg/local_object_storage/engine/control.go:125 +0x1bd
github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).Close(0xc00022d7a0)
/home/runner/work/neofs-node/neofs-node/pkg/local_object_storage/engine/control.go:76 +0xcc
github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.TestChildrenExpiration.func1()
/home/runner/work/neofs-node/neofs-node/pkg/local_object_storage/engine/gc_test.go:55 +0x2f
goroutine 835 [sync.RWMutex.RLock, 9 minutes]:
sync.runtime_SemacquireRWMutexR(0xc00022d7f0?, 0x1?, 0x100de39?)
/opt/hostedtoolcache/go/1.25.5/x64/src/runtime/sema.go:100 +0x25
sync.(*RWMutex).RLock(0xc00022d7e0)
/opt/hostedtoolcache/go/1.25.5/x64/src/sync/rwmutex.go:74 +0x5b
github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).IsLocked(0xc00022d7a0, {{0xc9, 0x25, 0xbd, 0xb9, 0xa9, 0xae, 0x28, 0xe4, 0xb, ...}, ...})
/home/runner/work/neofs-node/neofs-node/pkg/local_object_storage/engine/inhume.go:223 +0x5b
github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).processExpiredObjects(0xc00022d7a0, {0xc0002a0f00, 0x1, 0x1326c73?})
/home/runner/work/neofs-node/neofs-node/pkg/local_object_storage/engine/inhume.go:247 +0xf9
github.com/nspcc-dev/neofs-node/pkg/local_object_storage/shard.(*Shard).collectExpiredObjects(0xc000619200)
/home/runner/work/neofs-node/neofs-node/pkg/local_object_storage/shard/gc.go:238 +0xb2e
github.com/nspcc-dev/neofs-node/pkg/local_object_storage/shard.(*Shard).removeGarbage(0xc000619200)
/home/runner/work/neofs-node/neofs-node/pkg/local_object_storage/shard/gc.go:149 +0x125
github.com/nspcc-dev/neofs-node/pkg/local_object_storage/shard.(*gc).tickRemover(0xc0006256e0)
/home/runner/work/neofs-node/neofs-node/pkg/local_object_storage/shard/gc.go:120 +0x110
created by github.com/nspcc-dev/neofs-node/pkg/local_object_storage/shard.(*gc).init in goroutine 784
/home/runner/work/neofs-node/neofs-node/pkg/local_object_storage/shard/gc.go:81 +0xa5
To solve it we can TryRLock() and return if failed to obtain it. It's not
a problem for GC, it will come back to the same object list on subsequent
iterations.
Signed-off-by: Roman Khimov <roman@nspcc.ru>
e63e54c to
60d0d3a
Compare
No description provided.