fix: lock MeasurementFields while validating#25998
fix: lock MeasurementFields while validating#25998davidby-influx merged 4 commits intomaster-1.xfrom
Conversation
There was a window where a race between writes with differing types for the same field were being validated. Lock the MeasurementFields struct during field validation to avoid this. closes #23756
tsdb/shard.go
Outdated
| case PartialWriteError: | ||
| if reason == "" { | ||
| reason = err.Reason | ||
| cont, err := func(p models.Point, iter models.FieldIterator) (cont bool, err error) { |
There was a problem hiding this comment.
It looks like the only time we don't have cont == true is where there is an error, and then the continue is at the bottom of the loop. Is cont really doing anything?
tsdb/shard.go
Outdated
| fieldsToCreate = append(fieldsToCreate, &FieldCreate{ | ||
| Measurement: name, | ||
| Field: &Field{ | ||
| Name: string(fieldKey), | ||
| Type: dataType, | ||
| }, | ||
| }) |
There was a problem hiding this comment.
If a write has two points that both have a new field (newField), but the two points have different types for newField, they would both end up in fieldsToCreate, correct? Would that create issues later?
There was a problem hiding this comment.
The old code had that same pattern. The first gets created, the second rejected.
tsdb/shard.go
Outdated
| mf.mu.Lock() | ||
| defer mf.mu.Unlock() | ||
| // Check with the field validator. | ||
| if err := ValidateFields(mf, p, s.options.Config.SkipFieldSizeValidation); err != nil { |
There was a problem hiding this comment.
ValidateFields knows which fields are not currently in the measurement fields, and this is the only place in the code it is called from. We could avoid iterating over the fields again below looking for unknown fields if ValidateFields collected them for us.
There was a problem hiding this comment.
I agree. My question was how wide-ranging the changes should be. This PR is the minimal set of changes I could find that fixed the bug, but you are correct that there is much room for improvement in the code as it is in the product and in this PR. Let's discuss how radical to get.
There was a problem hiding this comment.
I'd like to replace the atomic.Value storing the fields map with a sync.Map with a generic wrapper for type safety, for instance. Lots of locking goes away if we do that, but it's a big, scary change.
tsdb/shard.go
Outdated
| mf := engine.MeasurementFields(name) | ||
| mf.mu.Lock() |
There was a problem hiding this comment.
It feels like there is still a potential race condition. We lock the mf while looking up fields here, but then unlock it while we continue to the next point. Another incoming write in a different goroutine could then look up fields in mf before this goroutine can create the new fields. Or am I missing something?
There was a problem hiding this comment.
The field creation has its own locking and checks yet again for field type conflicts. So either of the go routines may win, and the other will report an error.
So, the race you describe is real, but gets sequenced in field creation.
There was a problem hiding this comment.
I changed the MeasurementFields lock to a RWMutex and used an RLock in Shard.validateSeriesAndFields to allow greater parallelism.
The lock has to be taken on a per point basis, because points can differ in which measurement they pertain to, and thus which MeasurementFields object has to be locked.
MeasurementFields.CreateFieldIfNotExists still uses the full lock, of course.
stuartcarnie
left a comment
There was a problem hiding this comment.
Looks good to me – I believe I understand the data race, as per my comments. If I am wrong, please do correct me!
| // If the fields is not present, there cannot be a conflict. | ||
| f := mf.FieldBytes(iter.FieldKey()) | ||
| if f == nil { | ||
| if bytes.Equal(fieldKey, timeBytes) { |
There was a problem hiding this comment.
My understanding is this is the first part of the data race, where to concurrent writes would both return nil here, resulting in this continuing, as "there cannot be a conflict" according to the comment.
| if mf.FieldBytes(fieldKey) != nil { | ||
| continue | ||
| } |
There was a problem hiding this comment.
And the second part is here, where the thread one has now created the field, of (for example) data type float, and thread two reaches this point and continues as the field key now exists, but thread two wants to create a field of type integer.
I believe the other part of the race is that a snapshot has started for the data from thread one, so a new cache accepts the writes for thread two of the new data type, as the cache will accept the value, as it does not have any previous data to validate.
* fix: lock MeasurementFields while validating (#25998) There was a window where a race between writes with differing types for the same field were being validated. Lock the MeasurementFields struct during field validation to avoid this. closes #23756 (cherry picked from commit 5a20a83) helps #26001 * fix: switch MeasurementFields from atomic.Value to sync.Map (#26022) Simplify and speed up synchronization for MeasurementFields structures by switching from a mutex and atomic.Value to a sync.Map (cherry picked from commit b617eb2) closes #26001
|
some follow up in pr: #26028 |
* fix: lock MeasurementFields while validating (#25998) There was a window where a race between writes with differing types for the same field were being validated. Lock the MeasurementFields struct during field validation to avoid this. closes #23756 (cherry picked from commit 5a20a83) helps #26001 * fix: switch MeasurementFields from atomic.Value to sync.Map (#26022) Simplify and speed up synchronization for MeasurementFields structures by switching from a mutex and atomic.Value to a sync.Map (cherry picked from commit b617eb2) closes #26001 (cherry picked from commit 8711e2d)
* fix: lock MeasurementFields while validating (#25998) There was a window where a race between writes with differing types for the same field were being validated. Lock the MeasurementFields struct during field validation to avoid this. closes #23756 (cherry picked from commit 5a20a83) helps #26001 * fix: switch MeasurementFields from atomic.Value to sync.Map (#26022) Simplify and speed up synchronization for MeasurementFields structures by switching from a mutex and atomic.Value to a sync.Map (cherry picked from commit b617eb2) closes #26001 (cherry picked from commit 8711e2d)
There was a window where a race between writes with
differing types for the same field were being validated.
Lock the MeasurementFields struct during field
validation to avoid this.
closes #23756