Skip to content

fix: lock MeasurementFields while validating#25998

Merged
davidby-influx merged 4 commits intomaster-1.xfrom
DSB_field_creation_test
Feb 13, 2025
Merged

fix: lock MeasurementFields while validating#25998
davidby-influx merged 4 commits intomaster-1.xfrom
DSB_field_creation_test

Conversation

@davidby-influx
Copy link
Copy Markdown
Contributor

There was a window where a race between writes with
differing types for the same field were being validated.
Lock the MeasurementFields struct during field
validation to avoid this.

closes #23756

There was a window where a race between writes with
differing types for the same field were being validated.
Lock the  MeasurementFields struct during field
validation to avoid this.

closes #23756
tsdb/shard.go Outdated
case PartialWriteError:
if reason == "" {
reason = err.Reason
cont, err := func(p models.Point, iter models.FieldIterator) (cont bool, err error) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the only time we don't have cont == true is where there is an error, and then the continue is at the bottom of the loop. Is cont really doing anything?

tsdb/shard.go Outdated
Comment on lines +747 to +753
fieldsToCreate = append(fieldsToCreate, &FieldCreate{
Measurement: name,
Field: &Field{
Name: string(fieldKey),
Type: dataType,
},
})
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a write has two points that both have a new field (newField), but the two points have different types for newField, they would both end up in fieldsToCreate, correct? Would that create issues later?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code had that same pattern. The first gets created, the second rejected.

tsdb/shard.go Outdated
mf.mu.Lock()
defer mf.mu.Unlock()
// Check with the field validator.
if err := ValidateFields(mf, p, s.options.Config.SkipFieldSizeValidation); err != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ValidateFields knows which fields are not currently in the measurement fields, and this is the only place in the code it is called from. We could avoid iterating over the fields again below looking for unknown fields if ValidateFields collected them for us.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. My question was how wide-ranging the changes should be. This PR is the minimal set of changes I could find that fixed the bug, but you are correct that there is much room for improvement in the code as it is in the product and in this PR. Let's discuss how radical to get.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to replace the atomic.Value storing the fields map with a sync.Map with a generic wrapper for type safety, for instance. Lots of locking goes away if we do that, but it's a big, scary change.

tsdb/shard.go Outdated
Comment on lines +707 to +708
mf := engine.MeasurementFields(name)
mf.mu.Lock()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like there is still a potential race condition. We lock the mf while looking up fields here, but then unlock it while we continue to the next point. Another incoming write in a different goroutine could then look up fields in mf before this goroutine can create the new fields. Or am I missing something?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field creation has its own locking and checks yet again for field type conflicts. So either of the go routines may win, and the other will report an error.

So, the race you describe is real, but gets sequenced in field creation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the MeasurementFields lock to a RWMutex and used an RLock in Shard.validateSeriesAndFields to allow greater parallelism.

The lock has to be taken on a per point basis, because points can differ in which measurement they pertain to, and thus which MeasurementFields object has to be locked.

MeasurementFields.CreateFieldIfNotExists still uses the full lock, of course.

Copy link
Copy Markdown
Contributor

@stuartcarnie stuartcarnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me – I believe I understand the data race, as per my comments. If I am wrong, please do correct me!

Comment on lines -41 to +40
// If the fields is not present, there cannot be a conflict.
f := mf.FieldBytes(iter.FieldKey())
if f == nil {
if bytes.Equal(fieldKey, timeBytes) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is this is the first part of the data race, where to concurrent writes would both return nil here, resulting in this continuing, as "there cannot be a conflict" according to the comment.

Comment on lines -736 to -738
if mf.FieldBytes(fieldKey) != nil {
continue
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the second part is here, where the thread one has now created the field, of (for example) data type float, and thread two reaches this point and continues as the field key now exists, but thread two wants to create a field of type integer.

I believe the other part of the race is that a snapshot has started for the data from thread one, so a new cache accepts the writes for thread two of the new data type, as the cache will accept the value, as it does not have any previous data to validate.

@davidby-influx davidby-influx merged commit 5a20a83 into master-1.x Feb 13, 2025
9 checks passed
@davidby-influx davidby-influx deleted the DSB_field_creation_test branch February 13, 2025 19:34
davidby-influx added a commit that referenced this pull request Feb 14, 2025
There was a window where a race between writes with
differing types for the same field were being validated.
Lock the  MeasurementFields struct during field
validation to avoid this.

closes #23756

(cherry picked from commit 5a20a83)

helps #26001
davidby-influx added a commit that referenced this pull request Feb 14, 2025
* fix: lock MeasurementFields while validating (#25998)

There was a window where a race between writes with
differing types for the same field were being validated.
Lock the  MeasurementFields struct during field
validation to avoid this.

closes #23756

(cherry picked from commit 5a20a83)

helps #26001

* fix: switch MeasurementFields from atomic.Value to sync.Map (#26022)

Simplify and speed up synchronization for
MeasurementFields structures by switching
from a mutex and atomic.Value to a sync.Map

(cherry picked from commit b617eb2)

closes #26001
@philjb
Copy link
Copy Markdown
Contributor

philjb commented Feb 18, 2025

some follow up in pr: #26028

devanbenz pushed a commit that referenced this pull request May 9, 2025
* fix: lock MeasurementFields while validating (#25998)

There was a window where a race between writes with
differing types for the same field were being validated.
Lock the  MeasurementFields struct during field
validation to avoid this.

closes #23756

(cherry picked from commit 5a20a83)

helps #26001

* fix: switch MeasurementFields from atomic.Value to sync.Map (#26022)

Simplify and speed up synchronization for
MeasurementFields structures by switching
from a mutex and atomic.Value to a sync.Map

(cherry picked from commit b617eb2)

closes #26001

(cherry picked from commit 8711e2d)
devanbenz pushed a commit that referenced this pull request May 9, 2025
* fix: lock MeasurementFields while validating (#25998)

There was a window where a race between writes with
differing types for the same field were being validated.
Lock the  MeasurementFields struct during field
validation to avoid this.

closes #23756

(cherry picked from commit 5a20a83)

helps #26001

* fix: switch MeasurementFields from atomic.Value to sync.Map (#26022)

Simplify and speed up synchronization for
MeasurementFields structures by switching
from a mutex and atomic.Value to a sync.Map

(cherry picked from commit b617eb2)

closes #26001

(cherry picked from commit 8711e2d)
devanbenz added a commit that referenced this pull request May 12, 2025
Co-authored-by: davidby-influx <72418212+davidby-influx@users.noreply.github.com>
fix: lock MeasurementFields while validating (#25998)
closes #23756
fix: switch MeasurementFields from atomic.Value to sync.Map (#26022)
closes #26001
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants