[WIP] [ENH] Export EDF files using pyedflib #12085
[WIP] [ENH] Export EDF files using pyedflib #12085withmywoessner wants to merge 8 commits intomne-tools:mainfrom
Conversation
for more information, see https://pre-commit.ci
wmvanvliet
left a comment
There was a problem hiding this comment.
Happy to see you working on this @withmywoessner!
| # DeprecationWarning: `sample_rate` is deprecated and | ||
| # will be removed in a future release. | ||
| # Please use `sample_frequency` instead | ||
| # This causes test to fail, so we catch it here |
There was a problem hiding this comment.
Couldn't we update the test instead?
There was a problem hiding this comment.
I'll look into it!
| f"trying to write {desc} at {onset} " | ||
| f"for {duration} seconds." | ||
| ) | ||
| del hdl |
There was a problem hiding this comment.
I'm not sure if it is. I just included it because in the documentation for the pyedflib high level functions they include it:
with pyedflib.EdfWriter(edf_file, n_channels=n_channels, file_type=file_type) as f:
f.setDatarecordDuration(int(100000 * block_size))
f.setSignalHeaders(signal_headers)
f.setHeader(header)
f.writeSamples(signals, digital=digital)
for annotation in annotations:
f.writeAnnotation(*annotation)
del f
There was a problem hiding this comment.
@wmvanvliet Thank you for the suggestions! The thing I really wanted help/opinions on is getting the tolerances to be lower and getting the setDatarecordDuration function to match the original implementation so that the sampling rate matches the original implementation
There was a problem hiding this comment.
regarding del f, I really don't think it's necessary since we're at the end of the function already: when the function returns, f should be cleaned up automatically.
There was a problem hiding this comment.
Could you explain a bit more about the tolerances and need to set setDatarecordDuration? Sorry, I'm note completely up to date about the details of the EDF/BDF file format.
There was a problem hiding this comment.
Could you explain a bit more about the tolerances and need to set setDatarecordDuration? Sorry, I'm note completely up to date about the details of the EDF/BDF file format.
Sure @wmvanvliet, basically when I tested my first code, the tolerances were way too high:

The current implementation expects tolerances of 1e-5. Also, the setDatarecordDuration function basically lets you have a non-int sampling rate. There are some problems with the documentation (see this issue), but I think I figured that out. The function is supposed to give the same sampling rate as the original file but instead, it is slightly lower.
Original:

EDF export:

I suspect this is one reason why the tolerances are different, but it can't entirely explain the problem as the tolerances are too high even with an integer sampling rate.
There was a problem hiding this comment.
The current precision is 1.110142833033462e-08 V, or 0.011101428330334619 µV – so we should be good?
There was a problem hiding this comment.
From my testing I thought the test expected 1e-5 µV, but I could have been wrong. Also, the test expects a tolerance of 1e-5 for the timepoints as well:
# Due to the data record duration limitations of EDF files, one
# cannot store arbitrary float sampling rate exactly. Usually this
# results in two sampling rates that are off by very low number of
# decimal points. This for practical purposes does not matter
# but will result in an error when say the number of time points
# is very very large.
assert_allclose(raw.times, raw_read.times[:orig_raw_len], rtol=0, atol=1e-5)
This is why the sampling rate is a problem because it changes the number of time points and therefore the value. Maybe we just need to change the tests?
|
Ah, I see now the problem with tolerances. For reference, here is a test script that demonstrates the problem: import mne
raw_fname = mne.datasets.sample.data_path() / "MEG" / "sample" / "sample_audvis_raw.fif"
raw = mne.io.read_raw_fif(raw_fname)
raw.pick("eeg")
raw.load_data()
# Save to EDF and load again
raw.export("/tmp/test.edf", overwrite=True)
raw2 = mne.io.read_raw_edf("/tmp/test.edf", preload=True)
# Print the first 5 samples and timepoints
print(raw[0, :5])
print(raw2[0, :5])current output: |
|
I guess there's not much we can do. With 16-bit precision, the maximum precision we can obtain on the EEG part of the sample dataset is: digital_min = -32767
digital_max = 32767
physical_min = raw.get_data().min()
physical_max = raw.get_data().max()
tolerance = (physical_max - physical_min) / (digital_max - digital_min)
print(tolerance * 1E6, "microvolts")0.011101428330334619 microvolts |
|
And it's a similar story for the sample rate. Given that EDF doesn't support non-integer sampling rates at all, I think we do rather well. |
|
Okay, I see. Thanks @wmvanvliet! Why does this not occur in the current implementation? Could this be improved in the pyedflib code? Also, since this may be a problem for others should we allow users to choose either library to use? |
|
There are two (unrelated) problems:
Regarding 1, why did the issue not occur previously? Regarding 2, did you see Q9 in https://www.edfplus.info/specs/edffaq.html? I guess we have to live with some inaccuracies, I'm just not sure how to best define the record duration and the number of samples per record. |
I also am wondering this. I looked in the pyedflib documentation and the digital_min is -32768 (vs -32767 in the current implementation) would this make a slight difference @wmvanvliet @cbrnr? |
| for key, val in [ | ||
| ("PatientCode", subj_info.get("his_id", "")), | ||
| ("PatientName", name), | ||
| ("PatientGender", sex), | ||
| ("AdditionalPatientInfo", additional_patient_info), | ||
| ("Gender", sex), | ||
| ("PatientAdditional", additional_patient_info), |
There was a problem hiding this comment.
According to https://www.edfplus.info/specs/edfplus.html#additionalspecs, should the order not be: Code, Gender, Birthdate, Name?
There was a problem hiding this comment.
I don't think this matters for the function of the code, because the code is just using those to call a function (ex setGender, setPatientAdditional). I can change it though to be more consistent. @cbrnr
There was a problem hiding this comment.
You are right. Could you still change the order to be consistent with the specs?
No, -32768 is actually correct. You have 16bit, so you go from Regarding the difference (I've also replied in the other thread), the current precision is 1.110142833033462e-08 V, or 0.011101428330334619 µV – so we should be good? |
|
Looking at the difference between original and exported files for the current >>> np.abs(raw.get_data() - raw_read.get_data()).max()
6.749146666805992e-09In this PR (using >>> np.abs(raw.get_data() - raw_read.get_data()).max()
0.00016842048660163812That's a huge difference, and we should not lower the tolerance than what we currently have. If it worked with |
|
How did the previous edflib implementation achieve such good tolerances? Did it save in 32 bit by any chance? |
|
@wmvanvliet @cbrnr. I did some testing and the tolerances can be much higher depending on the size of the data: As you can see the tolerance for both implementations is about .01 and the values are the same. This is very far from what the testing data expects (1e-5) This is for a file that is about an hour in length: So for int sampling rates, I think we can definitively say it works the same so I do not think it is a problem with the size (16 bit vs 32 bit). I think it is entirely due to the datarecords being different sizes for non int sampling rates |
|
So you are saying that the differences are almost entirely caused by to the non-integer sampling frequency? Do we have round-trip tests that use data with an integer sampling frequency? |
Yes
I am not sure what a round-trip test is, but all the edf tests I see either generate a non-int sampling rate file from scratch or use 'test_raw.fif', which has a non-int sampling rate. |
|
A round-trip test in this case means you take some data, write it out, read what you wrote back in, and check the read-in data against the original data (have gone in a "full circle", i.e., a read-write round trip). We have a bunch of EDF test files in the testing dataset: Some or all of these probably have integer sampling rates. |
for more information, see https://pre-commit.ci
Co-authored-by: Marijn van Vliet <w.m.vanvliet@gmail.com>
Co-authored-by: Marijn van Vliet <w.m.vanvliet@gmail.com>
Co-authored-by: Marijn van Vliet <w.m.vanvliet@gmail.com>
|
Can I suggest that we give https://github.com/the-siesta-group/edfio a try? I haven't tested it yet, but I have a strong feeling that this package should be exactly what we want. I'll be playing around with it in the next couple of days, so I can also try to integrate it for our export. |
Okay sounds good! Maybe a new pull request should be made for that then? |
Yes, I'll definitely do that in a new PR. However, since I don't know how well this package will work, it's best to keep this one alive for now. |

Reference issue
Addresses #9883
What does this implement/fix?
I modify the _export_raw() function for edf files to use pyedflib instead of EDFlib as pyedflib is significantly faster.
Additional information
My basic implementation works with files that have an integer sampling rate. However, I run into problems with non int sampling rates. I think it has something to do with the setDatarecordDuration function being implemented differently in each library. I also noticed that in my implementation tolerances for the data are around .01 to .1 compared to the original file which may be too high.