Skip to content

[WIP] [ENH] Export EDF files using pyedflib #12085

Closed
withmywoessner wants to merge 8 commits intomne-tools:mainfrom
withmywoessner:edflibpytest
Closed

[WIP] [ENH] Export EDF files using pyedflib #12085
withmywoessner wants to merge 8 commits intomne-tools:mainfrom
withmywoessner:edflibpytest

Conversation

@withmywoessner
Copy link
Copy Markdown
Contributor

@withmywoessner withmywoessner commented Oct 6, 2023

Reference issue
Addresses #9883

What does this implement/fix?
I modify the _export_raw() function for edf files to use pyedflib instead of EDFlib as pyedflib is significantly faster.

Additional information
My basic implementation works with files that have an integer sampling rate. However, I run into problems with non int sampling rates. I think it has something to do with the setDatarecordDuration function being implemented differently in each library. I also noticed that in my implementation tolerances for the data are around .01 to .1 compared to the original file which may be too high.

@withmywoessner withmywoessner changed the title convert to pyedflib Read EDF files using pyedflib Oct 6, 2023
@withmywoessner withmywoessner marked this pull request as draft October 8, 2023 06:52
Copy link
Copy Markdown
Contributor

@wmvanvliet wmvanvliet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to see you working on this @withmywoessner!

# DeprecationWarning: `sample_rate` is deprecated and
# will be removed in a future release.
# Please use `sample_frequency` instead
# This causes test to fail, so we catch it here
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we update the test instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into it!

f"trying to write {desc} at {onset} "
f"for {duration} seconds."
)
del hdl
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it is. I just included it because in the documentation for the pyedflib high level functions they include it:

    with pyedflib.EdfWriter(edf_file, n_channels=n_channels, file_type=file_type) as f:
        f.setDatarecordDuration(int(100000 * block_size))
        f.setSignalHeaders(signal_headers)
        f.setHeader(header)
        f.writeSamples(signals, digital=digital)
        for annotation in annotations:
            f.writeAnnotation(*annotation)
    del f

Copy link
Copy Markdown
Contributor Author

@withmywoessner withmywoessner Oct 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wmvanvliet Thank you for the suggestions! The thing I really wanted help/opinions on is getting the tolerances to be lower and getting the setDatarecordDuration function to match the original implementation so that the sampling rate matches the original implementation

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding del f, I really don't think it's necessary since we're at the end of the function already: when the function returns, f should be cleaned up automatically.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain a bit more about the tolerances and need to set setDatarecordDuration? Sorry, I'm note completely up to date about the details of the EDF/BDF file format.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain a bit more about the tolerances and need to set setDatarecordDuration? Sorry, I'm note completely up to date about the details of the EDF/BDF file format.

Sure @wmvanvliet, basically when I tested my first code, the tolerances were way too high:
image
The current implementation expects tolerances of 1e-5. Also, the setDatarecordDuration function basically lets you have a non-int sampling rate. There are some problems with the documentation (see this issue), but I think I figured that out. The function is supposed to give the same sampling rate as the original file but instead, it is slightly lower.
Original:
image
EDF export:
image

I suspect this is one reason why the tolerances are different, but it can't entirely explain the problem as the tolerances are too high even with an integer sampling rate.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current precision is 1.110142833033462e-08 V, or 0.011101428330334619 µV – so we should be good?

Copy link
Copy Markdown
Contributor Author

@withmywoessner withmywoessner Oct 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my testing I thought the test expected 1e-5 µV, but I could have been wrong. Also, the test expects a tolerance of 1e-5 for the timepoints as well:

    # Due to the data record duration limitations of EDF files, one
    # cannot store arbitrary float sampling rate exactly. Usually this
    # results in two sampling rates that are off by very low number of
    # decimal points. This for practical purposes does not matter
    # but will result in an error when say the number of time points
    # is very very large.
    assert_allclose(raw.times, raw_read.times[:orig_raw_len], rtol=0, atol=1e-5)

This is why the sampling rate is a problem because it changes the number of time points and therefore the value. Maybe we just need to change the tests?

@wmvanvliet
Copy link
Copy Markdown
Contributor

wmvanvliet commented Oct 9, 2023

Ah, I see now the problem with tolerances. For reference, here is a test script that demonstrates the problem:

import mne

raw_fname = mne.datasets.sample.data_path() / "MEG" / "sample" / "sample_audvis_raw.fif"
raw = mne.io.read_raw_fif(raw_fname)
raw.pick("eeg")
raw.load_data()

# Save to EDF and load again
raw.export("/tmp/test.edf", overwrite=True)
raw2 = mne.io.read_raw_edf("/tmp/test.edf", preload=True)

# Print the first 5 samples and timepoints
print(raw[0, :5])
print(raw2[0, :5])

current output:

(array([[1.13989260e-05, 9.85015885e-06, 7.68188489e-06, 5.82336435e-06,
        6.81457530e-07]]), array([0.        , 0.00166496, 0.00332992, 0.00499488, 0.00665984]))
(array([[1.14042070e-05, 9.86110885e-06, 7.68522951e-06, 5.83129150e-06,
        6.91331626e-07]]), array([0.        , 0.00166773, 0.00333546, 0.00500319, 0.00667092]))

@wmvanvliet
Copy link
Copy Markdown
Contributor

I guess there's not much we can do. With 16-bit precision, the maximum precision we can obtain on the EEG part of the sample dataset is:

digital_min = -32767
digital_max = 32767
physical_min = raw.get_data().min()
physical_max = raw.get_data().max()
tolerance = (physical_max - physical_min) / (digital_max - digital_min)
print(tolerance * 1E6, "microvolts")

0.011101428330334619 microvolts

@wmvanvliet
Copy link
Copy Markdown
Contributor

wmvanvliet commented Oct 9, 2023

And it's a similar story for the sample rate. Given that EDF doesn't support non-integer sampling rates at all, I think we do rather well.

@withmywoessner
Copy link
Copy Markdown
Contributor Author

Okay, I see. Thanks @wmvanvliet! Why does this not occur in the current implementation? Could this be improved in the pyedflib code? Also, since this may be a problem for others should we allow users to choose either library to use?

@cbrnr
Copy link
Copy Markdown
Contributor

cbrnr commented Oct 9, 2023

There are two (unrelated) problems:

  1. Precision
  2. Non-integer sampling frequencies

Regarding 1, why did the issue not occur previously?

Regarding 2, did you see Q9 in https://www.edfplus.info/specs/edffaq.html? I guess we have to live with some inaccuracies, I'm just not sure how to best define the record duration and the number of samples per record.

@withmywoessner
Copy link
Copy Markdown
Contributor Author

withmywoessner commented Oct 9, 2023

Regarding 1, why did the issue not occur previously?

I also am wondering this. I looked in the pyedflib documentation and the digital_min is -32768 (vs -32767 in the current implementation) would this make a slight difference @wmvanvliet @cbrnr?

Comment on lines 244 to +248
for key, val in [
("PatientCode", subj_info.get("his_id", "")),
("PatientName", name),
("PatientGender", sex),
("AdditionalPatientInfo", additional_patient_info),
("Gender", sex),
("PatientAdditional", additional_patient_info),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to https://www.edfplus.info/specs/edfplus.html#additionalspecs, should the order not be: Code, Gender, Birthdate, Name?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this matters for the function of the code, because the code is just using those to call a function (ex setGender, setPatientAdditional). I can change it though to be more consistent. @cbrnr

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Could you still change the order to be consistent with the specs?

@cbrnr
Copy link
Copy Markdown
Contributor

cbrnr commented Oct 9, 2023

I also am wondering this. I looked in the pyedflib documentation and the digital_min is -32768 (vs -32767 in the current implementation) would this make a slight difference @wmvanvliet @cbrnr?

No, -32768 is actually correct. You have 16bit, so you go from -2**15 to 2**15 - 1 (the - 1 accounts for zero).

Regarding the difference (I've also replied in the other thread), the current precision is 1.110142833033462e-08 V, or 0.011101428330334619 µV – so we should be good?

@cbrnr
Copy link
Copy Markdown
Contributor

cbrnr commented Oct 9, 2023

Looking at the difference between original and exported files for the current edflib-python based export yields:

>>> np.abs(raw.get_data() - raw_read.get_data()).max()
6.749146666805992e-09

In this PR (using pyedflib):

>>> np.abs(raw.get_data() - raw_read.get_data()).max()
0.00016842048660163812

That's a huge difference, and we should not lower the tolerance than what we currently have. If it worked with edflib-python, we should be able to achieve the same results with pyedflib I think.

@wmvanvliet
Copy link
Copy Markdown
Contributor

wmvanvliet commented Oct 9, 2023

How did the previous edflib implementation achieve such good tolerances? Did it save in 32 bit by any chance?

@withmywoessner
Copy link
Copy Markdown
Contributor Author

withmywoessner commented Oct 10, 2023

@wmvanvliet @cbrnr. I did some testing and the tolerances can be much higher depending on the size of the data:

CURRENT IMP:
filename: /home/woess/workspace/current_assignment/read_brain/Raw Files/BRAIN_VIS.eeg
n_times: 3278420
sfreq: 1000.0
first 10 samples: [-10377.2896087 -10378.2173426 -10373.6763293 -10373.3345326
 -10380.0239823 -10383.1001526 -10380.9517162 -10376.8013277
 -10375.5806252 -10377.924374 ]
first 10 times: [0.    0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009]
last 10 times: [3278.41  3278.411 3278.412 3278.413 3278.414 3278.415 3278.416 3278.417
 3278.418 3278.419]

filename: /home/woess/workspace/current_assignment/read_brain/export/test_10.edf
n_times: 3279000
sfreq: 1000.0
first 10 samples: [-10377.28695257 -10378.30553301 -10373.72192106 -10373.72192106
 -10380.34269387 -10383.39843516 -10381.3612743  -10376.77766236
 -10375.75908193 -10378.30553301]
first 10 times: [0.    0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009]
last 10 times: [3278.99  3278.991 3278.992 3278.993 3278.994 3278.995 3278.996 3278.997
 3278.998 3278.999]


PYEDFLIB IMP:
filename: /home/woess/workspace/current_assignment/read_brain/Raw Files/BRAIN_VIS.eeg
n_times: 3278420
sfreq: 1000.0
first 10 samples: [-10377.2896087 -10378.2173426 -10373.6763293 -10373.3345326
 -10380.0239823 -10383.1001526 -10380.9517162 -10376.8013277
 -10375.5806252 -10377.924374 ]
first 10 times: [0.    0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009]
last 10 times: [3278.41  3278.411 3278.412 3278.413 3278.414 3278.415 3278.416 3278.417
 3278.418 3278.419]

filename: /home/woess/workspace/current_assignment/read_brain/export/test_10.edf
n_times: 3279000
sfreq: 1000.0
first 10 samples: [-10377.28695257 -10378.30553301 -10373.72192106 -10373.72192106
 -10380.34269387 -10383.39843516 -10381.3612743  -10376.77766236
 -10375.75908193 -10378.30553301]
first 10 times: [0.    0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009]
last 10 times: [3278.99  3278.991 3278.992 3278.993 3278.994 3278.995 3278.996 3278.997
 3278.998 3278.999]

As you can see the tolerance for both implementations is about .01 and the values are the same. This is very far from what the testing data expects (1e-5) This is for a file that is about an hour in length:
image

So for int sampling rates, I think we can definitively say it works the same so I do not think it is a problem with the size (16 bit vs 32 bit). I think it is entirely due to the datarecords being different sizes for non int sampling rates

@cbrnr
Copy link
Copy Markdown
Contributor

cbrnr commented Oct 10, 2023

So you are saying that the differences are almost entirely caused by to the non-integer sampling frequency? Do we have round-trip tests that use data with an integer sampling frequency?

@withmywoessner
Copy link
Copy Markdown
Contributor Author

withmywoessner commented Oct 10, 2023

So you are saying that the differences are almost entirely caused by to the non-integer sampling frequency?

Yes

Do we have round-trip tests that use data with an integer sampling frequency?

I am not sure what a round-trip test is, but all the edf tests I see either generate a non-int sampling rate file from scratch or use 'test_raw.fif', which has a non-int sampling rate.

@larsoner
Copy link
Copy Markdown
Member

A round-trip test in this case means you take some data, write it out, read what you wrote back in, and check the read-in data against the original data (have gone in a "full circle", i.e., a read-write round trip). We have a bunch of EDF test files in the testing dataset:

$ ls MNE-testing-data/EDF/
SC4001EC-Hypnogram.edf			subsecond_starttime.edf			test_edf_stim_resamp.edf		test_reduced.edf
chtypes_edf.edf				test_edf_overlapping_annotations.edf	test_generator_2.edf			test_utf8_annotations.edf

Some or all of these probably have integer sampling rates.

@cbrnr cbrnr changed the title Read EDF files using pyedflib Export EDF files using pyedflib Oct 24, 2023
@withmywoessner withmywoessner changed the title Export EDF files using pyedflib [WIP] [ENH] Export EDF files using pyedflib Nov 9, 2023
withmywoessner and others added 5 commits November 9, 2023 03:48
Co-authored-by: Marijn van Vliet <w.m.vanvliet@gmail.com>
Co-authored-by: Marijn van Vliet <w.m.vanvliet@gmail.com>
Co-authored-by: Marijn van Vliet <w.m.vanvliet@gmail.com>
@cbrnr
Copy link
Copy Markdown
Contributor

cbrnr commented Nov 9, 2023

Can I suggest that we give https://github.com/the-siesta-group/edfio a try? I haven't tested it yet, but I have a strong feeling that this package should be exactly what we want. I'll be playing around with it in the next couple of days, so I can also try to integrate it for our export.

@withmywoessner
Copy link
Copy Markdown
Contributor Author

Can I suggest that we give https://github.com/the-siesta-group/edfio a try? I haven't tested it yet, but I have a strong feeling that this package should be exactly what we want. I'll be playing around with it in the next couple of days, so I can also try to integrate it for our export.

Okay sounds good! Maybe a new pull request should be made for that then?

@cbrnr
Copy link
Copy Markdown
Contributor

cbrnr commented Nov 10, 2023

Okay sounds good! Maybe a new pull request should be made for that then?

Yes, I'll definitely do that in a new PR. However, since I don't know how well this package will work, it's best to keep this one alive for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants