[MRG] Fused type makedataset#9040
Conversation
|
ping: @raghavrv |
|
The idea was to import the However, fused types cannot be used as class attributes and we have several A possible workaround is to add from cython cimport floating
cdef class Printer:
// We wish num to be fused types, so declared it as void*
cdef void *num;
cdef bint is_float;
// Used as fused type arguments
cdef float float_sample;
cdef double double_sample;
def __init__(self):
cdef float num = float(5)
self.num = &num
if type(num) == float:
self.is_float = True
else:
self.is_float = False
if self.is_float:
self._print(float_sample)
else:
self._print(double_sample)
// Underlying function
def _print(self, floating sample):
// Typecast it when we want to access its value
cdef floating *num = <floating*>self.num
print num[0]@jnothman concern is that
Talking to @ogrisel, @amueller, @raghavrv we could do something like what it has been done in Can you guys give us any hints on how to do that? |
|
@arthurmensch can you give a look to all this? |
|
Example of how pandas uses cython templates to have versions of the same function for different dtypes: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/groupby_helper.pxi.in |
|
Thanks @jorisvandenbossche. I think this is what @ogrisel suggested yesterday IRL... We need a I'm trying to see if we can avoid that... Most of the class attributes are not data themselves but pointers. Pointers can actually be BTW why does this PR diff have changes unrelated to |
b80cd39 to
de9189e
Compare
|
@raghavrv, Sorry for the changes unrelated to |
9e13ead to
1bcee61
Compare
|
can you guys give us feedback in this ? |
sklearn/linear_model/base.py
Outdated
| # seed should never be 0 in SequentialDataset | ||
| seed = rng.randint(1, np.iinfo(np.int32).max) | ||
|
|
||
| if isinstance(X, np.float32): |
There was a problem hiding this comment.
Shouldn't you be checking the dtype, rather than X?
Something like X.dtype == np.float32
sklearn/utils/seq_dataset.pxd.tp
Outdated
| @@ -0,0 +1,69 @@ | |||
| """Dataset abstractions for sequential data access. """ | |||
There was a problem hiding this comment.
It would be useful here to point to what kind of file this is (a Cython template), given an URL where it is described. Indeed, such files are uncommon, and people might be surprised.
| from libc.limits cimport INT_MAX | ||
| cimport numpy as np | ||
| import numpy as np | ||
| {{py: |
There was a problem hiding this comment.
I don't think it is needed to put those strings inside the {{py part (so I would leave the author information and the note you added at the top before the {{py stuff)
arthurmensch
left a comment
There was a problem hiding this comment.
Looks fine to me, apart from minor comments. I would have prefered to go with void * pointers as I think this does not hinder performance much but I guess you discussed this. Let's wait for CI.
| """ | ||
|
|
||
| # name, c_type | ||
| dtypes = [('', 'double'), |
There was a problem hiding this comment.
['64', 'double'] sounds more consistent
There was a problem hiding this comment.
I agree. But doing that we would need to recode sag_fast.pyx for example... we could put it off to another PR (#9020) or just rename SequentialDataset in sag_fast.pyx as SequentialDataset64.
sklearn/utils/seq_dataset.pyx.tp
Outdated
| seed[0] ^= <np.uint32_t>(seed[0] << 5) | ||
|
|
||
| return seed[0] % (<np.uint32_t>RAND_R_MAX + 1) | ||
| return seed[0] % (<np.uint32_t>RAND_R_MAX + 1) No newline at end of file |
| for i in range(5): | ||
| # next sample | ||
| xi_32, yi32, swi32, idx32 = dataset32._next_py() | ||
| xi_, yi, swi, idx = dataset64._next_py() |
There was a problem hiding this comment.
Check name consistency
| xi_data32, _, _ = xi_32 | ||
| xi_data, _, _ = xi_ | ||
| assert_equal(xi_data32.dtype, np.float32) | ||
| assert_equal(xi_data.dtype, np.float64) |
There was a problem hiding this comment.
assert_array_almost_equal(xi_data64.astype('double'), xi_data32)?
sklearn/utils/setup.py
Outdated
| 'sklearn/utils/seq_dataset.pxd.tp'] | ||
|
|
||
| for pyxfiles in pyx_templates: | ||
| assert pyxfiles.endswith(('.pyx.tp', '.pxd.tp')) |
sklearn/utils/setup.py
Outdated
| for pyxfiles in pyx_templates: | ||
| assert pyxfiles.endswith(('.pyx.tp', '.pxd.tp')) | ||
| outfile = pyxfiles[:-3] | ||
| # if .pxi.in is not updated, no need to output .pxi |
There was a problem hiding this comment.
Edit comment to match code. Maybe precise that this a good idea for cythonization ?
| xi_data32, _, _ = xi_32 | ||
| xi_data, _, _ = xi_ | ||
| assert_equal(xi_data32.dtype, np.float32) | ||
| assert_equal(xi_data.dtype, np.float64) |
There was a problem hiding this comment.
Beware to add generated files to .gitignore
| """Dataset abstractions for sequential data access. """ | ||
|
|
||
| cimport numpy as np | ||
| {{py: |
There was a problem hiding this comment.
Shouldn't you put the header outside the {{py ? It would be cleaner.
There was a problem hiding this comment.
I put it just below actually. First the code defining the template (in the braces {{py blablabla}}) and then the template itself. Only the second part is generated.
sklearn/utils/seq_dataset.pyx.tp
Outdated
| # | ||
| # Author: Peter Prettenhofer <peter.prettenhofer@gmail.com> | ||
| # Arthur Imbert <arthurimbert05@gmail.com> | ||
| # License: BSD 3 clause |
There was a problem hiding this comment.
You should move author and license outside the generation loop.
There was a problem hiding this comment.
If I do that, they won't appear in the generated file. That's what you want?
There was a problem hiding this comment.
Yes. They won't be in the repo anyway
|
The windows test failure looks like a numerical accuracy that should be relaxed when tested with 32 bit floats (but kept precise when run with 64 bit floats). |
|
Also please use informative commit message. |
…earn into fused_type_makedataset
|
@ogrisel I changed the numerical accuracy in the test, is it better? |
| assert_equal(idx1, idx2) | ||
|
|
||
|
|
||
| def test_consistency_check_fused_types(): |
Address oliver's request for changing test name
ogrisel
left a comment
There was a problem hiding this comment.
I am not sure what the scope of this PR is exactly but I think we should check the change is actually useful for the user-facing estimators such as SGDClassifier. See my comment below.
|
|
||
| if X.dtype == np.float32: | ||
| CSRData = CSRDataset32 | ||
| ArrayData = ArrayDataset32 |
There was a problem hiding this comment.
The codecov chrome/firefox extension tells me that those lines are not covered. Please add a test for that case (and install the codecov browser extension ;).
This test should probably fit a SGDClassifier on 32 bit float iris and check that the coef_ attribute should be 32 bit float as well ( and the output of decision_function should also output a float32 array).
There was a problem hiding this comment.
Currently, there isn't any specific test for make_dataset. It's supposed to be covered by the tests for the sag solver. So we can either put the test off to a next PR about the sag solver (#9020) or doing the test you are proposing in this PR. What do you prefer?
Yes this test is good now. |
|
doing the test you are proposing in this PR. What do you prefer?
A dedicated test seems useful.
|
|
I added a test for |
I'm stuck in line 383. dataset is of the type sequentialDataset (see line 156) This comes form the templated code in scikit-learn#9040. So dataset would be able to be either float or double. but it does not pick it up when instanciated.
sklearn/linear_model/base.py
Outdated
| from ..utils.sparsefuncs import mean_variance_axis, inplace_column_scale | ||
| from ..utils.fixes import sparse_lsqr | ||
| from ..utils.seq_dataset import ArrayDataset, CSRDataset | ||
| from ..utils.seq_dataset import ArrayDataset32, CSRDataset32, ArrayDataset, \ |
There was a problem hiding this comment.
We prefer to avoid the line continuation with a backslash and rather start a new line, repeating the "from ...utils"
|
|
||
| # name, c_type | ||
| dtypes = [('', 'double'), | ||
| ('32', 'float')] |
There was a problem hiding this comment.
it is much more clear if we use:
dtypes = [('64', 'double'),
 ('32', 'float')]I found an error complaining about SequentialDataset and I had a hard time to figure out that it was referring to the 64bits version. @Henley13 do you remember why we choose '' over '64'?
There was a problem hiding this comment.
To avoid refactoring. But I don't think it is a good idea
|
I've rebased this PR, and fixed the tiny element remaining that caused the tests to fail. I've opened a new PR (#11155 ) Thanks a lot for all the work! |
Reference Issue
Works on #8769.
What does this implement/fix? Explain your changes.
Make make_dataset from linear_model/base.py support fused types.
Any other comments?
Intermediate step to make
sagsolver efficiently support fused types (PR #9020).