[MRG] Fused type makedataset by Henley13 · Pull Request #9040 · scikit-learn/scikit-learn

Henley13 · 2017-06-07T13:10:27Z

Reference Issue

Works on #8769.

What does this implement/fix? Explain your changes.

Make make_dataset from linear_model/base.py support fused types.

Any other comments?

Intermediate step to make sag solver efficiently support fused types (PR #9020).

massich · 2017-06-07T16:58:14Z

ping: @raghavrv

massich · 2017-06-08T09:29:38Z

The idea was to import the floating fuse type using from cython cimport floating and replace all double in seq_dataset.{pxy, pxd}.

However, fused types cannot be used as class attributes and we have several *double attributes both in ArrayDataset and CSRDataset (see here)

A possible workaround is to add cdef void *X_data_ptr, keep track of the type and fill the code with if else statements:

from cython cimport floating
cdef class Printer:
	
	// We wish num to be fused types, so declared it as void*
	cdef void *num;
	
	cdef bint is_float;
	
	// Used as fused type arguments
	cdef float float_sample;
	cdef double double_sample;

	def __init__(self):
		cdef float num = float(5)
		self.num = &num
		
		if type(num) == float:
			self.is_float = True
		else:
			self.is_float = False
			
		if self.is_float:
			self._print(float_sample)
		else:
			self._print(double_sample)
	
	// Underlying function
	def _print(self, floating sample):
		// Typecast it when we want to access its value
		cdef floating *num = <floating*>self.num
		print num[0]

@jnothman concern is that

A small problem with your workaround is that unlike if floating is float, the branching for if/else self.is_float needs to be compiled into C by Cython. Cython doesn't know that it can statically factor out that conditional.

Talking to @ogrisel, @amueller, @raghavrv we could do something like what it has been done in svm/src/libsvm/svm.{h,cpp} and svm/src/libsvm/libsvm_template.cpp. Where two interfaces are defined, in svm.h. The beginning of the namespace declaration is wrapped within an #ifdef declaration.

Can you guys give us any hints on how to do that?

massich · 2017-06-08T10:47:55Z

@agramfort @jorisvandenbossche

massich · 2017-06-08T10:48:15Z

@arthurmensch can you give a look to all this?

jorisvandenbossche · 2017-06-08T11:06:15Z

Example of how pandas uses cython templates to have versions of the same function for different dtypes: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/groupby_helper.pxi.in

raghavrv · 2017-06-08T12:59:29Z

Thanks @jorisvandenbossche. I think this is what @ogrisel suggested yesterday IRL...

@massich @Henley13

We need a seq_dataset.pyx.in with the template, a generator for seq_dataset_float.pyx and seq_dataset_double.pyx...

I'm trying to see if we can avoid that...

Most of the class attributes are not data themselves but pointers.

Pointers can actually be void* and be later cast to <floating> void* circumventing cython's limitation of "no fused type for class attributes"... But this needs trial and error and may not end up successful...

BTW why does this PR diff have changes unrelated to seq_dataset.pyx?

Henley13 · 2017-06-08T13:55:55Z

@raghavrv, Sorry for the changes unrelated to seq_dataset.pyx, I badly initialized the PR, but it's fixed now.

massich · 2017-06-09T11:47:08Z

can you guys give us feedback in this ?

cc: @GaelVaroquaux @raghavrv @lesteve

GaelVaroquaux · 2017-06-09T12:14:49Z

sklearn/linear_model/base.py

    # seed should never be 0 in SequentialDataset
    seed = rng.randint(1, np.iinfo(np.int32).max)

+    if isinstance(X, np.float32):


Shouldn't you be checking the dtype, rather than X?

Something like X.dtype == np.float32

GaelVaroquaux · 2017-06-09T12:28:02Z

sklearn/utils/seq_dataset.pxd.tp

@@ -0,0 +1,69 @@
+"""Dataset abstractions for sequential data access. """


It would be useful here to point to what kind of file this is (a Cython template), given an URL where it is described. Indeed, such files are uncommon, and people might be surprised.

jorisvandenbossche · 2017-06-09T13:21:29Z

sklearn/utils/seq_dataset.pyx.tp

-from libc.limits cimport INT_MAX
-cimport numpy as np
-import numpy as np
+{{py:


I don't think it is needed to put those strings inside the {{py part (so I would leave the author information and the note you added at the top before the {{py stuff)

arthurmensch

Looks fine to me, apart from minor comments. I would have prefered to go with void * pointers as I think this does not hinder performance much but I guess you discussed this. Let's wait for CI.

arthurmensch · 2017-06-09T13:27:02Z

sklearn/utils/seq_dataset.pxd.tp

+"""
+
+# name, c_type
+dtypes = [('', 'double'),


['64', 'double'] sounds more consistent

I agree. But doing that we would need to recode sag_fast.pyx for example... we could put it off to another PR (#9020) or just rename SequentialDataset in sag_fast.pyx as SequentialDataset64.

arthurmensch · 2017-06-09T13:27:17Z

sklearn/utils/seq_dataset.pyx.tp

    seed[0] ^= <np.uint32_t>(seed[0] << 5)

-    return seed[0] % (<np.uint32_t>RAND_R_MAX + 1)
+    return seed[0] % (<np.uint32_t>RAND_R_MAX + 1)


arthurmensch · 2017-06-09T13:29:05Z

sklearn/utils/tests/test_seq_dataset.py

+    for i in range(5):
+        # next sample
+        xi_32, yi32, swi32, idx32 = dataset32._next_py()
+        xi_, yi, swi, idx = dataset64._next_py()


Check name consistency

arthurmensch · 2017-06-09T13:30:14Z

sklearn/utils/tests/test_seq_dataset.py

+        xi_data32, _, _ = xi_32
+        xi_data, _, _ = xi_
+        assert_equal(xi_data32.dtype, np.float32)
+        assert_equal(xi_data.dtype, np.float64)


assert_array_almost_equal(xi_data64.astype('double'), xi_data32)?

arthurmensch · 2017-06-09T13:32:35Z

sklearn/utils/setup.py

+                     'sklearn/utils/seq_dataset.pxd.tp']
+
+    for pyxfiles in pyx_templates:
+        assert pyxfiles.endswith(('.pyx.tp', '.pxd.tp'))


arthurmensch · 2017-06-09T13:32:51Z

sklearn/utils/setup.py

+    for pyxfiles in pyx_templates:
+        assert pyxfiles.endswith(('.pyx.tp', '.pxd.tp'))
+        outfile = pyxfiles[:-3]
+        # if .pxi.in is not updated, no need to output .pxi


Edit comment to match code. Maybe precise that this a good idea for cythonization ?

can you elaborate ?

arthurmensch · 2017-06-09T13:33:50Z

sklearn/utils/tests/test_seq_dataset.py

+        xi_data32, _, _ = xi_32
+        xi_data, _, _ = xi_
+        assert_equal(xi_data32.dtype, np.float32)
+        assert_equal(xi_data.dtype, np.float64)


Beware to add generated files to .gitignore

Yes it's done!

arthurmensch · 2017-06-09T13:38:05Z

sklearn/utils/seq_dataset.pxd.tp

-"""Dataset abstractions for sequential data access. """
-
-cimport numpy as np
+{{py:


Shouldn't you put the header outside the {{py ? It would be cleaner.

I put it just below actually. First the code defining the template (in the braces {{py blablabla}}) and then the template itself. Only the second part is generated.

arthurmensch · 2017-06-09T13:39:10Z

sklearn/utils/seq_dataset.pyx.tp

+#
+# Author: Peter Prettenhofer <peter.prettenhofer@gmail.com>
+#             Arthur Imbert <arthurimbert05@gmail.com>
+# License: BSD 3 clause


You should move author and license outside the generation loop.

If I do that, they won't appear in the generated file. That's what you want?

Yes. They won't be in the repo anyway

ogrisel · 2017-06-10T09:44:40Z

The windows test failure looks like a numerical accuracy that should be relaxed when tested with 32 bit floats (but kept precise when run with 64 bit floats).

ogrisel · 2017-06-10T09:45:00Z

Also please use informative commit message.

…earn into fused_type_makedataset

Henley13 · 2017-06-10T10:01:45Z

@ogrisel I changed the numerical accuracy in the test, is it better?

ogrisel · 2017-06-10T10:11:19Z

sklearn/utils/tests/test_seq_dataset.py

        assert_equal(idx1, idx2)

+
+def test_consistency_check_fused_types():


test_fused_types_consistency

Address oliver's request for changing test name

ogrisel

I am not sure what the scope of this PR is exactly but I think we should check the change is actually useful for the user-facing estimators such as SGDClassifier. See my comment below.

ogrisel · 2017-06-10T13:08:11Z

sklearn/linear_model/base.py


+    if X.dtype == np.float32:
+        CSRData = CSRDataset32
+        ArrayData = ArrayDataset32


The codecov chrome/firefox extension tells me that those lines are not covered. Please add a test for that case (and install the codecov browser extension ;).

This test should probably fit a SGDClassifier on 32 bit float iris and check that the coef_ attribute should be 32 bit float as well ( and the output of decision_function should also output a float32 array).

Currently, there isn't any specific test for make_dataset. It's supposed to be covered by the tests for the sag solver. So we can either put the test off to a next PR about the sag solver (#9020) or doing the test you are proposing in this PR. What do you prefer?

ogrisel · 2017-06-10T13:20:38Z

@ogrisel I changed the numerical accuracy in the test, is it better?

Yes this test is good now.

GaelVaroquaux · 2017-06-10T14:07:26Z

doing the test you are proposing in this PR. What do you prefer?

A dedicated test seems useful.

…earn into fused_type_makedataset

Henley13 · 2017-06-10T15:31:48Z

I added a test for make_dataset, but without using a SGDClassifier (it's not consistent yet with fused types). @ogrisel @GaelVaroquaux ;)

I'm stuck in line 383. dataset is of the type sequentialDataset (see line 156) This comes form the templated code in scikit-learn#9040. So dataset would be able to be either float or double. but it does not pick it up when instanciated.

GaelVaroquaux · 2017-06-11T09:32:44Z

sklearn/linear_model/base.py

 from ..utils.sparsefuncs import mean_variance_axis, inplace_column_scale
 from ..utils.fixes import sparse_lsqr
-from ..utils.seq_dataset import ArrayDataset, CSRDataset
+from ..utils.seq_dataset import ArrayDataset32, CSRDataset32, ArrayDataset, \


We prefer to avoid the line continuation with a backslash and rather start a new line, repeating the "from ...utils"

massich · 2017-06-12T11:44:07Z

sklearn/utils/seq_dataset.pxd.tp

+
+# name, c_type
+dtypes = [('', 'double'),
+          ('32', 'float')]


it is much more clear if we use:

dtypes = [('64', 'double'), ('32', 'float')]

I found an error complaining about SequentialDataset and I had a hard time to figure out that it was referring to the 64bits version. @Henley13 do you remember why we choose '' over '64'?

To avoid refactoring. But I don't think it is a good idea

NelleV · 2018-05-28T23:37:39Z

I've rebased this PR, and fixed the tiny element remaining that caused the tests to fail. I've opened a new PR (#11155 )

Thanks a lot for all the work!

massich · 2018-06-01T08:51:10Z

@ogrisel I think you can close this one aswell in favor of #11155 (As you did in #9020)

Henley13 changed the title ~~Fused type makedataset~~ [WIP] Fused type makedataset Jun 7, 2017

massich mentioned this pull request Jun 7, 2017

[MRG] LogisticRegression convert to float64 (sag) #9020

Closed

Henley13 force-pushed the fused_type_makedataset branch from b80cd39 to de9189e Compare June 8, 2017 13:53

massich mentioned this pull request Jun 9, 2017

LogisticRegression convert to float64 #8769

Closed

Imbert Arthur added 4 commits June 9, 2017 10:41

initial PR commit

b64e562

seq_dataset.pyx generated from template

b763fa5

seq_dataset.pyx generated from template #2

1e9d3d0

rename variables

1bcee61

Henley13 force-pushed the fused_type_makedataset branch from 9e13ead to 1bcee61 Compare June 9, 2017 09:06

massich mentioned this pull request Jun 9, 2017

[MRG+2] Ridge linear model dtype consistency (all solvers but sag) #9033

Merged

fused types consistency test for seq_dataset

28e931c

a

7632545

GaelVaroquaux reviewed Jun 9, 2017

View reviewed changes

Imbert Arthur added 2 commits June 9, 2017 14:16

sklearn/utils/tests/test_seq_dataset.py

acf0e3a

new if statement

14676c9

Henley13 changed the title ~~[WIP] Fused type makedataset~~ [MRG] Fused type makedataset Jun 9, 2017

GaelVaroquaux reviewed Jun 9, 2017

View reviewed changes

Imbert Arthur added 2 commits June 9, 2017 15:05

add doc

28c96eb

sklearn/utils/seq_dataset.pyx.tp

2b551b0

jorisvandenbossche reviewed Jun 9, 2017

View reviewed changes

arthurmensch reviewed Jun 9, 2017

View reviewed changes

Imbert Arthur added 2 commits June 9, 2017 16:07

minor changes

9351555

minor changes

d0796ea

Henley13 mentioned this pull request Jun 9, 2017

[WIP] Allow SGDClassifier to support np.float32 without upcasting to float64 #9084

Closed

a

d1cad04

Imbert Arthur added 3 commits June 10, 2017 11:47

typo fix

225e02e

Merge branch 'fused_type_makedataset' of github.com:Henley13/scikit-l…

9b42554

…earn into fused_type_makedataset

check numeric accuracy only up 5th decimal

bde8f7c

ogrisel reviewed Jun 10, 2017

View reviewed changes

Joan Massich and others added 2 commits June 10, 2017 14:53

Address oliver's request for changing test name

8605e90

Address oliver's request for changing test name

b57ff6d

Address oliver's request for changing test name

ogrisel requested changes Jun 10, 2017

View reviewed changes

Imbert Arthur added 2 commits June 10, 2017 17:27

add test for make_dataset and rename a variable in test_seq_dataset

f168efc

Merge branch 'fused_type_makedataset' of github.com:Henley13/scikit-l…

84b2ba7

…earn into fused_type_makedataset

GaelVaroquaux reviewed Jun 11, 2017

View reviewed changes

massich reviewed Jun 12, 2017

View reviewed changes

Imbert Arthur added 2 commits June 17, 2017 18:56

fix typo

cab57f7

add a test for y

9700d0a

NelleV mentioned this pull request May 28, 2018

[MRG] LogisticRegression convert to float64 (sag) #11155

Closed

TomDLT closed this Jun 1, 2018

massich mentioned this pull request Feb 25, 2019

LogisticRegression convert to float64 (for SAG solver) #13243

Merged

		@@ -0,0 +1,69 @@
		"""Dataset abstractions for sequential data access. """

		assert_equal(idx1, idx2)


		def test_consistency_check_fused_types():

Uh oh!

Conversation

Henley13 commented Jun 7, 2017

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

massich commented Jun 7, 2017

Uh oh!

massich commented Jun 8, 2017

Uh oh!

massich commented Jun 8, 2017

Uh oh!

massich commented Jun 8, 2017

Uh oh!

jorisvandenbossche commented Jun 8, 2017

Uh oh!

raghavrv commented Jun 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Henley13 commented Jun 8, 2017

Uh oh!

massich commented Jun 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arthurmensch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Jun 10, 2017

Uh oh!

ogrisel commented Jun 10, 2017

Uh oh!

Henley13 commented Jun 10, 2017

raghavrv commented Jun 8, 2017 •

edited

Loading

massich commented Jun 1, 2018 •

edited

Loading