[MRG] LogisticRegression convert to float64 (sag)#9020
[MRG] LogisticRegression convert to float64 (sag)#9020Henley13 wants to merge 26 commits intoscikit-learn:masterfrom
Conversation
|
Do you plan to handle all solvers ? |
|
@arthurmensch, you check the cython scripts and I take care of the python ones? |
|
I ll do a separate pr for sag and saga
…On Wed, Jun 7, 2017, 10:39 AM Arthur Imbert ***@***.***> wrote:
@arthurmensch <https://github.com/arthurmensch>, you check the cython
scripts and I take care of the python ones?
Maybe we could do sag and saga together in this PR, but for the rest it
would be one solver per PR
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9020 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AD48Y-dTln_Y0dkmnJHlBgvBcJt4tspiks5sBmHAgaJpZM4Nxocl>
.
|
|
hi, I also want to work on that |
|
This PR is actually blocked by #9040. cc: @arthurmensch |
|
I'm working on it here |
initial PR commit seq_dataset.pyx generated from template seq_dataset.pyx generated from template #2 rename variables fused types consistency test for seq_dataset a sklearn/utils/tests/test_seq_dataset.py new if statement add doc sklearn/utils/seq_dataset.pyx.tp minor changes minor changes typo fix check numeric accuracy only up 5th decimal Address oliver's request for changing test name add test for make_dataset and rename a variable in test_seq_dataset
|
TODO:
|
|
cc: @raghavrv, @ogrisel, @arthurmensch |
.gitignore
Outdated
| .cache | ||
|
|
||
| # files generated from a template | ||
| sklearn/utils/seq.dataset.pyx |
sklearn/linear_model/setup.py
Outdated
|
|
||
| from sklearn._build_utils import get_blas_info | ||
|
|
||
| from Cython import Tempita as tempita |
| #ifdef _MSC_VER | ||
| # include <float.h> | ||
| # define skl_isfinite _finite | ||
| # define skl_isfinite32 _finite |
There was a problem hiding this comment.
Could you add a comment saying why this change is needed?
# When re-declaring the functions in the template for cython
# specific for each parameter input type, it needs to be 2 different functions
# as cython doesn't support function overloading.
@jnothman can you help in deciding whether this comment is needed or this is obvious.
| # array | ||
| dataset_32, _ = make_dataset(X_32, y_32, sample_weight_32) | ||
| dataset_64, _ = make_dataset(X_64, y_64, sample_weight_64) | ||
| xi_32, _, _, _ = dataset_32._next_py() |
| xi32, yi32, swi32, idx32 = dataset32._next_py() | ||
| xi64, yi64, swi64, idx64 = dataset64._next_py() | ||
|
|
||
| xi_data32, _, _ = xi32 |
There was a problem hiding this comment.
same comment as above. Why don't we check for y?
| from ..utils.fixes import sparse_lsqr | ||
| from ..utils.seq_dataset import ArrayDataset, CSRDataset | ||
| from ..utils.seq_dataset import ArrayDataset32, CSRDataset32 | ||
| from ..utils.seq_dataset import ArrayDataset64 as ArrayDataset |
There was a problem hiding this comment.
Should we keep the name "ArrayDataset"? Why not write explicitly the number of bits?
There was a problem hiding this comment.
IMHO that's how I would do it for the moment. Like this you ensure that we don't break anything.
There was a problem hiding this comment.
+1 for preserving backward compatibility. The ArrayDataset class of scikit-learn 0.18.2 is 64 bit float based. Let's keep the same name as an alias. We could even setup the alias in sklearn/utils/seq_dataset even if Cython code is not official part of the project public API.
There was a problem hiding this comment.
this also needs to be addressed
d1634e7 to
8655f4a
Compare
|
This looks good to me. But I guess you want to do the benchmark. Ping me once the CIs pass and you are satisfied with this. |
|
|
||
| for i in range(5): | ||
| # next sample | ||
| xi32, yi32, _, _ = dataset32._next_py() |
There was a problem hiding this comment.
@Henley13 this test cannot work. yi32 and yi64 are single elements not arrays. Therefore you cannot compare them into np.float32 nor np.float64. More over python build-in float type are all 64. For a single va lue this is not a problem. It is only a problem for vectors.
I think we should revert the test asked by @raghavrv.
|
Ok, I need some help here. Thats what we generate from the template: cdef class SequentialDataset64:
# ..... Some functions
def _next_py(self):
"""python function used for easy testing"""
cdef int current_index = self._get_next_index()
return self._sample_py(current_index)
# ...... more functions
def _sample_py(self, int current_index):
"""python function used for easy testing"""
cdef double* x_data_ptr
cdef int* x_indices_ptr
cdef int nnz, j
cdef double y, sample_weight
# call _sample in cython
self._sample(&x_data_ptr, &x_indices_ptr, &nnz, &y, &sample_weight,
current_index)
# .... some stuff when X or y are sparse
return (x_data, x_indices, x_indptr), y, sample_weight, sample_idx
cdef class ArrayDataset32(SequentialDataset32):
# ..... Similar things
def _sample_py(self, int current_index):
"""python function used for easy testing"""
cdef float* x_data_ptr
cdef int* x_indices_ptr
cdef int nnz, j
cdef float y, sample_weight
# call _sample in cython
self._sample(&x_data_ptr, &x_indices_ptr, &nnz, &y, &sample_weight,
current_index)
# ............................
return (x_data, x_indices, x_indptr), y, sample_weight, sample_idxHowever this is not working: xi_32, yi_32, _, _ = dataset_32._next_py()
xi_64, yi_64, _, _ = dataset_64._next_py()
assert_equal(yi_32.dtype, np.float32)
assert_equal(yi_64.dtype, np.float64)It is not working because neither of them is >>> type(yi_32)
<class 'float'>
>>> type(yi_64)
<class 'float'> This might have to do with the fact that |
I agree, let's not test the type of |
|
I think we can close in favor or #11155. |
Reference Issue
Works on #8769 (for SAG solver), Fixes #9040
What does this implement/fix? Explain your changes.
Avoids logistic regression to aggressively cast the data to
np.float64whennp.float32is supplied.Any other comments?
This PR follows up on PR #8835