Label propagation emnb by clayw · Pull Request #4 · larsmans/scikit-learn

clayw · 2012-01-31T06:41:09Z

Gave this as a pull request but mainly for yours / my info

I ran your "examples/semisupervised_document_classification.py" file against my label propagation implementation and your semisupervised EMNB and for this problem EMNB does way better (output below).

Here's the output of the program

$ python examples/semisupervised_document_classification.py
['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
data loaded
2034 documents (training set)
1353 documents (testing set)
4 categories

Extracting features from the training dataset using a sparse vectorizer
done in 2.360747s
n_samples: 2034, n_features: 32395

Extracting features from the test dataset using the same vectorizer
done in 1.481269s
n_samples: 1353, n_features: 32395

Removing labels of 1831 random training documents

Baseline: fully supervised Naive Bayes

Training:
MultinomialNB(alpha=0.01, fit_prior=True)
train time: 0.006s
test time: 0.003s
f1-score: 0.824
dimensionality: 32395

Training:
BernoulliNB(alpha=0.01, binarize=0.0, fit_prior=True)
train time: 0.006s
test time: 0.015s
f1-score: 0.816
dimensionality: 32395

Naive Bayes trained with Expectation Maximization

Training:
SemisupervisedNB(estimator=MultinomialNB(alpha=0.01, fit_prior=True),
estimator__alpha=0.01, estimator__fit_prior=True, n_iter=10,
relabel_all=True, tol=1e-05, verbose=False)
train time: 0.171s
test time: 0.003s
f1-score: 0.859
dimensionality: 32395

Training:
SemisupervisedNB(estimator=BernoulliNB(alpha=0.01, binarize=0.0, fit_prior=True),
estimator__alpha=0.01, estimator__binarize=0.0,
estimator__fit_prior=True, n_iter=10, relabel_all=True, tol=1e-05,
verbose=False)
train time: 0.412s
test time: 0.015s
f1-score: 0.856
dimensionality: 32395

Training:
LabelSpreading(alpha=1, gamma=25, kernel=rbf, max_iters=30, n_neighbors=7,
tol=0.001)
train time: 1.118s
test time: 0.452s
f1-score: 0.745

This reverts commit 617d731. Breaks other tests; let's live with the negative tf-idf weights for now.

This release brings zipped storage

…zation

…+ typo

Conflicts: sklearn/ensemble/forest.py sklearn/tree/tree.py

as_float_array is not enough because cd_fast does not accept float32.

And added a remark on updating the Gram matrix. This was triggered by tests inside dict_learning, probably the last commit made this visible. Thus I'm not adding more tests.

Conflicts: doc/modules/classes.rst doc/modules/label_propagation.rst sklearn/label_propagation.py sklearn/tests/test_label_propagation.py

…into label-propagation

@larsman

@larsman

@amueller

(@amueller's requests)

removed fix for unsupported numpy api

regularization procedure

…label-propagation

larsmans and others added 30 commits December 24, 2011 22:03

Revert "BUG Disallow negative tf-idf weight"

f116314

This reverts commit 617d731. Breaks other tests; let's live with the negative tf-idf weights for now.

PY3K fix in datasets.samples_generator

0e74917

fix scaling, more tests and docstrings

f85e7be

Merge branch 'master' into sparse-scaler

b0f221a

wording

2e0857a

FIX: py3k integer division in robust covariance estimation

708b61a

FIX: py3k integer division in samples generator

cd27709

FIX: in py3k svmlight files must be explicitly opened in binary mode

8232d49

ENH: update joblib

826443e

This release brings zipped storage

FIX: py3k bytes split in svmlight format parser

68b7ca3

Merge branch 'master' of github.com:scikit-learn/scikit-learn

9fe13b3

FIX: py3k need explicit bytes buffers for svmlight format serialization

bc18af3

FIX: py3k need output file in binary mode for svmlight format seriali…

29c092d

…zation

FIX: py3k: string formatting is not supported on byte strings

8226a98

FIX: fix test: integers are valid file descriptors in py3k

0f8b3c2

Add missing reference.

03a4e3b

Break down fit_transform into parts.

9ad004f

Merge branch 'master' into sparse-scaler

59cdcc7

FIX: unused cython variable

b2dac63

More checks when transforming sparse matrices with centering scalers …

671f304

…+ typo

DOC: update narrative documentation

c4f38fa

DOC: Added reference for feature importance

cd09d87

ENH: Revisited importances API

6707fec

Merge remote-tracking branch 'upstream/master' into tree

ab23e22

Conflicts: sklearn/ensemble/forest.py sklearn/tree/tree.py

EXAMPLE: Fixed API changes

83f693f

ENH: Missing default value for feature_importances_

df8adf0

FIX removed duplicate explicit linke for Vlad

7a8b4f2

FIX: RST indentation and blank lines

600f685

FIX RST and references

b623aa0

FIX minor rst

f57d254

Fabian Pedregosa and others added 29 commits January 9, 2012 14:58

FIX: doctest

2108f11

Still some tweaks for the sklearn.test() example

a072187

Remove pylab code from docstring and +SKIP those that requie PIL

dfc7c5c

BUG don't use deprecated attributes in GaussianNB.predict

f4eced3

FIX: explicit conversion to float64 in ElasticNet

0372ba7

as_float_array is not enough because cd_fast does not accept float32.

FIX: bug in elasticnet with precompute not being updated correctly.

f977346

And added a remark on updating the Gram matrix. This was triggered by tests inside dict_learning, probably the last commit made this visible. Thus I'm not adding more tests.

finalized KNN work, all tests pass properly

68444c0

Merge branch 'larsmans-label-propagation'

fdfb531

Conflicts: doc/modules/classes.rst doc/modules/label_propagation.rst sklearn/label_propagation.py sklearn/tests/test_label_propagation.py

removed extra semisupervised folder

bcd5e33

polished the lp & test code

04a0ee2

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

691ac41

…into label-propagation

variable name changes, using premade functions, doc fixes as per

e778f7b

@larsman

variable name changes, doc corrections

58433a7

(@amueller's requests)

removed unlabeled_identifier, updated tests and examples to reflect this

794a31c

corrected example that still refered to unlabeled_identifier

3fb76f3

optimization that stores the spatial index when using knn graphs

4fa5e29

updated rst docs with kernel information

2d6fde0

shuffled digits example, added sensible point colors to plot chart,

213f0a2

removed fix for unsupported numpy api

docs describe the different kernels available in techniques

7e837b0

TL directory change to push label propagation code into semi_supervised

268aed6

added __init__.py file to semi_supervised folder

c5aed1d

Updated docs for label propagation, added more technical details about

ca8334a

regularization procedure

specific fine tuning to the label propagation docs

e585485

doc updates & tweaks

900c97f

fixed typo in test code

729a4c2

added AISTAT ref to docs

c07a041

added AISTAT ref to rst doc

a8a4948

Merge branch 'emnb' of https://github.com/larsmans/scikit-learn into …

8d0f6f2

…label-propagation

combination of semisupervised algos

3777667

larsmans closed this Mar 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Label propagation emnb#4

Label propagation emnb#4
clayw wants to merge 430 commits intolarsmans:emnbfrom
clayw:label-propagation-emnb

clayw commented Jan 31, 2012

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

Conversation

clayw commented Jan 31, 2012

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants