@@ -351,89 +351,6 @@ features::
351351
352352 _`Faster API-compatible implementation `: https://github.com/mblondel/svmlight-loader
353353
354- ..
355- For doctests:
356-
357- >>> import numpy as np
358- >>> import os
359- >>> import tempfile
360- >>> # Create a temporary folder for the data fetcher
361- >>> custom_data_home = tempfile.mkdtemp()
362- >>> os.makedirs(os.path.join(custom_data_home, ' mldata' ))
363-
364-
365- .. _mldata :
366-
367- Downloading datasets from the mldata.org repository
368- ---------------------------------------------------
369-
370- `mldata.org <http://mldata.org >`_ is a public repository for machine learning
371- data, supported by the `PASCAL network <http://www.pascal-network.org >`_ .
372-
373- The ``sklearn.datasets `` package is able to directly download data
374- sets from the repository using the function
375- :func: `sklearn.datasets.fetch_mldata `.
376-
377- For example, to download the MNIST digit recognition database::
378-
379- >>> from sklearn.datasets import fetch_mldata
380- >>> mnist = fetch_mldata('MNIST original', data_home=custom_data_home)
381-
382- The MNIST database contains a total of 70000 examples of handwritten digits
383- of size 28x28 pixels, labeled from 0 to 9::
384-
385- >>> mnist.data.shape
386- (70000, 784)
387- >>> mnist.target.shape
388- (70000,)
389- >>> np.unique(mnist.target)
390- array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
391-
392- After the first download, the dataset is cached locally in the path
393- specified by the ``data_home `` keyword argument, which defaults to
394- ``~/scikit_learn_data/ ``::
395-
396- >>> os.listdir(os.path.join(custom_data_home, 'mldata'))
397- ['mnist-original.mat']
398-
399- Data sets in `mldata.org <http://mldata.org >`_ do not adhere to a strict
400- naming or formatting convention. :func: `sklearn.datasets.fetch_mldata ` is
401- able to make sense of the most common cases, but allows to tailor the
402- defaults to individual datasets:
403-
404- * The data arrays in `mldata.org <http://mldata.org >`_ are most often
405- shaped as ``(n_features, n_samples) ``. This is the opposite of the
406- ``scikit-learn `` convention, so :func: `sklearn.datasets.fetch_mldata `
407- transposes the matrix by default. The ``transpose_data `` keyword controls
408- this behavior::
409-
410- >>> iris = fetch_mldata('iris', data_home=custom_data_home)
411- >>> iris.data.shape
412- (150, 4)
413- >>> iris = fetch_mldata('iris', transpose_data=False,
414- ... data_home=custom_data_home)
415- >>> iris.data.shape
416- (4, 150)
417-
418- * For datasets with multiple columns, :func: `sklearn.datasets.fetch_mldata `
419- tries to identify the target and data columns and rename them to ``target ``
420- and ``data ``. This is done by looking for arrays named ``label `` and
421- ``data `` in the dataset, and failing that by choosing the first array to be
422- ``target `` and the second to be ``data ``. This behavior can be changed with
423- the ``target_name `` and ``data_name `` keywords, setting them to a specific
424- name or index number (the name and order of the columns in the datasets
425- can be found at its `mldata.org <http://mldata.org >`_ under the tab "Data"::
426-
427- >>> iris2 = fetch_mldata('datasets-UCI iris', target_name=1, data_name=0,
428- ... data_home=custom_data_home)
429- >>> iris3 = fetch_mldata('datasets-UCI iris', target_name='class',
430- ... data_name='double0', data_home=custom_data_home)
431-
432-
433- ..
434- >>> import shutil
435- >>> shutil.rmtree(custom_data_home)
436-
437354.. _external_datasets :
438355
439356Loading from external datasets
0 commit comments