Example: scikit-learn

Introduction to scikit-Learn

scikit-learn is a python based open-source machine learning library. It provides implementations of popular machine learning algorithms together with evaluation and preprocessing utilities. For more information, refer to the extensive documentation.

Installing scikit-learn

VisTrails should be able to automatically install scikit-learn via pip. If this fails, try to install it manually:

pip install --user scikit-learn

It this also fails, consult the installation instructions.

Using scikit-learn via VisTrails

Datasets

For testing purposes and the examples, there are two multi-class classification datasets made available as VisTrails modules, the Digits and the Iris datasets. These have as output ports training data and classification targets, to quickly test a pipeline. Both datasets are very simple toy datasets and should not be used as benchmarks.

The Digits dataset consists of 1797 handwritten digits as represented as 8x8 grey scale images, resulting in 64 features. The digits belong to the classes 0 to 9. The Iris dataset consists of 150 data points with four features, belonging to one of three classes.

Splitting data into training and test set

As machine learning is inherently about generalization from training to test data, it is essential to separate a data set into training and test parts. The TrainTestSplit module is a convenient way to do this:

_images/train_test_split.png

The round output ports of the Iris dataset correspond to training data and training targets, which are fed into TrainTestSplit. The outputs of TrainTestSplit are data and labels for two subsets of the data, corresponding to the four round output ports. First are training data and training labels, then test data and test labels. The split is done randomly with 25% test data by default.

Basic usage

The VisTrails sklearn package contains most algorithms provided in scikit-learn. In machine learning, applying an algorithm to a dataset usually means training it on one part of the data, the training set, and then applying it to another part, the test set.

Each algorithm in scikit-learn has a corresponding VisTrails module, which has input ports for training data, and outputs the model that was learned (with the exception of the Manifold learning module). To apply the model to new data, connect it to a Predict module (for classification and regression) or to a Transform module (for data transformations like feature selection and dimensionality reduction).

_images/transform.png

It is also possible to directly compute performance metrics like the accuracy or mean squared error using the Score module.

_images/score.png

The resulting Scores can be output with a GenericOutput, or more advanced string formatting.

To make connecting the ports easier, ports that represent models are diamond shaped, while ports that represent data or label arrays are round. The remaining square ports are either parameters of the models or additional information provided as output.

Manifold learning

Manifold learning algorithms are algorithms that embed high-dimensional data into a lower-dimensional space, often for visualization purposes. Most manifold learning algorithms in scikit-learn embed data, but cannot transform new data using a previously learned model. Therefore, manifold learning modules will directly output the transformed data.

_images/manifold.png

The left hand side of the pipeline uses PCA, which can use Transform to be applied to new data. The right hand side uses the manifold learning method TSNE, which cannot be applied to new data, and therefore directly produces the transformed input data (in contrast to PCA, which produces a model).

Pipelines of scikit-learn models

To perform cross validation or grid search over a chain of estimators, such as preprocessing followed by classification, scikit-learn provides a Pipeline module. Each step of a pipeline is defined by an input port specifying a model. All but the last model in the pipeline must be transformers, the last can be arbitrary. Currently the VisTrails scikit-learn package only supports up to four steps in a pipeline.

_images/pipeline1.png

As any other model, a pipeline can either be fit on data and then evaluated using Predict, Transform or Score modules, or can serve as the input model to CrossValScore or GridSearchCV.

_images/pipeline_gridsearch.png