Fooling Around with KNIME cont’d: Deep Learning

05/08/201805/08/2018 ~ Matthias Groncki ~ 1 Comment

In my previous post I wrote about my first experiences with KNIME and we implemented three classical supervised machine learning models to detect credit card fraud. In the meantime I found out that the newest version of KNIME (at this time 3.6) supports also the deep learning frameworks TensorFlow and Keras. So I thought lets revisit our deep learning model for the fraud detection and try to implement in KNIME using Keras without writing one line of Python code.

Install the required packages

The requirement is you have Python with TensorFlow and Keras (you can install it with pip or conda, if you using the anaconda distribution) on your machine. Then you need to install the Keras integration extensions in KNIME, you can follow the official tutorial on https://www.knime.com/deeplearning/keras.

Workflow in KNIME

The first part of the workflow is quite similar to the previous workflow

We load the data, remove the time column, split the data in train, validation and test sets and normalise the features.

The configuration of the column filter node is quite straight forward, we specify the columns we want to include and exclude (no big surprise in the configuration).

Design of the deep network

For building our very simple 3 layer network we need 3 different new nodes, the Keras Input-Layer-Node, the Dense-Layer-Node and the DropOut-Node:

We start with the input layer and we have to specify the dimensionality of our input, in our case we have 29 features, we can also specify here the batch size.

The next layer is a dense (fully connected) layer. We can specify we number of nodes, the activation function.

After the dense layer we apply a drop-out with an dropout-rate of 20%, also the configuration is here quite straightforward.

We add then another dense-layer with 50 node, another dropout and the final layer with one node and the sigmoid activation function (binary classification: fraud or non-fraud).

The last layer of the network, the training data and the validation set are input to the Keras-Network-Learner Node.

We set the input, the target variable choose the loss function, optimisation method and number of epochs.

We can specify a own loss function if we want or need to.

We can select an early stop strategy as well:

With the setting above the training will be stopped if the validation loss will no decrease more than 0.001 for at least 5 epochs.

During the training we can monitor the training process:

The trained model and the test data are the input for the DL-Network-Executer-Node which will use the trained network to classify the test set.

The results are plugged in a ROC-Curve-Node to asses the model quality.

And here the complete workflow:

Conclusion

It was very easy and fast to implement our previous model in KNIME without writing any line of code. Nevertheless the user still need to understand the concepts of deep learning in order to build the network and understand the node configurations. I really liked the feature of the real time training monitor. In my view KNIME is a tool which can help to democratize data science within an organisation. Analyst who can not code in Python or R can have access to very good deep learning libraries for their data analytics without the burden to learn a new programming language, they can focus on understanding the underlying concepts of deep learning, understand their data and choosing the right model for the data and understand the drawbacks and limitations of their approach instead of spending hours learning programming in a new language. Of course you need to spent time to learn using KNIME, which is maybe for some people easier than learning programming.

I plan to spent some more time with KNIME in the future and I want to find out how to reuse parts of the workflow in new workflow (like the data preparation, which is almost exactly the same as in the previous example) and how to move model into production. I will report about it in same later posts.

So long…

Fooling around with KNIME

19/07/2018 ~ Matthias Groncki ~ 1 Comment

At the moment I am quite busy with preparing a two training course about ‘Programming and Quantitative Finance in Python’ and ‘Programming and Machine Learning in Python’ for internal trainings at my work, so I haven’t had much free time for my blog. But I spent the last two nights fooling around with KNIME, an open-source tool for data analytics / mining (Peter, I took inspiration from your blog to name today’s post) and I want to share my experience. In the beginning I was quite sceptical and my first thought was ‘I can write code faster then drag-n-drop a model’ (and I still believe it). But I wanted to give it a try and I migrated my logistic regression fraud detection sample from my previous blog posts into a graphical workflow.

In the beginning it was bit frustrating since I didn’t know which node to use and where to find all the settings. But the interface and the node names are quite self-explaining so after exploring some examples and watching one or two youtube videos I was able build my first fraud detection model in KNIME.

To work with a classifier we need to transfer the numerical variable Class (1=Fraud, 0=NonFraud) into a string variable. This was not obvious for me and coming from Python and SkLearn it felt a bit wired and unnecessary. After fixing that, I split the data in a training and test set with the Partioning node. With a right-click on the node we can adjust the configuration and can change the it to the common 80-20 split.

In the next step the data will be standardized (using a normalizer node). We can select which features / variable and how we want to scale them. We have plenty of settings to choose from, a nice feature is the online help (node description) on the right side of the UI, which describes the different parameter.

I have capsuled the model fitting, prediction and scoring into a meta-node to make the workflow look cleaner and more understandable. With a double-click on the meta-node we can open the sub-workflow.

I am fitting three different models (again encapsulated into meta-nodes) and combine the results (the AUC Score) into one table and write it and export it as a csv-file.

Lets have a detailed look into the logistic regression model meta-node.

The first node is the so-called learner node. It fits the model on the training data. In the configuration we can select the input features, target column, the solver algorithm, and advanced settings like regularizations.

To make predictions we use the predictor node. The inputs are the fitted model (square input) and the standardized test set. Into the normaliser (apply) node, we feed the test set and the fitted normalizer (as far as I understand that is equivalent to use the transform method of a Scaler in Sklearn after fitting it before, please correct me if I am wrong). The prediction output will be used to calculate the AUC curve (in the configuration setting of the prediction node we have to add the predicted probabilities as an additional output, in the default settings is to output only the predicted class). We export the plot as an SVG file and the auc score (as a table with an extra column for the model name) is the output of our meta-node.

We can always investigate the output/result of one step, e.g. of the last node:

Or the interactive plot of the AUC node:

Or the model parameter output of the learner node:

The workflow for the other two models is quite similar. With copy and pasting the Logistic Regression meta-node, it was just replacing the learner and predictor node and adjusting the configurations.

To execute the complete workflow we just need to press the run/play button in the menu.

There is still much to discover and explore and try. For example there are node for cross-validation and feature selection which I haven’t tried yet and so many other nodes (e.g plotting, descriptive statistics and the Python and R integration nodes). And I haven’t tried to move a model into production, but I read that it should not be that difficult with KNIME (they promote it as a platform to create data science applications). I spent just a couple hours with it, so please forgive me if I didn’t use the right name for some of the nodes, setting, menus or features in KNIME.

What is my impression after playing with it for a couple hours?

I still believe that writing code is the faster option for me, but I have to admit that I like it more and more. And its not really fair comparison (years of Python programming vs couple hours experimenting with a new tool). Its a nice tool for prototyping models without writing a line of code. If you are not familiar with a ML library yet, its a good and fast way to build models. But here is no free lunch either, instead of learning a syntax and the architecture of a library you have to learn to use the UI and find all the settings.

It’s an open-source software and so far I haven’t encountered any limitation (e.g. other tools limit the numbers of rows you can use in a free version) but I’ve just scratched the surface.

In my opinion one big advantage is the visualization of the model. The model is easy to understand and can easily be handed over to some other developers or engineer. Everyone knows that working with other people’s code can be a sometimes a pain and having a visual workflow can eliminate that pain. But I believe the workflows can become messy as well. Its a tool which can be used by analysts and business user who want to explore and analyse their data, generate insights and use the power of standard machine learning and data mining algorithm without being forced to learn programming first.

The first impression is surprisingly good and I will continue playing with it and I want to figure out how to run my own Python script in a node and maybe even more important how to move a model into production.

I will report about it in a later post.

So long…

From Logistic Regression in SciKit-Learn to Deep Learning with TensorFlow – A fraud detection case study – Part II

18/05/2018 ~ Matthias Groncki ~ 5 Comments

We will continue to build our credit card fraud detection model. In the previous post we used scikit-learn to detect fraudulent transactions with a logistic regression model. This time we will build a logistic regression in TensorFlow from scratch. We will start with some TensorFlow basics and then see how to minimize a loss function with (stochastic) gradient descent.

We will fit our model to our training set by minimizing the cross entropy. For the logistic regression is minimizing the cross entropy aquivalent to maximizing the likelihood (see previous part for details). In the next part we will extend this model with some hidden layer and build a deep neural network to detect fraud. And we will see how to use the High-Level API to build the same model much easier and quicker.

As usual you will find the corresponding notebook also on GitHub and kaggle.

First we will load the data, split it into a training and test set and normalize the features as in our previous model.


import numpy as np
import pandas as pd
import tensorflow as tf
import keras
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_curve, roc_auc_score, classification_report, accuracy_score
import seaborn as sns
import matplotlib.pyplot as plt

credit_card = pd.read_csv('../input/creditcard.csv')

X = credit_card.drop(columns='Class', axis=1)
y = credit_card.Class.values

np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X, y)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Short TensorFlow introduction

TensorFlow uses a dataflow graph to represent the computation in terms of the dependencies between individual operations.

We first define the dataflow graph, then we create a TensorFlow session to run or calculate it.

Start with a very simple graph. Lets say we have a two dimensional vector

$x =(x_1,x_2)$

and want to compute $a x^2 = (ax_1^2, ax_2^2)$ , where $a$ is a constant (e.g.5).

First we define the input, a tf.placeholder object. We will use tf.placeholder to feed our data to our models (like our features and the value we try to predict). We can also use tf.placeholder to feed hyperparameter in our model (e.g. a learning_rate). We have to define the type of the input (in this case it’s a floating point number) and the shape of the input (a two dimensional vector).

Another class is the tf.constant, as the name indicates it represent a constant in our computational graph.

input_x = tf.placeholder(tf.float32)
a = tf.constant(5., tf.float32, name='a', shape=(2,))

Now we define our operator y.

y = a * tf.pow(input_x,2)

We can create a TensorFlow session and run our operator with the ´run()´ method of the session. We have to feed our input vector x as a dictionary into the graph. Lets say $x=(1,2)$

x = np.array([1,2])
with tf.Session() as sess:
    result = sess.run(y, {input_x: x})
    print(result)

[ 5. 20.]

Automatic differentiation in TensorFlow

Most machine learning and deep learning problems are at the end minimization problems. We either have to minimize a loss function or we have to maximize a reward function in the case of reinforced learning. A bread and butter method to solve these kind of problems is the gradient descent (ascent) method.

The power of TensorFlow is the automatic or algorithmic differentiation of our computational graph. We can get the anayltical derviates (or gradients) for almost ‘free’. We dont need to derive the formula for the gradient by ourself or implement it.

The gradient $\nabla f(x)$ is the multidimensional generalization of the derivate of a function. Its the vector of the partial derivates of the function and points into the direction with the strongest increase. If we have have real valued function $f(x)$ with $x$ being a n-dimensional vector, then $f$ is decreasing the fastest when we go from point $x$ into the direction of the negative gradient.

To get the gradient we use the tf.gradientclass. Lets say we want to derive y with respect to our input x. The function call looks like that:

g = tf.gradients(y, [input_x])
grad_y_x = 0
with tf.Session() as sess:
    grad_y_x = sess.run(g,{input_x: np.array([1,2])})
    print(grad_y_x)

[array([10., 20.], dtype=float32)]

Chain Rule in TensorFlow

Lets check a simple example for the chain rule. We come later back to the chain rule when we talk about back propagation in the coming part. Back propagation is one of the key concepts in deep learning. Lets recall the chain rule, which we all learned at high school:

$\frac{d}{dx}f(g(x)) = f'(g(x))g'(x)$

For our example we use our function y and chain it to a new function z=log(y).

The ‘inner’ partial derivate of y with respect to $x_i$ is

$\frac{\partial y}{\partial x_i} = 10x_i$

and the outer one with respect to $y$ is

$\frac{\partial z}{\partial y_i} =\frac{1}{5x_i^2} .$

The partial derivate is $\frac{2x_i }{x_i^2}$ .

With TensorFlow, we can calulate the outer and inner derivate seperatly or in one step.

In our example we will calculate two gradients one with respect to y and one with respect to x. Multiplying elementwise the gradient with respect to y with the gradient of y with respect to x (inner derivative) yield to the gradient of z with respect to x.

z = tf.log(y)
with tf.Session() as sess:
    result_z = sess.run(z,  {input_x: np.array([1,2])})
    print('z =', result_z)
    delta_z = tf.gradients(z, [y, input_x])
    grad_z_y, grad_z_x = sess.run(delta_z,  {input_x: np.array([1,2])})
    print('Gradient with respect to y', grad_z_y)
    print('Gradient with respect to x', grad_z_x)
    print('Manual chain rule', grad_z_y * grad_y_x)

z = [1.609438  2.9957323]
Gradient with respect to y [0.2  0.05]
Gradient with respect to x [2. 1.]
Manual chain rule [[2. 1.]]

Gradient descent method

As mentioned before the gradient is very useful if we need to minimize a loss function.

It’s like hiking down a hill, we walk step by step into the direction of the steepest descent and finally we reach the valley. The gradient provides us the information in which direct we need to walk.

So if we want to minimize the function
$f$ (e.g. root mean squared error, negative likelihood, …) we can apply an iterative algorithm

$x_n = x_{n-1} - \gamma \nabla f(x_{n-1}),$

with a starting point $x_0$. These kind of methods are called gradient descent methods.

Under particular circumstances we can be sure that we reach the global minimum but in general this is not true. Sometimes it can happen that we reach a local minima or a plateau.
To aviod stucking in local minima there are plenty extensions to the plain vanilla gradient descent (e.g. simulated annealing). In Machine Learning literature the dradient descent method is often called Batch Gradient method, because you will use all data points to calculate the gradients.

We will usually multiply the gradient with a factor before we subtract it from our previous value, the so called learning rate. If the learning rate is too large, we will make large steps into the direction but it can happen that we step over the minimum and miss it. If the learning rate is too small the algorithm takes longer to converge. There are extensions which adept the learning rate to the parameters (e.g ADAM, RMSProp or AdaGrad) to achive faster and better convergence (see for example http://ruder.io/optimizing-gradient-descent/index.html).

Example Batch Gradient descent

Lets see how to use it on a linear regression problem. We generate 1000 random observations $y = 2 x_1 + 3x_2 * \epsilon$ , with $\epsilon$ normal distributed with zero mean and a standard deviation of 0.2.

#Generate data
np.random.seed(42)
eps = 0.2 * np.random.randn(1000)
x = np.random.randn(2,1000)
y = 2 * x[0,:] + 3 * x[1,:] + eps

We use a simple linear model to predict y. Our model is

$\hat{y_i} = w_1 x_{i,1} + w_2 x_{i,2},$

for an observation $x_i$ and we want to minimize the mean squared error of our predictions

$\frac{1}{1000} \sum (y_i-\hat{y_i})^2.$

Clearly we could use the well known least square estimators for the weights, but we want to minimize the error with a gradient descent method in TensorFlow.

We use the tf.Variableclass to store the parameters $w$ which we want to learn (estimate) from the data. We specify the shape of the tensor, through the intial values. The inital values are the starting point of our minimization.

Since we have a linear model, we can represent our model with an single matrix multiplication of our observation matrix (row obs, columns features) with our weight (parameter) matrix w.

# Setup the computational graph with loss function
input_x = tf.placeholder(tf.float32, shape=(2,None))
y_true = tf.placeholder(tf.float32, shape=(None,))
w = tf.Variable(initial_value=np.ones((1,2)), dtype=tf.float32)
y_hat = tf.matmul(w, input_x)
loss = tf.reduce_mean(tf.square(y_hat - y_true))

In the next step we are going to apply our batch gradient descent algorithm.

We define gradient of the loss with respect to our weights $w$ grad_loss_w.

We also need to initialize our weights with the inital value (starting point our optimization). TensorFlow has a operator for this tf.global_variables_initializer(). In our session we run the initialization operator first. And then we can apply our algorithm.

We calculate the gradient and apply it to our weights with the function assign().

grad_loss_w = tf.gradients(loss, [w])
init = tf.global_variables_initializer()
losses = np.zeros(10)
with tf.Session() as sess:
    # Initialize the variables
    sess.run(init)
    # Gradient descent
    for i in range(0,10):
        # Calculate gradient
        dloss_dw = sess.run(grad_loss_w, {input_x:x,
                                          y_true:y})
        # Apply gradient to weights with learning rate
        sess.run(w.assign(w - 0.1 * dloss_dw[0]))
        # Output the loss
        losses[i] =  sess.run(loss, {input_x:x,
                                     y_true:y})
        print(i+1, 'th Step, current loss: ', losses[i])
    print('Found minimum', sess.run(w))
plt.plot(range(10), losses)
plt.title('Loss')
plt.xlabel('Iteration')
_ = plt.ylabel('RMSE')

Luckily we don’t need to program everytime the same algorithm by ourself. TensorFlow provide many of gradient descent algorithms, e.g.
tf.train.GradientDescentOptimizer, tf.train.AdagradDAOptimizer or tf.train.RMSPropOptimizer (to mention a few). They compute the gradient and apply it to the weights automatically.

In the case of the GradientDescentOptimizerwe only need to specify the learning rate and tell the optimizer which loss function we want to minimize.

We call the method minimize which returns our training or optimization operator. In our loop we just need to run the operator.

tf.train.RMSPropOptimizer
optimizer = tf.train.GradientDescentOptimizer(0.1)
train = optimizer.minimize(loss)
init = tf.global_variables_initializer()
losses = np.zeros(10)
with tf.Session() as sess:
    # Initialize the variables
    sess.run(init)
    # Gradient descent
    for i in range(0,10):
        _, losses[i] =  sess.run([train, loss], {input_x:x,
                                     y_true:y})
    print('Found minimum', sess.run(w))
plt.plot(range(10), losses)
plt.title('Loss')
plt.xlabel('Iteration')
_ = plt.ylabel('RMSE')

Stochastic Gradient Descent and Mini-Batch Gradient

One extension to batch gradient descent is the stochastic gradient descent. Instead of calculate the gradient for all observation we just randomly pick one observation (without replacement) an evaluate the gradient at this point. We repeat this until we used all data points, we call this an epoch. We repeat that process for several epochs.

Another variant use more than one random data point per gradient. Its the so called mini-batch gradient. Please feel free to play with the batch_size and the learning rate to see the effect of the optimization. One advantage is that we don’t need to keep all data in memory for optimization, especially if we talking about big data. We just need to load small batches at once to calculate the gradient.

np.random.seed(42)
optimizer = tf.train.GradientDescentOptimizer(0.1)
train = optimizer.minimize(loss)
init = tf.global_variables_initializer()
n_epochs = 10
batch_size = 25
losses = np.zeros(n_epochs)
with tf.Session() as sess:
    # Initialize the variables
    sess.run(init)
    # Gradient descent
    indices = np.arange(x.shape[1])
    for epoch in range(0,n_epochs):
        np.random.shuffle(indices)
        for i in range(int(np.ceil(x.shape[1]/batch_size))):
            idx = indices[i*batch_size:(i+1)*batch_size]
            x_i = x[:,idx]
            x_i = x_i.reshape(2,batch_size)
            y_i = y[idx]
            sess.run(train, {input_x: x_i, 
                             y_true:y_i})
        
        if epoch%1==0: 
            loss_i = sess.run(loss, {input_x: x, 
                             y_true:y})
            print(epoch, 'th Epoch Loss: ', loss_i)
        loss_i = sess.run(loss, {input_x: x, 
                             y_true:y})
        losses[epoch]=loss_i
    print('Found minimum', sess.run(w))
plt.plot(range(n_epochs), losses)
plt.title('Loss')
plt.xlabel('Iteration')
_ = plt.ylabel('RMSE')

Found minimum [[1.9929324 2.9882016]]

Our minimisation algorithm found a solution very very close to the real values.

Logistic Regression in TensorFlow

Now we have all tools to build our Logistic Regression model in TensorFlow.
Its quite similar to our previous toy example. Logisitc regression is also a kind of linear model, it belong to the class of generalized linear models with with the logit as a link function. As we have seen in the previous part we assume in logistic regression that the logits (logarithm of the odds) are linear in the parameters/weights.

Our data set has 30 features, so we adjust the placeholders and the weights accordingly. We have seen that the minimizing the cross entropy is aquivalent to maximizing the likelihood function.. TensorFlow provides us with the loss function sigmod_cross_entropy, so we don’t need to implement the loss function by ourself (let us use this little shortcut, the cross entropy or negative log likelihood is quite easy to implement). The loss function takes the logits and the true lables (response) as inputs. It computes the entropy elementwise, so we have to take the mean or sum of the output of the loss function.

# Setup the computational graph with loss function
input_x = tf.placeholder(tf.float32, shape=(None, 30))
y_true = tf.placeholder(tf.float32, shape=(None,1))
w = tf.Variable(initial_value=tf.random_normal((30,1), 0, 0.1, seed=42), dtype=tf.float32)
logit = tf.matmul(input_x, w)
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=y_true, logits=logit))

To get the prediction we have to use the sigmoid function on the logits.

y_prob = tf.sigmoid(logit)

For the training we can almost reuse the code of the gradient descent example. We just need to adjust the number of iterations (100, feel free to play with this parameter). and the function call. In each iteration we call our training operator, calculate the current loss and the current probabilities and store the information to visualize the training.

Every ten epoch we print the current loss and AUC score.

optimizer = tf.train.GradientDescentOptimizer(0.5)
train = optimizer.minimize(loss)
init = tf.global_variables_initializer()
n_epochs = 100
losses = np.zeros(n_epochs)
aucs = np.zeros(n_epochs)
with tf.Session() as sess:
    # Initialize the variables
    sess.run(init)
    # Gradient descent
    for i in range(0,n_epochs):
        _, iloss, y_hat =  sess.run([train, loss, y_prob], {input_x: X_train,
                                                           y_true: y_train.reshape(y_train.shape[0],1)})
        losses[i] = iloss
        aucs[i] = roc_auc_score(y_train, y_hat)
        if i%10==0:
            print('%i th Epoch Train AUC: %.4f Loss: %.4f' % (i, aucs[i], losses[i]))
    
    # Calculate test auc
    y_test_hat =  sess.run(y_prob, {input_x: X_test,
                                             y_true: y_test.reshape(y_test.shape[0],1)})
    weights = sess.run(w)

0 th Epoch Train AUC: 0.1518 Loss: 0.7446
10 th Epoch Train AUC: 0.8105 Loss: 0.6960
20 th Epoch Train AUC: 0.8659 Loss: 0.6906
30 th Epoch Train AUC: 0.9640 Loss: 0.6893
40 th Epoch Train AUC: 0.9798 Loss: 0.6884
50 th Epoch Train AUC: 0.9816 Loss: 0.6876
60 th Epoch Train AUC: 0.9818 Loss: 0.6868
70 th Epoch Train AUC: 0.9818 Loss: 0.6861
80 th Epoch Train AUC: 0.9819 Loss: 0.6853
90 th Epoch Train AUC: 0.9820 Loss: 0.6845

The AUC score of the test data is around 98%.

Since we have only 30 features we can easily visualize the influence of each feature in our model.

So thats it for today. I hoped you enjoyed reading the post. Please download or fork the notebook on GitHub or on kaggle and play with the code and change the parameters.

In the next post we will add a hidden layer to our model and build a neural network. And we will see how to use the High-Level API to build the same model much easier. If you want to learn more about gradient descent and optimization have a look into the following links

Some lecture note of the Unversity of Toronto: https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
Wikipedia

So long.

Some note: I will separate the data science related notebooks from the quant related notebooks on GitHub. In future you will find all Machine Learning, Deep Learning related notebook in a own repository, while the Quant notebooks will be stay in the old repo.

From Logistic Regression in SciKit-Learn to Deep Learning with TensorFlow – A fraud detection case study – Part I

08/05/2018 ~ Matthias Groncki ~ 7 Comments

Its quite a long time since my last post. It has been a busy time for me and a lot of things changed. One obvious change was the overdue change of my blog title from Ipython to Jupyter notebooks. Also has my work life shifted over the time from a Quant to a Data Scientist role. The most recent change is that relocated to Bangkok this year. I am still working in the finance industry and now lead a small Data Science team in the Operational Risk area.

After getting used to the new life here in South East Asia I’ve decided to continue my blog. But there will be slight change in the topics, I plan to write more about Machine Learning and Deep Learning in Python and R and less about pricing and XVAs. But I will have also the chance to look into some pricing model validation in my new role, so there is the chance that there will be some quant related postings coming as well.

In the next three coming posts, we will see how to build a fraud detection (classification) system with TensorFlow. We will start to build a logistic regression classifier in SciKit-Learn (sklearn). In the next step will build a logistic regression classifier in TensorFlow from scratch. In the 3rd post we will add a hidden layer to our logistic regression and build a neural network.

You can find the complete source code on GitHub or on kaggle.

For this example we use public available real world data set. You can find the data on kaggle. The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. It contains only numerical input variables which are the result of a PCA transformation. Due to confidentiality issues, there are no more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA.

Install requirements

Easiest way is to install the Anaconda Python distribution from Anaconda Cloud. Windows, Mac and Linux is supported. I use the Python 3.6 64 bit Version. Most of the required packages come already with the basic installation. To install Keras and TensorFlow open the Anaconda prompt (shell) and install the missing packages via conda (package manager, similar to apt-get in ubuntu):

conda install -c conda-forge keras tensorflow

Fraud detection with logistic regression in Scikit-Learn

First we load a required libraries and functions

 This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_curve, roc_auc_score, classification_report, accuracy_score, confusion_matrix 
import seaborn as sns
import matplotlib.pyplot as plt

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

Pandas is used for loading the data and a powerful libraries for data wrangling. If you are not familiar with pandas check out the tutorials on the pandas project website. numpy is the underlying numerical library for pandas and scikit-learn. seaborn and matplotlib are used for visualisation.

Load and visualize the data

First we load the data and try to get an overview of the data.

We load the csv-file with the command read_csv and store it as data-frame in our memory.

credit_card = pd.read_csv('../input/creditcard.csv')

Next we try to get an overview of the fraud vs non-fraud distribution, we going to use the seaborn countplot function to produce bar chart.

f, ax = plt.subplots(figsize=(7, 5))
sns.countplot(x='Class', data=credit_card)
_ = plt.title('# Fraud vs NonFraud')
_ = plt.xlabel('Class (1==Fraud)')

We can not even see the bar chart of the fraud cases. As we can see we have mostly non-fraudulent transactions. Such a problem is also called inbalanced class problem.

99.8% of all transactions are non-fraudulent. The easiest classifier would always predict no fraud and would be in almost all cases correct. Such classifier would have a very high accuracy but is quite useless.

For such an inbalanced classes we could use over or undersampling methods to try to balance the classes (see inbalance-learn for example: https://github.com/scikit-learn-contrib/imbalanced-learn), but this out of the scope of todays post. We will come back to this in a later post.

As accuracy is not very informative in this case, the AUC (Aera under the curve) a better metric to assess the model quality. The AUC score is in a two class classification class equal to the probability that our classifier will detect a fraudulent transaction given one fraudulent and genuine transaction to choice from. Guessing would have a probability of 50%.

We create now the feature matrix X and the result vector y. We drop the column Class from the data frame and store it in a new data frame X and we select the column Class as our vector y.

X = credit_card.drop(columns='Class', axis=1)
y = credit_card.Class.values

Due to the construction of the dataset (PCA transformed features, which minimizes the correlation between factors), we dont have any highly correlated features. Multicolinearity could cause problems in a logisitc regression.

To test for multicolinearity one could look into the correlation matrix (works only for non categorical features, which we do today) or run partial regressions and compare the standard errors or use pseudo-R^2 values and calculate Variance-Inflation-Factors.


corr = X.corr()

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
f, ax = plt.subplots(figsize=(11, 9))
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

Short reminder of Logistic Regression

In Logisitic Regression the logits (logs of the odds) are assumed to be a linear function of the features

$L=\log(\frac{P(Y=1)}{1-P(Y=1)}) = \beta_0 + \sum_{i=1}^n \beta_i X_i.$

Solving this equatation for $p=P(Y=1)$ yields to

$p = \frac{\exp(L)}{1-\exp(L)}.$

The parameters $\beta_i$ can be derived by Maximum Likelihood Estimation (MLE). The likelihood for a given $m$ observation $Y_j$ is

$lkl = \prod_{j=1}^m p^{Y_j}(1-p)^{1-Y_j}.$

To find the maximum of the likelihood is equivalent to the minimize the negative logarithm of the likelihood (loglikelihood).

$-llkh = -\sum_{j=1}^m Y_j \log(p) + (1-Y_j) \log(1-p),$

which is numerical more stable. The log-likelihood function has the same form as the cross-entropy error function for a discrete case.

So finding the maximum likelihood estimator is the same problem as minimizing the average cross entropy error function.

In SciKit-Learn uses by default a coordinate descent algorithm to find the minimum of L2 regularized version of the loss function (see. http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression).

The main difference between L1 (Lasso) and L2 (Ridge) regulaziation is, that the L1 prefer a sparse solution (the higher the regulazation parameter the more parameter will be zero) while L2 enforce small parameter values.

Train the model

Training and test set

First we split our data set into a train and a validation set by using the function train_test_split.

np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X, y)

Model definition

scaler = StandardScaler()
lr = LogisticRegression()
model1 = Pipeline([('standardize', scaler),
                    ('log_reg', lr)])
model1.fit(X_train, y_train)

As preperation we standardize our features to have zero mean and a unit standard deviation. The convergence of gradient descent algorithm are better. We use the class StandardScaler. The class StandardScaler has the method fit_transform() which learn the mean $\mu_i$ and standard deviation $\sigma_i$ of each feature $i$ and return a standardized version $\frac{x_i - \mu_i}{\sigma}$ . We learn the mean and sd on the training data. We can apply the same standardization on the test set with the function transform().

The logistic regression is implemented in the class LogisticRegression, we will use for now the default parameterization. The model can be fit using the function fit(). After fitting the model can be used to make predicitons predict() or return the estimated the class probabilities predict_proba().

We combine both steps into a Pipeline. The pipline performs both steps automatically. When we call the method fit() of the pipeline, it will invoke the method fit_and_transform() for all but the last step and the method fit() of the last step, which is equivalent to lr.fit(scaler.fit_transform(X_train), y_train)

If we invoke the method predict() of the pipeline its equvivalent to lr.predict(scaler.transform(X_train)).

Training score and Test score

confusion_matrix() returns the confusion matrix, C where $C_{0,0}$ are the true negatives (TN) and $C_{0,1}$ the false positives (FP) and vice-versa for the positives in the 2nd row. We use the function accurary_score() to calculate the accuracy our models on the train and test data. We see that the accuracy is quite high (99,9%) which is expected in such an unbalanced class problem. With the method roc_auc_score()can we get the area under the receiver-operator-curve (AUC) for our simple model.

y_train_hat = model1.predict(X_train)
y_train_hat_probs = model1.predict_proba(X_train)[:,1]
train_accuracy = accuracy_score(y_train, y_train_hat)*100
train_auc_roc = roc_auc_score(y_train, y_train_hat_probs)*100
print('Confusion matrix:\n', confusion_matrix(y_train, y_train_hat))
print('Training accuracy: %.4f %%' % train_accuracy)
print('Training AUC: %.4f %%' % train_auc_roc)

Confusion matrix:
 [[213200     26]
 [   137    242]]
Training accuracy: 99.9237 %
Training AUC: 98.0664 %

y_test_hat = model1.predict(X_test)
y_test_hat_probs = model1.predict_proba(X_test)[:,1]
test_accuracy = accuracy_score(y_test, y_test_hat)*100
test_auc_roc = roc_auc_score(y_test, y_test_hat_probs)*100
print('Confusion matrix:\n', confusion_matrix(y_test, y_test_hat))
print('Training accuracy: %.4f %%' % test_accuracy)
print('Training AUC: %.4f %%' % test_auc_roc)

Confusion matrix:
 [[71077    12]
 [   45    68]]
Training accuracy: 99.9199 %
Training AUC: 97.4810 %

Our model is able to detect 68 fraudulent transactions out of 113 (recall of 60%) and produce 12 false positives (<0.02%) on the test data.

To visualize the Receiver-Operator-Curve we use the function roc_curve. The method returns the true positive rate (recall) and the false positive rate (probability for a false alarm) for a bunch of different thresholds. This curve shows the trade-off between recall (detect fraud) and false alarm probability.

If we classifiy all transaction as fraud, we would have a recall of 100% but also the highest false alarm rate possible (100%). The naive way to minimize the false alarm probability is to classify all transaction as genuine. **


fpr, tpr, thresholds = roc_curve(y_test, y_test_hat_probs, drop_intermediate=True)

f, ax = plt.subplots(figsize=(9, 6))
_ = plt.plot(fpr, tpr, [0,1], [0, 1])
_ = plt.title('AUC ROC')
_ = plt.xlabel('False positive rate')
_ = plt.ylabel('True positive rate')
plt.style.use('seaborn')

plt.savefig('auc_roc.png', dpi=600)

Our model classify all transaction with a fraud probability => 50% as fraud. If we choose the threshold higher, we could reach a lower false positive rate but we would also miss more fraudulent transactions. If we choose the thredhold lower we can catch more fraud but need to investigate more false positives.

Depending on the costs for each error, it make sense to select another threshold.

If we set the threshold to 90% the recall decrease from 60% to 45%. while the false positve rate is the same. We can see that our model assign some non-fraudulent a very high probability to be fraud.

y_hat_90 = (y_test_hat_probs > 0.90 )*1
print('Confusion matrix:\n', confusion_matrix(y_test, y_hat_90))
print(classification_report(y_test, y_hat_90, digits=6))

If we set the threshold down to 10%, we can detect around 75% of all fraud case but almost double our false positive rate (now 25 false alarms)

Confusion matrix:
 [[71064    25]
 [   25    88]]
             precision    recall  f1-score   support

          0     0.9996    0.9996    0.9996     71089
          1     0.7788    0.7788    0.7788       113

avg / total     0.9993    0.9993    0.9993     71202

Where to go from here?

We just scratched the surface of sklearn and logistic regression. For example we could spent much more time with the

feature selection / engineering (which is a bit hard without any background information about the features),
we could try techniques to counter the data inbalance and
we could use cross-validation to fine tune the hyperparameters or
try a different regularization (Lasso/Elastic Net) or
try a different optimizer (stochastic gradient descent or mini-batch sgd)
adjust class weights to adjust the decision boundary (make missed frauds more expansive in the loss function)
and finally we could try different classifer models in sklearn like decision trees, random forrests, knn, naive bayes or support vector machines.

But for now we will stop here and we will implement in the next part the logisitc regression model with stochastic gradient descent in TensorFlow and then extend it to a neural net and we will come back to these points at a later time. But in the mean time feel free to play with the notebook and try to change the parameter and see how the model will change.

So long…

Jupyter notebooks – a Swiss Army Knife for Quants

A blog about quantitative finance, data science in fraud detection, machine and deep learning by Matthias Groncki

Machine Learning

Fooling Around with KNIME cont’d: Deep Learning

Install the required packages

Workflow in KNIME

Design of the deep network

Conclusion

Fooling around with KNIME

From Logistic Regression in SciKit-Learn to Deep Learning with TensorFlow – A fraud detection case study – Part II

Short TensorFlow introduction

Automatic differentiation in TensorFlow

Chain Rule in TensorFlow

Gradient descent method

Example Batch Gradient descent

Stochastic Gradient Descent and Mini-Batch Gradient

Logistic Regression in TensorFlow

From Logistic Regression in SciKit-Learn to Deep Learning with TensorFlow – A fraud detection case study – Part I

Install requirements

Fraud detection with logistic regression in Scikit-Learn

Load and visualize the data

Short reminder of Logistic Regression

Train the model

Training and test set

Model definition

Training score and Test score

Where to go from here?