Building a Deployable Bug Reports Classifier using Deep Neural Network

AICamp
11 min readJun 5, 2020
Under the Eiffiel Towers Deep Network (credit: author’s own)

by Chris Foo, student of Class (C20042810: Deep Learning for Developers) at AICamp. original post

Introduction

One of the professional aspirations I’ve had in a while is actually getting down and dirty to write some ML code and finally get a decent grasp of DNNs (Deep Neural Networks) and Tensorflow. I’ve done some courses on Coursera and Udemy but nothing beats actually working on a real project. On the side I’ve been working on a project I named Octopus (must have been all the Tako-pachi or Japanese Octopus balls that I’ve been consuming) — a side project at work to classify Jira bug tickets to engineering teams using traditional ML and I wanted to see if a DNN could be applied to a similar project (in this case assigning classifying Jira bug tickets based on priority).

I got my chance to do just that when I enrolled at AICamp’s online course Deep Learning for Developers — an intensive 4 week course that introduces you to the fundamentals of DNNs but in a very hands on manner. As part of the course, I got to work on a Capstone project and voilà there was my chance to actually dig in!

There were just 2 problems:

  1. As this project would be supervised learning, I’ll need labelled training data. I couldn’t use the Octopus data as it is internal. Also the data set was way too small (hundreds vs tens of thousands needed for DNN training).
  2. All the examples in the AICamp course were focused mainly on image classification eg. MNIST, CIFAR. Nothing on text classification.

I also wanted to get hands-on experience on the following:

  1. Train the model using GPUs to get a feel of the speedup in training
  2. Package up the model in some deployable form (perhaps using pickle)
  3. Package up the predictor as a HTTP POST API using Apache Flask
  4. Create a Docker image of the predictor package
  5. Create a CI pipeline to build the package and deploy to the Docker registry

These I was able to accomplish and you can find the Jupyter Notebooks, data, Dockerfile, Gitlab YAML here:

https://gitlab.com/foohm71/octopus2

As one of the objectives I had set out to accomplish was to train the model using GPUs, using Google Colab comes naturally. Also Google Colab has built in Tensorflow 2 and has some nifty enhancements to Jupyter Notebook to make things easier working on Python and data science. Most of the notebooks in the repo are meant to be run on Google Colab.

The Data

It turns out someone had previously open sourced Jira ticket data from several open source projects and imported it into a PostgresSQL DB. You can find it here: [1]. The team who worked to extract this data set also wrote a paper on it [2].

The next step was to install PostgresSQL and perform an import of the data.

Once done, I wanted a decently sized data set just to see how traditional ML algorithms would hold up since unlike Octopus, we would be classifying the tickets to bug priority instead of team as there is no team data being open source projects.

I chose the Zookeeper project bug tickets for this:

select * from jira_issue_report where status = ‘Closed’ and type = ‘Bug’ and project = ‘ZOOKEEPER’ not (description is null or description = ‘’) and priority is not null

You can find the CSV of this in the repository named: JIRA_OPEN_DATA_ZOOKEEPER.csv. This had ~ 400 rows.

I had also extracted ALL the tickets into a CSV: JIRA_OPEN_DATA_ALL.csv . This had about 200k rows.

However I soon realized that this data set was way too large to process on Google Colab (which was a pity) and so I created a subset of about 40k rows: JIRA_OPEN_DATA_LARGESET.csv using the SQL query:

select * from jira_issue_report where status = ‘Closed’ and type = ‘Bug’ and (project = ‘FLEX’ or project = ‘JBIDE’ or project = ‘RF’ or project = ‘SPR’ or project = ‘HBASE’) and not (description is null or description = ‘’) and priority is not null

Text Classification using Traditional ML

Note: I only briefly cover this section as it is not the “meat” of this article.

In ExploratoryDataAnalysis.ipynb a very basic exploratory data analysis was conducted. Also a very simple text blob Naive Bayes classifier was used to baseline the classification. This was following the example found in [3].

A few things to note:

  1. Accuracy was ~ 49% on the text blob Naive Bayes classifier
  2. The classification targets are very biased. Instead of balancing, we wanted to see how bad/good this would be
  3. In order to obtain a TF-IDF to use as features for traditional ML classification, I performed some word analysis to essentially extract out redundant data eg. IDs, emails, URLs. Also tokenization into words and bigrams was performed.

All of this was done using Numpy, Pandas, Sci-kit Learn and Matplotlib libraries.

The guide on how to do this can be found in [4] and [5].

Next I used the processed data set and ran it on: Naive Bayes, Logistic Regression, SVM and Random Forest classifiers and also performed some Hyperparameter Tuning on the Random Forest classified following the guidance from [6].

You can find all of these in ModelAnalysis.ipynb in the repo. The resulting accuracy was only ~ 33%. (I did not attempt at using XGBoost at the time. Perhaps at a later time!)

Text Classification using DNN and Tensorflow

If you look at the literature, applying DNNs to text classification usually fall into the following approaches:

  1. Using a Recurrent Neural Network (RNN) in particular a LSTM (Long, Short Term Memory)
  2. Using a Convolutional Neural Network (CNN)
  3. Using word embeddings such as word2vec, GloVe
  4. Using a pre-trained model as a base model and freeze/unfreeze layers eg. BERT

I decided to go with a CNN as I had some familiarity with it after attending the AICamp course. The key differences between a standard CNN for image classification and text classification are the following:

  1. Using an Embedding layer to map each word into a vector space
  2. Using 1D Convolution Layers instead of 2D ones

The rest of the network is a pretty standard CNN for image classification.

In order to get a good understanding of how CNNs can be applied to text classification, I recommend either watching Lukas Biewald’s YouTube video “Text Classification Using Convolutional Neural Networks (2019)”- see [7] or take The Lazy Programmer’s course on Udemy “Tensorflow 2.0: Deep Learning and Artificial Intelligence” [8]. There’s a section on applying both LSTM and CNN to binary text classification.

The key idea behind the Embedding layer is this: if we map each word into a N-dimension vector space, then the “distance” between words (ie. how each word relates to each other) can be “learned” by adjusting the location of each word in the vector space.

As for 1D Convolution layers — the idea is the same as 2D Convolution layers but instead of operating on a n x n kernel, we operate on 1 x n kernels which are numeric representations of words. The key intuition behind that is that words have relationships with the words next to them (same idea as using n-grams in traditional NLP).

Data Preparation

Similar to traditional ML where we convert the raw text to a TF-IDF matrix, we have to do something similar here. Basically create a series of word sequences and pad it with zeros at the end.

# Convert sentences to sequences
MAX_VOCAB_SIZE = 20000
tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE)
tokenizer.fit_on_texts(df_train)
sequences_train = tokenizer.texts_to_sequences(df_train)
sequences_test = tokenizer.texts_to_sequences(df_test)# get word -> integer mapping
word2idx = tokenizer.word_index
V = len(word2idx)# pad sequences so that we get a N x T matrix
data_train = pad_sequences(sequences_train)# get sequence length
T = data_train.shape[1]data_test = pad_sequences(sequences_test, maxlen=T)

Model Exploration/Tuning

In DNNModelAnalysis.ipynb you will find the model exploration:

  1. I started off with the basic CNN model proposed for binary classification and how it is modified to handle multi category classification
  2. Gradually adding more convolutional and dense layers to help the model learn better
  3. Adding Batch Normalization and Dropout to decrease over-fitting
  4. Finally adding early stopping

The final model had the following:

  • Embedding layer
  • 6 1D conv layers with kernel size 3
  • MaxPooling layer
  • 4 1D conv layers with kernel size 3
  • MaxPooling layer
  • 2 ID conv layers with kernel size 3
  • Global Max Pooling
  • 4 Dense layers with dropout
  • All layers with Batch Normalization
  • Early stopping

The code:

D = 20early_stopper = EarlyStopping(monitor=’val_loss’, patience=5, restore_best_weights=True)model = Sequential()
model.add(Input(shape=(T,)))model.add(Embedding(V + 1, D))model.add(Conv1D(32, 3, activation=’relu’))
model.add(BatchNormalization())
model.add(Conv1D(32, 3, activation=’relu’))
model.add(BatchNormalization())
model.add(Conv1D(32, 3, activation=’relu’))
model.add(BatchNormalization())
model.add(Conv1D(32, 3, activation=’relu’))
model.add(BatchNormalization())
model.add(Conv1D(32, 3, activation=’relu’))
model.add(BatchNormalization())
model.add(Conv1D(32, 3, activation=’relu’))
model.add(BatchNormalization())
model.add(MaxPooling1D(3))model.add(Conv1D(64, 3, activation=’relu’))
model.add(BatchNormalization())
model.add(Conv1D(64, 3, activation=’relu’))
model.add(BatchNormalization())
model.add(Conv1D(64, 3, activation=’relu’))
model.add(BatchNormalization())
model.add(Conv1D(64, 3, activation=’relu’))
model.add(BatchNormalization())
model.add(MaxPooling1D(3))model.add(Conv1D(128, 3, activation=’relu’))
model.add(BatchNormalization())
model.add(Conv1D(128, 3, activation=’relu’))
model.add(BatchNormalization())
model.add(GlobalMaxPooling1D())model.add(Dense(units=800, activation=’relu’))
model.add(BatchNormalization())
model.add(Dropout(0.2))
model.add(Dense(units=400, activation=’relu’))
model.add(BatchNormalization())
model.add(Dropout(0.2))
model.add(Dense(units=200, activation=’relu’))
model.add(BatchNormalization())
model.add(Dropout(0.2))
model.add(Dense(units=100, activation=’relu’))
model.add(BatchNormalization())
model.add(Dropout(0.2))
model.add(Dense(units=num_classes, activation=’softmax’))model.compile(optimizer=’adam’, loss=’categorical_crossentropy’, metrics=[‘accuracy’])
history = model.fit(data_train1, training_labels1, epochs=40, batch_size=8, verbose=True, validation_split=.2, callbacks=[early_stopper])
plot_training_history(history, model, data_test1, test_labels1)

The accuracy achieved is ~ 54% even though it is somewhat an Apples to Oranges comparison if we compare it with the traditional ML models but the LARGESET dataset would have been too large for the ML models and the ZOOKEEPER data set would have been too small for the DNN.

Building the Deployable Model

There are 3 components to this:

  1. Training and saving the model
  2. Building a simple HTTP POST API wrapper
  3. Building a simple CI pipeline

Note: there are several ways to build a deployable model for Cloud deployment: Google Cloud (GCP) has a few ways to do that which are well documented, so does Amazon (AWS) eg. Sagemaker and Microsoft Azure.

Training and saving the model

DNNProductionTraining.ipynb contains the steps I took to train the model and save the model into a .h5 file. It is designed to run on Google Colab using a GPU and High RAM runtime. No GPU means it will take way longer to train and normal RAM would not have sufficient memory to load the word sequence matrix.

The model is the same as the final model as described in the last section. Only difference is that this time it is trained with the whole LARGESET data set (no train/test split).

Once trained it is saved:

model.save(“/content/gdrive/My Drive/Colab Notebooks/Octopus2/jira_open_data_classifier.h5”, save_format=’tf’)

Then we test the model to see if it will predict.

test_sentance = [‘The hard coded host of the client can only let it run on the same host as the thrift server.’]test_seq = tokenizer.texts_to_sequences(test_sentance)test_padded = pad_sequences(test_seq, maxlen=T)p = model.predict_classes(test_padded)

Building a simple HTTP POST API Wrapper

This is essentially the app.py in the repo.

This is just a Apache Flask application that:

  1. Loads the model from jira_open_data_classifier.h5
  2. Extracts ‘title’ and ‘description’ from the JSON POST payload
  3. Runs the predict and outputs the prediction. This code is similar to the test code in the previous section.

To run app.py, just “python3 app.py

The test.sh contains the curl command to perform the POST:

curl -X POST -H “Content-Type: application/json” \
-d ‘{“title”: “TestSizeBasedThrottler fails occasionally”, “description”: “Every now and then TestSizeBasedThrottler fails reaching the test internal timeouts.I think the timeouts (200ms) are too short for the Jenkins machines.On my (reasonably fast) machine I get timeout reliably when I set the timeout to 50ms, and occasionally at 100ms.”}’ \
“http://localhost:5000/api_predict"

Building a simple CI pipeline

One thing nice I discovered about Gitlab is that they provide templates for the .gitlab-ci.yml (for CI) and Dockerfile. To create a CI pipeline you just need to on the main project page, click on the “Set Up CI/CD” button and it will create a basic blank .gitlab-ci.yml and there will be a dropdown that says “Apply a Template” that let’s you choose from a variety of types of build pipelines eg. Android, C++ etc. For this I chose “Docker” as I wanted to create a Docker container.

The “Set up CI/CD” button

How to apply the “Docker” template”

Note: If you have never created a Docker container and published a Hello World container on the Docker registry I suggest heading over to docker.com and follow along the Quickstart tutorial to create a simple Hello World container and have it published on the Docker registry.

Coming back to the .gitlab-ci.yml, you will notice the following environment variables:

  • $CI_REGISTRY_USER — this is the userid you use to sign in to the registry (in this case docker.com)
  • $CI_REGISTRY_PASSWORD — this is the password you use to sign in to the registry (in this case docker.com)
  • $CI_REGISTRY — this can be removed as we’re using the default Docker registry. It will be needed if you have a separate Docker registry eg. AWS

These environment variables need to be configured under “Settings -> CI / CD -> Variables”. Once there you can add those variables as key-value pairs.

Where to configure your CI/CD variables

If you have set up everything correctly (in fact once there is a .gitlab-ci.yml in the repo) Gitlab will kick off a build. Go to “CI / CD -> Pipelines”, you should see some running.

The CI/CD Pipeline View

If you click on the pipeline hash and navigate to the running job and click on it, you’ll be able to see the console log for the job.

Console log of a successfully completed Job

Unfortunately the pipeline is most likely to fail since the Dockerfile has not yet been created. To do so, go to the “Project Overview” and click on the “New File” button (if you do not see the “New File” button you can also click on the “Changelog” button).

How to add a Dockerfile

You will then be allowed to select a template. I chose the Python one as a base.

The Dockerfile I created was:

FROM python:3.6WORKDIR /usr/src/appCOPY requirements.txt /usr/src/app/
RUN pip install --no-cache-dir -r requirements.txt
COPY . /usr/src/app# Inform Docker that the container is listening on the specified port at runtime.
EXPOSE 5000
# Run command
CMD ["python", "app.py"]

Once that’s in place, the build should succeed ie. the docker image is built and published on the registry.

References

[1] Jira Social Repository, Marco Ortu et al https://github.com/marcoortu/jira-social-repository

[2] The JIRA Repository Dataset, Marco Ortu et al, Sept 2015 https://www.researchgate.net/publication/301370380_The_JIRA_Repository_Dataset

[3] Tutorial: Building a Text Classification System,TextBlob https://textblob.readthedocs.io/en/dev/classifiers.html

[4] Text Analytics for Beginners using NLTK, Avinash Navlani, Dec 2019 https://www.datacamp.com/community/tutorials/text-analytics-beginners-nltk

[5] Understanding Random Forests Classifiers in Python, Avinash Navlani, May 2019 https://www.datacamp.com/community/tutorials/random-forests-classifier-python

[6] Hyperparameter Tuning the Random Forest in Python, Will Koehrsen,Jan 2018 https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74

[7] Text Classification Using Convolutional Neural Networks (2019), Lukas Biewald https://youtu.be/8YsZXTpFRO0

[8] Tensorflow 2.0: Deep Learning and Artificial Intelligence, Lazy Programmer https://www.udemy.com/course/deep-learning-tensorflow-2/

--

--