Artificial Intelligence ZA Launch
Thursday 19 May 2016 marked the kickoff of the first AI hackathon meet up in Johannesburg, South Africa. The meet up is planned to happen monthly with a focus on growing a community around learning and practicing concepts in artificial intelligence.
The evening started with drinks and pizza where everyone got to socialize and network with their peers. Some interesting conversations around AI and innovation sprouted.
The first concept that the meetup group will be tackling is Machine Learning. The Machine Learning for Beginners GitHub repository included a getting started guide for Python and R. The repository demonstrates a simple example for learning the difference between an apple and orange based on its weight and texture. It also includes an example for learning classifications of the Iris flower species – the Iris dataset is a popular dataset used for testing and learning in the area of data science. The two examples serve as a quick getting started guide for beginners in machine learning and people not familiar with the Scikit Learn library for Python. The Scikit Learn library provides a number of built-in classifiers and prediction algorithms for machine learning which makes getting up and running simple and easy.
The hackathon session included an exercise and dataset where the group were challenged to create an algorithm to classify password strength. The dataset consists of 50000 randomly generated passwords for use as training data and 25000 randomly generated passwords for use as testing data. The password strength is measured by detecting the use of uppercase characters, numbers, special characters, and length. More information on the exercise can be found here.
The key aspect of the hackathon was to think about the properties of the password that are most likely to be useful for machine learning, and thereafter, preparing the data such that it can be consumed by a machine learning classifier. Often data, as it exists, is not suitable for machine learning. Removing redundant features, removing unnecessary features, and choosing the correct types of features are an important part of the process.
The group split up into small teams of two or three where they discussed solutions and hacked away at some code. The outcome was interesting as different teams had different approaches to the problem, and conducted various experiments to learn more about the performance of the machine learning classifier.
Some teams chose Boolean features for the occurrence of numbers, special characters, and upper case characters; whilst other teams counted the occurrences of the mentioned features. Some teams represented the output as a percentage of accuracy, and other teams visualized the output in good looking charts. One of the most interesting outcomes were the experiments that teams conducted to test the threshold for learning by reducing the training set and testing the outcomes of the machine learning.
Here’s some insight from a few of the participants:
The approach I used was to extract the data into a list of inputs and targets, then mapping a feature extraction function to each password to get a list of feature sets that map nicely with the target array. The feature extraction function extracted four features from the data:
An integer value representing the length of the password subtracting 8,
Three bit/boolean value representing if the password contains an uppercase letter, a digit and a special character.
Three learning algorithms were trained using the input data set and then these classifiers were used to predict the strengths of the password in the testing set. The results of the predictions were compared to the target values, with the error being calculated as the difference between the prediction and the target. The accuracy (error = 0) of each method used is as follows:
Stochastic Gradient Descent: 62.552% accuracy (15638 / 25000)
K-nearest neighbours: 99.998% accuracy (24997 / 25000)
Random Forests: 100.000% accuracy (25000 / 25000)
We decided on the following stack: Jumpy, Pandas, re and obviously sklearn.
Since we needed to generate features, we decided that the apply function of a pandas series would work out really well. For each series you can apply a function which is “applied” to each element in the series.
Given this, we wrote two functions. The first function parses the file by just opening it and iterating over each line. Each line is split using regex to strip out the password and the score. The second function, called count_chars, counts the number of characters in the password. It takes a password and a set of characters from the string module and returns the count of characters. We can then apply the function using the syntax below:
training.password.apply(lambda x: count_chars(x, string.ascii_uppercase))
So from the latter function and the len function we created the features below.
We then used the cross_val_score to evaluate how well the model generalizes and got an average score of 99%. Checking this against the test set we got approximately the same accuracy. We ran some performance checks on the code and found the following:
Parsing either file takes about 5.72 µs
Creating all the features takes about 6.91 µs
-Bradley, Stuart, Kirton