Classification

Introduction

Classification predicts categorical values. It is one of the most useful analytics in machine learning. An example of classification is to classify credit card applications into those who have good credit, bad credit or fall into gray area based on annual salary, outstanding debts, age, etc.

Actable AI would use the entire table as the source data and automatically splits the table into three parts:

  • Train data: Rows in the table where predictors and target are filled.
  • Prediction data: Rows where target column is missing.
  • Validation data: Actable AI would sample a part of the data to verify the reliability of the trained model. This part of the data would also be used in the performance tunning stage if performance optimisation is selected.

Parameters

  • Predicted target: Choose one column whose missing values would be predicted.
  • Predictors: Columns that are used to predict the target.
  • Optimize for performance: If selected, Actable AI would try to gain the best performance. However, it would take more time for the process to be finished.
  • Explain predictions: If selected, an additional column would be appended to the result to state the reason behind the prediction for each row.
  • Cross-Validation: If selected, Actable AI would use different portions of the data to test and train a model on different iterations. It would help Actable AI to evaluate the trained model better.
  • Validation percentage: The model is evaluated in the end via the validation dataset. By sliding this value, one can control the percentage of rows with a non-empty predicted target that is used for validation.
  • Biased Groups: Groups that are biased and are creating bias in other features of the dataset.
  • Debiased Features: Features that needs to be debiased from the biased groups.
  • Extra Columns (optional): Columns that are not in the predictors but would display along with the returned results.
  • Filters (optional): Set conditions on columns to filter on the original dataset. If selected, only a subset of the original data would be used in the analytics.

Actable AI provides the debiased feature to handle columns with biased data distribution. Please refer to Debias in order to understand our debiased algorithm.

Case Study

Imagine we are a real estate company and would like to evaluate the location of the property. The locations are categorised as great, good and poor.

An example of the dataset could be:

number_of_rooms number_of_bathrooms days_on_market neighborhood rental_price sqft location
0 1 10 south_side 2271 4848 great
1 1 1 downtown 2167 674 good
1 1 19 westbrae 1883 554 poor
0 1 3 south_side 2431 529 great

Now we added some new properties and would like to rate their location automatically:

number_of_rooms number_of_bathrooms days_on_market neighborhood rental_price sqft location
2 1 3 downtown 3359 818  
2 1 8 downtown 3305 771  
1 1 6 south_side 2284 333  

We set our parameters as:

_images/setup2.png

Review Result

The result view contains a Prediction tab, a Performance tab and a Table tab.

Prediction

The Prediction tab shows the prediction result for the rows which missed the target value. The category with the highest confidence is returned in red colour and the probability for each category would be displayed. We will cover this more when we explain Details table. If Explain predictions is set, an Explanation column would be appended. We will cover this more when we explain Details table.

Performance

In the Performance tab, we see that our model performed with 100% accuracy.

_images/performance_overall.png

To understand the performance details, Actable AI provides a confusion matrix as one of the break down evaluation matrices. The confusion matrix is computed from the held-out validation data set. It shows how many percent of data points with an actual category in the validation set are classified into different categories. The below table shows the confusion matrix for our example location rate classification task. Take the actual rate Good as an example. By using the original validation dataset, the evaluation result shows that the model trained by Actable AI predicts all rows originally with good location as good.

_images/performance_confusion_matrix.png

Actable AI does not only act as a model training tool, but also try to provide the rationale behind the classification. There are two more tables provided: Important Features table and Details table.

The Important Features table tells that according to the training, Actable AI realises that some features (columns in the data) are more important than others. In the display, value 0 means minimal importance and value 1 means maximum importance. For example in our case study, days_on_market takes the most weight in making the prediction (0.455).

_images/performance_important_features.png

The Details table tells the rationale behind each classification decision made on rows. There are several new columns appended to the original table:

  • Probabilities columns: The N columns “target_value probability” indicate the probability of a certain category being predicted. Take the below screenshot as an example, the model predicts the second row of the table has 99.67% chance to be a Good location.
  • Prediction result: The column target_prediction show the predicted class.
  • Explanation: This column would only be displayed when Explain predictions option is enabled. It tells you the reason why the prediction is made. The full explanation would be shown when you hover your mouse on top of the cell.

The following example indicates that the model thinks that the fourth property (4th row) should be categorised as a good location because 98% of similar examples are categorised as such. The similarity calculation is made mainly from days_on_market and neighbourhood.

_images/explaination.png

Leaderboard

The Leaderboard tab shows the underlying models used to get a prediction. Actable AI uses state-of-the-art machine learning algorithms to get a prediction and uses the best algorithm to get the prediction.

On the table below, we can see the following information:
  • The name of the model trained.
  • The validation score of the model.
  • The training time of the model.
  • The prediction time of the model.
_images/leaderboard.png

Table

The Table tab display the original dataset.

Notes

In addition, for binary classification (when the target has only 2 distinct values), we also display a plot that shows the trade-off between True Positive Rate (TPR) and False Positive Rate (FPR) (ROC curve). A True Positive Rate is the percentage of data points in the validation set with positive labels (1) are correctly classified as positive. A False Positive Rate is the percentage of data points in the validation set with positive labels that are incorrectly classified as negative (0). As we can choose a different threshold of confidence to classify a data point to positive or negative (default is 0.5), we can have different TPR-FPR pairs depending on chosen threshold. One might find a trade-off between TPR and FPR useful in different uses cases.

_images/roc.png