Association Rules

Introduction

Association Rules help finding associations among different subjects in a dataset. For example, based on historical data, one can find out what items are usually bought together.

First we generate the most frequent itemsets from the dataset then we generate the association rules from those frequent itemsets.

Parameters

  • Group by: Columns used to group associated items together (e.g. order id for online orders)

  • Items: Items whose association rules are to be found (e.g. product id for products)

  • Frequent Method: Method used to generate frequent itemsets, either fp-growth, apriori or fpmax. Note that fp-growth is the same as apriori but is faster on average. fpmax generates only the maximal itemset which means that frequent patterns contained in larger frequent patterns are discarded.

  • Minimum value frequent itemsets: Minimum value for the support of a frequent itemset. The support refers to the number of times the itemset appears in the dataset. Lowering this value generates more frequent itemsets but make the analysis slower and the found associations aren’t as strong. Should be a value between 0 and 1. Default is 0.1

  • Association metric: Metric used to generate the association rules are either support, confidence, lift, leverage, or conviction. Default is confidence. An association rule possess an antecedent (A) and a consequent (C). The consequent is an itemset associated with the antecedent. An association rule between A and C is written as A -> C.

    • The support metric computes the support of the combined itemset A ∪ C, which is the appearance percentage of either the consequent or the antecedent in the whole dataset.
    • The confidence of a rule A->C is the probability of seeing the consequent in a transaction given that it also contains the antecedent.
    • The lift metric is used to measure the statistical dependency of the antecedent and consequent. If A and C are independent, the Lift score will be exactly 1. The higher the score the more dependent they are.
    • The higher the conviction metric, the more likely the consequent is to be associated with the antecedent. If items are completely independent the conviction will be 1.
    • The leverage is the difference of the support(A U C) and the expected frequency of A and C if they were independent. If A and C are independent, the leverage will be exactly 0.

If you have no idea what association metric to use, simply use confidence as it is the natural way to measure the association between two items.

See https://en.wikipedia.org/wiki/Association_rule_learning#Useful_Concepts for more information.

Note: Sometimes when using the fpmax method there is not enough data to generate other association metrics than support, when this is the case the association metric becomes automatically support.

  • Minimum association metric: Minimum value for the association metric. The lower this value is the more association rules are shown. Default is 0.001
  • Filters (optional): Set conditions on columns to filter on the original dataset. If selected, only a subset of the original data would be used in the analytics.

Result View

  • Association rules: Table showing the antecedents, consequents and the respective association metric for each association rule.
  • Frequent itemset: Shows the most frequent itemsets and a histogram of them.
  • List association: Shows a list of the associated items for each group by.
  • Association graph: Shows a graph of the association rules. The thicker the line the higher the association_metric. The graph displays only the first 100 association rules.
  • Chord diagram: Shows a chord diagram of the association rules.

Case Study

Imagine we are a marketing company and we want to find out which products are most likely to be bought together in a shopping cart. We have a dataset of orders that could be looking like this:

member_number item_description
1032 banana
655 whole milk
2 apples
1280 buns
1280 tropical fruit
1032 tropical fruit
655 other vegetables

As you can see the orders can be grouped by member_number. We want to find out which products are most likely to be bought together.

To generate the association rules, we need to specify the following parameters:

  • The Group By columns should only contain member_number.
  • The items are the item_description.
  • We leave all the other parameters as default.
_images/parameters.png

After pressing the Run button, the following will be displayed:

Association rules

A table containing all the generated association rules. Here we can see that people buying yoghurt will most likely also buy whole milk.

_images/association_rules.png

This table can be downloaded by clicking on the Download button.

_images/case_study_download.png

Frequent itemset

A histogram and a table showing the generated frequent itemsets. Here for example whole milk represents the most frequent itemset. Its support value means that it appears in more than 40% of the orders.

_images/frequent_itemset.png

List association

A list of the associated items for each group by. On this table we can see all the items that each member_number bought.

_images/list_association.png

Association graph

A graph representing the association rules. The thicker the line the higher the association_metric. The graph displays only the first 100 association rules.

Just like we saw on the first table, people buying yoghurt will most likely also buy whole milk. We can confirm this by looking at the graph and see that there is a strong edge between yoghurt and whole milk.

_images/association_graph.png

Chord diagram

A chord diagram showing the association rules.

_images/chord.png