Apriori关联分析-Python实现

问题描述

这类问题常出现于确定某些商品之间的关联销售联系情况。比如,沃尔玛将啤酒与尿布一起销售,通常可以获得两个商品的更高销量。背后的原因是妻子要求丈夫为小孩买尿布,但丈夫同时也会去购买自己喜欢的啤酒等商品。
但从数据中还存在着更多的关联关系需要被发掘。关联规则学习就是来解决这些问题的。

关联规则学习(Association rule learning)是一种在大型数据库中发现变量之间的有趣性关系的方法。它的目的是利用一些有趣性的量度来识别数据库中发现的强规则。
这些关系可以有两种形式:频繁项集或者关联规则。

频繁项集(frequent item sets)是经常出现在一块的物品的集合
关联规则(association rules)暗示两种物品之间可能存在很强的关系。

关联规则学习基础概念

支持度与置信度
支持度
$$
support(X,Y)=P(X,Y)=\frac{number(X,Y)}{number(all sample)}
$$
置信度

$$
confidence(X⇒Y)=P(Y∣X)=\frac{P(X,Y)}{P(X)}=\frac{number(X,Y)}{number(X)}
$$

举个例子

交易编号购买商品
0牛奶,洋葱,肉豆蔻,芸豆,鸡蛋,酸奶
1莳萝,洋葱,肉豆蔻,芸豆,鸡蛋,酸奶
2牛奶,苹果,芸豆,鸡蛋
3牛奶,独角兽,玉米,芸豆,酸奶
4玉米,洋葱,洋葱,芸豆,冰淇淋,鸡蛋

牛奶的支持度
$$
support(X,Y)=P(X,Y)=\frac{number(X,Y)}{number(all sample)}=3/5
$$

(牛奶->鸡蛋)的置信度
$$
confidence(X⇒Y)=P(Y∣X)=\frac{P(X,Y)}{P(X)}=\frac{number(X,Y)}{number(X)}
$$

$$
Confidence(牛奶->鸡蛋)=(牛奶,鸡蛋)一起购买的次数/鸡蛋的购买次数=2/4=0.5
$$

除了以上的两个概念,还有一个概念提升度(Lift):指当销售一个物品时,另一个物品销售率会增加多少。
$$
Lift(X⇒Y)=confidence(X⇒Y)/support(X)
Lift(牛奶->鸡蛋)=2/4/3/5=0.83
$$

当Lift大于1时,表示该物品X对商品Y的销售产生积极作用;提升度小于1那么意味着购买X反而会减少Y的销量。

Apriori算法

从原理上来看,求解支持度和置信度的公式是很简单的,遍历所有组合计算即可。但是,遍历的时间复杂度是$ 2^{N-1} $

Apriori算法则是找出频繁项集的高效算法。

某个项集是频繁的,那么它的所有子集也是频繁的。

如果一个项集是 非频繁项集,那么它的所有超集也是非频繁项集。

关于这两点,可以举个例子:如果A和B之间并不存在频繁项,则他们和C之间亦不存在。

如果{A,B}这个项集是非频繁的,那么{A,B}这个项集的超集,{A,B,C},{A,B,D}等等也都是非频繁的,这些就都可以忽略不去计算。

具体实现

首先是第三方库
from apyori import apriori
dataset = [['牛奶','洋葱','肉豆蔻','芸豆','鸡蛋','酸奶'],
        ['莳萝','洋葱','肉豆蔻','芸豆','鸡蛋','酸奶'],
        ['牛奶','苹果','芸豆','鸡蛋'],
        ['牛奶','独角兽','玉米','芸豆','酸奶'],
        ['玉米','洋葱','芸豆','冰淇淋','鸡蛋']]
rules = apriori(dataset, min_support=0.5)
for i in rules:
    print(i)
RelationRecord(items=frozenset({'洋葱'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'洋葱'}), confidence=0.6, lift=1.0)])
RelationRecord(items=frozenset({'牛奶'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'牛奶'}), confidence=0.6, lift=1.0)])
RelationRecord(items=frozenset({'芸豆'}), support=1.0, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'芸豆'}), confidence=1.0, lift=1.0)])
RelationRecord(items=frozenset({'酸奶'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'酸奶'}), confidence=0.6, lift=1.0)])
RelationRecord(items=frozenset({'鸡蛋'}), support=0.8, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'鸡蛋'}), confidence=0.8, lift=1.0)])
RelationRecord(items=frozenset({'芸豆', '洋葱'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'芸豆', '洋葱'}), confidence=0.6, lift=1.0), OrderedStatistic(items_base=frozenset({'洋葱'}), items_add=frozenset({'芸豆'}), confidence=1.0, lift=1.0), OrderedStatistic(items_base=frozenset({'芸豆'}), items_add=frozenset({'洋葱'}), confidence=0.6, lift=1.0)])
RelationRecord(items=frozenset({'鸡蛋', '洋葱'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'鸡蛋', '洋葱'}), confidence=0.6, lift=1.0), OrderedStatistic(items_base=frozenset({'洋葱'}), items_add=frozenset({'鸡蛋'}), confidence=1.0, lift=1.25), OrderedStatistic(items_base=frozenset({'鸡蛋'}), items_add=frozenset({'洋葱'}), confidence=0.7499999999999999, lift=1.2499999999999998)])
RelationRecord(items=frozenset({'芸豆', '牛奶'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'芸豆', '牛奶'}), confidence=0.6, lift=1.0), OrderedStatistic(items_base=frozenset({'牛奶'}), items_add=frozenset({'芸豆'}), confidence=1.0, lift=1.0), OrderedStatistic(items_base=frozenset({'芸豆'}), items_add=frozenset({'牛奶'}), confidence=0.6, lift=1.0)])
RelationRecord(items=frozenset({'芸豆', '酸奶'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'芸豆', '酸奶'}), confidence=0.6, lift=1.0), OrderedStatistic(items_base=frozenset({'芸豆'}), items_add=frozenset({'酸奶'}), confidence=0.6, lift=1.0), OrderedStatistic(items_base=frozenset({'酸奶'}), items_add=frozenset({'芸豆'}), confidence=1.0, lift=1.0)])
RelationRecord(items=frozenset({'芸豆', '鸡蛋'}), support=0.8, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'芸豆', '鸡蛋'}), confidence=0.8, lift=1.0), OrderedStatistic(items_base=frozenset({'芸豆'}), items_add=frozenset({'鸡蛋'}), confidence=0.8, lift=1.0), OrderedStatistic(items_base=frozenset({'鸡蛋'}), items_add=frozenset({'芸豆'}), confidence=1.0, lift=1.0)])
RelationRecord(items=frozenset({'芸豆', '鸡蛋', '洋葱'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'芸豆', '鸡蛋', '洋葱'}), confidence=0.6, lift=1.0), OrderedStatistic(items_base=frozenset({'洋葱'}), items_add=frozenset({'芸豆', '鸡蛋'}), confidence=1.0, lift=1.25), OrderedStatistic(items_base=frozenset({'芸豆'}), items_add=frozenset({'鸡蛋', '洋葱'}), confidence=0.6, lift=1.0), OrderedStatistic(items_base=frozenset({'鸡蛋'}), items_add=frozenset({'芸豆', '洋葱'}), confidence=0.7499999999999999, lift=1.2499999999999998), OrderedStatistic(items_base=frozenset({'芸豆', '洋葱'}), items_add=frozenset({'鸡蛋'}), confidence=1.0, lift=1.25), OrderedStatistic(items_base=frozenset({'鸡蛋', '洋葱'}), items_add=frozenset({'芸豆'}), confidence=1.0, lift=1.0), OrderedStatistic(items_base=frozenset({'芸豆', '鸡蛋'}), items_add=frozenset({'洋葱'}), confidence=0.7499999999999999, lift=1.2499999999999998)])
其次,自己实现

考虑的问题如下:

  1. 首先生成所有单个物品的项集列表
  2. 遍历数据集中所有项集,将不满足最小支持度的项集去掉
  3. 对剩下的项集组合,生成包含两个元素的项集
  4. 重新遍历数据集,去掉不满足最小支持度的项集
  5. 重复上述过程,直到所有项集都被去掉
import pandas as pd
from itertools import combinations,permutations

计算最后的置信度和提升度

def calcflf(astring,support_series,minconfidence):
    support_series[frozenset()]=1
    items = astring.split(',')
    t = len(items)
    items=frozenset(items)
    res = []
    ordered_statistics=[]
    for num in range(t):
        res += list((frozenset(set(items) - set(l)),frozenset(l)) for l in permutations(items, num))

    for k,l in reversed(res):
        items_base =k
        items_add=l

        confidence=support_series[items]/support_series[l]
        lift=support_series[items]/support_series[l]/support_series[k]
        if confidence>=minconfidence:
            ordered_statistics.append({'items_base':items_add,'items_add':items_base,'support':support_series[items],'confidence':confidence,'lift':lift})

    return ordered_statistics

主进程,计算支持度,利用frozenset作为hash键值

def Apriori(dataset,minconfidence=0.5,minsupport=0.5):
    # tranform dataset into table which contains is_exist in all tranctions
    func = lambda x: pd.Series(1, index=x)
    data = pd.DataFrame(list(map(func, dataset))).fillna(0)

    # apply Apriori,  use frozenset as the key of support_series
    support_series = 1.0 * data.sum() / len(data)
    column = list(support_series[support_series >= minsupport].index)
    support_series={frozenset([i]):k for i,k in dict(support_series).items() if k>=minsupport}
    # k is the id of search, also the parameters of combinations
    k = 1
    # final res
    res = []
    cols=column
    while len(column) > 1:
        k = k + 1
        temp=set.union(*(set(i.split(',')) for i in column))
        temp = [frozenset(i) for i in combinations(temp, k)]

        secfunc = lambda i: data[i].prod(axis=1, numeric_only=True)
        datatemp = pd.DataFrame(list(map(secfunc, temp)), index=[','.join(i) for i in temp]).T

        support_series_temp = 1.0 * datatemp[[','.join(i) for i in temp]].sum() / len(data)

        column = list(support_series_temp[support_series_temp >= minsupport].index)
        cols+=column
        support_series_temp = {frozenset(i.split(',')): k for i, k in dict(support_series_temp).items() if k >= minsupport}

        support_series.update(support_series_temp)

    # after find all columns which are greater than minsupport, find all lift and confidence info.
    for i in cols:
        res_temp=calcflf(i, support_series,minconfidence)
        if len(res_temp)>0:
            res.append({i: res_temp})

    return res
minsupport = 0.5
minconfidence = 0.5

rules = Apriori(dataset, minconfidence, minsupport)
for i in rules:
    print(i)
{'牛奶': [{'items_base': frozenset(), 'items_add': frozenset({'牛奶'}), 'support': 0.6, 'confidence': 0.6, 'lift': 1.0}]}
{'洋葱': [{'items_base': frozenset(), 'items_add': frozenset({'洋葱'}), 'support': 0.6, 'confidence': 0.6, 'lift': 1.0}]}
{'芸豆': [{'items_base': frozenset(), 'items_add': frozenset({'芸豆'}), 'support': 1.0, 'confidence': 1.0, 'lift': 1.0}]}
{'鸡蛋': [{'items_base': frozenset(), 'items_add': frozenset({'鸡蛋'}), 'support': 0.8, 'confidence': 0.8, 'lift': 1.0}]}
{'酸奶': [{'items_base': frozenset(), 'items_add': frozenset({'酸奶'}), 'support': 0.6, 'confidence': 0.6, 'lift': 1.0}]}
{'芸豆,牛奶': [{'items_base': frozenset({'牛奶'}), 'items_add': frozenset({'芸豆'}), 'support': 0.6, 'confidence': 1.0, 'lift': 1.0}, {'items_base': frozenset({'芸豆'}), 'items_add': frozenset({'牛奶'}), 'support': 0.6, 'confidence': 0.6, 'lift': 1.0}, {'items_base': frozenset(), 'items_add': frozenset({'芸豆', '牛奶'}), 'support': 0.6, 'confidence': 0.6, 'lift': 1.0}]}
{'芸豆,洋葱': [{'items_base': frozenset({'洋葱'}), 'items_add': frozenset({'芸豆'}), 'support': 0.6, 'confidence': 1.0, 'lift': 1.0}, {'items_base': frozenset({'芸豆'}), 'items_add': frozenset({'洋葱'}), 'support': 0.6, 'confidence': 0.6, 'lift': 1.0}, {'items_base': frozenset(), 'items_add': frozenset({'芸豆', '洋葱'}), 'support': 0.6, 'confidence': 0.6, 'lift': 1.0}]}
{'鸡蛋,洋葱': [{'items_base': frozenset({'洋葱'}), 'items_add': frozenset({'鸡蛋'}), 'support': 0.6, 'confidence': 1.0, 'lift': 1.25}, {'items_base': frozenset({'鸡蛋'}), 'items_add': frozenset({'洋葱'}), 'support': 0.6, 'confidence': 0.7499999999999999, 'lift': 1.2499999999999998}, {'items_base': frozenset(), 'items_add': frozenset({'鸡蛋', '洋葱'}), 'support': 0.6, 'confidence': 0.6, 'lift': 1.0}]}
{'芸豆,鸡蛋': [{'items_base': frozenset({'鸡蛋'}), 'items_add': frozenset({'芸豆'}), 'support': 0.8, 'confidence': 1.0, 'lift': 1.0}, {'items_base': frozenset({'芸豆'}), 'items_add': frozenset({'鸡蛋'}), 'support': 0.8, 'confidence': 0.8, 'lift': 1.0}, {'items_base': frozenset(), 'items_add': frozenset({'芸豆', '鸡蛋'}), 'support': 0.8, 'confidence': 0.8, 'lift': 1.0}]}
{'芸豆,酸奶': [{'items_base': frozenset({'酸奶'}), 'items_add': frozenset({'芸豆'}), 'support': 0.6, 'confidence': 1.0, 'lift': 1.0}, {'items_base': frozenset({'芸豆'}), 'items_add': frozenset({'酸奶'}), 'support': 0.6, 'confidence': 0.6, 'lift': 1.0}, {'items_base': frozenset(), 'items_add': frozenset({'芸豆', '酸奶'}), 'support': 0.6, 'confidence': 0.6, 'lift': 1.0}]}
{'芸豆,鸡蛋,洋葱': [{'items_base': frozenset({'鸡蛋', '洋葱'}), 'items_add': frozenset({'芸豆'}), 'support': 0.6, 'confidence': 1.0, 'lift': 1.0}, {'items_base': frozenset({'芸豆', '洋葱'}), 'items_add': frozenset({'鸡蛋'}), 'support': 0.6, 'confidence': 1.0, 'lift': 1.25}, {'items_base': frozenset({'鸡蛋', '洋葱'}), 'items_add': frozenset({'芸豆'}), 'support': 0.6, 'confidence': 1.0, 'lift': 1.0}, {'items_base': frozenset({'芸豆', '鸡蛋'}), 'items_add': frozenset({'洋葱'}), 'support': 0.6, 'confidence': 0.7499999999999999, 'lift': 1.2499999999999998}, {'items_base': frozenset({'芸豆', '洋葱'}), 'items_add': frozenset({'鸡蛋'}), 'support': 0.6, 'confidence': 1.0, 'lift': 1.25}, {'items_base': frozenset({'芸豆', '鸡蛋'}), 'items_add': frozenset({'洋葱'}), 'support': 0.6, 'confidence': 0.7499999999999999, 'lift': 1.2499999999999998}, {'items_base': frozenset({'洋葱'}), 'items_add': frozenset({'芸豆', '鸡蛋'}), 'support': 0.6, 'confidence': 1.0, 'lift': 1.25}, {'items_base': frozenset({'鸡蛋'}), 'items_add': frozenset({'芸豆', '洋葱'}), 'support': 0.6, 'confidence': 0.7499999999999999, 'lift': 1.2499999999999998}, {'items_base': frozenset({'芸豆'}), 'items_add': frozenset({'鸡蛋', '洋葱'}), 'support': 0.6, 'confidence': 0.6, 'lift': 1.0}, {'items_base': frozenset(), 'items_add': frozenset({'芸豆', '鸡蛋', '洋葱'}), 'support': 0.6, 'confidence': 0.6, 'lift': 1.0}]}

结论

  • map()的使用
  • frozenset()作为字典的键值
  • itertools库的combinations,permutations使用
  • DataFrame().prod()乘积表示多元素同时出现

Categories: Python

0 Comments

Leave a Reply

Your email address will not be published.