朴素贝叶斯分类

作者:管理员 发布时间:2021-02-07 16:25

scikit-learn中,提供了3中朴素贝叶斯分类算法:GaussianNB(高斯朴素贝叶斯)、MultinomialNB(多项式朴素贝叶斯)、BernoulliNB(伯努利朴素贝叶斯)
简单介绍:
高斯朴素贝叶斯:适用于连续型数值,比如身高在160cm以下为一类,160-170cm为一个类,则划分不够细腻。
多项式朴素贝叶斯:常用于文本分类,特征是单词,值是单词出现的次数。
伯努利朴素贝叶斯:所用特征为全局特征,只是它计算的不是单词的数量,而是出现则为1,否则为0。也就是特征等权重。

高斯朴素贝叶斯:sklearn.naive_bayes.GaussianNB(priors=None)

利用GaussianNB类建立简单模型

In [1]: import numpy as np
   ...: from sklearn.naive_bayes import GaussianNB
   ...: X = np.array([[-1, -1], [-2, -2], [-3, -3],[-4,-4],[-5,-5], [1, 1], [2,
   ...:   2], [3, 3]])
   ...: y = np.array([1, 1, 1,1,1, 2, 2, 2])
   ...: clf = GaussianNB()#默认priors=None
   ...: clf.fit(X,y)
 
Out[1]: GaussianNB(priors=None)


经过训练集训练后,观察各个属性值

priors属性:获取各个类标记对应的先验概率

In [2]: clf.priors#无返回值,因priors=None
In [3]: clf.set_params(priors=[0.625, 0.375])#设置priors参数值
Out[3]: GaussianNB(priors=[0.625, 0.375])
In [4]: clf.priors#返回各类标记对应先验概率组成的列表
Out[4]: [0.625, 0.375]
In [5]: clf.class_prior_
Out[5]: array([ 0.625, 0.375])
In [6]: type(clf.class_prior_)
Out[6]: numpy.ndarray

class_prior_属性:同priors一样,都是获取各个类标记对应的先验概率,区别在于priors属性返回列表,class_prior_返回的是数组

In [7]: clf.class_count_
Out[7]: array([ 5., 3.])
class_count_属性:获取各类标记对应的训练样本数
In [8]: clf.theta_
Out[8]:
array([[-3., -3.],
[ 2., 2.]])

theta_属性:获取各个类标记在各个特征上的均值

In [9]: clf.sigma_
Out[9]:
array([[ 2.00000001, 2.00000001],
[ 0.66666667, 0.66666667]]


sigma_属性:获取各个类标记在各个特征上的方差


方法

get_params(deep=True):返回priors与其参数值组成字典

In [10]: clf.get_params(deep=True)
Out[10]: {'priors': [0.625, 0.375]}
In [11]: clf.get_params()
Out[11]: {'priors': [0.625, 0.375]}

set_params(**params):设置估计器priors参数

In [3]: clf.set_params(priors=[ 0.625, 0.375])
Out[3]: GaussianNB(priors=[0.625, 0.375])

fit(X, y, sample_weight=None):训练样本,X表示特征向量,y类标记,sample_weight表各样本权重数组

In [12]: clf.fit(X,y,np.array([0.05,0.05,0.1,0.1,0.1,0.2,0.2,0.2]))#设置样本不同的权重
Out[12]: GaussianNB(priors=[0.625, 0.375])
In [13]: clf.theta_
Out[13]:
array([[-3.375, -3.375],
[ 2. , 2. ]])
In [14]: clf.sigma_
Out[14]:
array([[ 1.73437501, 1.73437501],
[ 0.66666667, 0.66666667]])

对于不平衡样本,类标记1在特征1均值及方差计算过程:

均值= ((-1*0.05)+(-2*0.05)+(-3*0.1)+(-4*0.1+(-5*0.1)))/(0.05+0.05+0.1+0.1+0.1)=-3.375

方差=((-1+3.375)**2*0.05 +(-2+3.375)**2*0.05+(-3+3.375)**2*0.1+(-4+3.375)**2*0.1+(-5+3.375)**2*0.1)/(0.05+0.05+0.1+0.1+0.1)=1.73437501

partial_fit(X, y, classes=None, sample_weight=None):增量式训练,当训练数据集数据量非常大,不能一次性全部载入内存时,可以将数据集划分若干份,重复调用partial_fit在线学习模型参数,在第一次调用partial_fit函数时,必须制定classes参数,在随后的调用可以忽略

In [18]: import numpy as np
...: from sklearn.naive_bayes import GaussianNB
...: X = np.array([[-1, -1], [-2, -2], [-3, -3],[-4,-4],[-5,-5], [1, 1], [2
...: , 2], [3, 3]])
...: y = np.array([1, 1, 1,1,1, 2, 2, 2])
...: clf = GaussianNB()#默认priors=None
...: clf.partial_fit(X,y,classes=[1,2],sample_weight=np.array([0.05,0.05,0.
...: 1,0.1,0.1,0.2,0.2,0.2]))
...:
Out[18]: GaussianNB(priors=None)
In [19]: clf.class_prior_
Out[19]: array([ 0.4, 0.6])

predict(X):直接输出测试集预测的类标记
predict_proba(X):输出测试样本在各个类标记预测概率值

In [21]: clf.predict_proba([[-6,-6],[4,5]])
Out[21]:
array([[ 1.00000000e+00, 4.21207358e-40],
[ 1.12585521e-12, 1.00000000e+00]])


predict_log_proba(X):输出测试样本在各个类标记上预测概率值对应对数值

In [22]: clf.predict_log_proba([[-6,-6],[4,5]])
Out[22]:
array([[ 0.00000000e+00, -9.06654487e+01],
[ -2.75124782e+01, -1.12621024e-12]])

score(X, y, sample_weight=None):返回测试样本映射到指定类标记上的得分(准确率)

In [23]: clf.score([[-6,-6],[-4,-2],[-3,-4],[4,5]],[1,1,2,2])
Out[23]: 0.75
In [24]: clf.score([[-6,-6],[-4,-2],[-3,-4],[4,5]],[1,1,2,2],sample_weight=[0.3
...: ,0.2,0.4,0.1])
Out[24]: 0.59999999999999998


多项式朴素贝叶斯:sklearn.naive_bayes.MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)

主要用于离散特征分类,例如文本分类单词统计,以出现的次数作为特征值
参数说明:
alpha:浮点型,可选项,默认1.0,添加拉普拉修/Lidstone平滑参数
fit_prior:布尔型,可选项,默认True,表示是否学习先验概率,参数为False表示所有类标记具有相同的先验概率
class_prior:类似数组,数组大小为(n_classes,),默认None,类先验概率
①利用MultinomialNB建立简单模型

In [2]: import numpy as np
   ...: from sklearn.naive_bayes import MultinomialNB
   ...: X = np.array([[1,2,3,4],[1,3,4,4],[2,4,5,5],[2,5,6,5],[3,4,5,6],[3,5,6,
   ...: 6]])
   ...: y = np.array([1,1,4,2,3,3])
   ...: clf = MultinomialNB(alpha=2.0)
   ...: clf.fit(X,y)
   ...:
Out[2]: MultinomialNB(alpha=2.0, class_prior=None, fit_prior=True)

②经过训练后,观察各个属性值

class_log_prior_:各类标记的平滑先验概率对数值,其取值会受fit_prior和class_prior参数的影响
a、若指定了class_prior参数,不管fit_prior为True或False,class_log_prior_取值是class_prior转换成log后的结果

In [4]: import numpy as np
...: from sklearn.naive_bayes import MultinomialNB
...: X = np.array([[1,2,3,4],[1,3,4,4],[2,4,5,5],[2,5,6,5],[3,4,5,6],[3,5,6,
...: 6]])
...: y = np.array([1,1,4,2,3,3])
...: clf = MultinomialNB(alpha=2.0,fit_prior=True,class_prior=[0.3,0.1,0.3,0
...: .2])
...: clf.fit(X,y)
...: print(clf.class_log_prior_)
...: print(np.log(0.3),np.log(0.1),np.log(0.3),np.log(0.2))
...: clf1 = MultinomialNB(alpha=2.0,fit_prior=False,class_prior=[0.3,0.1,0.3
...: ,0.2])
...: clf1.fit(X,y)
...: print(clf1.class_log_prior_)
...:
[-1.2039728 -2.30258509 -1.2039728 -1.60943791]
-1.20397280433 -2.30258509299 -1.20397280433 -1.60943791243
[-1.2039728 -2.30258509 -1.2039728 -1.60943791]

b、若fit_prior参数为False,class_prior=None,则各类标记的先验概率相同等于类标记总个数N分之一

In [5]: import numpy as np
   ...: from sklearn.naive_bayes import MultinomialNB
   ...: X = np.array([[1,2,3,4],[1,3,4,4],[2,4,5,5],[2,5,6,5],[3,4,5,6],[3,5,6,
   ...: 6]])
   ...: y = np.array([1,1,4,2,3,3])
   ...: clf = MultinomialNB(alpha=2.0,fit_prior=False)
   ...: clf.fit(X,y)
   ...: print(clf.class_log_prior_)
   ...: print(np.log(1/4))
   ...:
[-1.38629436 -1.38629436 -1.38629436 -1.38629436]
-1.38629436112

c、若fit_prior参数为True,class_prior=None,则各类标记的先验概率相同等于各类标记个数除以各类标记个数之和

In [6]: import numpy as np
...: from sklearn.naive_bayes import MultinomialNB
...: X = np.array([[1,2,3,4],[1,3,4,4],[2,4,5,5],[2,5,6,5],[3,4,5,6],[3,5,6,
...: 6]])
...: y = np.array([1,1,4,2,3,3])
...: clf = MultinomialNB(alpha=2.0,fit_prior=True)
...: clf.fit(X,y)
...: print(clf.class_log_prior_)#按类标记1、2、3、4的顺序输出
...: print(np.log(2/6),np.log(1/6),np.log(2/6),np.log(1/6))
...:
[-1.09861229 -1.79175947 -1.09861229 -1.79175947]
-1.09861228867 -1.79175946923 -1.09861228867 -1.79175946923

伯努利朴素贝叶斯:sklearn.naive_bayes.BernoulliNB(alpha=1.0, binarize=0.0, fit_prior=True,class_prior=None)

类似于多项式朴素贝叶斯,也主要用户离散特征分类,和MultinomialNB的区别是:MultinomialNB以出现的次数为特征值,BernoulliNB为二进制或布尔型特性
参数说明:
binarize:将数据特征二值化的阈值

①利用BernoulliNB建立简单模型 

In [5]: import numpy as np
...: from sklearn.naive_bayes import BernoulliNB
...: X = np.array([[1,2,3,4],[1,3,4,4],[2,4,5,5]])
...: y = np.array([1,1,2])
...: clf = BernoulliNB(alpha=2.0,binarize = 3.0,fit_prior=True)
...: clf.fit(X,y)
...:
Out[5]: BernoulliNB(alpha=2.0, binarize=3.0, class_prior=None, fit_prior=True)

经过binarize = 3.0二值化处理,相当于输入的X数组为 

In [7]: X = np.array([[0,0,0,1],[0,0,1,1],[0,1,1,1]])
In [8]: X
Out[8]:
array([[0, 0, 0, 1],
[0, 0, 1, 1],
[0, 1, 1, 1]])

②训练后查看各属性值

  • class_log_prior_:类先验概率对数值,类先验概率等于各类的个数/类的总个数
In [9]: clf.class_log_prior_
Out[9]: array([-0.40546511, -1.09861229])

feature_log_prob_ :指定类的各特征概率(条件概率)对数值,返回形状为(n_classes, n_features)数组

Out[10]:
array([[-1.09861229, -1.09861229, -0.69314718, -0.40546511],
[-0.91629073, -0.51082562, -0.51082562, -0.51082562]])

上述结果计算过程:

假设X对应的四个特征为A1、A2、A3、A4,类别为y1,y2,类别为y1时,特征A1的概率为:P(A1|y=y1) = P(A1=0|y=y1)*A1+P(A1=1|y=y1)*A1

In [11]: import numpy as np
...: from sklearn.naive_bayes import BernoulliNB
...: X = np.array([[1,2,3,4],[1,3,4,4],[2,4,5,5]])
...: y = np.array([1,1,2])
...: clf = BernoulliNB(alpha=2.0,binarize = 3.0,fit_prior=True)
...: clf.fit(X,y)
...: print(clf.feature_log_prob_)
...: print([np.log((2+2)/(2+2*2))*0+np.log((0+2)/(2+2*2))*1,np.log((2+2)/(2
...: +2*2))*0+np.log((0+2)/(2+2*2))*1,np.log((1+2)/(2+2*2))*0+np.log((1+2)/
...: (2+2*2))*1,np.log((0+2)/(2+2*2))*0+np.log((2+2)/(2+2*2))*1])
...:
[[-1.09861229 -1.09861229 -0.69314718 -0.40546511]
[-0.91629073 -0.51082562 -0.51082562 -0.51082562]]
[-1.0986122886681098, -1.0986122886681098, -0.69314718055994529, -0.405465108108
16444]


 


标签:
Copyright © 2020 万物律动 旗下 AI算法狮 京ICP备20010037号-1
本站内容来源于网络开放内容的收集整理,并且仅供学习交流使用;
如有侵权,请联系删除相关内容;