中文自然语言处理最首要的就是要中文分词了,现在而言效果最好的还是要算crf了,具体可以查看Stanford NLP,不过鉴于crf速度比较慢,而且咱对其还没有完全的理解,所以这里就没有比较crf算法了。这里主要比较的是最大匹配算法,隐马尔可夫,uni-gram,和一种character based generative model这四种进行比较。
在52nlp上的这篇文章介绍到了Bakeoff 2005的数据。SIGHAN是国际计算语言学会(ACL)中文语言处理小组的简称,其英文全称为“Special Interest Group for Chinese Language Processing of the Association for Computational Linguistics”,而Bakeoff则是SIGHAN所主办的国际中文语言处理竞赛,然后在Bakeoff 2005的主页上,我们可以下载到用于测试的数据,数据主要是由北京大学,微软亚洲研究院,香港城市大学,台湾中央研究院提供。
下载后可以看到包括doc,training,gold,scripts,testing几个文件夹,其中training中有训练用的分词好的数据,testing中有测试用的未分词的数据,gold中有用于检验的测试的分词好的数据,scripts里是包括一个最大匹配分词的脚本和一个计算分数的脚本,而且数据都提供了gbk和utf8版本。
scripts下的mwseg.pl是自带的最大匹配分词脚本,可以作为baseline来比较
./mwseg.pl ../gold/pku_training_words.txt < ../testing/pku_test.txt > pku_test_seg.txt
可以生成分好词的文件
./score ../gold/pku_training_words.txt ../gold/pku_test_gold.txt pku_test_seg.txt > score.txt
可以生成评分文件,文件中主要内容入下
=== SUMMARY: === TOTAL INSERTIONS: 9274 === TOTAL DELETIONS: 1365 === TOTAL SUBSTITUTIONS: 8377 === TOTAL NCHANGE: 19016 === TOTAL TRUE WORD COUNT: 104372 === TOTAL TEST WORD COUNT: 112281 === TOTAL TRUE WORDS RECALL: 0.907 === TOTAL TEST WORDS PRECISION: 0.843 === F MEASURE: 0.874
可以看到最大匹配算法在pku数据集上的效果精确率是0.843
然后我们再考虑HMM算法,这里用了这篇论文里写的TnT算法,uni-gram写法主要是我这篇文章实现的算法,还有前面提到的这篇论文里的character based generative model。然后跑了北京大学和微软亚洲研究院的数据,下面是算法在两个数据集上的准确率。
--------+----------+--------+----------+------------+ | 最大匹配 | TnT | uni-gram | generative | --------+----------+--------+----------+------------+ PKU | 0.843 | 0.801 | 0.853 | 0.871 | --------+----------+--------+----------+------------+ MSR | 0.917 | 0.759 | 0.925 | 0.944 | --------+----------+--------+----------+------------+
可以看出character based generative model的表现都很不错,而TnT的表现却不怎么好,不过TnT在词性标注的表现却是异常的好,下面我们可以看看这几种算法的主要思路,首先是uni-gram
argmax(ΠP(wordi))
TnT的公式是
argmax(ΠP(wordi|tagi)∗P(tagi|tagi−1,tagi−2))
而character based generative model是
argmax(ΠP((wordi,tagi)|(wordi−1,tagi−1),(wordi−2,tagi−2)))
看上去character based generative model像是n-gram和TnT的综合,这也可能是他表现比较好的原因吧,话说大概还有很多细节的优化或者训练测试集的不同导致了和论文里的评测值不太一样。
最后我就把这个分词算法扔到我新完成的项目SnowNLP里了,代码可以看下面这个。
- # -*- coding: utf-8 -*-
- from __future__ import unicode_literals
-
- import sys
- import gzip
- import marshal
- from math import log
-
- from ..utils import frequency
-
-
- class CharacterBasedGenerativeModel(object):
-
- def __init__(self):
- self.l1 = 0.0
- self.l2 = 0.0
- self.l3 = 0.0
- self.status = ('b', 'm', 'e', 's')
- self.uni = frequency.NormalProb()
- self.bi = frequency.NormalProb()
- self.tri = frequency.NormalProb()
-
- def save(self, fname, iszip=True):
- d = {}
- for k, v in self.__dict__.items():
- if hasattr(v, '__dict__'):
- d[k] = v.__dict__
- else:
- d[k] = v
- if sys.version_info[0] == 3:
- fname = fname + '.3'
- if not iszip:
- marshal.dump(d, open(fname, 'wb'))
- else:
- f = gzip.open(fname, 'wb')
- f.write(marshal.dumps(d))
- f.close()
-
- def load(self, fname, iszip=True):
- if sys.version_info[0] == 3:
- fname = fname + '.3'
- if not iszip:
- d = marshal.load(open(fname, 'rb'))
- else:
- try:
- f = gzip.open(fname, 'rb')
- d = marshal.loads(f.read())
- except IOError:
- f = open(fname, 'rb')
- d = marshal.loads(f.read())
- f.close()
- for k, v in d.items():
- if hasattr(self.__dict__[k], '__dict__'):
- self.__dict__[k].__dict__ = v
- else:
- self.__dict__[k] = v
-
- def div(self, v1, v2):
- if v2 == 0:
- return 0
- return float(v1)/v2
-
- def train(self, data):
- for sentence in data:
- now = [('', 'BOS'), ('', 'BOS')]
- self.bi.add((('', 'BOS'), ('', 'BOS')), 1)
- self.uni.add(('', 'BOS'), 2)
- for word, tag in sentence:
- now.append((word, tag))
- self.uni.add((word, tag), 1)
- self.bi.add(tuple(now[1:]), 1)
- self.tri.add(tuple(now), 1)
- now.pop(0)
- tl1 = 0.0
- tl2 = 0.0
- tl3 = 0.0
- samples = sorted(self.tri.samples(), key=lambda x: self.tri.get(x)[1])
- for now in samples:
- c3 = self.div(self.tri.get(now)[1]-1, self.bi.get(now[:2])[1]-1)
- c2 = self.div(self.bi.get(now[1:])[1]-1, self.uni.get(now[1])[1]-1)
- c1 = self.div(self.uni.get(now[2])[1]-1, self.uni.getsum()-1)
- if c3 >= c1 and c3 >= c2:
- tl3 += self.tri.get(now)[1]
- elif c2 >= c1 and c2 >= c3:
- tl2 += self.tri.get(now)[1]
- elif c1 >= c2 and c1 >= c3:
- tl1 += self.tri.get(now)[1]
- self.l1 = self.div(tl1, tl1+tl2+tl3)
- self.l2 = self.div(tl2, tl1+tl2+tl3)
- self.l3 = self.div(tl3, tl1+tl2+tl3)
-
- def log_prob(self, s1, s2, s3):
- uni = self.l1*self.uni.freq(s3)
- bi = self.div(self.l2*self.bi.get((s2, s3))[1], self.uni.get(s2)[1])
- tri = self.div(self.l3*self.tri.get((s1, s2, s3))[1],
- self.bi.get((s1, s2))[1])
- if uni+bi+tri == 0:
- return float('-inf')
- return log(uni+bi+tri)
-
- def tag(self, data):
- now = [((('', 'BOS'), ('', 'BOS')), 0.0, [])]
- for w in data:
- stage = {}
- not_found = True
- for s in self.status:
- if self.uni.freq((w, s)) != 0:
- not_found = False
- break
- if not_found:
- for s in self.status:
- for pre in now:
- stage[(pre[0][1], (w, s))] = (pre[1], pre[2]+[s])
- now = list(map(lambda x: (x[0], x[1][0], x[1][1]),
- stage.items()))
- continue
- for s in self.status:
- for pre in now:
- p = pre[1]+self.log_prob(pre[0][0], pre[0][1], (w, s))
- if (not (pre[0][1],
- (w, s)) in stage) or p > stage[(pre[0][1],
- (w, s))][0]:
- stage[(pre[0][1], (w, s))] = (p, pre[2]+[s])
- now = list(map(lambda x: (x[0], x[1][0], x[1][1]), stage.items()))
- return zip(data, max(now, key=lambda x: x[1])[2])