拼写纠错

2019-09-13 09:42:31 阅读：371 来源： 互联网

标签：count term word items 拼写 bigram 纠错 prob

词典库

#list查找的复杂度为线性复杂度
#转换为set，复杂度一般为log n
vocab=set([line.rstrip() for line in open('vocab.txt')])
print(vocab)

{'rts', 'tactual', 'unavoidably', 'Interstate', 'Compared', 'vulcanized', 'Shakya', 'lowliest', 'Weber', 'Vocational', 'destroyer', 'Flying', 'showmanship', 'spurred', 'Devout', 'Amstutz', 'totality', 'Sides', 'Dietetic', 'JYJ', 'struggled', 'translated', 'myth', 'alienate', 'corona', 'exemplifies', 'showcase', 'drawers', 'stringently', 'companion', 'hopping', 'Reference', 'sauces', 'requiring', 'strove', 'volens', 'complex', 'confronted', 'playbacks', 'confinement', 'sorption', 'black', 'Churches', 'Erie', 'manila', 'Rumanians', 'ancestry', 'sank', 'reagents', 'favorable', 'thorny', 'lifts', 'rhododendron', 'abduction', 'accenting', 'platinum', 'carries', 'barrel', 'detract', 'resisted', '102', 'Magnum', 'attained', 'engineer', 'naturalness', 'Wellesley', 'fiche', 'multistage', 'Sandalwood', 'stress', 'emotion', 'concluded', 'justitia', 'tumbled', 'activation', 'instance', 'harsher', 'phantom', 'sanctions', 'abortion', 'Gianicolo', 'Ching', 'limply', 'Stronghold', 'draftees', 'campagna', 'badinage', 'Couperin', 'Seminar', 'choked', 'involvement', 'unveiled', 'excite', 'jabbed', 'rotations', 'arteriosclerosis', 'geology', 'Redland', 'Mill', 'Schramm', 'candle', 'brave', 'Lovelace', 'Compagnie'}

需要生成所有首选集合

#需要生成所有候选集合
def generate_candidates(word):
    '''
    word:给定的输入(错误的输入)
    返回所有(valid)候选集合
    '''
    #生成编辑距离为1的单词
    #1.insert 2.delete 3.replace
    #假设使用26个字符
    letters='abcdefghijklmnopqrstuvwxyz'
    splits=[(word[:i],word[i:]) for i in range(len(word)+1)]
    #print(splits)
    #[('', 'apple'), ('a', 'pple'), ('ap', 'ple'), ('app', 'le'), ('appl', 'e'), ('apple', '')]

    #insert操作
    inserts= [L+c+R for L,R in splits for c in letters]
    #delete操作
    deletes= [L+R[1:] for L,R in splits if R]
    #replace操作
    replace=[L+c+R[1:] for L,R in splits if R for c in letters]
    candidates = set(inserts+deletes+replace)
    return [word for word in candidates if word in vocab]
print(generate_candidates('apple'))

['apply', 'apple', 'ample', 'apples']

读取语料库

import nltk
nltk.download('punkt')
from nltk.corpus import reuters
#读取nltk自带的语料库
categories=reuters.categories()
corpus=reuters.sents(categories=categories)

构建语言模型：bigram

#构建语言模型：bigram
term_count={}
bigram_count={}
for doc in corpus:
    #<s> I like basketball.这样可以计算P(I|<s>)
    doc=['<s>']+doc
    #最后一个单词后面没有任何字符
    for i in range(0,len(doc)-1):
        #bigram:[i,i+1]
        term=doc[i]
        bigram=doc[i:i+2]

        if term in term_count:
            term_count[term]+=1
        else:
            term_count[term]=1
        bigram=''.join(bigram)
        if bigram in bigram_count:
            bigram_count[bigram]+=1
        else:
            bigram_count[bigram]=1
print(term_count)
print(bigram_count)

用户打错的概率统计

# 用户打错的概率统计 - channel probability
channel_prob = {}

for line in open('spell-errors.txt'):
    items = line.split(":")
    correct = items[0].strip()
    mistakes = [item.strip() for item in items[1].strip().split(",")]
    channel_prob[correct] = {}
    for mis in mistakes:
        channel_prob[correct][mis] = 1.0/len(mistakes)

print(channel_prob)

纠错

#总共有多少个词
V=len(term_count.keys())

import numpy as np
file = open('testdata.txt','r')
for line in file:
    items = line.rstrip().split('\t')
    line = items[2].split()
    # line = ["I", "like", "playing"]
    for word in line:
        if word not in vocab:
            # 需要将word替换成正确的单词
            # Step1：生成所有的(valid)候选集合
            candidates = generate_candidates(word)
            #一种方式：if candidate=[]，多生成几个candidates
            #TODO:根据条件生成更多的候选集合
            probs = []
            # 对于每一个candidate，计算它的score
            # score = p(correct)*p(mistakes|correct)
            #       = log p(correct) + log p(mistake|correct)
            # 返回score最大的candidate
            for candi in candidates:
                prob=0
                # a.计算channel probability
                # 若词不存在map里，则给它很小的概率
                if candi in channel_prob and word in channel_prob[candi]:
                    prob += np.log(channel_prob[candi][word])
                else:
                    prob += np.log(0.001)

                # b.计算语言模型的概率
                idx = items[2].index(word)+1
                #前面默认有个<s>,所以要+1
                if items[2][idx - 1] in bigram_count and candi in bigram_count[items[2][idx - 1]]:
                    prob += np.log((bigram_count[items[2][idx - 1]][candi] + 1.0) / (
                            term_count[bigram_count[items[2][idx - 1]]] + V))
                # TODO: 也要考虑当前 [word, post_word]
                #   prob += np.log(bigram概率)
                else:
                    prob += np.log(1.0 / V)

                probs.append(prob)

                max_idx = probs.index(max(probs))
                print(word, candidates[max_idx])

运行结果：

protectionst protectionist
products. products
long-run, long-run
gain. gain
gain. gain
17, 17e
17, 17e
retaiation retaliation
cost. cost
cost. cost
cost. cost
busines, business
ltMC.T. ltMC.T
U.S., U.S.
Murtha, Murtha
worried. worried
seriousnyss seriousness
aganst against
.............

标签：count,term,word,items,拼写,bigram,纠错,prob
来源： https://blog.csdn.net/weixin_44953364/article/details/100772108

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

拼写纠错

词典库

需要生成所有首选集合

读取语料库

构建语言模型：bigram

用户打错的概率统计

纠错