【NLP】中文纠错代码解析（pycorrector）

2020-01-30 22:00:53 阅读：1969 来源： 互联网

标签：NLP word pycorrector 输出 scores print new 纠错 data

0.win10上安装pycorrector

https://github.com/shibing624/pycorrector
1.pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pycorrector出现No module named ‘pypinyin’
2.pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pypinyin出现No module named ‘kenlm’
3.pip install https://github.com/kpu/kenlm/archive/master.zip出现少了Microsoft Visual C++
4.Microsoft Visual C++ 链接：https://pan.baidu.com/s/1toZQAaJXa3xnflhjDMx6lg 提取码:ky7w 。安装完后继续第3步，第1步，再pip install jieba

1.unbuntu上训练语言模型：

wget -O - http://kheafield.com/code/kenlm.tar.gz |tar xz
cd kenlm
mkdir -p build
cd build
cmake ..
make -j 4
cmake未安装问题：sudu apt install cmake
boost问题：sudo apt-get install libboost-all-dev
Eigen3的问题：如下图
在这里插入图片描述
build/bin/lmplz -o 3 --verbose_header --text people2014corpus_words.txt --arpa result/people2014corpus_words.arps

build/bin/build_binary ./result/people2014corpus_words.arps ./result/people2014corpus_words.klm
如果提示lmplz不存在，build_binary不存在，则需要设置环境变量：将kenlm文件夹加入路径：
gedit .profile
在这里插入图片描述
source .profile

2.use kenlm

2.1 kenlm打分

pycorrector里有一文件包含了很多字，把每个字挨个送进编辑距离产生的空格然后用语言模型打分，困惑度最低的就是正确的。

import kenlm
lm = kenlm.Model('C:/Users/1/Anaconda3/Lib/site-packages/pycorrector/data/kenlm/people_chars_lm.klm')
print(lm.score('银行', bos = True, eos = True)) # begain end

输出：
在这里插入图片描述

chars = ['中国工商银行',
        '往来账业务']
print(lm.score(' '.join(chars), bos = True, eos = True))

输出：
在这里插入图片描述

' '.join(chars)  #以空格为分隔符（delimiter）

输出：
在这里插入图片描述

lm.perplexity('中国工商银行')

输出：
在这里插入图片描述

2.2 分词

分词方法主要基于词典匹配(正向最大匹配法、逆向最大匹配法和双向匹配分词法等)和基于统计(HMM、CRF、和深度学习)；主流分词工具库包括中科院计算所NLPIR、哈工大LTP、清华大学THULAC、Hanlp分词器、Python jieba工具库等。更多的分词方法和工具库参考知乎：https://www.zhihu.com/question/19578687

s="我在课堂学习自然语言1000处理"#不能1=
b=jieba.cut(s)
print("/ ".join(b))

输出：我/ 在/ 课堂/ 学习/ 自然语言/ 1000/ 处理

b=jieba.cut(s)
print(b)

输出：<generator object Tokenizer.cut at 0x000001DDD9CFB728>

b=jieba.lcut(s) #l为list
print(b)

输出：[‘我’, ‘在’, ‘课堂’, ‘学习’, ‘自然语言’, ‘1000’, ‘处理’]

b= jieba.cut(s, cut_all=True)
print("Full Mode: " + "/ ".join(b))  # 全模式

输出：Full Mode: 我/ 在/ 课堂/ 学习/ 自然/ 自然语言/ 语言/ 1000/ 处理
1.jieba.cut 方法接受三个参数:
•需要分词的字符串
•cut_all 参数用来控制是否采用全模式
•HMM 参数用来控制是否使用 HMM 模型
2.jieba.cut_for_search 方法接受两个参数：用于搜索引擎构建倒排索引的分词，粒度比较细
•需要分词的字符串
•是否使用 HMM 模型。

import jieba

seg_list = jieba.cut("我在课堂学习自然语言1000处理", cut_all=True)
print("Full Mode: " + "/ ".join(seg_list))  # 全模式

seg_list = jieba.cut("我在课堂学习自然语言处理", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))  # 精确模式

seg_list = jieba.cut("他毕业于北京航空航天大学，在百度深度学习研究院进行研究")  # 默认是精确模式
print(", ".join(seg_list))

seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所，后在斯坦福大学深造")  # 搜索引擎模式
print(", ".join(seg_list))

输出：Full Mode: 我/ 在/ 课堂/ 学习/ 自然/ 自然语言/ 语言/ 1000/ 处理
Default Mode: 我/ 在/ 课堂/ 学习/ 自然语言/ 处理
他, 毕业, 于, 北京航空航天大学, ，, 在, 百度, 深度, 学习, 研究院, 进行, 研究
小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, ，, 后, 在, 福大, 大学, 斯坦福, 斯坦福大学, 深造

添加用户自定义字典，很多时候我们需要针对自己的场景进行分词，会有一些领域内的专有词汇。
1.可以用jieba.load_userdict(file_name)加载用户字典
2.少量的词汇可以自己用下面方法手动添加：
用 add_word(word, freq=None, tag=None)和 del_word(word)在程序中动态修改词典
用 suggest_freq(segment, tune=True)可调节单个词语的词频，使其能（或不能）被分出来

print('/'.join(jieba.cut('如果放到旧字典中将出错。', HMM=False)))

out:如果/放到/旧/字典/中将/出错/。

jieba.suggest_freq(('中', '将'), True)
print('/'.join(jieba.cut('如果放到旧字典中将出错。', HMM=False)))

out:如果/放到/旧/字典/中/将/出错/。

import jieba_fast as jieba
jieba.lcut('浙江萧山农村商业银行对公取款凭条客户联')

输出：
在这里插入图片描述

from pycorrector.tokenizer import segment as seg
seg('浙江萧山农村商业银行对公取款凭条客户联')

输出：
在这里插入图片描述

import thulac   
thu1 = thulac.thulac()  #默认模式
text = thu1.cut("福州运恒出租车服务有限公司通用机打发票出租汽车专用")  
print(text)

输出：
在这里插入图片描述

2.3 2或3_gram打分

import kenlm
lm = kenlm.Model('C:\ProgramData\Anaconda3\Lib\site-packages\pycorrector\data\kenlm/people_chars_lm.klm')
sentence = '中国二商银行'

# 2-gram
ngram_avg_scores = []
n = 2
scores = []
for i in range(6 - n + 1):
    word = sentence[i:i + n]
    score = lm.score(word, bos=False, eos=False)
    scores.append(score)
print(scores)
# if not scores:
#     continue
for _ in range( 1):
    scores.insert(0,scores[0])
    scores.append(scores[-1])
print(scores)
avg_scores = [sum(scores[i:i + n]) / len(scores[i:i + n]) for i in range(6)]
ngram_avg_scores.append(avg_scores)
print(ngram_avg_scores)

输出：
在这里插入图片描述

# 3-gram
ngram_avg_scores = []
n = 3
scores = []
for i in range(6 - n + 1):
    word = sentence[i:i + n]
    score = lm.score(word, bos=False, eos=False)
    scores.append(score)
print(scores)
# if not scores:
#     continue
for _ in range( n-1):
    scores.insert(0,scores[0])
    scores.append(scores[-1])
print(scores)
avg_scores = [sum(scores[i:i + n]) / len(scores[i:i + n]) for i in range(6)]
ngram_avg_scores.append(avg_scores)
print(ngram_avg_scores)

输出：
在这里插入图片描述

# 2或3-gram
ngram_avg_scores = []

for n in [2,3]:
    scores = []
    for i in range(6 - n + 1):
        word = sentence[i:i + n]
        score = lm.score(word, bos=False, eos=False)
        scores.append(score)
#     print(scores)
    # if not scores:
    #     continue
    for _ in range( n-1):
        scores.insert(0,scores[0])
        scores.append(scores[-1])
#     print(scores)
    avg_scores = [sum(scores[i:i + n]) / len(scores[i:i + n]) for i in range(6)]
    ngram_avg_scores.append(avg_scores)
print(ngram_avg_scores)

输出：
在这里插入图片描述

2.4 numpy矩阵处理

import numpy as np
# 取拼接后的ngram平均得分
# sent_scores = list(np.average(np.array(ngram_avg_scores), axis=0))
np.array(ngram_avg_scores)

输出：
在这里插入图片描述

np.average(np.array(ngram_avg_scores), axis=0)

输出：
在这里插入图片描述

sent_scores = list(np.average(np.array(ngram_avg_scores), axis=0))
sent_scores

输出：
在这里插入图片描述

scoress = sent_scores
scoress = np.array(scoress)
scoress

输出：
在这里插入图片描述

len(scoress.shape)

输出：
在这里插入图片描述

scores2 = scoress[:, None]
scores2

输出：
在这里插入图片描述

median = np.median(scores2 , axis = 0)#中位数先排序，奇数取中间，偶数取中间两个求平均。不是np.mean
median

输出：
在这里插入图片描述

np.sqrt(np.sum((scores2 - median) ** 2 , axis = -1))

输出：
在这里插入图片描述

#margin_median = np.sqrt(np.sum((scores2 - median) ** 2, axis=-1))
margin_median = np.sqrt(np.sum((scores2 - median) ** 2 , axis = 1))
margin_median

输出：
在这里插入图片描述

# 平均绝对离差值
med_abs_deviation = np.median(margin_median)
med_abs_deviation

输出：
在这里插入图片描述

ratio=0.6745
y_score = ratio * margin_median / med_abs_deviation
y_score

输出：
在这里插入图片描述

# scores = scores.flatten()
# maybe_error_indices = np.where((y_score > threshold) & (scores < median))
scores2 = scores2.flatten()
scores2

输出：
在这里插入图片描述

print('scores2 :' ,scores2)
print('median :' ,median)
print('y_score :' ,y_score)

输出：
在这里插入图片描述

np.where(y_score > 1.4)

输出：
在这里插入图片描述

list(np.where(scores2 < median)[0])

输出：
在这里插入图片描述

3.编辑距离

import re
import os
from collections import Counter

def candidates(word):
    """
    generate possible spelling corrections for word.
    :param word:
    :return:
    """
    return known([word]) or known(edits1(word)) or known(edits2(word)) or [word]

def known(words):
    """
    the subset of 'words' that appear in the dictionary of WORDS
    :param words:
    :return:
    """
    return set(w for w in words if w in WORDS)

def edits1(word):
    """
    all edits that are one edit away from 'word'
    :param word:
    :return:
    """
    letters = 'abcdefghijklmnopqrstuvwxyz'
    splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
    deletes = [L + R[1:] for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R) > 1]
    replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
    inserts = [L + c + R for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word):
    """
    all edit that are two edits away from 'word'
    :param word:
    :return:
    """
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

word = '中国工商银行'
for i in range(len(word) + 1):
    print(word[:i], word[i:])

输出：
在这里插入图片描述

edits1('中国工商银行')  #编辑距离算法

输出：
在这里插入图片描述

sentence  = '##我爱##/中国###//'
sentence.strip('#''/')

输出：
在这里插入图片描述

from pycorrector.tokenizer import Tokenizer
tokenize = Tokenizer() #类的实例
sentence = '中国是联合国第五大常任理事国'
token = tokenize.tokenize(sentence)
token

输出：
在这里插入图片描述

4.pandas use pycorrector

数据集链接：https://pan.baidu.com/s/1c1EGc_tY4K7rfoS-NbGhMg 提取码：kp4h

import pandas as pd
data = pd.read_csv('data.txt',sep = '	',header = None)
data

输出：
在这里插入图片描述

data.info()#1列有（3978-3754）个null值

输出：
在这里插入图片描述

new_data = data.dropna()#将有NULL的一行如第24行去除,但是序号不变
new_data

输出：
在这里插入图片描述

new_data.info()#但是序号没变

输出：
在这里插入图片描述

new_data.index = range(0,3754)
new_data

输出：
在这里插入图片描述

new_data.columns = ['Right','Wrong']# = 号不要忘记写
new_data

输出：
在这里插入图片描述

a = new_data['Right'] == new_data['Wrong']
a

输出：
在这里插入图片描述

new_data_1 = pd.concat([new_data,a],axis=1) #增加一列
new_data_1

输出：
在这里插入图片描述

new_data_1[0].value_counts()#有2956条要纠错（统计0这列名称为False有2956条）

输出：
在这里插入图片描述

error_sentences = new_data_1['Wrong']
error_sentences

输出：
在这里插入图片描述

import pycorrector
corrector = []
for error_sentence in error_sentences:
    corrected_sent,detail = pycorrector.correct(error_sentence)#不能加单引号'error_sentence'
    corrector.append(corrected_sent)
    print(corrected_sent)

输出：
在这里插入图片描述

corrector

输出：
在这里插入图片描述

new_data_2 = pd.concat([new_data_1,pd.DataFrame(corrector)],axis=1) #DataFrame不是dateframe，不用单引号
new_data_2.columns = ['Right','Wrong','t/f','correct']
new_data_2

输出：
在这里插入图片描述

b = new_data_2['Right'] == new_data_2['correct']
new_data_3 = pd.concat([new_data_2,b],axis=1) #增加一列
new_data_3.columns = ['Right','Wrong','t/f','correct','T/F']
new_data_3

输出：
在这里插入图片描述

new_data_3['T/F'].value_counts()

输出：
在这里插入图片描述

# 统计将正确纠正错误的个数 和 将错误纠正正确的个数
data_change = new_data_3[new_data_3['t/f'] != new_data_3['T/F']]
data_change

输出：
在这里插入图片描述

data_change['T/F'].value_counts()

输出：
在这里插入图片描述

weixin_43435675 发布了89 篇原创文章 · 获赞 108 · 访问量 2万+ 私信关注

标签：NLP,word,pycorrector,输出,scores,print,new,纠错,data
来源： https://blog.csdn.net/weixin_43435675/article/details/88137709

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

【NLP】中文纠错代码解析（pycorrector）

目录

0.win10上安装pycorrector

1.unbuntu上训练语言模型：

2.use kenlm

2.1 kenlm打分

2.2 分词

2.3 2或3_gram打分

2.4 numpy矩阵处理

3.编辑距离

4.pandas use pycorrector