自然语言处理(NLP)之命名实体识别

2021-05-05 22:31:10 阅读：377 来源： 互联网

标签：NLP tagged document sentence GPE named 识别 sentences 自然语言

自然语言处理(NLP)之命名实体识别

本文将会简单介绍自然语言处理（NLP）中的命名实体识别（NER）。

命名实体识别（Named Entity Recognition，简称NER）是信息提取、问答系统、句法分析、机器翻译等应用领域的重要基础工具，在自然语言处理技术走向实用化的过程中占有重要地位。一般来说，命名实体识别的任务就是识别出待处理文本中三大类（实体类、时间类和数字类）、七小类（人名、机构名、地名、时间、日期、货币和百分比）命名实体。

举个简单的例子，在句子“小明早上8点去学校上课。”中，对其进行命名实体识别，应该能提取信息

人名：小明，时间：早上8点，地点：学校。

首先我们来看一下NLTK和Stanford NLP中对命名实体识别的分类，如下图：

在上图中，LOCATION和GPE有重合。GPE通常表示地理—政治条目，比如城市，州，国家，洲等。LOCATION除了上述内容外，还能表示名山大川等。FACILITY通常表示知名的纪念碑或人工制品等。

接下来介绍NLTK，我们的示例文档（介绍FIFA，来源于维基百科）如下：

FIFA was founded in 1904 to oversee international competition among the national associations of Belgium,
Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its
membership now comprises 211 national associations. Member countries must each also be members of one of
the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America
and the Caribbean, Oceania, and South America.

实现NER的Python代码如下：

import re
import pandas as pd
import nltk


def parse_document(document):
    document = re.sub('\n', '', document)
    if isinstance(document, str):
        document = document
    else:
        raise ValueError('Document is not string!')

    document = document.strip()
    sentences = nltk.sent_tokenize(document)
    sentences = [sentence.strip() for sentence in sentences]

    return sentences


#  sample document
text = """
FIFA was founded in 1904 to oversee international competition among the national associations of Belgium, 
Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its 
membership now comprises 211 national associations. Member countries must each also be members of one of 
the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America 
and the Caribbean, Oceania, and South America.
"""

#  tokenize sentence
sentences = parse_document(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]

#  tag sentences and use nltk's Named Entity Chunk
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
ne_chunked_sents = [nltk.ne_chunk(tagged) for tagged in tagged_sentences]

#  extract all named entities
named_entities = []
for ne_tagged_sentence in ne_chunked_sents:
    for tagged_tree in ne_tagged_sentence:
        #  extract only chunks having NE labels
        if hasattr(tagged_tree, 'label'):
            entity_name = ' '.join(c[0] for c in tagged_tree.leaves())  # get NE name
            entity_type = tagged_tree.label()  # get NE category
            named_entities.append((entity_name, entity_type))
            #  get unique named entities
            named_entities = list(set(named_entities))

#  store named entities in a data frame
entity_frame = pd.DataFrame(named_entities, columns=['Entity Name', 'Entity Type'])
#  display results
print(entity_frame)

运行结果：

        Entity Name   Entity Type
0            Zürich           GPE
1       Netherlands           GPE
2       Switzerland           GPE
3           Germany           GPE
4         Caribbean      LOCATION
5            France           GPE
6           Denmark           GPE
7           Belgium           GPE
8            Sweden           GPE
9           Oceania           GPE
10    South America           GPE
11           Africa        PERSON
12            Spain           GPE
13           Europe           GPE
14             FIFA  ORGANIZATION
15            North           GPE
16             Asia           GPE
17  Central America  ORGANIZATION

可以看到，NLTK中的NER任务大体上完成得还是不错的，能够识别FIFA为组织（ORGANIZATION），Belgium,Asia为GPE, 但是也有一些不太如人意的地方，比如，它将Central America识别为ORGANIZATION，而实际上它应该为GPE；将Africa识别为PERSON，实际上应该为GPE。

标签：NLP,tagged,document,sentence,GPE,named,识别,sentences,自然语言
来源： https://blog.csdn.net/weixin_44799217/article/details/116430845

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

自然语言处理(NLP)之命名实体识别

自然语言处理(NLP)之命名实体识别