标签:python regex tokenize nlp nltk
我正在使用以下正则表达式,它假设找到字符串’U.S.A.’,但它只获得’A’,是否有人知道什么是错的?
#INPUT
import re
text = 'That U.S.A. poster-print costs $12.40...'
print re.findall(r'([A-Z]\.)+', text)
#OUTPUT
['A.']
预期产出:
['U.S.A.']
我正在关注NLTK Book,第3.7章here,它有一套正则表达式,但它只是不起作用.我在Python 2.7和3.4中都尝试过它.
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \w+(-\w+)* # words with optional internal hyphens
... | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
... | \.\.\. # ellipsis
... | [][.,;"'?():-_`] # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']
nltk.regexp_tokenize()与re.findall()的工作方式相同,我想我的python在某种程度上无法按预期识别正则表达式.上面列出的正则表达式输出:
[('', '', ''),
('A.', '', ''),
('', '-print', ''),
('', '', ''),
('', '', '.40'),
('', '', '')]
解决方法:
可能,这与先前使用在v3.1中废除的nltk.internals.compile_regexp_to_noncapturing()编译正则表达式有关,参见here)
>>> import nltk
>>> nltk.__version__
'3.0.5'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
... | \w+([-']\w+)* # words w/ optional internal hyphens/apostrophe
... | [+/\-@&*] # special characters with meanings
... '''
>>>
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']
但它在NLTK v3.1中不起作用:
>>> import nltk
>>> nltk.__version__
'3.1'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
... | \w+([-']\w+)* # words w/ optional internal hyphens/apostrophe
... | [+/\-@&*] # special characters with meanings
... '''
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
[('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]
稍微修改一下你的正则表达式组的定义,你可以使用这个正则表达式在NLTK v3.1中使用相同的模式:
pattern = r"""(?x) # set flag to allow verbose regexps
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
|\d+(?:\.\d+)?%? # numbers, incl. currency and percentages
|\w+(?:[-']\w+)* # words w/ optional internal hyphens/apostrophe
|(?:[+/\-@&*]) # special characters with meanings
"""
在代码中:
>>> import nltk
>>> nltk.__version__
'3.1'
>>> pattern = r"""
... (?x) # set flag to allow verbose regexps
... (?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
... |\d+(?:\.\d+)?%? # numbers, incl. currency and percentages
... |\w+(?:[-']\w+)* # words w/ optional internal hyphens/apostrophe
... |(?:[+/\-@&*]) # special characters with meanings
... """
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']
如果没有NLTK,使用python的re模块,我们发现本机不支持旧的正则表达式模式:
>>> pattern1 = r"""(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... |\$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
... |\w+([-']\w+)* # words w/ optional internal hyphens/apostrophe
... |[+/\-@&*] # special characters with meanings
... |\S\w* # any sequence of word characters#
... """
>>> text="My weight is about 68 kg, +/- 10 grams."
>>> re.findall(pattern1, text)
[('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]
>>> pattern2 = r"""(?x) # set flag to allow verbose regexps
... (?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
... |\d+(?:\.\d+)?%? # numbers, incl. currency and percentages
... |\w+(?:[-']\w+)* # words w/ optional internal hyphens/apostrophe
... |(?:[+/\-@&*]) # special characters with meanings
... """
>>> text="My weight is about 68 kg, +/- 10 grams."
>>> re.findall(pattern2, text)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']
注意:NLTK的RegexpTokenizer如何编译正则表达式的变化会使NLTK’s Regular Expression Tokenizer上的示例过时.
标签:python,regex,tokenize,nlp,nltk 来源: https://codeday.me/bug/20190727/1556139.html
本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享; 2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关; 3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关; 4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除; 5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。