python – 使用pandas进行基于NLTK的文本处理

2019-09-17 22:06:02 阅读：200 来源： 互联网

标签：python pandas dataframe string nltk

使用nltk时,标点符号和数字小写不起作用.

我的代码

stopwords=nltk.corpus.stopwords.words('english')+ list(string.punctuation)
user_defined_stop_words=['st','rd','hong','kong']                    
new_stop_words=stopwords+user_defined_stop_words

def preprocess(text):
    return [word for word in word_tokenize(text) if word.lower() not in new_stop_words and not word.isdigit()]

miss_data['Clean_addr'] = miss_data['Adj_Addr'].apply(preprocess)

样本输入

23FLOOR 9 DES VOEUX RD WEST     HONG KONG
PAG CONSULTING FLAT 15 AIA CENTRAL 1 CONNAUGHT RD CENTRAL
C/O CITY LOST STUDIOS AND FLAT 4F 13-15 HILLIER ST SHEUNG HONG KONG

预期产出

 floor des voeux west
 pag consulting flat aia central connaught central
 co city lost studios flat f hillier sheung

解决方法:

你的功能很慢而且不完整.首先,问题 –

>您不是在降低数据量.
>你没有正确摆脱数字和标点符号.
>你没有返回一个字符串(你应该使用str.join加入列表并返回它)
>此外,具有文本处理的列表理解是引入可读性问题的主要方式,更不用说可能的冗余(对于每个出现的条件,您可以多次调用函数.

接下来,您的函数会出现一些明显的低效问题,尤其是使用停用词删除代码.

>您的停用词结构是一个列表,对列表的检查很慢.要做的第一件事就是将其转换为一个集合,使得未经检查的时间不变.
>你正在使用nltk.word_tokenize这是不必要的慢.
>最后,你不应该总是依赖申请,即使你正在使用很少有任何矢量化解决方案的NLTK.几乎总有其他方法可以做同样的事情.通常情况下,即使是python循环也会更快.但这不是一成不变的.

首先,创建一组增强的停用词 –

user_defined_stop_words = ['st','rd','hong','kong'] 

i = nltk.corpus.stopwords.words('english')
j = list(string.punctuation) + user_defined_stop_words

stopwords = set(i).union(j)

下一个修复是摆脱列表理解并将其转换为多行函数.这使事情变得更容易使用.你的函数的每一行应该专门用于解决一个特定的任务(例如,摆脱数字/标点符号,或删除停用词或小写) –

def preprocess(x):
    x = re.sub('[^a-z\s]', '', x.lower())                  # get rid of noise
    x = [w for w in x.split() if w not in set(stopwords)]  # remove stopwords
    return ' '.join(x)                                     # join the list

举个例子.这将适用于您的专栏 –

df['Clean_addr'] = df['Adj_Addr'].apply(preprocess)

作为替代方案,这是一种不依赖于应用的方法.这应该适用于小句子.

将数据加载到一系列中 –

v = miss_data['Adj_Addr']
v

0            23FLOOR 9 DES VOEUX RD WEST     HONG KONG
1    PAG CONSULTING FLAT 15 AIA CENTRAL 1 CONNAUGHT...
2    C/O CITY LOST STUDIOS AND FLAT 4F 13-15 HILLIE...
Name: Adj_Addr, dtype: object

现在是繁重的工作.

>小写与str.lower
>使用str.replace消除噪音
>使用str.split将单词拆分为单独的单元格
>使用pd.DataFrame.isin pd.DataFrame.where应用停用词删除
>最后,使用agg加入数据帧.

v = v.str.lower().str.replace('[^a-z\s]', '').str.split(expand=True)

v.where(~v.isin(stopwords) & v.notnull(), '')\
 .agg(' '.join, axis=1)\
 .str.replace('\s+', ' ')\
 .str.strip()

0                                 floor des voeux west
1    pag consulting flat aia central connaught central
2           co city lost studios flat f hillier sheung
dtype: object

要在多列上使用它,请将此代码放在函数preprocess2中并调用apply –

def preprocess2(v):
     v = v.str.lower().str.replace('[^a-z\s]', '').str.split(expand=True)

     return v.where(~v.isin(stopwords) & v.notnull(), '')\
             .agg(' '.join, axis=1)\
             .str.replace('\s+', ' ')\
             .str.strip()

c = ['Col1', 'Col2', ...] # columns to operate
df[c] = df[c].apply(preprocess2, axis=0)

你仍然需要一个应用程序调用,但是只有少量列,它不应该扩展得太严重.如果你不喜欢申请,那么这里有一个循环的变种 –

for _c in c:
    df[_c] = preprocess2(df[_c])

让我们看看我们的非loopy版本和原始版本之间的区别 –

s = pd.concat([s] * 100000, ignore_index=True) 

s.size
300000

首先,进行健全检查 –

preprocess2(s).eq(s.apply(preprocess)).all()
True

现在来了时间.

%timeit preprocess2(s)   
1 loop, best of 3: 13.8 s per loop

%timeit s.apply(preprocess)
1 loop, best of 3: 9.72 s per loop

这是令人惊讶的,因为应用很少比非循环解决方案快.但是在这种情况下这是有意义的,因为我们已经优化了预处理,并且大熊猫中的字符串操作很少被矢量化(它们通常是,但性能增益并不像您期望的那么多).

让我们看一下,如果我们可以做得更好,绕过apply,使用np.vectorize

preprocess3 = np.vectorize(preprocess)

%timeit preprocess3(s)
1 loop, best of 3: 9.65 s per loop

这与应用相同,但由于“隐藏”循环周围的开销减少,因此碰巧更快一些.

标签：python,pandas,dataframe,string,nltk
来源： https://codeday.me/bug/20190917/1810084.html

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

python – 使用pandas进行基于NLTK的文本处理