python-UnicodeDecodeError：“ ascii”编解码器无法解码位置40的字节0xc3：序数不在范围内(128)

2019-10-28 16:58:15 阅读：230 来源： 互联网

我试图将字典的具体内容保存到文件中,但是当我尝试编写它时,出现以下错误：

Traceback (most recent call last):
  File "P4.py", line 83, in <module>
    outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 40: ordinal not in range(128)

这是代码：

from collections import Counter

with open("corpus.txt") as inf:
    wordtagcount = Counter(line.decode("latin_1").rstrip() for line in inf)

with open("lexic.txt", "w") as outf:
    outf.write('Palabra\tTag\tApariciones\n'.encode("utf-8"))
    for word,count in wordtagcount.iteritems():
        outf.write(u"{}\t{}\n".format(word, count).encode("utf-8"))
"""
2) TAGGING USING THE MODEL
Dados los ficheros de test, para cada palabra, asignarle el tag mas
probable segun el modelo. Guardar el resultado en ficheros que tengan
este formato para cada linea: Palabra  Prediccion
"""
file=open("lexic.txt", "r") # abrimos el fichero lexic (nuestro modelo) (probar con este)
data=file.readlines()
file.close()
diccionario = {}

"""
In this portion of code we iterate the lines of the .txt document and we create a dictionary with a word as a key and a List as a value
Key: word
Value: List ([tag, #ocurrencesWithTheTag])
"""
for linea in data:
    aux = linea.decode('latin_1').encode('utf-8')
    sintagma = aux.split('\t')  # Here we separate the String in a list: [word, tag, ocurrences], word=sintagma[0], tag=sintagma[1], ocurrences=sintagma[2]
    if (sintagma[0] != "Palabra" and sintagma[1] != "Tag"): #We are not interested in the first line of the file, this is the filter
        if (diccionario.has_key(sintagma[0])): #Here we check if the word was included before in the dictionary
            aux_list = diccionario.get(sintagma[0]) #We know the name already exists in the dic, so we create a List for every value
            aux_list.append([sintagma[1], sintagma[2]]) #We add to the list the tag and th ocurrences for this concrete word
            diccionario.update({sintagma[0]:aux_list}) #Update the value with the new list (new list = previous list + new appended element to the list)
        else: #If in the dic do not exist the key, que add the values to the empty list (no need to append)
            aux_list_else = ([sintagma[1],sintagma[2]])
            diccionario.update({sintagma[0]:aux_list_else})

"""
Here we create a new dictionary based on the dictionary created before, in this new dictionary (diccionario2) we want to keep the next
information:
Key: word
Value: List ([suggestedTag, #ocurrencesOfTheWordInTheDocument, probability])

For retrieve the information from diccionario, we have to keep in mind:

In case we have more than 1 Tag associated to a word (keyword ), we access to the first tag with keyword[0], and for ocurrencesWithTheTag with keyword[1],
from the second case and forward, we access to the information by this way:

diccionario.get(keyword)[2][0] -> with this we access to the second tag
diccionario.get(keyword)[2][1] -> with this we access to the second ocurrencesWithTheTag
diccionario.get(keyword)[3][0] -> with this we access to the third tag
...
..
.
etc.
"""
diccionario2 = dict.fromkeys(diccionario.keys())#We create a dictionary with the keys from diccionario and we set all the values to None
with open("estimation.txt", "w") as outfile:
    for keyword in diccionario:
        tagSugerido = unicode(diccionario.get(keyword[0]).decode('utf-8')) #tagSugerido is the tag with more ocurrences for a concrete keyword
        maximo = float(diccionario.get(keyword)[1]) #maximo is a variable for the maximum number of ocurrences in a keyword
        if ((len(diccionario.get(keyword))) > 2): #in case we have > 2 tags for a concrete word
            suma = float(diccionario.get(keyword)[1])
            for i in range (2, len(diccionario.get(keyword))):
                suma += float(diccionario.get(keyword)[i][1])
                if (diccionario.get(keyword)[i][1] > maximo):
                    tagSugerido = unicode(diccionario.get(keyword)[i][0]).decode('utf-8'))
                    maximo = float(diccionario.get(keyword)[i][1])
            probabilidad = float(maximo/suma);
            diccionario2.update({keyword:([tagSugerido, suma, probabilidad])})

        else:
            diccionario2.update({keyword:([diccionario.get(keyword)[0],diccionario.get(keyword)[1], 1])})

        outfile.write(u"{}\t{}\n".format(keyword, tagSugerido).encode("utf-8"))

所需的输出将如下所示：

keyword(String)  tagSugerido(String):
Hello    NC
Friend   N
Run      V
...etc

冲突线是：

outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))

谢谢.

解决方法:

由于您没有提供简单明了的代码来说明您的问题,因此,我将向您提供一般性建议,以说明错误应该是什么：

如果遇到解码错误,那就是tagSugerido被读为ASCII而不是Unicode.要解决此问题,您应该执行以下操作：

tagSugerido = unicode(diccionario.get(keyword[0]).decode('utf-8'))

将其存储为unicode.

然后,您可能会在write()阶段遇到编码错误,并且应通过以下方式解决您的写入问题：

outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))

应该：

outfile.write(u"{}\t{}\n".format(keyword, tagSugerido.encode("utf-8")))

我随便回答了一个非常类似的问题moments ago.使用unicode字符串时,切换到python3,它将使您的生活更轻松！

如果您仍不能切换到python3,则可以使用python-future import语句使python2的行为几乎像python3：

from __future__ import absolute_import, division, print_function, unicode_literals

N.B .：而不是：

file=open("lexic.txt", "r") # abrimos el fichero lexic (nuestro modelo) (probar con este)
data=file.readlines()
file.close()

在读取行失败时将无法正确关闭文件描述符,您最好这样做：

with open("lexic.txt", "r") as f:
    data=f.readlines()

这将确保即使出现故障也始终关闭文件.

N.B.2：避免使用文件,因为这是您要隐藏的python类型,但请使用f或lexic_file…

标签：io,file-io,output,python
来源： https://codeday.me/bug/20191028/1953669.html

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

python-UnicodeDecodeError：“ ascii”编解码器无法解码位置40的字节0xc3：序数不在范围内(128)