ICode9

精准搜索请尝试: 精确搜索
首页 > 编程语言> 文章详细

为什么我所有解码后的字符串都带有“?”在末尾? Java String解码

2019-11-08 04:03:44  阅读:266  来源: 互联网

标签:decoding tweepy string python java


我正在使用Tweepy库(Python)和Kafka从Twitter检索推文.该文本以UTF-8编码,如下所示:

self.producer.send('my-topic', data.encode('UTF-8'))

其中“数据”是字符串.然后,此数据以键值格式存储到Oracle NoSQL数据库中.因此,推文本身已被编码.我使用Java执行此操作:

Value myValue = Value.createValue(msg.value().getBytes("UTF-8"));

最后,这些推文由用Java开发的Formatter检索.为了将其存储在关系模式中,我必须解析该tweet,以便将其检索为String.

String data = new String(value.toByteArray(),StandardCharsets.UTF_8);

如您所见,我在执行的所有步骤中都维护UTF-8编码.但是,当我在数据库中看到推文时,它总是被剪切.例如:

RT @briIIohead: the hardest pill i had to swallow this year was learning that no matter how good you could be to somebody, no matter how mu?

注意它的结尾是“?”符号,并且已被明确切割.好吧,每条漫长的推文都会发生这种情况.我的意思是,如果文本长约30个字符,则可以正常显示,但是任何超过100个左右的内容都会被剪切掉.

起初,我以为可能是我的表定义,但是字段“ Text”被声明为VARCHAR2(400 CHAR),这是一条推文在社交网络中可以包含的最大字符数.

关于如何发现剪切文本和插入“?”的任何想法符号结尾?

“数据”的外观如下:

{"created_at":"Tue May 28 09:23:36 +0000 2019","id":1133302792129351681,"id_str":"1133302792129351681","text":"RT @AppleEDU: Learn, create, and do more with iPad in your classroom. Get the free Everyone Can Create curriculum and bring projects to lif\u2026","source":"Twitter for iPhone<\/a>","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":1060510851889750022,"id_str":"1060510851889750022","name":"Rem.0112","screen_name":"0112Rem","location":"Mawson Lakes, Adelaide","url":null,"description":null,"translator_type":"none","protected":false,"verified":false,"followers_count":739,"friends_count":1853,"listed_count":10,"favourites_count":33406,"statuses_count":36936,"created_at":"Thu Nov 08 12:34:25 +0000 2018","utc_offset":null,"time_zone":null,"geo_enabled":true,"lang":null,"contributors_enabled":false,"is_translator":false,"profile_background_color":"F5F8FA","profile_background_image_url":"","profile_background_image_url_https":"","profile_background_tile":false,"profile_link_color":"1DA1F2","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/1093157842163355649\/6oAdJTCs_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/1093157842163355649\/6oAdJTCs_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/1060510851889750022\/1546155144","default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweeted_status":{"created_at":"Thu May 23 15:15:16 +0000 2019","id":1131579354964725760,"id_str":"1131579354964725760","text":"Learn, create, and do more with iPad in your classroom. Get the free Everyone Can Create curriculum and bring proje\u2026 https:\/\/t.co\/aeeSPTXtFx","source":"Twitter Ads Composer<\/a>","truncated":true,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":468741166,"id_str":"468741166","name":"Apple Education","screen_name":"AppleEDU","location":"Cupertino, CA","url":null,"description":"Spark new ideas, create more aha moments, and teach in ways you\u2019ve always imagined. Follow @AppleEDU for tips, updates, and inspiration.","translator_type":"none","protected":false,"verified":true,"followers_count":728781,"friends_count":273,"listed_count":2594,"favourites_count":13189,"statuses_count":2766,"created_at":"Thu Jan 19 21:26:14 +0000 2012","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":null,"contributors_enabled":false,"is_translator":false,"profile_background_color":"F0F0F0","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"0088CC","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/892429342046691328\/2SOlm_09_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/892429342046691328\/2SOlm_09_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/468741166\/1530123538","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"extended_tweet":{"full_text":"Learn, create, and do more with iPad in your classroom. Get the free Everyone Can Create curriculum and bring projects to life through music, drawing, video and photography.","display_text_range":[0,173],"entities":{"hashtags":[],"urls":[],"user_mentions":[],"symbols":[]}},"quote_count":0,"reply_count":3,"retweet_count":3,"favorite_count":58,"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/aeeSPTXtFx","expanded_url":"https:\/\/twitter.com\/i\/web\/status\/1131579354964725760","display_url":"twitter.com\/i\/web\/status\/1\u2026","indices":[117,140]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"scopes":{"followers":false},"filter_level":"low","lang":"en"},"is_quote_status":false,"quote_count":0,"reply_count":0,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[],"user_mentions":[{"screen_name":"AppleEDU","name":"Apple Education","id":468741166,"id_str":"468741166","indices":[3,12]}],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"en","timestamp_ms":"1559035416048"}

我还必须提到,这整个块都是经过编码的.然后解码,最后解析为要引入数据库.除剪切的“文本”外,所有字段均已正确解码和解析

解决方法:

根据official文档,一条推文最多包含“ 140”个字符(即广泛定义);但最近他们将其更改为280.

该文件说:

Twitter counts the length of a Tweet using the Normalization Form C (NFC) version of the text.

因此,他们首先对文本进行了规范化(我将让您找出Java是如何实现的).后来他们说:

Twitter also counts the number of codepoints in the text rather than UTF-8 bytes.

从而:

String test = "RT @briIIohead: the hardest pill i had to swallow this year was learning that no matter how good you could be to somebody, no matter how mu";
System.out.println(test.codePoints().count()); // 139

最初的推文似乎是280个“字符”,而您使用的库对此并不了解,因此仅使用了前140个.由于这样做会进行一些分块,因此分块似乎也是错误的,它也会在末尾删除一些“部分”字节.当您尝试打印这些字符时-java不知道那些字节(末尾)的实际含义(由于某些错误的分块),只是简单地说? (这是在根本不了解某些内容时显示的默认策略).

标签:decoding,tweepy,string,python,java
来源: https://codeday.me/bug/20191108/2005233.html

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有