ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

character

2022-05-07 21:02:54  阅读:212  来源: 互联网

标签:Chinese character GBK also characters bits


In computer and machine-based telecommunications terminology, a character is a unit of information that roughly corresponds to a grapheme [书写单位], grapheme-like unit, or symbol, such as in an alphabet or syllabary [音节文字] in the written form of a natural language.

Examples of characters include letters, numerical digits, common punctuation marks (such as "." or "-"), and whitespace. The concept also includes control characters, which do not correspond to visible symbols but rather to instructions to format or process the text. Examples of control characters include carriage return or tab, as well as instructions to printers or other devices that display or otherwise process text. White spaces can be created by pressing the Return key, spacebar key, or the Tab key, and can also be created by setting the document's margins.

Characters are typically combined into strings.

Historically, the term character was also used to just denote a specific number of contiguous bits. While a character is most commonly assumed to refer to 8 bits (one byte) today, other definitions, like 4 bits or 6 bits, have been used in the past as well.

Computers and communication equipment represent characters using a character encoding that assigns each character to something - an integer quantity represented by a sequence of digits, typically - that can be stored or transmitted through a network. Two examples of usual encodings are ASCII and the UTF-8 encoding for Unicode.

A char in the C programming language is a data type with the size of exactly one byte, which in turn is defined to be large enough to contain any member of the "basic execution character set". The exact number of bits can be checked via CHAR_BIT macro. By far the most common size is 8 bits, and the POSIX standard requires it to be 8 bits. In newer C standards char is required to hold UTF-8 code units which requires a minimum size of 8 bits.

A Unicode code point requires 21 bits. This will not fit in a char on most systems, so more than one is used for some of them, as in the variable-length encoding UTF-8 where each code point takes 1 to 4 bytes. Furthermore, a "character" may require more than one code point (for instance with combining characters), depending on what is meant by the word "character".

It's possible to code the middle character of the word 'naïve' either as a single character 'ï' or as a combination of the character 'i' with the combining diaeresis: (U+0069 LATIN SMALL LETTER I + U+0308 COMBINING DIAERESIS [分音符号]). They are considered canonically equivalent by the Unicode standard.

Chinese characters, also called Hanzi (simplified Chinese: 汉字; traditional Chinese: 漢字; pinyin: Hànzì; lit. 'Han characters'), are logograms [标记] developed for the writing of Chinese. They have been adapted to write other Asian languages, and remain a key component of the Japanese writing system where they are known as kanji. Chinese characters are the oldest continuously used system of writing in the world. By virtue of their widespread current use in East Asia, and historic use throughout the Sinosphere, Chinese characters are among the most widely adopted writing systems in the world by number of users.

Pictograms [象形字] are highly stylized and simplified pictures of material objects. Examples of pictograms include 日 rì for "sun", 月 yuè for "moon", and 木 mù for "tree" or "wood". Xu Shen placed approximately 4% of characters in this category. Though few in number and expressing literal objects, pictograms and ideograms are nonetheless the basis on which all the more complex characters such as associative idea characters (会意字) and pictophonetic characters (形声字) are formed.

《说文解字》,简称《说文》,是由东汉经学家、文字学家许慎编著的语文工具书著作。《说文解字》是中国最早的系统分析汉字字形和考究字源的语文辞书,也是世界上很早的字典之一。《说文解字》内容共十五卷,其中前十四卷为文字解说,字头以小篆书写。

Pictograms are primary characters in the sense that they, along with ideograms (indicative characters i.e. symbols), are the building blocks of compound characters (意意字) and picto-phonetic characters (形声字).

Simple ideograms [指事字 zhǐshìzì] are also called simple indicatives. This small category contains characters that are direct iconic illustrations. Examples include 上 shàng "up" and 下 xià "down", originally a dot above and below a line. Indicative characters are symbols for abstract concepts which could not be depicted literally but nonetheless can be expressed as a visual symbol e.g. convex 凸, concave 凹, flat-and-level 平.

Associative idea characters (compound conceptual characters) [会意字 / 會意字 huìyìzì] are also translated as logical aggregates or associative compounds. These characters have been interpreted as combining two or more pictographic or ideographic characters to suggest a third meaning. The canonical example is 明 bright. 明 is the association of the two brightest objects in the sky the sun 日 and moon 月, brought together to express the idea of "bright".

Rebus [假借字 jiǎjièzì] are also called borrowings or phonetic loan characters. The rebus category covers cases where an existing character is used to represent an unrelated word with similar or identical pronunciation; sometimes the old meaning is then lost completely, as with characters such as 自 zì, which has lost its original meaning of "nose" completely and exclusively means "oneself", or 萬 wàn, which originally meant "scorpion" but is now used only in the sense of "ten thousand".

鼻子变成自己的来历应该是指着自己的鼻子表示自己。:-) 萬始见于商代甲骨文,其字形像蝎子,后来另造形声字“虿”(读作chài)表示“萬”的本义,而万假借为数词。

GBK is an extension of the GB2312 character set for Simplified Chinese characters, used in the People's Republic of China. Since its initial release in 1993, GBK has been extended by Microsoft in Code page 936/1386, which was then extended into GBK 1.0. GBK is also the IANA-registered internet name for the Microsoft mapping, which differs from other implementations primarily by the single-byte euro sign at 0x80. IANA: Internet Assigned Numbers Authority.

GB abbreviates Guojia Biaozhun, which means national standard in Chinese, while K stands for Extension (扩展 kuòzhǎn). GBK not only extended the old standard GB2312 with Traditional Chinese characters, but also with Chinese characters that were simplified after the establishment of GB2312 in 1981.

As of March 2022, GBK is the second-most popular Chinese encoding (after the subset GB2312), with 2.3% of web pages served from China and territories declaring it, and 0.1% of all web pages globally, that is when marked as such, but all major web browsers decode documents marked as e.g. "GB 2312" or "GB2312" as if they were marked "gbk" (while not all do so for pages marked "GB_2312"), and GBK and the subset encoding GB 2312 have a combined 7.7% share (or less than 0.2% globally).

这一数字不是说中文网页少,只是说使用GBK和GB2312编码的网页少。许多中文网页可能使用UTF-8编码。

六级/考研单词: compute, telecommunications, correspond, alphabet, numerical, digit, instruct, denote, equip, assign, transmit, data, execute, equivalent, component, literal, nonetheless, compound, abstract, depict, translate, logic, aggregate, interpret, tertiary, pronounce, oneself, differ, euro, abbreviation, nationwide, march, web, territory

标签:Chinese,character,GBK,also,characters,bits
来源: https://www.cnblogs.com/funwithwords/p/16244099.html

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有