ICode9

精准搜索请尝试: 精确搜索
首页 > 数据库> 文章详细

redis6.0.5之HyperLogLog阅读笔记1-基数估算前言翻译

2021-03-11 17:35:09  阅读:240  来源: 互联网

标签:registers redis6.0 HyperLogLog HLL set sparse 寄存器 representation 前言


/* The Redis HyperLogLog implementation is based on the following ideas:
Redis的超对数实现是基于以下的想法:
 * * The use of a 64 bit hash function as proposed in [1], in order to don't
 *   limited to cardinalities up to 10^9, at the cost of just 1 additional
 *   bit per register.
使用文1推荐的64位的哈希函数,仅用对每个寄存器增加一个比特的代价,就可以突破基数个数10的9次方的限制。
 * * The use of 16384 6-bit registers for a great level of accuracy, using
 *   a total of 12k per key.
为了获得一个比较高的精度,使用16384个6比特的寄存器,每个键值使用了12k(2的14次方 * 6 = 12k)
 * * The use of the Redis string data type. No new type is introduced.
 * * No attempt is made to compress the data structure as in [1]. Also the
 *   algorithm used is the original HyperLogLog Algorithm as in [2], with
 *   the only difference that a 64 bit hash function is used, so no correction
 *   is performed for values near 2^32 as in [1].
使用的是redis的字符串类型,没有引入新的数据类型。
没有采用文1中压缩数据结构的算法。同时使用了文2中原始的超对数算法,唯一不同的是使用了64位的哈希函数,
所以不用对类似文1中接近2^32的值做修正。
 * [1] Heule, Nunkesser, Hall: HyperLogLog in Practice: Algorithmic
 *     Engineering of a State of The Art Cardinality Estimation Algorithm.
 *
 * [2] P. Flajolet, Eric Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The
 *     analysis of a near-optimal cardinality estimation algorithm.
(这两篇文章留到后面再慢慢翻译)
 * Redis uses two representations:
redis使用两种表示方式:
 * 1) A "dense" representation where every entry is represented by
 *    a 6-bit integer.
密集表示方式 每个实体用6比特整数来表示
 * 2) A "sparse" representation using run length compression suitable
 *    for representing HyperLogLogs with many registers set to 0 in
 *    a memory efficient way.
稀疏表示方式  使用行程长的压缩方式表示具有很多个0值寄存器的超级日志,这是一中非常有效的内存表示方法
 *
 * HLL header  HLL的头部
 * ===
 *
 * Both the dense and sparse representation have a 16 byte header as follows:
密集和稀疏的表示都拥有一个16个字节的如下头部:
 * +------+---+-----+----------+
 * | HYLL | E | N/U | Cardin.  |
 * +------+---+-----+----------+
 *
 * The first 4 bytes are a magic string set to the bytes "HYLL".
 * "E" is one byte encoding, currently set to HLL_DENSE or
 * HLL_SPARSE. N/U are three not used bytes.
开始的4个字节是一个魔法数,被设置为HYLL.
E表示一个字节的编码,当前用来设置HLL_DENSE或者HLL_SPARSE。N/U是3个还么有被使用的字节。
 * The "Cardin." field is a 64 bit integer stored in little endian format
 * with the latest cardinality computed that can be reused if the data
 * structure was not modified since the last computation (this is useful
 * because there are high probabilities that HLLADD operations don't
 * modify the actual data structure and hence the approximated cardinality).
Cardin.字段 是一个小端格式保存的最近计算得到64位的集合中元素个数。
如果自从上次计算之后数据结构没有发生变化,那么可以这个数值可以重复使用。
(这个是很有用的,因为HLLADD操作大概率不会修改实际的数据结构,从而不需要修改近似的集合中元素个数)

 * When the most significant bit in the most significant byte of the cached
 * cardinality is set, it means that the data structure was modified and
 * we can't reuse the cached value that must be recomputed.
一旦缓存基数的最重要字节的最重要的比特位被设置,
那就意味着数据结构被修改并且我们不能使用缓存的值,必须要重新计算
 * Dense representation  密集表示
 * ===
 *
 * The dense representation used by Redis is the following:
redis使用的密集表示如下
 * +--------+--------+--------+------//      //--+
 * |11000000|22221111|33333322|55444444 ....     |
 * +--------+--------+--------+------//      //--+
 *
 * The 6 bits counters are encoded one after the other starting from the
 * LSB to the MSB, and using the next bytes as needed.
6个比特位从lsb到msb依次编码,根据情况使用下一个字节的比特位。

 * Sparse representation 稀疏表示
 * ===
 *
 * The sparse representation encodes registers using a run length
 * encoding composed of three opcodes, two using one byte, and one using
 * of two bytes. The opcodes are called ZERO, XZERO and VAL.
稀疏表示编码寄存器使用了包含3种操作符的行程长编码,其中两个操作符使用1个字节,一个操作符使用两个字节。
这三个操作符分别为 ZERO, XZERO and VAL.
 * ZERO opcode is represented as 00xxxxxx. The 6-bit integer represented
 * by the six bits 'xxxxxx', plus 1, means that there are N registers set
 * to 0. This opcode can represent from 1 to 64 contiguous registers set
 * to the value of 0.
ZERO操作符表示为00xxxxxx。6比特的整数用六个比特的'xxxxxx'表示,还要加上1,即意味着有N个寄存器被设置为0.
这个操作符可以表示从1到64个连续为0的寄存器。
 * XZERO opcode is represented by two bytes 01xxxxxx yyyyyyyy. The 14-bit
 * integer represented by the bits 'xxxxxx' as most significant bits and
 * 'yyyyyyyy' as least significant bits, plus 1, means that there are N
 * registers set to 0. This opcode can represent from 0 to 16384 contiguous
 * registers set to the value of 0.
XZERO操作符用两个字节01xxxxxx yyyyyyyy来表示。14个比特的整数用高6位的比特'xxxxxx'和低8位的比特'yyyyyyyy'表示。
同时还需要加1,即意味着N个寄存器设置为0.这个操作符能够代表从0(加1的话,这里应该为1)到16384个连续为0的寄存器。
 * VAL opcode is represented as 1vvvvvxx. It contains a 5-bit integer
 * representing the value of a register, and a 2-bit integer representing
 * the number of contiguous registers set to that value 'vvvvv'.
 * To obtain the value and run length, the integers vvvvv and xx must be
 * incremented by one. This opcode can represent values from 1 to 32,
 * repeated from 1 to 4 times.
VAL操作符由1vvvvvxx表示,其中'vvvvv'5比特的整数表示寄存器的值,'xx'2比特的整数表示连续的值为'vvvvv'的寄存器个数。
为了获取值和行程长,整数 vvvvv 和 xx 必须加1.这个操作符表示从1到32的值,和1到4的重复次数。
 * The sparse representation can't represent registers with a value greater
 * than 32, however it is very unlikely that we find such a register in an
 * HLL with a cardinality where the sparse representation is still more
 * memory efficient than the dense representation. When this happens the
 * HLL is converted to the dense representation.
稀疏表示不能表示值超过值为32的寄存器,然而,不可能在HLL算法中找到这样一个寄存器,稀疏表示任然比密集表示更加内存有效。
如果这种情况发生,那么HLL算法就改用密集表示。
 * The sparse representation is purely positional. For example a sparse
 * representation of an empty HLL is just: XZERO:16384.
稀疏表示纯粹是位置表示。举一个稀疏表示的例子,一个空的HLL表示,就是 XZERO:16384
 * An HLL having only 3 non-zero registers at position 1000, 1020, 1021
 * respectively set to 2, 3, 3, is represented by the following three
 * opcodes:
一个HLL只有3个非零的寄存器,分别在位置1000, 1020, 1021,对应的值为2,3,3, 可以用如下的操作符来表示
 * XZERO:1000 (Registers 0-999 are set to 0)   从0到999的寄存器都是0
 * VAL:2,1    (1 register set to value 2, that is register 1000)   寄存器1000的值是2
 * ZERO:19    (Registers 1001-1019 set to 0)  寄存器1001到1019的寄存器都是0
 * VAL:3,2    (2 registers set to value 3, that is registers 1020,1021) 寄存器1020,1021的值都是3
 * XZERO:15362 (Registers 1022-16383 set to 0) 寄存器1022到16383的值都是0
 *
 * In the example the sparse representation used just 7 bytes instead
 * of 12k in order to represent the HLL registers. In general for low
 * cardinality there is a big win in terms of space efficiency, traded
 * with CPU time since the sparse representation is slower to access:
在上面的例子中,稀疏表示值使用了7个字节而不是12K表示了HLL寄存器。一般来说,对于低基数,在空间效率方面有很大优势,
因为稀疏表示访问速度比较慢,可以用CPU的时间来交换。

 * The following table shows average cardinality vs bytes used, 100
 * samples per cardinality (when the set was not representable because
 * of registers with too big value, the dense representation size was used
 * as a sample).
下面的表格展示了平均基数 和 使用字节的对比,
每基数100个样本(当集合由于值太大而无法表示时,使用密集表示大小作为样本)

 * 100 267
 * 200 485
 * 300 678
 * 400 859
 * 500 1033
 * 600 1205
 * 700 1375
 * 800 1544
 * 900 1713
 * 1000 1882
 * 2000 3480
 * 3000 4879
 * 4000 6089
 * 5000 7138
 * 6000 8042
 * 7000 8823
 * 8000 9500
 * 9000 10088
 * 10000 10591
 *
 * The dense representation uses 12288 bytes, so there is a big win up to
 * a cardinality of ~2000-3000. For bigger cardinalities the constant times
 * involved in updating the sparse representation is not justified by the
 * memory savings. The exact maximum length of the sparse representation
 * when this implementation switches to the dense representation is
 * configured via the define server.hll_sparse_max_bytes.
 */
密集表示法使用12288字节,因此在基数为~2000-3000的情况下有很大的优势。
对更大的基数,更新稀疏表示的常量时间不是简单的内存节省。
当实现切换到密集表示的时,准确的稀疏表示的最大长度被配置在变量 server.hll_sparse_max_bytes中

config.c中  默认的初始化配置为3000
createSizeTConfig("hll-sparse-max-bytes", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.hll_sparse_max_bytes, 3000, MEMORY_CONFIG, NULL, NULL)







 

 

标签:registers,redis6.0,HyperLogLog,HLL,set,sparse,寄存器,representation,前言
来源: https://www.cnblogs.com/cquccy/p/14519280.html

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有