ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

Clickhouse join优化之分桶优化

2022-09-09 11:30:59  阅读:252  来源: 互联网

标签:之分桶 Int32 join String pid test 优化 ds


1、背景

ck在单表查询能够做到极致,但是在join上性能就相对尬尴,

A JOIN B 特别是当两张表的数据都不小的时候,经常就会有内存溢出,超时等等情况

特别是当AB都为分布表的时候

就拿常用的事件表(events_all)和用户表(users_all)做JOIN为例,都是分布表;

以下为clickhouse分桶测试(针对账号pid进行hash分桶,相同的pid会落到同一个分片上,增加join效率,减少网络IO)

2、表结构创建

    事件本地表
 CREATE TABLE test.join_event_local on cluster cluster_3shards_2replicas
(
    `ds` String,
    `gn` String,
    `log_type` String,
    `server_id` Int32,
    `pid` Int32,
    `plosgn` Int8,
    `role_uid` Int64,
    `log_time` Int32,
    `gold_num` Int32,
    `gold_type` Int32,
    `function_key` Int32,
    `direction` Int8,
    `item_id` Int32,
    `item_num` Int64,
    `uuid` String,
    `level` Int32,
    `vip` Int32,
    `remain` Int32,
    `salt` Int32,
    `plosgn_lo` Int8,
    `kv` String,
    `state` String,
    `flume2hdfs` String,
    `updatetime` String
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/join_event_local', '{replica}')
PARTITION BY toYYYYMM(toDate(ds))
ORDER BY (gn, server_id, log_time, item_id, uuid, plosgn,intHash64(pid))
SAMPLE BY intHash64(pid)
SETTINGS index_granularity = 8192; 

  事件分布式表

CREATE TABLE test.join_event_all on cluster cluster_3shards_2replicas
as test.join_event_local
ENGINE = Distributed('cluster_3shards_2replicas', 'test', 'join_event_local',intHash64(pid)) ;

   

        用户本地表

CREATE TABLE test.join_user_local on cluster cluster_3shards_2replicas
(
    `ds` String,
    `gn` String,
    `pid` Int32,
    `qid` String,
    `active_time` Int64,
    `login_ip` Int64,
    `source` String,
    `bind` String,
    `os` String,
    `create_time` Int64,
    `last_login_time` Int64,
    `isvalid` Int32,
    `groupid` Int32,
    `status` Int8,
    `uuid_hash_low` String,
    `uuid_hash_high` String,
    `plosgn` Int32,
    `groupbase` Int32,
    `state` String,
    `timezone` String,
    `a` String,
    `b` String
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/join_user_local', '{replica}')
PARTITION BY toYYYYMM(toDate(ds))
ORDER BY (gn, pid, intHash64(pid), active_time, login_ip, source, bind, os, create_time, last_login_time, isvalid, groupid, status, uuid_hash_low, uuid_hash_high, plosgn, groupbase)
SAMPLE BY intHash64(pid)
SETTINGS index_granularity = 8192 ;

  用户分布式表

CREATE TABLE test.join_user_all on cluster cluster_3shards_2replicas
as test.join_user_local 
ENGINE = Distributed('cluster_3shards_2replicas', 'test', 'join_user_local', intHash64(pid)) 

查询结果

***************非分桶join查询1*******************
测试sql1:账号关联,日期关联计数(256秒)
select count(a.pid) from (select pid,ds from test.event_all) a
join (select pid,ds from test.user_all) b on a.pid=b.pid and a.ds=b.ds group by ds order by ds limit 10;

***************分桶join查询1*******************
测试sql1:账号关联,日期关联计数 (分桶表常规写法查询,依然很慢,309秒)
select ds,count(a.pid) from (select pid,ds from test.join_event_all) a
join (select pid,ds from test.join_user_all) b on a.pid=b.pid and a.ds=b.ds group by ds order by ds limit 10;

(分桶表关联查询ck写法,62秒)
set distributed_product_mode = 'global';
select a.ds,count(a.pid) from test.join_event_all a
global join test.join_user_all b using(pid,ds) group by ds order by ds limit 10;

 

***************非分桶join查询2*******************
测试sql2:大表关联小表 账号关联:
select count(a.pid) from (select pid from test.event_all) a join (select pid from test.user_all) b on a.pid=b.pid; (非分桶 30秒)

***************分桶join查询2**********************
select count(a.pid) from test.join_event_all a
global join test.join_user_all b using(pid) limit 10; (分桶查询46秒)

注意:查询的时候不要B表括号起来select *from (select form A) a join (select from B) b ...这样  否则优化无效,这跟ck的解析有关

总结:当大表和大表进行关联的时候,效率提升明显(选定好常用的字段进行hash分桶,比如账号。。)(大小表关联条件单一的时候,可能效果不是太明显)

参考:https://blog.csdn.net/a495679822/article/details/118548564

标签:之分桶,Int32,join,String,pid,test,优化,ds
来源: https://www.cnblogs.com/-courage/p/16672200.html

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有