标签:decimal sales sk catalog bigint File cs ORC
ORC file can reduce the data size read from HDFS.
The size of catalog_sales at orc format is 151644639.
hive> SHOW CREATE TABLE tpcds_bin_partitioned_orc_2.catalog_sales;
OK
CREATE TABLE `tpcds_bin_partitioned_orc_2.catalog_sales`(
`cs_sold_time_sk` bigint,
`cs_ship_date_sk` bigint,
`cs_bill_customer_sk` bigint,
`cs_bill_cdemo_sk` bigint,
`cs_bill_hdemo_sk` bigint,
`cs_bill_addr_sk` bigint,
`cs_ship_customer_sk` bigint,
`cs_ship_cdemo_sk` bigint,
`cs_ship_hdemo_sk` bigint,
`cs_ship_addr_sk` bigint,
`cs_call_center_sk` bigint,
`cs_catalog_page_sk` bigint,
`cs_ship_mode_sk` bigint,
`cs_warehouse_sk` bigint,
`cs_item_sk` bigint,
`cs_promo_sk` bigint,
`cs_order_number` bigint,
`cs_quantity` int,
`cs_wholesale_cost` decimal(7,2),
`cs_list_price` decimal(7,2),
`cs_sales_price` decimal(7,2),
`cs_ext_discount_amt` decimal(7,2),
`cs_ext_sales_price` decimal(7,2),
`cs_ext_wholesale_cost` decimal(7,2),
`cs_ext_list_price` decimal(7,2),
`cs_ext_tax` decimal(7,2),
`cs_coupon_amt` decimal(7,2),
`cs_ext_ship_cost` decimal(7,2),
`cs_net_paid` decimal(7,2),
`cs_net_paid_inc_tax` decimal(7,2),
`cs_net_paid_inc_ship` decimal(7,2),
`cs_net_paid_inc_ship_tax` decimal(7,2),
`cs_net_profit` decimal(7,2))
PARTITIONED BY (
`cs_sold_date_sk` bigint)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://localhost:9000/user/hive/warehouse/tpcds_bin_partitioned_orc_2.db/catalog_sales'
TBLPROPERTIES (
'bucketing_version'='2',
'transient_lastDdlTime'='1628754485')
Time taken: 0.051 seconds, Fetched: 47 row(s)
hive> dfs -du -s hdfs://localhost:9000/user/hive/warehouse/tpcds_bin_partitioned_orc_2.db/catalog_sales;
151644639 151644639 hdfs://localhost:9000/user/hive/warehouse/tpcds_bin_partitioned_orc_2.db/catalog_sales
the size read from HDFS is much smaller than the data size
set hive.compute.query.using.stats=false;
select count(1) from tpcds_bin_partitioned_orc_2.catalog_sales;
VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS OUTPUT_RECORDS
----------------------------------------------------------------------------------------------
Map 1 10082.00 38,630 868 2,880,058 4
Reducer 2 0.00 460 0 4 0
TEZ Counters: HDFS_BYTES_READ 30863152
Parquet file
create table parquet.catalog_sales stored as parquet as select * from tpcds_bin_partitioned_orc_2.catalog_sales;
select count(1) from parquet.catalog_sales;
Task Execution Summary
----------------------------------------------------------------------------------------------
VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS OUTPUT_RECORDS
----------------------------------------------------------------------------------------------
Map 1 4020.00 12,620 601 2,880,058 12
Reducer 2 87.00 680 0 12 0
hive> dfs -du -s hdfs://localhost:9000/user/hive/warehouse/parquet.db/catalog_sales;
243755174 243755174 hdfs://localhost:9000/user/hive/warehouse/parquet.db/catalog_sales
TEZ Counters: HDFS_BYTES_READ 243795493
The HDFS_BYTES_READ indicates that
标签:decimal,sales,sk,catalog,bigint,File,cs,ORC 来源: https://blog.csdn.net/houzhizhen/article/details/121743808
本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享; 2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关; 3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关; 4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除; 5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。