小文件治理之hive文件合并：hive小文件合并的三种方法_星星之火_hive 数据合并

网络 02-07 824

文章目录前言一、concatenate方法二、insert overwrite方法三、insert overwrite select * 用法总结

前言

hive分区下，有很多小文件，例如一个分区有1000个文件，但每个文件大小是10k，数仓大量这种小文件。小文件太多，需要消耗hdfs存储资源，mr,spark计算的任务数。为了处理小文件，需要对它们进行合并。

一、concatenate方法

#对于非分区表 alter table tablename concatenate; #对于分区表 alter table tablename partition(dt=20201224) concatenate;

优点：使用方便缺点： concatenate 命令只支持 RCFILE 和 ORC 文件类型，需要执行多次，才能把文件合并为1个。

二、insert overwrite方法 insert overwrite table tableName partition(dt=2022031100) select column1,column2 from tableName where dt=2022031100

缺点： select 的字段需要自己拼起来，select * 的话，由于带有dt字段，无法写入新分区。

优点：支持所有数据类型

三、insert overwrite select * 用法

从select * 中去掉一列的方法： insert overwrite tableA select (name)?+.+ from test;

hive> set hive.cli.print.header=true; hive> select * from test; hook status=true,operation=QUERY OK name friends children address songsong ["bingbing","lili"] {"xiao song":18,"xiaoxiao song":19} {"street":"hui long guan","city":"beijing"} yangyang ["caicai","susu"] {"xiao yang":18,"xiaoxiao yang":19} {"street":"chao yang","city":"beijing"} Time taken: 0.14 seconds, Fetched: 2 row(s)

从select * 中去掉列 address

hive> select `(address)?+.+` from test; hook status=true,operation=QUERY OK name friends children songsong ["bingbing","lili"] {"xiao song":18,"xiaoxiao song":19} yangyang ["caicai","susu"] {"xiao yang":18,"xiaoxiao yang":19} Time taken: 0.144 seconds, Fetched: 2 row(s)

用这个方法就能去掉分区表的日期字段

注意，语法生效，需要设置

hive> set hive.support.quoted.identifiers=none;

更多用法参考： https://blog.csdn.net/spark_dev/article/details/123692018

总结

concatenate 和 insert overwrite方法都可以实现hive文件的合并。