将Hive数据导入到外部表_奔向大数据的凡小王_hive外部表导入数据

网络投稿 02-07 7270

1. 说明

? 最近接到一个需求，需要将Hive数据导出到ES。根据调研情况，可以在Hive上创建外部表，通过写SQL的形式将数据导出到Es，有个注意事项需要说明的是：写入到Es中的数据是无法覆盖的。根据使用情况，简单总结了一下。

2. 环境说明 Hive：2.1.1ElasticSearch：7.17.0Hadoop：3.0.0Spark：3.1.2 3. Es搭建 3.1 解压 # 解压 tar -zxvf elasticsearch-7.17.0-linux-x86_64.tar.gz 3.2 修改权限 chown -R elasticsearch:elasticsearch elasticsearch-7.17.0 3.3 配置 elasticsearch.yml # 集群名称 cluster.name: elasticsearch-cluster # 节点名，各节点不一致 node.name: "node-1" # 是否有资格选举为master node.master: true # 是否是数据节点 node.data: true # 自定义的属性，这是官方文档中自带的 node.attr.rack: r1 # 默认分片数 index.number_of_shards: 5 # 默认分片副本数 index.number_of_replicas: 3 # 数据路径 path.data: /opt/data1/es_data # 日志路径 path.logs: /opt/data1/es_logs # 是否锁住内存 bootstrap.memory_lock: true # network.host: 172.16.0.20 # 可以访问IP network.bind_host: 0.0.0.0 # 与其他节点交互IP地址 # network.publish_host: 172.16.0.20 # 对外服务IP地址 http.port: 9500 # TCP传输是否进行压缩 transport.tcp.compress: true # 节点间传输TCP端口 transport.tcp.port: 9600 # 可发现的种子节点 discovery.seed_hosts: ["172.16.0.20:9600", "172.16.0.22:9600", "172.16.0.23:9600"] # 初始化可以选举master节点 cluster.initial_master_nodes: ["node-1", "node-2", "node-3"] # 几个节点恢复后开始恢复数据 gateway.recover_after_nodes: 2 jvm.options 3.4 系统参数 # 将/etc/fstab 文件中包含swap的行注释掉 sed -i '/swap/s/^/#/' /etc/fstab swapoff -a # 单用户可以打开的最大文件数量，可以设置为官方推荐的65536或更大些 echo "* - nofile 655360" >> /etc/security/limits.conf # 单用户线程数调大 echo "* - nproc 131072" >> /etc/security/limits.conf # 单进程可以使用的最大map内存区域数量 echo "vm.max_map_count = 655360" >> /etc/sysctl.conf # 参数修改立即生效 sysctl -p 3.5 启动 ./elasticsearch -d 4. Hive配置创建 hive>add jar /home/elasticsearch-hadoop-7.17.0.jar

或者

<property> <name>hive.aux.jars.path</name> <value>/path/elasticsearch-hadoop.jar</value> <description>A comma separated list (with no spaces) of the jar files</description> </property> 5. 准备数据 cat >test.txt << EOF 111,aaa 222,bbb 333,ccc 6. 创建ES对接外部表 create table test(key string,value string) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource' = 'test', 'es.nodes'='192.168.200.100', 'es.port'='9200', 'es.nodes.wan.only'='true'); 7. 创建测试表 # 创建数据源表 CREATE TABLE test1(key string,value string) row format delimited fields terminated by ',' stored as textfile; # 导入数据 load data local inpath '/mnt/test.txt' into table test1; 8. 加载MR结果到HIVE insert into table test select * from test1; 9. ES查询数据 10. 优化点

Linux系统参数调整

Elasticsearch JVM内存

Es 配置多个磁盘

Es按天索引

Es分片大小

Es入数据时需要将副本数设置为0，后期再进行调整即可

Es集群节点数

修改索引刷新时间与大小

每个节点最大分片数是1000，可以调整

Hive导出数据到Es外部表时，如果两张表字段不一致需要先导入到中间表，再导入Es外部表