Hive数据分析案例_whvcse_hlzhang_hive数据分析

irpas 02-07 1652

第9章汽车销售数据分析

1 创建数据库cardb hive>create database if not exists cardb; hive>use cardb;

2 创建数据表car hive>create external table cars(province string, month int, city string, country string, year int, cartype string, productor string, brand string, mold string, owner string, nature string, number int, ftype string, outv int, power double, fuel string, length int, width int, height int, xlength int, xwidth int, xheight int, count int, base int, front int, norm string, tnumber int, total int, curb int, hcurb int, passenger string, zhcurb int, business string, dtype string, fmold string, fbusiness string, name string, age int, sex string) row format delimited fields terminated by ',' location '/cars';

hive> desc cars;

province ? ? ? ? ? ??? ?string ? ? ? ? ? ? ??? ? ? ? ? ? ? ? ? ? ? ? month ? ? ? ? ? ? ? ?? ?int ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? city ? ? ? ? ? ? ? ??? ?string ? ? ? ? ? ? ??? ? ? ? ? ? ? ? ? ? ? ? country ? ? ? ? ? ? ?? ?string ? ? ? ? ? ? ??? ? ? ? ? ? ? ? ? ? ? ? year ? ? ? ? ? ? ? ??? ?int ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? cartype ? ? ? ? ? ? ?? ?string ? ? ? ? ? ? ??? ? ? ? ? ? ? ? ? ? ? ? productor ? ? ? ? ? ?? ?string ? ? ? ? ? ? ??? ? ? ? ? ? ? ? ? ? ? ? brand ? ? ? ? ? ? ? ?? ?string ? ? ? ? ? ? ??? ? ? ? ? ? ? ? ? ? ? ? mold ? ? ? ? ? ? ? ??? ?string ? ? ? ? ? ? ??? ? ? ? ? ? ? ? ? ? ? ? owner ? ? ? ? ? ? ? ?? ?string ? ? ? ? ? ? ??? ? ? ? ? ? ? ? ? ? ? ? nature ? ? ? ? ? ? ??? ?string ? ? ? ? ? ? ??? ? ? ? ? ? ? ? ? ? ? ? number ? ? ? ? ? ? ??? ?int ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? ftype ? ? ? ? ? ? ? ?? ?string ? ? ? ? ? ? ??? ? ? ? ? ? ? ? ? ? ? ? outv ? ? ? ? ? ? ? ??? ?int ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? power ? ? ? ? ? ? ? ?? ?double ? ? ? ? ? ? ??? ? ? ? ? ? ? ? ? ? ? ? fuel ? ? ? ? ? ? ? ??? ?string ? ? ? ? ? ? ??? ? ? ? ? ? ? ? ? ? ? ? length ? ? ? ? ? ? ??? ?int ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? width ? ? ? ? ? ? ? ?? ?int ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? height ? ? ? ? ? ? ??? ?int ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? xlength ? ? ? ? ? ? ?? ?int ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? xwidth ? ? ? ? ? ? ??? ?int ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? xheight ? ? ? ? ? ? ?? ?int ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? count ? ? ? ? ? ? ? ?? ?int ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? base ? ? ? ? ? ? ? ??? ?int ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? front ? ? ? ? ? ? ? ?? ?int ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? norm ? ? ? ? ? ? ? ??? ?string ? ? ? ? ? ? ??? ? ? ? ? ? ? ? ? ? ? ? tnumber ? ? ? ? ? ? ?? ?int ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? total ? ? ? ? ? ? ? ?? ?int ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? curb ? ? ? ? ? ? ? ??? ?int ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? hcurb ? ? ? ? ? ? ? ?? ?int ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? passenger ? ? ? ? ? ?? ?string ? ? ? ? ? ? ??? ? ? ? ? ? ? ? ? ? ? ? zhcurb ? ? ? ? ? ? ??? ?int ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? business ? ? ? ? ? ??? ?string ? ? ? ? ? ? ??? ? ? ? ? ? ? ? ? ? ? ? dtype ? ? ? ? ? ? ? ?? ?string ? ? ? ? ? ? ??? ? ? ? ? ? ? ? ? ? ? ? fmold ? ? ? ? ? ? ? ?? ?string ? ? ? ? ? ? ??? ? ? ? ? ? ? ? ? ? ? ? fbusiness ? ? ? ? ? ?? ?string ? ? ? ? ? ? ??? ? ? ? ? ? ? ? ? ? ? ? name ? ? ? ? ? ? ? ??? ?string ? ? ? ? ? ? ??? ? ? ? ? ? ? ? ? ? ? ? age ? ? ? ? ? ? ? ? ?? ?int ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? sex ? ? ? ? ? ? ? ? ?? ?string

3 加载数据文件到数据表cars 1）使用XShell的XFtp文件传输工具，将Win7系统D:\Hive教学\教学课件\Hive离线计算-配套视频代码数据资料\\汽车销售系统\data\car.txt数据文件上传到虚拟机Linux系统的/root目录 2) hive> load data local inpath '/root/car.txt' overwrite into table cars; 加载数据文件 3) hive> select * from cars limit 10;?查询前10条数据

下面开始进行数据分析：

4 统计cars表的数据行数 hive> select count(*) from cars; OK70640 说明：cars表总共有70640行数据

hive>select count(*) from cars where nature is not null and nature != ''; OK 70362 说明：car表中nature列值不为null且不为空字符串的数据行共有70362行，70640-70362=278行数据的nature列存在缺失值，即车辆使用性质未知

重要提醒：缺失值和重复列会影响数据分析的准确性，最好在数据分析前先通过数据清洗将有缺失值和重复列的数据行给剔除掉！数据清洗过程： 1）建立一个表结构相同的新表cars_clean，用于存放数据清洗后的数据 hive>create external table cars_clean(province string, month int, city string, country string, year int, cartype string, productor string, brand string, mold string, owner string, nature string, number int, ftype string, outv int, power double, fuel string, length int, width int, height int, xlength int, xwidth int, xheight int, count int, base int, front int, norm string, tnumber int, total int, curb int, hcurb int, passenger string, zhcurb int, business string, dtype string, fmold string, fbusiness string, name string, age int, sex string) row format delimited fields terminated by ',' location '/cars_clean';

2）选取后面的数据分析会用到的重要列nature，number, province, year, month, sex, owner, cartype, mold, brand, ftype来剔除缺失值，再对所有列进行distinct去重，并逐行插入到新表cars_clean中 hive>insert into cars_clean select distinct * from cars where nature is not null and nature != '' and number is not null and province is not null and province != '' and year is not null and month is not null and sex is not null and sex != '' and owner is not null and ?owner ?!='' and cartype is not null and cartype ?!= '' and mold is not null and mold ?!= '' and brand is not null and brand ?!= '' and ftype is not null and ftype ?!= ''; 说明：int类型的列只需判断不为null，string类型的列要判断不为null且不为空字符串；后面数据分析的SQL语句中就不用写类似where nature is not null and !=''的语句了

3）统计经过清洗后的数据表cars_clean的行数 hive> select count(*) from cars_clean; OK43079 说明：数据清洗后的cars_clean表共有43079行数据，而数据清洗前的cars表总共有70640行数据，有27561行数据被清洗掉了，原始数据文件car.txt的数据质量堪忧！

下面开始进行数据分析，建议在经过数据清洗后的car_clean表中执行进行数据分析！

以下SQL语句都改成针对car_clean表进行操作，并可以去掉where判缺失值的语句

5 统计车辆的使用性质：商用车、乘用车的数量

hive> select nature,count(*) as cnt from cars ?group by nature having nature is not null and nature != '';

中小学生校车?? ?119 公交客运?? ?1742 公路客运?? ?1448 出租客运?? ?2 初中生校车?? ?2 小学生校车?? ?111 工程救险?? ?1 幼儿校车?? ?17 救护?? ?1 教练?? ?26 旅游客运?? ?219 消防?? ?7 租赁?? ?24 警用?? ?165 非营运?? ?66478

说明：car表中nature列值不为null且不为空字符串的数据行共有119+1742+1448+2+2+111+1+17+1+26+219+7+24+165+66478=70362行

hive>select '非营运', sum(if(a.nature='非营运',a.cnt,0)), '营运', sum(if(a.nature!='非营运',a.cnt,0)) from (select nature,count(*) as cnt from cars? group by nature having nature is not null and nature != '') a; OK 非营运?? ?66478?? ?营运?? ?3884 Time taken: 47.985 seconds, Fetched: 1 row(s) 说明：car表的nature列值为“非营运”的数据行有66478行，nature列值为“中小学生校车”、“公交客运”等属于营运性质的数据行有3884行，总共66478+3884=70362行6 按月统计山西省2013年的汽车销售比例 hive> select? sum(number)?as total from cars where province='山西省' and year='2013'; OK 70362 说明：山西省2013年按年份统计的汽车销售总量是70362；执行select * from cars where number !=1; 查询不到任何结果，说明cars表的所有行number列值都是1，因此sum(number)恰好等于总行数70362

hive> select month, sum(number) as ss from cars where province='山西省' and year='2013' group by month;

1?? ?10413 2?? ?4103 3?? ?6548 4?? ?4635 5?? ?5151 6?? ?3903 7?? ?4449 8?? ?4488 9?? ?4889 10?? ?7352 11?? ?7074 12?? ?7357

说明：山西省2013年按月份统计的汽车销售数量

hive> select ?month, c1.ss/c.total from? (select month, sum(number) as ss from cars where province='山西省' and year='2013' group by month) c1, (select sum(number) as total from cars where province='山西省' and year='2013') c; OK

1?? ?0.14799181376311077 2?? ?0.05831272561894204 3?? ?0.09306159574770473 4?? ?0.06587362496802251 5?? ?0.0732071288479577 6?? ?0.05547028225462608 7?? ?0.06323015263920867 8?? ?0.06378442909525028 9?? ?0.06948352804070379 10?? ?0.1044882180722549 11?? ?0.10053722179585571 12?? ?0.1045592791563628

说明：山西省2013年每个月的汽车销售比例，即每月的销售数量/一年的销售数量

7 按性别统计男性、女性买车的比例，以及对品牌的选择

hive>select sex, count(*) as cnt from cars group by sex having sex is not null and sex !='';

1913?? ?37 1921?? ?21 1925?? ?18 1926?? ?37 1928?? ?36 1929?? ?13 1930?? ?18 1931?? ?19 1932?? ?21 1933?? ?13 1934?? ?10 1935?? ?10 1936?? ?4 1937?? ?4 1938?? ?2 1939?? ?3 1940?? ?1 1941?? ?5 1942?? ?1 1943?? ?2 1944?? ?1 1947?? ?2 1952?? ?1 1954?? ?1 1956?? ?1 东风汽车有限公司商用车发动机?? ?23 女性?? ?17696 客车?? ?1238 广西玉柴机器股份有限公司?? ?2 男性?? ?41501 轻型客车?? ?5 重庆渝安淮海动力有限公司?? ?4

说明：按性别sex分组统计每种性别的买车数量，并筛选出性别不为null和空字符串的分组；但该SQL语句先分组再筛选有点别扭，建议执行select sex, count(*) as cnt from cars ?where sex is not null and sex !='' group by sex;先筛选再分组，效率更好

hive>select '男性', sum(if(b.sex='男性', b.cnt, 0)) as man, '女性', sum(if(b.sex='女性' ,b.cnt, 0)) as woman?from (select sex, count(*) as cnt from cars ?where sex is not null and sex !='' group by sex) b;男性?? ?41501?? ?女性?? ?17696 说明：if语句对sex列值等于'男性'或'女性'的b.cnt列进行sum求和，目的是将多行变成一行

hive>select '男性', a.man*1.0/(a.man+a.woman), '女性 ', a.woman*1.0/(a.man+a.woman)? from (select '男性', sum(if(b.sex='男性', b.cnt, 0)) as man, '女性', sum(if(b.sex='女性' ,b.cnt, 0)) as woman?from (select sex, count(*) as cnt from cars ?where sex is not null and sex !='' group by sex) b)a;男性?? ?0.7010659323952227?? ?女性 ?? ?0.29893406760477725 说明：计算男性买车数量占男女总和的比例，女性买车的数量占男女总和比例，相当于执行：

hive> select ?'男性', 41501/(41501+17696), '女性', 17696/(41501+17696); OK 男性?? ?0.7010659323952227?? ?女性?? ?0.29893406760477725

评论：上面的三重嵌套select语句把问题搞复杂了，直接执行以下两个语句即可执行hive>select sex, count(*) as cnt from cars group by sex having sex is not null and sex !='' and (sex = '男性' or sex = '女性');女性?? ?17696 男性?? ?41501 或执行hive>select sex, count(*) as cnt from cars where sex is not null and sex !='' and (sex = '男性' or sex = '女性') group by sex;女性?? ?17696 男性?? ?41501

hive> select sex, brand, count(*) from cars where sex is not null and sex !='' and age is not null group by sex, brand having brand is not null and brand != '';

女性?? ?一汽佳星?? ?1 女性?? ?东南?? ?1 女性?? ?东风?? ?1367 女性?? ?中誉?? ?4 女性?? ?五菱?? ?12004 女性?? ?五菱宏光?? ?1057 女性?? ?众泰?? ?2 女性?? ?依维柯?? ?32 女性?? ?俊风?? ?1 女性?? ?力帆?? ?27 女性?? ?北京?? ?741 女性?? ?吉奥?? ?12 女性?? ?哈飞?? ?4 女性?? ?大通?? ?7 女性?? ?大马?? ?3 女性?? ?奥路卡?? ?125 女性?? ?宇通?? ?6 女性?? ?少林?? ?28 女性?? ?尼桑?? ?2 女性?? ?开瑞?? ?89 女性?? ?恒通客车?? ?2 女性?? ?昌河?? ?20 女性?? ?昌河铃木?? ?3 女性?? ?松花江?? ?25 女性?? ?欧诺?? ?121 女性?? ?江淮?? ?7 女性?? ?江铃全顺?? ?84 女性?? ?海格?? ?1 女性?? ?神剑?? ?6 女性?? ?福田?? ?17 女性?? ?航天?? ?31 女性?? ?解放?? ?96 女性?? ?通家福?? ?5 女性?? ?野马?? ?8 女性?? ?金旅?? ?7 女性?? ?金杯?? ?102 女性?? ?金龙?? ?16 女性?? ?长城?? ?2 女性?? ?长安?? ?1628 男性?? ?一汽佳星?? ?2 男性?? ?东南?? ?12 男性?? ?东风?? ?3214 男性?? ?中誉?? ?2 男性?? ?中通?? ?1 男性?? ?五菱?? ?28208 男性?? ?五菱宏光?? ?2331 男性?? ?众泰?? ?6 男性?? ?依维柯?? ?64 男性?? ?俊风?? ?4 男性?? ?力帆?? ?84 男性?? ?北京?? ?1836 男性?? ?合客?? ?2 男性?? ?吉奥?? ?30 男性?? ?同心?? ?1 男性?? ?哈飞?? ?7 男性?? ?大通?? ?31 男性?? ?大马?? ?7 男性?? ?奥路卡?? ?277 男性?? ?宇通?? ?7 男性?? ?少林?? ?72 男性?? ?尼桑?? ?2 男性?? ?开瑞?? ?231 男性?? ?恒通客车?? ?2 男性?? ?昌河?? ?75 男性?? ?昌河铃木?? ?1 男性?? ?松花江?? ?86 男性?? ?柯斯达?? ?6 男性?? ?梅赛德斯-奔驰?? ?1 男性?? ?欧诺?? ?239 男性?? ?汇众?? ?3 男性?? ?江淮?? ?13 男性?? ?江铃全顺?? ?200 男性?? ?海格?? ?1 男性?? ?神剑?? ?16 男性?? ?福田?? ?49 男性?? ?航天?? ?93 男性?? ?解放?? ?242 男性?? ?通家福?? ?19 男性?? ?野马?? ?20 男性?? ?金旅?? ?6 男性?? ?金杯?? ?265 男性?? ?金龙?? ?26 男性?? ?长城?? ?18 男性?? ?长安?? ?3679 男性?? ?飞碟?? ?3 男性?? ?黄海?? ?2

说明：先按性别不为null和空字符串且年龄不为null进行筛选，再按性别、车牌品牌进行分组统计

8 统计车辆的拥有者、车辆型号和车辆类型

hive>select owner, count(*) as cnt from cars group by owner order by cnt desc; 个人?? ?60745 单位?? ?9617NULL?? ?273?? ?5 说明：按车辆拥有者owner列，分组统计各类拥有者的车辆数量；有的数据行的owner列存在NULL和空字符串

hive> select owner, count(*) as cnt from cars where owner is not null and owner != '' group by owner order by cnt desc; 个人?? ?60745 单位?? ?9617 说明：按车辆拥有者owner列，分组统计各类拥有者的车辆数量；先过滤掉owner列存在NULL和空字符串的数据行

hive> select cartype, count(*) as cnt from cars where cartype is not null and cartype != '' group by cartype order by cnt desc; 说明：按照车辆型号cartype列，分组统计每种车辆型号的车辆数量；先过滤掉cartype列存在NULL和空字符串的数据行

hive> select mold, count(*) as cnt from cars where mold is not null and mold != '' group by mold order by cnt desc;

小型普通客车 62156 大型普通客车 3275 中型普通客车 1398 大型专用校车 221 中型专用校车 29 小型专用客车 5 大型铰接客车 3 微型普通客车 2 大型双层客车 1 中型越野客车 1

说明：按照车辆类型mold列，分组统计每种车辆类型的车辆数量；先过滤掉mold列存在NULL和空字符串的数据行

hive> select a.con, count(*) from ( select concat(owner, cartype, mold) as con from cars) a ?group by a.con; 说明：将车辆拥有者owner列，车辆型号cartype列，车辆类型mold列进行字符串拼接，按照这三列的拼接组合进行分组统计；也可以执行select owner, cartype, mold, count(*) from cars ?where owner is not null and owner !='' and cartype is not null and cartype !='' and ? mold is not null and ?mold !=''group by owner, cartype, mold;

9 统计月销售量、年销售总量、以及月销售量占年销售总量的比例

hive>select sum(number) as yearsum from cars where province="山西省" and year="2013" 说明：统计年销售总量

hive>select month,sum(number) as monthsum from cars where province="山西省" and year="2013" group by month 说明：按月分组统计月销售量

hive> select month, c1.yearsum/c2.monthsum from (select sum(number) as yearsum from cars where province="山西省" and year="2013") c1, (select month,sum(number) as monthsum from cars where province="山西省" and year="2013" group by month) c2;

1 0.14799181376311077 2 0.05831272561894204 3 0.09306159574770473 4 0.06587362496802251 5 0.0732071288479577 6 0.05547028225462608 7 0.06323015263920867 8 0.06378442909525028 9 0.06948352804070379 10 0.1044882180722549 11 0.10053722179585571 12 0.1045592791563628

说明：统计每个月的月销售量占年销售总量的比例?

第10章新浪微博数据分析JSON格式的新浪微博历史数据： [ ?? ?{ ?? ??? ?"beCommentWeiboId":"", ?? ??? ?"beForwardWeiboId":""," ?? ??? ?catchTime":"1387165692"," ?? ??? ?commentCount":"61748", ?? ??? ?"content":"感谢微博上朋友们对我的肯定，也感谢朋友们对我的批评。", ?? ??? ?"createTime":"1350008370", ?? ??? ?"info1":"", ?? ??? ?"info2":"", ?? ??? ?"info3":"", ?? ??? ?"mlevel":"", ?? ??? ?"musicurl":[], ?? ??? ?"pic_list":[], ?? ??? ?"praiseCount":"3688", ?? ??? ?"reportCount":"55723", ?? ??? ?"source":"HTC 新渴望VC", ?? ??? ?"userId":"1672272373", ?? ??? ?"videourl":[], ?? ??? ?"weiboId":"3500251157908958", ?? ??? ?"weiboUrl":"http://weibo.com/1672272373/z06x5xbtQ" ?? ?} ]

1 创建数据库weibodb hive>create database if not exists weibodb; hive>use weibodb;

2 创建数据表weibo hive> create external table weibo(json string) comment 'This is the quova ADN source json table'; hive> desc weibo; OK json ? ? ? ? ? ? ? ??? ?string? 说明：weibo表只包含一个string类型的列json

3 加载数据文件到数据表weibo 1）将Win7系统D:\Hive教学\教学课件\Hive离线计算-配套视频代码数据资料\\\微博数据分析系统数据\weibo.zip压缩文件上传到虚拟机Linux系统的/root目录；在Linux终端执行unzip weibo.zip命令进行解压缩，如果提示“未找到命令”说明unzip命令未安装，可以执行yum install zip命令来安装unzip命令；执行unzip weibo.zip命令解压缩后，会产生一个名字为619893的目录，可以执行mv 619893 weibo命令给这个目录更名；如果在Linux环境解压嫌麻烦，可以直接在win7下解压缩后将整个weibo目录上传到虚拟机Linux系统的/root目录；在Linux终端执行命令ls /root/weibo/ | wc -l 统计weibo目录下的数据文件数目是1206 2) hive> ?load data local inpath '/root/weibo/*' overwrite into table weibo; Loading data to table weibodb.weibo Table weibodb.weibo stats: [numFiles=0, numRows=0, totalSize=0, rawDataSize=0] OK Time taken: 192.44 seconds 说明：加载weibo目录下的所有数据文件到微博表中，数据文件数量很多，加载时间很长，用时192秒 3) hive> select * from weibo limit 10;?查询weibo表的前10行数据 4) hive> select * from weibo;?查询weibo表的全部行数据，数据行数太多，滚屏许久不完 5) hive>select count(*) from weibo; 统计weibo表的总行数，共1451868行 OK1451868 Time taken: 110.42 seconds, Fetched: 1 row(s)

下面开始进行数据分析：

4 统计微博总数和独立用户数

hive> select count(*) ?from weibo; OK1451868 说明：统计微博总数，即weibo表的总行数

hive> select count(distinct get_json_object(a.j , '$.userId') ) from (select substring(json, 2, length(json) - 2) as j from weibo) a; OK 78540 说明：统计独立的用户数，共有78540个独立用户（不重复的userId）

5 统计被转发次数最多的前10条微博 hive> select get_json_object(a.j,'$.userId') as id, cast(get_json_object(a.j,'$.reportCount') as int) as cnt from (select substring(json,2,length(json)-2) as j from weibo) a order by cnt desc limit 10;

2202387347?? ?2692012 2202387347?? ?2692012 2202387347?? ?2692012 2202387347?? ?2692012 2202387347?? ?2692012 2202387347?? ?2692012 2202387347?? ?2692012 2202387347?? ?2692009 2202387347?? ?2692009 2202387347?? ?2692008

说明：统计被转发次数reportCount最多的前10条微博

6 统计用户所有微博被转发总次数的前10名 hive> select b.id, sum(b.cnt) as bsum from ? ( select get_json_object(a.j , '$.userId') ?as id, get_json_object(a.j , '$.reportCount') as cnt from ?( select substring(json, 2, length(json) - 2) as j from weibo ) a ) b group by b.id order by bsum desc limit 10;

1793285524?? ?7.6454805E7 1629810574?? ?7.3656898E7 2803301701?? ?6.8176008E7 1266286555?? ?5.5111054E7 1191258123?? ?5.4808042E7 2202387347?? ?5.1597226E7 2656274875?? ?5.0289184E7 1197161814?? ?4.6236208E7 1188552450?? ?4.1578163E7 1195242865?? ?4.0537386E7

说明：按userid进行分组，累加计算微博转发次数reportCount的总次数，根据总次数倒序排序并取前10条数据，7.6454805E7是科学计数法表示的7.6454805*10000000=76454805

7 统计每个用户发布的微博总数，并存储到临时表 hive> create table weibo_uid_wbcnt(userid string, wbcnt int) row format delimited fields terminated by '\t'; 说明：创建一个临时表weibo_uid_wbcnt，用于存放每个用户发布的微博总数，userid列是用户id，wbcnt列是微博总数

hive>insert overwrite table weibo_uid_wbcnt select get_json_object(a.j,'$.userId'), count(1) from (select substring(json,2,length(json)-2) as j from weibo) a group by get_json_object(a.j,'$.userId') ; 说明：按userId进行分组统计每个用户的微博总数，并将结果逐行插入到临时表weibo_uid_wbcnt( hive>select * from weibo_uid_wbcnt limit 10; hive>select * from weibo_uid_wbcnt;

8 统计带图片的微博数 hive> select count(1) from (select substring(json,2,length(json)-2) as j from weibo) a ?where get_json_object(a.j, '$.pic_list') like '%http%'; OK750512 说明：统计图片列表pic_list包含'http'子字符串的总行数，即带图片URL地址的总行数；统计结果总共750512条微博含图片

9 统计使用iphone发微博的独立用户数量 hive> select count(distinct get_json_object(a.j,'$.userId')) from ?(select substring(json,2,length(json)-2) as j from weibo) a ?where lower(get_json_object(a.j,'$.source')) like '%iphone%'; OK936 说明：统计数据来源source包含子字符串'iphone'的不重复用户userId的总数量，总共936个用户

10 创建视图进行数据分析 hive> create view weibo_view as select get_json_object(a.j,'$.userId') as uid, get_json_object(a.j,'$.source') as src from (select substring(json,2,length(json)-2) as j from weibo) a where get_json_object(a.j,'$.commentCount') <1000; 说明：创建视图weibo_view，包括微博评论次数小于1000的用户userId列和数据来源source列；创建视图的SQL语句执行速度很快，因为视图并不真正存储数据，而只是定义存储数据的逻辑规则

hive>select * from weibo_view limit 10;?

2989711735 1087770692 iPad客户端 1390470392 1390470392 1498502972 1087770692 iPad客户端 1589706710 1087770692 iPad客户端 1087770692 iPad客户端 1589706710

说明：视图weibo_view的source列的值存在很多空字符串，会影响数据分析

hive>select * from weibo_view; 说明：查询视图时，才会真正执行视图中定义的逻辑规则

提示：由于以上的数据分析SQL语句非常复杂，分析结果中还存在空字符串，会影响分析结果的准确性；因此最好先进行数据转换和数据清洗后再进行数据分析，优点是可以提高数据分析结果的准确性，简化SQL语句的复杂性

数据清洗过程：

hive>create table weibo_data(becommentweiboid string, beforwardweiboid string, catchtime string, commentcount string, content string, createtime string, info1 string, info2 string, info3 string, mlevel string, musicurl string, pic_list string, praisecount string, reportcount string, source string, userid string, videourl string, weiboid string, weibourl string); 说明：创建一张新表weibo_data用于保存从JSON提取出来的数据，特别注意hive列名中不允许有大写字母，如果列名有大写字母建表后desc weibo_data查看列名会发现会自动转为小写字母

hive>insert into weibo_data ? ? (becommentweiboid, beforwardweiboid, catchtime, commentcount, content, createtime, info1, info2, info3, mlevel,musicurl, pic_list, praisecount, reportcount, source, userid, videourl, weiboid, weibourl)? ? ? select? ? ? ? ? get_json_object(a.j, '$.beCommentWeiboId'), get_json_object(a.j, '$.beForwardWeiboId'), get_json_object(a.j, '$.catchTime'), get_json_object(a.j, '$.commentCount'),get_json_object(a.j, '$.content'), get_json_object(a.j,'$.createTime'), ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? get_json_object(a.j, '$.info1'), get_json_object(a.j, '$.info2'), get_json_object(a.j, '$.info3'), get_json_object(a.j, '$.mlevel'), get_json_object(a.j, '$.musicurl'), get_json_object(a.j, '$.pic_list'), get_json_object(a.j, '$.praiseCount'), get_json_object ? ? ? ? ? ? ? ? ? ? ? ? ?(a.j,'$.reportCount'), get_json_object(a.j, '$.source'), get_json_object(a.j, '$.userId'), get_json_object(a.j, '$.videourl'), get_json_object(a.j, '$.weiboId'), get_json_object(a.j, '$.weiboUrl')? ? ? from? ? ?(select substr(json, 2, length(json)-2) as j from weibo) a; 说明：将weibo表中的JSON格式对象通过get_json_object函数依次提取出来，再逐行insert到新表weixin_data中

hive>select * from weibo_data limit 3;? 显示weibo_data表的前3行 ? hive> create table weibo_clean(becommentweiboid string, beforwardweiboid string, catchtime string, commentcount string, content string, createtime string, info1 string, info2 string, info3 string, mlevel string, musicurl string, pic_list string, praisecount string, reportcount string, source string, userid string, videourl string, weiboid string, weibourl string); 说明：创建一张新表weibo_clean用于存放经过数据清洗过的数据

hive> insert into weibo_clean select distinct * from weibo_data ?where userid is not null and userid !="" and reportcount is not null and reportcount !="" and pic_list is not null and pic_list !="" and source is not null and source !="" and commentcount is not null and commentcount !="" and praisecount is not null and praisecount !="" and content is not null and content !=""; 说明：将原数据表先按需要进行数据分析的关键列进行where判缺失值过滤，再对所有列进行distinct去重，然后再插入到新的数据表weibo_clean中，这样weibo_clean表中的数据行就没有重复也没有缺失值；重复的数据行和关键列的缺失值会影响数据分析的正确结果

在经过数据清洗后的weibo_clean表中执行进行数据分析，可以保证数据分析的准确性，也可以简化SQL语句：

hive> select count(*) from weibo_clean; OK1008511 说明：统计经过数据清洗的表weibo_clean总行数 hive> select count(distinct userid) from weibo_clean; OK1196 说明：统计不重复的userid的数量

hive> select userid, reportcount from weibo_clean order by reportCount desc limit 10; 说明：统计被转发次数最多的前10条微博

hive> select b.id, sum(b.cnt) as bsum from (select userid as id, reportCount as cnt from weibo_clean) b group by b.id order by bsum desc limit 10; 说明：统计用户所有微博被转发总次数的前10名，相当于执行select userid, sum(reportCount) ?as reportsum from ?weibo_clean? group by userid order by reportsum desc limit 10;

hive> select count(1) from weibo_clean where pic_list like '%http%';s 说明：统计带图片的微博数

hive> select count(distinct userid) from weibo_clean where lower(source) like '%iphone%'; 说明：统计使用iphone发微博的独立用户数量

hive> create view weibo_clean_view as select userId as uid, source as src from weibo_clean where commentcount<1000; 说明：创建视图进行数据分析

hive>select * from weibo_view limit 10;

UDF自定义函数的应用：

应用1：两个整数列的列值求和

package myUDF; import org.apache.hadoop.hive.ql.exec.UDF; public class UDFDemo extends UDF { public int evaluate(int a, int b) { try { return a + b; } catch(Exception e) { return 0; } } }

hive> add jar /root/UDFDemo.jar; ? 载入jar包到hive中

hive> create temporary function add2 as 'myUDF.UDFDemo'; 创建临时自定义函数，命名为add2

hive> show functions; 发现add2函数赫然在列

hive> create table test(c1 int, c2 int); ?创建一个测试表test，包含两个整型列

hive> insert into test values(1,1); ?插入一列数据 hive> insert into test values(2,2); ?再插入一列数据

hive> select * from test; 查看test表数据 OK1?? ?1 2?? ?2

hive> select add2(c1, c2) from test; ?测试sum2函数，将c1和c2列的值相加 OK2 4

hive> select cast(praisecount as int), cast(reportcount as int) from weibo_clean limit 10; 说明：使用cast函数，将praisecount列和reportcount列的列值强制转换为int整型

hive> select add2(cast(praisecount as int), cast(reportcount as int)) as s, content from weibo_clean order by s desc limit 10;内容

2692013 【#小米手机2微博开卖#转发狂送20台】新浪微博社会化网购首单，12月21日中午12点，5万台#小米手机2#微博专场销售。全球首款28nm四核手机，2G大内存，新一代背照式相机，1999元性价比之王。现在起至20日，关注@小米手机，转发@好友，每天10点20点每2小时送出2台米2。微博预约http://t.cn/zj9m34X 2692013 【#小米手机2微博开卖#转发狂送20台】新浪微博社会化网购首单，12月21日中午12点，5万台#小米手机2#微博专场销售。全球首款28nm四核手机，2G大内存，新一代背照式相机，1999元性价比之王。现在起至20日，关注@小米手机，转发@好友，每天10点20点每2小时送出2台米2。微博预约http://t.cn/zj9m34X 2692013 【#小米手机2微博开卖#转发狂送20台】新浪微博社会化网购首单，12月21日中午12点，5万台#小米手机2#微博专场销售。全球首款28nm四核手机，2G大内存，新一代背照式相机，1999元性价比之王。现在起至20日，关注@小米手机，转发@好友，每天10点20点每2小时送出2台米2。微博预约http://t.cn/zj9m34X 2692013 【#小米手机2微博开卖#转发狂送20台】新浪微博社会化网购首单，12月21日中午12点，5万台#小米手机2#微博专场销售。全球首款28nm四核手机，2G大内存，新一代背照式相机，1999元性价比之王。现在起至20日，关注@小米手机，转发@好友，每天10点20点每2小时送出2台米2。微博预约http://t.cn/zj9m34X 2692013 【#小米手机2微博开卖#转发狂送20台】新浪微博社会化网购首单，12月21日中午12点，5万台#小米手机2#微博专场销售。全球首款28nm四核手机，2G大内存，新一代背照式相机，1999元性价比之王。现在起至20日，关注@小米手机，转发@好友，每天10点20点每2小时送出2台米2。微博预约http://t.cn/zj9m34X 2692013 【#小米手机2微博开卖#转发狂送20台】新浪微博社会化网购首单，12月21日中午12点，5万台#小米手机2#微博专场销售。全球首款28nm四核手机，2G大内存，新一代背照式相机，1999元性价比之王。现在起至20日，关注@小米手机，转发@好友，每天10点20点每2小时送出2台米2。微博预约http://t.cn/zj9m34X 2692013 【#小米手机2微博开卖#转发狂送20台】新浪微博社会化网购首单，12月21日中午12点，5万台#小米手机2#微博专场销售。全球首款28nm四核手机，2G大内存，新一代背照式相机，1999元性价比之王。现在起至20日，关注@小米手机，转发@好友，每天10点20点每2小时送出2台米2。微博预约http://t.cn/zj9m34X 2692010 【#小米手机2微博开卖#转发狂送20台】新浪微博社会化网购首单，12月21日中午12点，5万台#小米手机2#微博专场销售。全球首款28nm四核手机，2G大内存，新一代背照式相机，1999元性价比之王。现在起至20日，关注@小米手机，转发@好友，每天10点20点每2小时送出2台米2。微博预约http://t.cn/zj9m34X 2692010 【#小米手机2微博开卖#转发狂送20台】新浪微博社会化网购首单，12月21日中午12点，5万台#小米手机2#微博专场销售。全球首款28nm四核手机，2G大内存，新一代背照式相机，1999元性价比之王。现在起至20日，关注@小米手机，转发@好友，每天10点20点每2小时送出2台米2。微博预约http://t.cn/zj9m34X 2692009 【#小米手机2微博开卖#转发狂送20台】新浪微博社会化网购首单，12月21日中午12点，5万台#小米手机2#微博专场销售。全球首款28nm四核手机，2G大内存，新一代背照式相机，1999元性价比之王。现在起至20日，关注@小米手机，转发@好友，每天10点20点每2小时送出2台米2。微博预约http://t.cn/zj9m34X

说明：统计最热门的top10热门微博帖子，即微博点赞人数praisecount和转发人数reportcount之和降序排序并取前10行

经测试发现：其实可以没必要适用两列相加的自定义函数add2，因为Hive SQL的select语句本来就可以把两列直接相加

hive> select c1+c2 from test;

hive> select praisecount + reportcount from weibo_data limit 10;

hive> select cast(praisecount as int) + cast(reportcount as int) from weibo_clean limit 10;

hive> select cast(praisecount as int)+ cast(reportcount as int) as s, content from weibo_data order by s desc limit 10;

应用2：统计微博内容中“微博”出现次数最多的用户ID和出现次数

说明：相当于求一个字符串中的某个子字符串的出现次数

算法1：循环遍历

package myUDF; import org.apache.hadoop.hive.ql.exec.UDF; public class UDFDemo1 extends UDF { public int evaluate(String content, String word) { int count = 0; // if(content == null|| content.equals("") || word == null|| word.equals("")) if(content == null|| content.length() == 0 || word == null|| word.length() == 0) { return 0; } int i = 0; while(true) { if(content.indexOf(word, i) == -1) { break;//找不到，退出循环 } else { count ++;//找到，count加1 //i = content.indexOf(word, i) + 1;//更新查找的索引值i，继续插座 i = content.indexOf(word, i) + word.length();//更新查找的索引值i，继续插座 } } return count; } }

算法2：使用正则表达式

package myUDF; import org.apache.hadoop.hive.ql.exec.UDF; import java.util.regex.Pattern; import java.util.regex.Matcher; public class UDFDemo2 extends UDF { public int evaluate(String content, String word) { int count = 0; if(content == null|| content.length() <= 0 || word == null|| word.length() <= 0) { return 0; } Pattern p = Pattern.compile(word); Matcher m = p.matcher(content); while (m.find()) { count++; } return count; } }

注意书上P181的算法是错误的：使用split方法有bug，当word是content的前缀或后缀时，使用split方法得到的字符数组长度是错误的

String[] array = content.split(word);//用word将content拆分为字符串数组 count = array.length - 1; if(content.endsWith(word))//如果content以word结尾 { count ++; }

测试wcount函数：

在Eclipse开发环境进行测试：编写main函数，用debug模式断点跟踪

public static void main(String[] args) { int count = evaluate("abc", "abc"); System.out.println(count); }

在Hive终端环境进行测试：

hive> add jar /root/UDFDemo1.jar; ? 载入jar包到hive中

hive> create temporary function wordcount as 'myUDF.UDFDemo1';创建临时自定义函数，命名为wcount

执行测试：

hive> select wordcount("", ""); hive> select wordcount("", "a"); hive> select wordcount("a", ""); hive> select wordcount("a", "a"); hive> select wordcount("a", "aa"); hive> select wordcount("ba", "a"); hive> select wordcount("ab", "a"); hive> select wordcount("aba", "a"); hive> select wordcount("babab", "a"); hive> select wordcount("aaabaaabaabababaaabaa", "aa");

hive> select userid, wordcount(content, "微博") as wc , content?from weibo_data order by wc desc limit 10;

1629313452 17 吃饭拍照发微博，逛街拍照发微博，睡觉拍照发微博，相亲拍照发微博，集体散步拍照发微博，K歌拍照发微博，洗澡拍照发微博，约会拍照微博，上班拍照发微博，坐车拍照发微博，旅游拍照发微博，胸大拍照发微博，宠物拍照发微博，生孩子拍照发微博，喂奶拍照发微博，结婚拍照发微博，追悼会拍照发微博..... 1642051345 13 一看微博就想要绝食、一看微博就想戴口罩、一看微博就想当宅男、一看微博就想去移民、一看微博就想去创业、一看微博就想当天使、一看微博就想干革命、一看微博就想要自由、一看微博就想装公知、一看微博就想当导师、一看微博就想吹牛逼、一看微博就想要装逼……一看微博就被各种控。 2115264790 13 #微博营销#【个人微博与企业微博的区别】1、个人微博是代表自己，企业微博是代表品牌；2、个人微博是随时发，企业微博是定时发；3、个人微博发的是感悟，企业微博发的是感情；4、个人微博舆论勿逃避，企业微博舆论需澄清；5、个人微博玩的是魅力营销、圈子营销，企业微博玩的是情感营销、口碑营销. 1432482114 12 《论语》古微博政论；《老子》古微博哲学；《说苑》古微博史料；《世说新语》古微博杂史；《大唐新语》古微博野史；《太平广记》古微博故事；《笑林》古微博笑话；《宋人轶事汇编》古微博传记；《菜根谭》古微博随笔；《阅微草堂笔记》古微博小说；《苦瓜和尚画语录》古微博艺术论……古微博笔记海量。 2718351604 12 【12星座现在正在干什么】白羊看这条微博金牛看这条微博双子看这条微博巨蟹看这条微博狮子看这条微博处女看这条微博天秤看这条微博天蝎看这条微博射手看这条微博摩羯看这条微博水瓶看这条微博双鱼看这条微博 1646724250 12 【12星座现在正在干什么】白羊看这条微博金牛看这条微博双子看这条微博巨蟹看这条微博狮子看这条微博处女看这条微博天秤看这条微博天蝎看这条微博射手看这条微博摩羯看这条微博水瓶看这条微博双鱼看这条微博 2016713117 11 #微博帮助#微博内容可以禁止转载吗目前微博分为公开微博和权限微博。公开微博内容任何人都可转发；如果发布的微博不想让别人转发或看到可选择发布权限微博：定向微博仅设置的特定分组中与您互相关注的人可见；密友微博密友分组可见；私密微博仅自己可见，详情http://t.cn/zjwTUn9。 2408095697 11 @尚道微营销：【企业微博10大矩阵】1、企业官方微博；2、企业领导人微博；3、员工微博；4、品牌官方微博；5、产品官方微博；6、客服官方微博；7、招聘官方微博；8、企业促销、活动官方微博；9、企业吉祥物官方微博；10、企业大学官方微博.@袁岳@王利芬@彭曙光V 2016713117 10 #微博帮助#微博字数限制，话没说完就超了，这种感觉灰常不好其实吧，如果微博字数超过140字可使用长微博1、发布框下方长微博按钮，点击使用长微博工具；2、微博桌面右上角“笔”形标识，发布框下方"长微博"；3、进入微吧长微博发布按钮，输入内容也可发布长微博：http://t.cn/8kzRiVH 2654839837 10 【微营销推广十大要点】?微博基本设置；?微博推广内容；?如何提高微博的粉丝；?微博营销的语法使用；?微博营销也是监控平台；?微博营销通过多手段进行更新；?微博营销的直播；?微博营销与网站的关系；?微博营销中的排名算法；?微博营销中的相关性。

说明：列出内容content列包含子字符串"微博"的个数排名前10的数据行

hive> hive> select b.id, max(b.wc) as m from (select userid as id, wordcount(content,"微博") as wc, content from weibo_data) b group by b.id order by desc limit 10;

1629313452 17 2115264790 13 1642051345 13 2718351604 12 1646724250 12 1432482114 12 2408095697 11 2016713117 11 2242991583 10 1764222885 10

说明：按userid分组统计微博内容content列包含字符串"微博"的最大次数，取最大次数最多的前10行，列出用户id和次数

hive>?select b.id, sum(b.wc) as m from (select userid as id, wordcount(content,"微博") as wc from weibo_data) b group by b.id order by m desc limit 10;

1652867473 3351 1642909335 3219 2671109275 1803 1287708222 1670 1658688240 1588 2016713117 1559 2654839837 1494 1585341321 1416 1404287794 1403 1781387491 1226

说明：按userid分组统计微博内容content列包含字符串"微博"的总次数，取总次数最多的前10行，列出用户id和总次数

第8章电商推荐系统分析

1 创建电商数据表并加载数据到数据仓库

hive> create database if not exists tmall; hive> use tmall; 说明：创建电商数据库tmall

hive> create external table if not exists tmall_201412(uid string,time string,pname string,price double,number int,pid string) row format delimited fields terminated by '\t'; 说明：创建电商数据表tmall_201412，是按月份建表，存放2014年12月的电商数据

使用XShell的XFtp文件传输工具，将Win7系统D:\Hive教学\教学课件\Hive离线计算-配套视频代码数据资料\电商推荐系统\tmall-201412.csv数据文件上传到虚拟机Linux系统的/root目录 Linux终端执行cd切换到/root目录，再执行命令 wc -l tmall-201412.csv 统计数据文件行数：? 242425 tmall-201412.csv hive> load data local inpath '/root/tmall-201412.csv' into table tmall_201412; 说明：将数据文件tmall-201412.csv加载到数据表tmall_201412

hive> select * from tmall_201412 limit 10; hive> select uid from tmall_201412 limit 10; 说明:查询tmall_201412表的前10行数据 hive> select count(*) from tmall_201412; OK242425 说明：统计tmall_201412表的总行数，共242425行

2 对原数据表进行数据清洗

将后面执行数据分析（推荐算法）涉及的两个关键列uid和pid提取出来，重定向到一个本地数据文件保存，实现Hadoop集群到本地的数据转移，在生产环境经常用这种方法进行数据清洗在Linux终端执行命令 hive -e "select uid, pid from tmall.tmall_201412" > tmall_201412_uid_pid.txt

本案例采用创建临时表（通常是内部表）的方式进行数据清洗 hive> create table if not exists tmall_201412_uid_pid(uid string,pid string) row format delimited fields terminated by '\t'; 说明：创建临时表tmall_201412_uid_pid，只存放关键列uid和pid，作为后面推荐算法的input数据

执行初步的数据加载，检查数据表的数据完整性 hive> insert overwrite table tmall_201412_uid_pid select uid, pid from tmall_201412;

hive> select uid, pid from tmall_201412 limit 1000; hive> select uid, pid from tmall_201412 limit 2000;注意：由于数据量很大，XShell终端无法显示所有数据行，查看前1000条和2000条会发现存在数据缺失NULL、空字符串""和数据格式错误uid列存在数据缺失：null?? ?22138924179uid存在列空字符串：?? ?38660306914uid列存在数据格式错误：2577000?? ?17525034357结论：需要对原数据表tmall_201412的uid和pid列进行数据清洗！

下面对原数据表tmall_201412的uid和pid列的进行数据清洗：

利用Hive 正则表达式过滤函数 regexp_extract 进行数据清洗函数格式: ? ?string regexp_extract(string subject, ?string pattern, ?int index) 第1个参数：subject ? 要被过滤的主字符串第2个参数：pattern 正则表达式过滤的模式第3个参数：index ?

index=0 ?显示与pattern的模式完全匹配的字符串subject中的内容? index=1 ?显示pattern的第1个括号里的模式匹配的字符串subject中的内容 index=2 ?显示pattern的第2个括号里的模式匹配的字符串subject中的内容 index=3 ? 以此类推返回值: 主字符串subject经过正则表达式pattern 过滤后的结果

正则匹配字符解释： ^ 表示匹配字符串开头 $ 表示匹配结尾 . 表示任意单个字符 * 表示匹配零次到多次 + 表示匹配一次到多次

举例说明： hive>select regexp_extract('abc123def456ghi789jkl', '([0-9]+)([a-z]+)', 0);123def hive>select regexp_extract('abc123def456ghi789jkl', '([0-9]+)([a-z]+)', 1);def hive>select regexp_extract('abc123def456ghi789jkl', '([0-9]+)([a-z]+)', 2);123 hive>select regexp_extract('http://a.m.taobao.com/i41915173660abc.htm','i([0-9]+)([a-z]+)', 0);i41915173660abc hive>select regexp_extract('http://a.m.taobao.com/i41915173660abc.htm','i([0-9]+)([a-z]+)', 1);41915173660 hive> select regexp_extract('http://a.m.taobao.com/i41915173660abc.htm','i([0-9]+)([a-z]+)', 2); abc hive>select regexp_extract('13764633023', '^[0-9]*$', 0);13764633023 hive>select regexp_extract('137646 33023', '^[0-9]*$', 0); hive>select regexp_extract('', '^[0-9]*$', 0); hive>select regexp_extract(' ', '^[0-9]*$', 0); hive>select regexp_extract(NULL, '^[0-9]*$', 0); hive>select regexp_extract(null, '^[0-9]*$', 0); hive>select regexp_extract('NULL', '^[0-9]*$', 0); hive>select regexp_extract('null', '^[0-9]*$', 0); 说明：只有pid列是正确格式的值(例如'13764633023')，regexp_extract函数过滤的结果才是正确的

下面利用regexp_extract 函数对tmall_201412表进行数据清洗，清洗的结果保存到新表tmall_201412_uid_pid中：

hive> insert overwrite table tmall_201412_uid_pid select regexp_extract(uid,'^[0-9]*$', 0), regexp_extract(pid, '^[0-9]*$', 0) from tmall_201412 where regexp_extract(uid, '^[0-9]*$', 0) is not null and regexp_extract(uid, '^[0-9]*$', 0) !='' and regexp_extract(uid,'^[0-9]*$', 0) !=' ' and regexp_extract(uid, '^[0-9]*$',0) != 'NULL' and regexp_extract(uid, '^[0-9]*$', 0) != 'null' and regexp_extract(pid, '^[0-9]*$', 0) is not null and regexp_extract(pid, '^[0-9]*$', 0) !='' and regexp_extract(pid, '^[0-9]*$', 0) != ' ' and regexp_extract(pid,'^[0-9]*$', 0) != 'NULL' and regexp_extract(pid, '^[0-9]*$', 0) != 'null'; 说明：利用regexp_extract函数，过滤掉数据列uid和pid存在缺失值NULL、空字符'',空格字符串' ',、列值为'null'和'NULL'的数据行 hive> select uid, pid from tmall_201412_uid_pid limit 1000; hive> select uid, pid from tmall_201412_uid_pid limit 2000; hive> select uid, pid from tmall_201412_uid_pid; 说明：执行数据清洗再次查询临时表tmall_201412_uid_pid，发现查到的都是清洗过的数据，tmall_201412_uid_pid表是经过清洗的干净表

下面进行电商推荐算法的实现：? ? ?Mahout是Apache下面的开源项目，提供可扩展的经典机器学习算法。本项目的电商推荐系统需要用到Mahout提供的协同过滤推荐算法。1 Mahout的安装步骤 1）使用XShell的XFtp文件传输工具，将Win7系统D:\Hive教学\教学课件\Hive离线计算-配套视频代码数据资料\电商推荐系统\apache-mahout-distribution-0.10.1.tar.gz安装包上传到虚拟机Linux系统的/root目录

2)在Linux终端执行cd命令切换到/root目录，执行命令tar -zxvf apache-mahout-distribution-0.10.1.tar.gz -C /usr/local/ ?将Mahout安装包解压缩到 /usr/local/目录

3)在Linux终端执行命令 cd /usr/local/apache-mahout-distribution-0.10.1/bin/ ?切换到Mahout安装目录的bin子目录执行ls命令查看，有一个shell脚本文件mahout

4）执行mahout脚本文件?./mahout，如果Linux终端有打印输出，则说明Mahout安装成功

2 利用Mahout运行协同推荐算法在Linux终端执行命令 cd /usr/local/apache-mahout-distribution-0.10.1/bin/ ?切换到Mahout安装目录的bin子目录在bin子目录带参数运行mahout脚本文件 ?./mahout recommenditembased --similarityClassname SIMILARITY_COSINE --input /user/hive/warehouse/tmall.db/tmall_201412_uid_pid/ --output /output_tmall --numRecommendations 10 --booleanData 参数含义详见：教材第151页

Mahout调用MapReduce完成推荐算法，耐心等待协同推荐算法执行完成（不明觉厉@_@）！21/10/26 21:27:30 INFO MahoutDriver: Program took 412881 ms (Minutes: 6.88135)

由于数据量比较大，运行协同推荐算法用时大约7分钟！

在Linux终端执行命令hadoop fs -cat /output_tmall 查看运行协同推荐算法生成的结果文件在Linux终端执行命令 hadoop fs -cat /output_tmall/part-r-00000 ?查看协同推荐算法的计算结果，例如：

85366778010?? ?[16596842686:1.0,37960797318:1.0,38330727924:1.0,38192360483:1.0,41379268069:1.0,42864581492:1.0,37347600538:1.0,38482099856:1.0,41340342132:1.0,38332450786:1.0]

说明：第1个字段85366778010表示用户uid，后面中括号中的是推荐商品列表，列表中共有10个元素，说明有10件推荐商品，例如16596842686:1.0是一件商品，16596842686表示商品pid,1.0是对推荐商品的打分；详细解释参考教材第152页