Hive SQL 不需要注释,它会自己解释自己
1. 数据表操作
1.1 创建 Hive 表
CREATE TABLE [DATABASE].[TABLE_NAME] (
`id` BIGINT COMMENT '学号',
name STRING COMMENT '姓名',
ttl_score DOUBLE COMMENT '总分'
)
COMMENT '枝江中学五年级学生成绩明细表'
PARTITIONED BY (day STRING COMMENT 'day', hour STRING COMMENT 'hour')
ROW FORMAT [row_format]
STORED AS [file_format];
Note: 想了解更多,请参考 CREATE TABLE with Hive format
1.2 删除 Hive 表
DROP TABLE [DATABASE].[TABLE_NAME];
DROP TABLE IF EXISTS [DATABASE].[TABLE_NAME];
ALTER TABLE [DATABASE].[TABLE_NAME] DROP IF EXISTS PARTITION (dt='2024-09-01', country='US');
1.3 插入新的行
从查询结果插入:
INSERT INTO TABLE [DATABASE].[TABLE_A]
SELECT id, name, department
FROM [DATABASE].[TABLE_B]
WHERE date = '2024-05-30';
直接插入特定值:
INSERT INOT TABLE [DATABASE].[TABLE_A]
VALUES (1001, 'Alice Smith', 'HR'),
(1002, 'Bob Johnson', 'Finance');
插入分区表:
INSERT INTO TABLE [DATABASE].[TABLE_A] PARTITION (date='2024-05-30')
SELECT id, name, department
FROM [DATABASE].[TABLE_B]
WHERE date = '2024-05-30';
1.4 覆盖表数据
删除老数据,插入新数据:
INSERT OVERWRITE TABLE [DATABASE].[TABLE_A]
SELECT id, name, department
FROM [DATABASE].[TABLE_B]
WHERE date = '2024-05-30';
PS: 对于分区表,仅覆盖分区数据
1.5 增加新的列
ALTER TABLE [DATABASE].[TABLE_NAME] ADD COLUMNS (
sub_score ARRAY<DOUBLE> COMMENT '分科成绩',
sub_score_map MAP<STRING,DOUBLE> COMMENT '分科成绩map',
`rank` BIGINT COMMENT '年段排名',
update_time STRING COMMENT '更新时间'
);
1.6 查询表结构
DESC [DATABASE].[TABLE_NAME];
1.7 查询分区
SHOW PARTITIONS [DATABASE].[TABLE_NAME];
1.8 创建临时表
CREATE TEMPORARY TABLE std.std_temp_high_score AS
SELECT
name,
ttl_score
FROM [DATABASE].[TABLE_NAME]
WHERE ttl_score >= 10
AND LENGTH(name) > 1;
1.9 将本地 csv 文件上传至 Hive
LOAD DATA LOCAL INPATH '/home/user/data.csv' INTO TABLE my_table
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
2. 构建测试用例
2.1 利用 UNION ALL 构建测试用例
SELECT
`id`,
name,
ttl_score
FROM
(
SELECT 1 AS `id`, '小岁' AS name, 90.5 AS ttl_score
UNION ALL
SELECT 2 AS `id`, '大深' AS name, 91.5 AS ttl_score
UNION ALL
SELECT 3 AS `id`, '阿弥' AS name, 93 AS ttl_score
) t
2.2 后文中会用到的示例
为了方便说明,构建两个可复用的测试用例_(┐「ε:)_
1)学生分数表:
CREATE TABLE std.std_student_score AS
SELECT
`id`,
name,
ttl_score
FROM
(
SELECT 1 AS `id`, '小岁' AS name, 89.5 AS ttl_score
UNION ALL
SELECT 2 AS `id`, '大深' AS name, 91.5 AS ttl_score
UNION ALL
SELECT 3 AS `id`, '阿弥' AS name, 93 AS ttl_score
) t;
2)学生基础信息表:
CREATE TABLE std.std_student_info AS
SELECT
`id`,
gender,
age
FROM
(
SELECT 1 AS `id`, 'P' AS gender, 13 AS age
UNION ALL
SELECT 2 AS `id`, 'P' AS gender, 15 AS age
UNION ALL
SELECT 3 AS `id`, 'T' AS gender, 18 AS age
) t;
3. 数据类型
3.1 常用数据类型
数据类型 | 示例 | 说明 |
---|---|---|
BIGINT | 3 |
整型 |
DOUBLE | 3.5 |
双精度浮点型 |
STRING | apple |
字符串 |
ARRAY<BIGINT> | [1,2,3] |
列表,列表元素可以是 BIGINT, DOUBLE, STRING 等 |
MAP<BIGINT,STRING> | {1:'A+',2:'B-'} |
字典,字典的 key 和 value 可以是 BIGINT, DOUBLE, STRING 等 |
3.2 数据类型转换
1)DOUBLE 转成 STRING
SELECT
`id`,
CAST(ttl_score AS STRING) AS ttl_score
FROM std.std_student_score
2)分数除 10,然后转为 BIGINT。转为 BIGINT 有取整效果
SELECT
`id`,
CAST(ttl_score/10 AS BIGINT) AS ttl_score
FROM std.std_student_score;
4. STRING 数据类型的常用方法
4.1 替换字符串
SELECT
REGEXP_REPLACE("abcdeffecmn", "ef|cm", "%");
output: abcd%fe%n
4.2 替换符号:需使用 \\
转义
SELECT
REGEXP_REPLACE("abcd:effe(eee)cmn", "\\:|\\(|\\)", " ");
output: abcd effe eee cmn
4.3 获取字符串的子串
SELECT
SUBSTR('abcedf', 3) AS sub_str;
output: cedf
SELECT
SUBSTR('abcedf', 3, 2) AS sub_str;
output: ce
SELECT
SUBSTR('abcedf', -3, 2) AS sub_str;
output: ed
SELECT
SUBSTR('abcedf', -3, 5) AS sub_str;
output: edf
4.4 找到特定字符串在用英文逗号分隔的字符串列表中的位置
SELECT
FIND_IN_SET('a', 'c,b,a,d');
output: 3
SELECT
FIND_IN_SET('m', 'c,b,a,d');
output: 0
找不到会返回 0
所以它也可以通过是否为零来判断字符在列表中是否存在
5. DOUBEL 数据类型的常用方法
5.1 ROUND 函数
SELECT
ROUND(ratio, 3) -- 保留小数点后 3 位有效数字
FROM
(
SELECT 1.232812 AS ratio
) t;
output: 1.233
6. 行内函数
HQL 函数可以分为两种:一种是做行内处理的,输入是一行,输出也是一行;另一种做聚合,输入多行,输出一行。
本节列举一些常用的行内函数。
6.1 IF 函数
SELECT
`id`,
if(ttl_score > 90, 1, 0) AS is_high
FROM std.std_student_score;
6.2 CASE 函数
SELECT
`id`,
CASE WHEN ttl_score >= 90 THEN 'A+'
WHEN ttl_score >= 80 THEN 'A'
WHEN ttl_score >= 70 THEN 'B'
WHEN ttl_score >= 60 THEN 'C'
ELSE 'C-'
END AS `level`
FROM std.std_student_score;
6.3 CONCAT 函数
SELECT
CONCAT('a', '-', 'b')
output: a-b
6.4 CONCAT_WS 函数
以下两种写法都可以:
SELECT
CONCAT_WS(',',
ARRAY('1','2','3')
);
SELECT
CONCAT_WS(',', '1', '2', '3');
output: 1,2,3
6.5 SPLIT 函数
SELECT
SPLIT('1,2,3', ',');
output: ["1","2","3"]
7. 聚合 GROUP BY
7.1 用 GROUP BY 去重
聚合语句 GROUP BY,单独使用有去重的效果
SELECT gender
FROM std.std_student_info
GROUP BY gender;
7.2 计算每个 gender 值对应的记录数
SELECT
gender,
COUNT(*) AS cnt
FROM std.std_student_info
GROUP BY gender;
8. 聚合函数
8.1 计算记录数:COUNT 函数
SELECT COUNT(*) AS row_num
FROM std.std_student_info;
8.2 聚合时对字段值去重:DISTINCT
SELECT COUNT(DISTINCT gender) AS gender_cnt
FROM std.std_student_info;
8.3 内置函数:SUM, MAX, MIN, AVG, VARIANCE
SELECT
gender,
SUM(age) AS sum_age, -- 求和
MAX(age) AS max_age, -- 最大值
MIN(age) AS min_age, -- 最小值
AVG(age) AS avg_age, -- 均值
VARIANCE(age) AS var_age -- 方差
FROM std.std_student_info
GROUP BY gender;
9. JOIN 方法
9.1 常见的 JOIN 方法
方法 | 说明 |
---|---|
LEFT JOIN |
保留左表所有记录,保留右表与左表连接键相同的记录,连接键有多值时做笛卡尔积 |
INNER JOIN |
保留左表与右表连接键存在且相同的记录,连接键有多值时做笛卡尔积 |
FULL OUTER JOIN |
保留左表与右表所有记录,连接键值相等时连接,连接键有多值时做笛卡尔积 |
LEFT SEMI JOIN |
如果左表的数据行在右表中存在匹配项,则返回左表这一行。它不会返回右表的任何列,结果集只包含左表的列 |
MAP JOIN |
适用于大小表 JOIN,MAP JOIN 实现时,会将小表数据广播到所有 MAP 任务节点,所有 MAP 任务都会在本地缓存小表数据。由于 JOIN 操作在 MAP 阶段完成,不需要 shuffle 和 reduce,因此大大减少了计算资源的消耗 |
9.2 LEFT JOIN 示例
SELECT
a.id,
ttl_score,
age
FROM
(
SELECT
id,
ttl_score
FROM std.std_student_score
) a
LEFT JOIN
(
SELECT
id,
age
FROM std.std_student_info
) b ON a.id = b.id;
9.3 LEFT SEMI JOIN 示例
为了下面的演示,先造两张临时表。注意,临时表要和 LEFT SEMI JOIN 代码一起运行
-- 创建临时表 orders
CREATE TEMPORARY TABLE orders AS
SELECT *
FROM
(
SELECT 1001 AS order_id, 1 AS product_id
UNION ALL
SELECT 1002 AS order_id, 2 AS product_id
UNION ALL
SELECT 1003 AS order_id, 6 AS product_id
UNION ALL
SELECT 1004 AS order_id, 1 AS product_id
) t;
-- 创建临时表 products
CREATE TEMPORARY TABLE products AS
SELECT *
FROM
(
SELECT 1 AS product_id, 'A' AS product_name
UNION ALL
SELECT 2 AS product_id, 'B' AS product_name
UNION ALL
SELECT 3 AS product_id, 'C' AS product_name
UNION ALL
SELECT 4 AS product_id, 'D' AS product_name
) t;
SELECT p.product_name
FROM products p
LEFT SEMI JOIN orders o
ON p.product_id = o.product_id;
输出:
product_name |
---|
A |
B |
9.4 MAP JOIN 示例
复用上面的两张临时表,计算 MAP JOIN
-- 创建临时表 orders
CREATE TEMPORARY TABLE orders AS
SELECT *
FROM
(
SELECT 1001 AS order_id, 1 AS product_id
UNION ALL
SELECT 1002 AS order_id, 2 AS product_id
UNION ALL
SELECT 1003 AS order_id, 6 AS product_id
UNION ALL
SELECT 1004 AS order_id, 1 AS product_id
) t;
-- 创建临时表 products
CREATE TEMPORARY TABLE products AS
SELECT *
FROM
(
SELECT 1 AS product_id, 'A' AS product_name
UNION ALL
SELECT 2 AS product_id, 'B' AS product_name
UNION ALL
SELECT 3 AS product_id, 'C' AS product_name
UNION ALL
SELECT 4 AS product_id, 'D' AS product_name
) t;
-- 自动检测,尽可能将符合条件的 JOIN 转换为 MAP JOIN
SET hive.auto.convert.join=true;
-- 配置阈值,如果小表超阈值,则不会自动进行 MAP JOIN
SET hive.mapjoin.smalltable.filesize=25000000;
SELECT /*+ MAPJOIN(p) */ o.order_id, o.product_id, p.product_name
FROM orders o
INNER JOIN products p
ON o.product_id = p.product_id;
输出:
order_id | product_id | product_name |
---|---|---|
1001 | 1 | A |
1002 | 2 | B |
1004 | 1 | A |
Note: 如果你不确定 MAP JOIN 是否实际生效,可以使用
EXPLAIN
关键字,获取查询计划EXPLAIN [query]; -- 获取查询计划 EXPLAIN EXTENDED [query]; -- 获取更详细的查询计划
10. ARRAY 相关函数
10.1 COLLECT_LIST 函数
COLLECT_LIST 函数:将来自多行的某字段的值聚合成 ARRAY
SELECT
gender,
COLLECT_LIST(age) AS age_list
FROM std.std_student_info
GROUP BY gender;
10.2 COLLECT_SET 函数
COLLECT_SET 函数:将来自多行的某字段的值聚合成 ARRAY 并去重
SELECT
gender,
COLLECT_SET(age) AS age_set
FROM std.std_student_info
GROUP BY gender;
10.3 用分隔符拼接 ARRAY
SELECT
gender,
CONCAT_WS(',',
COLLECT_SET(CAST(age AS STRING))
) AS age_set
FROM std.std_student_info
GROUP BY gender;
10.4 ARRAY_CONTAINS 函数
ARRAY_CONTAINS 函数:判断 ARRAY 中是否包含某元素
SELECT
id,
age_list
FROM
(
SELECT 1 AS id, array(1,2,3) AS age_list
UNION ALL
SELECT 2 AS id, array(6,7,8) AS age_list
) t
WHERE ARRAY_CONTAINS(age_list, 3);
10.5 元素排序: SORT_ARRAY 函数
SELECT
SORT_ARRAY(lst) AS sorted_lst
FROM
(
SELECT ARRAY(3,2,4,1) AS lst
) t;
11. MAP 相关函数
11.1 STR_TO_MAP 函数
STR_TO_MAP 函数:将固定格式的 STRING 转成 MAP
可以写成
SELECT
STR_TO_MAP(
CONCAT_WS(',',
CONCAT('key_1', ':', 'value_1'),
CONCAT('key_2', ':', 'value_2')
), ',', ':'
) AS my_map;
或者
SELECT
STR_TO_MAP(
CONCAT_WS(',', ARRAY(
CONCAT('key_1', ':', 'value_1'),
CONCAT('key_2', ':', 'value_2')
)
), ',', ':'
) AS my_map;
output: {"key_1":"value_1","key_2":"value_2"}
11.2 通过 key 读取 value
SELECT
my_map['A']
FROM
(
SELECT
STR_TO_MAP(
CONCAT_WS(',', ARRAY(
CONCAT('A', ':', 'A1'),
CONCAT('B', ':', 'B1')
)
), ',', ':'
) AS my_map
) t;
output: A1
11.3 MAP_KEYS 函数
MAP_KEYS 函数:提取 map 中 的 key,生成由 key 组成的 array
📖 为了后文演示方便,这里创建一个通用示例:
CREATE TABLE std.std_level_map AS
SELECT
`id`,
sub_score
FROM
(
SELECT
1 AS `id`,
STR_TO_MAP(
CONCAT_WS(',',
CONCAT('语文:', '80'),
CONCAT('数学:', '85'),
CONCAT('英语:', '95'),
CONCAT('政治:', '90')
), ',', ':'
) AS sub_score
UNION ALL
SELECT
2 AS `id`,
STR_TO_MAP(
CONCAT_WS(',',
CONCAT('法律:', '90'),
CONCAT('英语:', '75'),
CONCAT('政治:', '92')
), ',', ':'
) AS sub_score
) t;
使用 MAP_KEYS 函数将字典的 key 转为 key 的 ARRAY
SELECT
`id`,
MAP_KEYS(sub_score) AS sub_list
FROM std.std_level_map;
11.4 MAP_VALUES 函数
MAP_VALUES 函数:把 map 中的 value 转为 value 的 array
与 MAP_KEYS 类似,用 MAP_VALUES 可以把 map 的 value 转为 value 的 ARRAY
SELECT
`id`,
MAP_VALUES(sub_score) AS score_list
FROM std.std_level_map;
12 构造 JSON
12.1 构造简单的 JSON
SELECT
CONCAT('{"',
CONCAT_WS('","',
CONCAT('语文:', '80'),
CONCAT('数学:', '90')
), '"}'
) AS my_map;
output: {"语文:80","数学:90"}
12.2 将 MAP 转换为 JSON 格式的字符串
SELECT
`id`,
CONCAT('{"',
CONCAT_WS('","',
COLLECT_SET(
CONCAT(k, '":"', v)
)
), '"}'
) AS my_map
FROM std.std_level_map
LATERAL VIEW OUTER EXPLODE(sub_score) e AS k, v
GROUP BY `id`;
output:
id | my_map |
---|---|
1 | {"语文":"80","数学":"85","英语":"95","政治":"90"} |
2 | {"法律":"90","英语":"75","政治":"92"} |
12.3 构造形如 ARRAY<MAP<STRING,STRING>>
的字符串
SELECT
CONCAT(
'[',
CONCAT_WS(',',
CONCAT('{"',
CONCAT_WS('","',
CONCAT('语文:', '80'),
CONCAT('数学:', '90')
), '"}'
)
), ']'
) AS my_map;
output: [{"语文:80","数学:90"}]
12.4 基于 MAP 构造形如 ARRAY<MAP<STRING,STRING>>
的字符串
SELECT
IF(SUM(IF(ARRAY_CONTAINS(MAP_KEYS(sub_score), '英语') AND ARRAY_CONTAINS(MAP_KEYS(sub_score), '政治'), 1, 0)) = 0, '',
CONCAT('[',
CONCAT_WS(',',
COLLECT_SET(
CONCAT('{"',
CONCAT_WS('","',
CONCAT('英语', '":"', CAST(sub_score['英语'] AS STRING)),
CONCAT('政治', '":"', CAST(sub_score['政治'] AS STRING))
), '"}'
)
)
), ']'
)
) AS my_map
FROM std.std_level_map;
output: [{"英语":"95","政治":"90"},{"英语":"75","政治":"92"}]
Note: 如果原MAP中不包含任何选定数据,应该返回空字符串,为了实现这个效果,需要在外层套一个IF函数来判断。常用的判断条件有:
ARRAY_CONTAINS(MAP_KEYS(sub_score), '[YOUR_KEY_HERE]')
SIZE(MAP_KEYS(sub_score)) > 0
12.5 利用跨行数据构造 JSON
Note: 凡是有
CONCAT_WS
的地方,里面都可以套一层COLLECT_SET
或者COLLECT_LIST
做跨行处理。可以参考
2)
和4)
,这两条代码实际都有跨行处理能力
13. 解析 JSON
13.1 用 get_json_object
解析单个字段
SELECT get_json_object(json_column, '$.name')
FROM your_table;
13.2 用 json_tuple
解析多个字段
如果有大量字段需要解析,用 json_tuple
会快非常多。
SELECT json_tuple(json_column, 'key1', 'key2') as (key1, key2)
FROM my_table;
14. LATERAL VIEW OUTER EXPLODE
函数
LATERAL VIEW OUTER EXPLODE()
用于把 ARRAY 或者 MAP 中的元素展成多行。该语句一般放在 FROM
之后,WHERE
之前,示例如下👇
14.1 将 ARRAY 展成多行
SELECT
`id`,
gid
FROM
(
SELECT
1 AS `id`,
ARRAY(1,2,3,5,5) AS group_id
UNION ALL
SELECT
2 AS `id`,
ARRAY(1,2) AS group_id
) t
LATERAL VIEW OUTER EXPLODE(group_id) g AS gid
WHERE id IN (1,3);
output:
id | gid |
---|---|
1 | 1 |
1 | 2 |
1 | 3 |
1 | 5 |
1 | 5 |
14.2 将 ARRAY 展成多行:更复杂的例子
SELECT
gid,
CONCAT_WS(',', COLLECT_SET(CAST(id AS STRING))) AS id_list
FROM
(
SELECT
1 AS `id`,
ARRAY(1,2,3,4,5,5) AS group_id
UNION ALL
SELECT
2 AS `id`,
ARRAY(2,4,4,3) AS group_id
) t
LATERAL VIEW OUTER EXPLODE(group_id) g AS gid
WHERE gid > 1
GROUP BY gid;
output:
gid | id_list |
---|---|
2 | 1,2 |
3 | 1,2 |
4 | 1,2 |
5 | 1 |
14.3 多个 ARRAY 展开:group_id
与 level
之间做笛卡尔积
SELECT
`id`,
gid,
lvl
FROM
(
SELECT
1 AS `id`,
ARRAY(1,2) AS group_id,
ARRAY('A','B') AS level
) t
LATERAL VIEW OUTER EXPLODE(group_id) g AS gid
LATERAL VIEW OUTER EXPLODE(level) l AS lvl;
output:
id | gid | lvl |
---|---|---|
1 | 1 | A |
1 | 1 | B |
1 | 2 | A |
1 | 2 | B |
14.4 MAP 也能展开
Note: 请看 12.2 将 MAP 转换为 JSON 格式的字符串。同理,多个 MAP 也能像多个 ARRAY 一样展开,这里就不多做演示了。
15. 窗口函数
窗口函数的处理效果,是输入 n 行,输出 n 行。但这不意味窗口函数是行处理的,它可以综合其他行的信息对当前行做运算。
15.1 找同班同学: PARTITION BY
- 输入是
id
(学号)和class
(班级) - 输出是
id
(学号)和class
(班级)和id_list
(同班同学学号列表)
SELECT
id,
class,
COLLECT_SET(id) OVER(PARTITION BY class) AS id_list
FROM
(
SELECT 1 AS id, 'A' AS class
UNION ALL
SELECT 2 AS id, 'A' AS class
UNION ALL
SELECT 3 AS id, 'B' AS class
) t;
output:
id | class | id_list |
---|---|---|
1 | A | [1,2] |
2 | A | [1,2] |
3 | B | [3] |
Note: 常用的聚合函数,都可以替换本例中的
COLLECT_SET(id)
,比如:
SUM()
COUNT()
MAX()
等等……
15.2 为每个数据行生成序号: ROW_NUMBER 函数
SELECT
id,
class,
score,
ROW_NUMBER() OVER(PARTITION BY class ORDER BY score DESC) AS class_no
FROM
(
SELECT 1 AS id, 'A' AS class, 90 AS score
UNION ALL
SELECT 2 AS id, 'A' AS class, 91 AS score
UNION ALL
SELECT 3 AS id, 'A' AS class, 91 AS score
UNION ALL
SELECT 4 AS id, 'B' AS class, 100 AS score
) t;
output:
id | class | score | class_no |
---|---|---|---|
1 | A | 90 | 3 |
2 | A | 91 | 1 |
3 | A | 91 | 2 |
4 | B | 100 | 1 |
Note: 注意到虽然学生2和学生3分数都是91分,但他们的班级序号(class_no)不重叠。2号学生班级序号是1,3号学生班级序号是2
15.3 排名: RANK 函数
SELECT
id,
class,
score,
RANK() OVER(PARTITION BY class ORDER BY score DESC) AS class_rk
FROM
(
SELECT 1 AS id, 'A' AS class, 90 AS score
UNION ALL
SELECT 2 AS id, 'A' AS class, 91 AS score
UNION ALL
SELECT 3 AS id, 'A' AS class, 91 AS score
UNION ALL
SELECT 4 AS id, 'B' AS class, 100 AS score
) t;
output:
id | class | score | class_rk |
---|---|---|---|
1 | A | 90 | 3 |
2 | A | 91 | 1 |
3 | A | 91 | 1 |
4 | B | 100 | 1 |
Note: 注意看,虽然学生2和学生3分数都是91分,但他们的班级排名(class_rk)是相同的。2号同学与3号同学并列第一,1号学生由于有2位同学排在他前面,直接变成第3名。
15.4 不计人头排名: DENSE_RANK 函数
DENSE_RANK()
是不计人头,纯计分数的排名方式。
上例中的1号同学,在考试中获得了90分,虽然按人头数计,有两位同学排在他前面,因此成为了第三名。但如果纯以分数计,他的90分是仅次于91分的班级第二好名次。如果我们希望不考虑人头数,仅考虑分数在排序中的位置,那么可以使用DENSE_RANK函数。
SELECT
id,
class,
score,
DENSE_RANK() OVER(PARTITION BY class ORDER BY score DESC) AS class_rk
FROM
(
SELECT 1 AS id, 'A' AS class, 90 AS score
UNION ALL
SELECT 2 AS id, 'A' AS class, 91 AS score
UNION ALL
SELECT 3 AS id, 'A' AS class, 91 AS score
UNION ALL
SELECT 4 AS id, 'B' AS class, 100 AS score
) t;
output:
id | class | score | class_rk |
---|---|---|---|
1 | A | 90 | 2 |
2 | A | 91 | 1 |
3 | A | 91 | 1 |
4 | B | 100 | 1 |
16. 尾缀方法
16.1 对结果排序: ORDER BY 方法
寻找分数 TOP2 的两位同学:
SELECT
id,
score
FROM
(
SELECT 1 AS id, 70 AS score
UNION ALL
SELECT 2 AS id, 60 AS score
UNION ALL
SELECT 3 AS id, 90 AS score
) t
ORDER BY score DESC
LIMIT 2;
ORDER BY也常用于随机抽样:
ORDER BY RAND();
16.2 组筛选: HAVING 方法
HAVING语句能对GROUP BY的结果做筛选。以下 HQL 选出班级人数大于1的班级:
SELECT
class,
COUNT(DISTINCT id) AS id_cnt
FROM
(
SELECT 1 AS id, 'A' AS class
UNION ALL
SELECT 2 AS id, 'A' AS class
UNION ALL
SELECT 3 AS id, 'B' AS class
) t
GROUP BY class
HAVING id_cnt > 1;
16.3 CLUSTER BY
, DISTRIBUTE BY
和 SORT BY
这三个语句是控制reducer的。DISTRIBUTE BY [YOUR_COLUMN_HERE]
用于控制数据如何进入reducer,比如:
SELECT
class,
id
FROM
(
SELECT 1 AS id, 'A' AS class
UNION ALL
SELECT 2 AS id, 'A' AS class
UNION ALL
SELECT 3 AS id, 'B' AS class
) t
DISTRIBUTE BY class;
如上,则拥有相同class值的记录,会进入同一个reducer。
SORT BY [YOUR_COLUMN_HERE]
用于控制reducer中数据的排序,比如:
SELECT
class,
id
FROM
(
SELECT 1 AS id, 'A' AS class
UNION ALL
SELECT 2 AS id, 'A' AS class
UNION ALL
SELECT 3 AS id, 'B' AS class
) t
DISTRIBUTE BY class
SORT BY id;
如上例,在每个reducer中,会用id字段做reducer内的排序。
对某字段使用 CLUSTER BY
,等于对该字段既使用DISTRIBUTE BY,又使用SORT BY,一种语法糖了属于是。
Note: 这三个语句日常不太常见,什么场景能用上呢?
举个例子,用户 (uid) 的购物车订单 (oid) 有时候会被拆成多个包裹,这时每个包裹会对应一个子单 (sub_oid)。这种情况下,一个订单就会对应多个子单。假设我们只知道
uid - oid
的关系与uid - sub_oid
的关系。如果已知 sub_oid 是 oid 后加下划线加子单单号,我们希望获取oid - sub_oid
的关系。这种情况就可以使用DISTRIBUTE BY uid
,这可以大幅节省算力。
17. TRANSFORM 关键字
HQL 部分:
SELECT TRANSFORM
-- PYTHON文件的输入
(
COL_1,
COL_2
)
USING '[PYTHON_ENV_PATH_HERE] [YOUR_PYTHON_FILE_HERE]' AS
-- PYTHON文件的输出,输出的行数和内容取决于PYTHON怎么写
(
OUTPUT_1,
OUTPUT_2,
OUTPUT_3
)
FROM
-- 数据源,会传给PYTHON文件的输入
(
SELECT
COL_1,
COL_2
FROM [DB_NAME].[SOURCE_TABLE]
WHERE [CONDITIONS]
DISTRIBUTE BY [YOUR_KEY]
) t;
PYTHON 部分:
import sys
class Processor:
@staticmethod
def add(lst):
nlst = list()
for col_1, col_2 in lst:
nlst.append((col_1, col_2, col_1+'-'+col_2))
return nlst
def main(self):
lst = list()
for line in sys.stdin:
col_1, col_2 = line.strip().split('\t')[:2]
lst.append((col_1, col_2))
for col_1, col_2, col_3 in self.add(lst):
print('{}\t{}\t{}'.format(col_1, col_2, col_3))
if __name__ == '__main__':
p = Processor()
p.main()
18. 空值处理
18.1 COALESCE 函数
COALESCE 函数:接受多个参数。从左到右遍历参数,返回第一个遇到的非 NULL
参数
SELECT
COALESCE(col_1, col_2, 0) AS res
FROM
(
SELECT NULL AS col_1, 10 AS col_2
UNION ALL
SELECT NULL AS col_1, NULL AS col_2
) t;
output:
res |
---|
10 |
0 |
18.2 NVL 函数
NVL 函数:接受两个参数。如果第一个参数为 NULL
,返回第二个参数,否则返回第一个参数
SELECT
NVL(col_1, col_2) AS res
FROM
(
SELECT NULL AS col_1, 10 AS col_2
UNION ALL
SELECT NULL AS col_1, NULL AS col_2
UNION ALL
SELECT 100 AS col_1, 200 AS col_2
) t;
output:
res |
---|
10 |
NULL |
100 |
19. Useful Codes
19.1 行内跨字段计算最大 / 最小值
原始用法:
SELECT
GREATEST(3,1,2) AS greatest_num
LEAST(6,5,7) AS least_num;
output:
greatest_num | least_num |
---|---|
3 | 5 |
场景:计算一天中最贵/最便宜的一餐
SELECT
LEAST(breakfast, lunch, dinner) AS cheapest_meal,
GREATEST(breakfast, lunch, dinner) AS expensive_meal
FROM
(
SELECT 3 AS breakfast, 25 AS lunch, 15 AS dinner
UNION ALL
SELECT 8 AS breakfast, 0 AS lunch, 20 AS dinner
UNION ALL
SELECT 0 AS breakfast, 35 AS lunch, 22 AS dinner
) t;
output:
cheapest_meal | expensive_meal |
---|---|
3 | 25 |
0 | 20 |
0 | 35 |
19.2 计算分位数
简单的例子:计算学生分数的95分位数
SELECT
ttl_score,
quantile,
distance,
ROUND(quantile, 3) AS quantile_round, -- 精确到小数点后三位便于展示
ROUND(distance, 3) AS distance_round
FROM
(
SELECT
ttl_score,
id_cnt_accumulation / all_cnt AS quantile, -- 实际的分位数值
ABS((id_cnt_accumulation / all_cnt) - 0.05) AS distance -- 与标准95分位数的距离
FROM
(
SELECT
ttl_score,
SUM(id_cnt) OVER(ORDER BY ttl_score DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS id_cnt_accumulation,
SUM(id_cnt) OVER() AS all_cnt
FROM
(
SELECT
ttl_score,
COUNT(DISTINCT `id`) AS id_cnt
FROM std.std_student_score
GROUP BY ttl_score
) t1
) t2
) t3
ORDER BY distance
LIMIT 1;
复杂一点的例子:计算多组学生分数的分位数:25/50/75分位数
SELECT
ttl_score,
quantile,
ROUND(quantile, 3) AS quantile_round, -- 精确到小数点后三位便于展示
CONCAT_WS(',',
IF(rk_25 = 1, '25', null),
IF(rk_50 = 1, '50', null),
IF(rk_75 = 1, '75', null)
) AS prompt, -- 提示是哪一种分位数,可能存在同一个ttl_score值为多个分位数的情况
distance_25,
distance_50,
distance_75
FROM
(
SELECT
ttl_score,
quantile,
ABS(quantile - 0.25) AS distance_25,
ABS(quantile - 0.5) AS distance_50,
ABS(quantile - 0.75) AS distance_75,
row_number() over(order by ABS(quantile - 0.25)) AS rk_25,
row_number() over(order by ABS(quantile - 0.5)) AS rk_50,
row_number() over(order by ABS(quantile - 0.75)) AS rk_75
FROM
(
SELECT
ttl_score,
id_cnt_accumulation / all_cnt AS quantile -- 实际的分位数值
FROM
(
SELECT
ttl_score,
SUM(id_cnt) OVER(ORDER BY ttl_score ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS id_cnt_accumulation,
SUM(id_cnt) OVER() AS all_cnt
FROM
(
SELECT
ttl_score,
COUNT(DISTINCT `id`) AS id_cnt
FROM std.std_student_score
GROUP BY ttl_score
) t1
) t2
) t3
) t4
WHERE rk_25 = 1
OR rk_50 = 1
OR rk_75 = 1;
简单方法:使用 NTILE()
函数
SELECT
ttl_score,
NTILE(10) OVER (ORDER BY ttl_score) AS level
FROM std.std_student_score
20. 用户自定义函数
20.1 UDF
Hive 允许用户编写 UDF (User-Defined Function) 来拓展 Hive 的功能,以处理内置函数无法满足的需求。
现在来编写一个简单的 UDF 函数,我们约定:
- 输入是由逗号作为分隔符的元素序列:
"apple,orange,banana,apple"
- 输出是无重复元素的个数,在当前这个例子下输出是:
3
1)新建一个 Java 工程
- IDE 使用:IntelliJ IDEA Community Edition
- 构建工具使用:Maven
使用 IntelliJ IDEA 默认项目模版初始化后,新建两个文件:
pom.xml
UniqueCountUDF.java
文件位置如下:
.
├── pom.xml
└─ src
└── main
└── java
└── com
└── bigdata
└── hive
└── udf
└── UniqueCountUDF.java
2)编写 UDF 程序和 Maven 配置
UniqueCountUDF.java
文件:
package com.bigdata.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import java.util.Arrays;
@SuppressWarnings("deprecation")
public class UniqueCountUDF extends UDF {
/**
* 计算输入字符串中通过逗号分隔的元素列表的无重复元素数量
*
* @param input 输入的字符串,元素通过逗号分隔
* @return 无重复元素的数量
*/
public IntWritable evaluate(Text input) {
if (input == null || input.toString().isEmpty()) {
return new IntWritable(0);
}
String[] arr = input.toString().split(",");
return new IntWritable((int) Arrays.stream(arr).distinct().count());
}
}
pom.xml
文件:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.bigdata.hive.udf</groupId>
<artifactId>hive-udf</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>3.1.3</version>
</dependency>
</dependencies>
</project>
如果你的 Hive 版本不是 3.1.3,需要改成实际的版本号。查询 Hive 版本的命令是 hive --version
3)打包项目
在 IntelliJ IDEA 右侧边栏 [hive-udf] -> [Lifecycle]
这里,依次点击 clean
和 package
,先清理再打包。
这会在 target
目录下,生成名为 hive-udf-1.0-SNAPSHOT.jar
的文件。
.
├── pom.xml
├── src
│ └── ...
└── target
├── ...
└── hive-udf-1.0-SNAPSHOT.jar
Note: 也可以用命令行打包
打开命令行,在项目根目录运行以下命令,以清理、编译、打包项目:
mvn clean package
4)上传 jar 包
将 jar 包上传至 HDFS:
# 检测上级路径是否存在
hdfs dfs -ls hdfs://[HOSTNAME]:[PORT]/[PATH/TO/USER/DIR]
# 为 hive-udf 创建新目录
hdfs dfs -mkdir hdfs://[HOSTNAME]:[PORT]/[PATH/TO/USER/DIR]/hive
hdfs dfs -mkdir hdfs://[HOSTNAME]:[PORT]/[PATH/TO/USER/DIR]/hive/udf
# 上传 jar 包
hdfs dfs -put [PATH/TO/JAR/FILE] \
hdfs://[HOSTNAME]:[PORT]/[PATH/TO/USER/DIR]/hive/udf
# 删除 jar 包
hdfs dfs -rm \
hdfs://[HOSTNAME]:[PORT]/[PATH/TO/USER/DIR]/hive/udf/[JAR_FILE]
# 删除整个目录及其内容(递归删除)
hdfs dfs -rm -r hdfs://[HOSTNAME]:[PORT]/[PATH/TO/USER/DIR]
5)测试
写个 demo 测试一下,看函数功能是否符合预期。
unique_count.hql
文件:
ADD JAR hdfs://[HOSTNAME]:[PORT]/[PATH/TO/USER/DIR]/hive/udf/hive-udf-1.0-SNAPSHOT.jar;
CREATE TEMPORARY FUNCTION uniq_count AS 'com.bigdata.hive.udf.UniqueCountUDF';
select uniq_count("apple,orange,banana,apple") as fruit_cnt;
打开命令行,运行:
hive -f unique_count.hql
如符合预期,应输出:3
Note:
UniqueCountUDF.java
文件是老版写法(但也能跑)。对于高版本的 Hive,推荐使用:import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;
你需要创建一个继承自
GenericUDF
的类,并重写父类方法。
20.2 UDAF
如果你的数据涉及“聚合”操作(可以简单理解为跨行处理),那么 UDF 已经满足不了你,你需要 UDAF (User-Defined Aggregate Function)
import org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver;
你需要继承并实现这个抽象类,根据你的需求重写类方法。
21. 命令行操作
hive -e "[HQL]"
: 命令行执行hql
语句hive -f [PATH/TO/HQL/FILE]
: 命令行执行hql
文件hive -f [PATH/TO/HQL/FILE] > [LOCAL_FILE].csv
: 把结果存入csv
hive -e "SET hive.metastore.uris;"
: 查看 Hive 服务的 ip 和端口hive -e "SET hive.cli.print.header=true;select * from [YOUR_TABLE]"
: 查询时打印表头
附录
- Hive CLI 工具:Beeline
- 在线练习工具: hue
- Hive UDF 库:facebook-hive-udfs