Hive SQL 笔记

Hive SQL 不需要注释,它会自己解释自己

1. 数据表操作

1.1 创建 Hive 表

CREATE TABLE [DATABASE].[TABLE_NAME] (
    `id`    BIGINT    COMMENT '学号',
    name    STRING    COMMENT '姓名',
    ttl_score    DOUBLE    COMMENT '总分'
)
COMMENT '枝江中学五年级学生成绩明细表'
PARTITIONED BY (day STRING COMMENT 'day', hour STRING COMMENT 'hour')
ROW FORMAT [row_format]
STORED AS [file_format];

Note: 想了解更多,请参考 CREATE TABLE with Hive format

1.2 删除 Hive 表

DROP TABLE [DATABASE].[TABLE_NAME];
DROP TABLE IF EXISTS [DATABASE].[TABLE_NAME];
ALTER TABLE [DATABASE].[TABLE_NAME] DROP IF EXISTS PARTITION (dt='2024-09-01', country='US');

1.3 插入新的行

从查询结果插入:

INSERT INTO TABLE [DATABASE].[TABLE_A]
SELECT id, name, department
FROM [DATABASE].[TABLE_B]
WHERE date = '2024-05-30';

直接插入特定值:

INSERT INOT TABLE [DATABASE].[TABLE_A]
VALUES (1001, 'Alice Smith', 'HR'),
       (1002, 'Bob Johnson', 'Finance');

插入分区表:

INSERT INTO TABLE [DATABASE].[TABLE_A] PARTITION (date='2024-05-30')
SELECT id, name, department
FROM [DATABASE].[TABLE_B]
WHERE date = '2024-05-30';

1.4 覆盖表数据

删除老数据,插入新数据:

INSERT OVERWRITE TABLE [DATABASE].[TABLE_A]
SELECT id, name, department
FROM [DATABASE].[TABLE_B]
WHERE date = '2024-05-30';

PS: 对于分区表,仅覆盖分区数据

1.5 增加新的列

ALTER TABLE [DATABASE].[TABLE_NAME] ADD COLUMNS (
    sub_score    ARRAY<DOUBLE>    COMMENT '分科成绩',
    sub_score_map    MAP<STRING,DOUBLE>    COMMENT '分科成绩map',
    `rank`    BIGINT    COMMENT '年段排名',
    update_time    STRING    COMMENT '更新时间'
);

1.6 查询表结构

DESC [DATABASE].[TABLE_NAME];

1.7 查询分区

SHOW PARTITIONS [DATABASE].[TABLE_NAME];

1.8 创建临时表

CREATE TEMPORARY TABLE std.std_temp_high_score AS
SELECT
    name,
    ttl_score
FROM [DATABASE].[TABLE_NAME]
WHERE ttl_score >= 10
    AND LENGTH(name) > 1;

1.9 将本地 csv 文件上传至 Hive

LOAD DATA LOCAL INPATH '/home/user/data.csv' INTO TABLE my_table
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';

2. 构建测试用例

2.1 利用 UNION ALL 构建测试用例

SELECT
    `id`,
    name,
    ttl_score
FROM
(
    SELECT 1 AS `id`, '小岁' AS name, 90.5 AS ttl_score
    UNION ALL
    SELECT 2 AS `id`, '大深' AS name, 91.5 AS ttl_score
    UNION ALL
    SELECT 3 AS `id`, '阿弥' AS name, 93 AS ttl_score
) t

2.2 后文中会用到的示例

为了方便说明,构建两个可复用的测试用例_(┐「ε:)_

1)学生分数表:

CREATE TABLE std.std_student_score AS
SELECT
    `id`,
    name,
    ttl_score
FROM
(
    SELECT 1 AS `id`, '小岁' AS name, 89.5 AS ttl_score
    UNION ALL
    SELECT 2 AS `id`, '大深' AS name, 91.5 AS ttl_score
    UNION ALL
    SELECT 3 AS `id`, '阿弥' AS name, 93 AS ttl_score
) t;

2)学生基础信息表:

CREATE TABLE std.std_student_info AS
SELECT
    `id`,
    gender,
    age
FROM
(
    SELECT 1 AS `id`, 'P' AS gender, 13 AS age
    UNION ALL
    SELECT 2 AS `id`, 'P' AS gender, 15 AS age
    UNION ALL
    SELECT 3 AS `id`, 'T' AS gender, 18 AS age
) t;

3. 数据类型

3.1 常用数据类型

数据类型 示例 说明
BIGINT 3 整型
DOUBLE 3.5 双精度浮点型
STRING apple 字符串
ARRAY<BIGINT> [1,2,3] 列表,列表元素可以是 BIGINT, DOUBLE, STRING 等
MAP<BIGINT,STRING> {1:'A+',2:'B-'} 字典,字典的 key 和 value 可以是 BIGINT, DOUBLE, STRING 等

3.2 数据类型转换

1)DOUBLE 转成 STRING

SELECT
    `id`,
    CAST(ttl_score AS STRING) AS ttl_score
FROM std.std_student_score

2)分数除 10,然后转为 BIGINT。转为 BIGINT 有取整效果

SELECT
    `id`,
    CAST(ttl_score/10 AS BIGINT) AS ttl_score
FROM std.std_student_score;

4. STRING 数据类型的常用方法

4.1 替换字符串

SELECT
    REGEXP_REPLACE("abcdeffecmn", "ef|cm", "%");

output: abcd%fe%n

4.2 替换符号:需使用 \\ 转义

SELECT
    REGEXP_REPLACE("abcd:effe(eee)cmn", "\\:|\\(|\\)", " ");

output: abcd effe eee cmn

4.3 获取字符串的子串

SELECT
    SUBSTR('abcedf', 3) AS sub_str;

output: cedf

SELECT
    SUBSTR('abcedf', 3, 2) AS sub_str;

output: ce

SELECT
    SUBSTR('abcedf', -3, 2) AS sub_str;

output: ed

SELECT
    SUBSTR('abcedf', -3, 5) AS sub_str;

output: edf

4.4 找到特定字符串在用英文逗号分隔的字符串列表中的位置

SELECT
    FIND_IN_SET('a', 'c,b,a,d');

output: 3

SELECT
    FIND_IN_SET('m', 'c,b,a,d');

output: 0

找不到会返回 0 所以它也可以通过是否为零来判断字符在列表中是否存在

5. DOUBEL 数据类型的常用方法

5.1 ROUND 函数

SELECT
    ROUND(ratio, 3)  -- 保留小数点后 3 位有效数字
FROM
(
    SELECT 1.232812 AS ratio
) t;

output: 1.233

6. 行内函数

HQL 函数可以分为两种:一种是做行内处理的,输入是一行,输出也是一行;另一种做聚合,输入多行,输出一行。

本节列举一些常用的行内函数。

6.1 IF 函数

SELECT
    `id`,
    if(ttl_score > 90, 1, 0) AS is_high
FROM std.std_student_score;

6.2 CASE 函数

SELECT
    `id`,
    CASE WHEN ttl_score >= 90 THEN 'A+'
        WHEN ttl_score >= 80 THEN 'A'
        WHEN ttl_score >= 70 THEN 'B'
        WHEN ttl_score >= 60 THEN 'C'
        ELSE 'C-'
    END AS `level`
FROM std.std_student_score;

6.3 CONCAT 函数

SELECT
    CONCAT('a', '-', 'b')

output: a-b

6.4 CONCAT_WS 函数

以下两种写法都可以:

SELECT
    CONCAT_WS(',',
        ARRAY('1','2','3')
    );
SELECT
    CONCAT_WS(',', '1', '2', '3');

output: 1,2,3

6.5 SPLIT 函数

SELECT
    SPLIT('1,2,3', ',');

output: ["1","2","3"]

7. 聚合 GROUP BY

7.1 用 GROUP BY 去重

聚合语句 GROUP BY,单独使用有去重的效果

SELECT gender
FROM std.std_student_info
GROUP BY gender;

7.2 计算每个 gender 值对应的记录数

SELECT
    gender,
    COUNT(*) AS cnt
FROM std.std_student_info
GROUP BY gender;

8. 聚合函数

8.1 计算记录数:COUNT 函数

SELECT COUNT(*) AS row_num
FROM std.std_student_info;

8.2 聚合时对字段值去重:DISTINCT

SELECT COUNT(DISTINCT gender) AS gender_cnt
FROM std.std_student_info;

8.3 内置函数:SUM, MAX, MIN, AVG, VARIANCE

SELECT
    gender,
    SUM(age) AS sum_age,    -- 求和
    MAX(age) AS max_age,    -- 最大值
    MIN(age) AS min_age,    -- 最小值
    AVG(age) AS avg_age,    -- 均值
    VARIANCE(age) AS var_age    -- 方差
FROM std.std_student_info
GROUP BY gender;

9. JOIN 方法

9.1 常见的 JOIN 方法

方法 说明
LEFT JOIN 保留左表所有记录,保留右表与左表连接键相同的记录,连接键有多值时做笛卡尔积
INNER JOIN 保留左表与右表连接键存在且相同的记录,连接键有多值时做笛卡尔积
FULL OUTER JOIN 保留左表与右表所有记录,连接键值相等时连接,连接键有多值时做笛卡尔积
LEFT SEMI JOIN 如果左表的数据行在右表中存在匹配项,则返回左表这一行。它不会返回右表的任何列,结果集只包含左表的列
MAP JOIN 适用于大小表 JOIN,MAP JOIN 实现时,会将小表数据广播到所有 MAP 任务节点,所有 MAP 任务都会在本地缓存小表数据。由于 JOIN 操作在 MAP 阶段完成,不需要 shuffle 和 reduce,因此大大减少了计算资源的消耗

9.2 LEFT JOIN 示例

SELECT
    a.id,
    ttl_score,
    age
FROM
(
    SELECT
        id,
        ttl_score
    FROM std.std_student_score
) a
LEFT JOIN
(
    SELECT
        id,
        age
    FROM std.std_student_info
) b ON a.id = b.id;

9.3 LEFT SEMI JOIN 示例

为了下面的演示,先造两张临时表。注意,临时表要和 LEFT SEMI JOIN 代码一起运行

-- 创建临时表 orders
CREATE TEMPORARY TABLE orders AS
SELECT *
FROM
(
    SELECT 1001 AS order_id, 1 AS product_id
    UNION ALL
    SELECT 1002 AS order_id, 2 AS product_id
    UNION ALL
    SELECT 1003 AS order_id, 6 AS product_id
    UNION ALL
    SELECT 1004 AS order_id, 1 AS product_id
) t;

-- 创建临时表 products
CREATE TEMPORARY TABLE products AS
SELECT *
FROM
(
    SELECT 1 AS product_id, 'A' AS product_name
    UNION ALL
    SELECT 2 AS product_id, 'B' AS product_name
    UNION ALL
    SELECT 3 AS product_id, 'C' AS product_name
    UNION ALL
    SELECT 4 AS product_id, 'D' AS product_name
) t;

SELECT p.product_name
FROM products p
LEFT SEMI JOIN orders o
ON p.product_id = o.product_id;

输出:

product_name
A
B

9.4 MAP JOIN 示例

复用上面的两张临时表,计算 MAP JOIN

-- 创建临时表 orders
CREATE TEMPORARY TABLE orders AS
SELECT *
FROM
(
    SELECT 1001 AS order_id, 1 AS product_id
    UNION ALL
    SELECT 1002 AS order_id, 2 AS product_id
    UNION ALL
    SELECT 1003 AS order_id, 6 AS product_id
    UNION ALL
    SELECT 1004 AS order_id, 1 AS product_id
) t;

-- 创建临时表 products
CREATE TEMPORARY TABLE products AS
SELECT *
FROM
(
    SELECT 1 AS product_id, 'A' AS product_name
    UNION ALL
    SELECT 2 AS product_id, 'B' AS product_name
    UNION ALL
    SELECT 3 AS product_id, 'C' AS product_name
    UNION ALL
    SELECT 4 AS product_id, 'D' AS product_name
) t;

-- 自动检测,尽可能将符合条件的 JOIN 转换为 MAP JOIN
SET hive.auto.convert.join=true;
-- 配置阈值,如果小表超阈值,则不会自动进行 MAP JOIN
SET hive.mapjoin.smalltable.filesize=25000000;

SELECT /*+ MAPJOIN(p) */ o.order_id, o.product_id, p.product_name
FROM orders o
INNER JOIN products p
ON o.product_id = p.product_id;

输出:

order_id product_id product_name
1001 1 A
1002 2 B
1004 1 A

Note: 如果你不确定 MAP JOIN 是否实际生效,可以使用 EXPLAIN 关键字,获取查询计划

EXPLAIN [query];  -- 获取查询计划
EXPLAIN EXTENDED [query];  -- 获取更详细的查询计划

10. ARRAY 相关函数

10.1 COLLECT_LIST 函数

COLLECT_LIST 函数:将来自多行的某字段的值聚合成 ARRAY

SELECT
    gender,
    COLLECT_LIST(age) AS age_list
FROM std.std_student_info
GROUP BY gender;

10.2 COLLECT_SET 函数

COLLECT_SET 函数:将来自多行的某字段的值聚合成 ARRAY 并去重

SELECT
    gender,
    COLLECT_SET(age) AS age_set
FROM std.std_student_info
GROUP BY gender;

10.3 用分隔符拼接 ARRAY

SELECT
    gender,
    CONCAT_WS(',',
        COLLECT_SET(CAST(age AS STRING))
    ) AS age_set
FROM std.std_student_info
GROUP BY gender;

10.4 ARRAY_CONTAINS 函数

ARRAY_CONTAINS 函数:判断 ARRAY 中是否包含某元素

SELECT
    id,
    age_list
FROM
(
    SELECT 1 AS id, array(1,2,3) AS age_list
    UNION ALL
    SELECT 2 AS id, array(6,7,8) AS age_list
) t
WHERE ARRAY_CONTAINS(age_list, 3);

10.5 元素排序: SORT_ARRAY 函数

SELECT
    SORT_ARRAY(lst) AS sorted_lst
FROM
(
    SELECT ARRAY(3,2,4,1) AS lst
) t;

11. MAP 相关函数

11.1 STR_TO_MAP 函数

STR_TO_MAP 函数:将固定格式的 STRING 转成 MAP

可以写成

SELECT
    STR_TO_MAP(
        CONCAT_WS(',',
            CONCAT('key_1', ':', 'value_1'),
            CONCAT('key_2', ':', 'value_2')
        ), ',', ':'
    ) AS my_map;

或者

SELECT
    STR_TO_MAP(
        CONCAT_WS(',', ARRAY(
            CONCAT('key_1', ':', 'value_1'),
            CONCAT('key_2', ':', 'value_2')
            )
        ), ',', ':'
    ) AS my_map;

output: {"key_1":"value_1","key_2":"value_2"}

11.2 通过 key 读取 value

SELECT
    my_map['A']
FROM
(
    SELECT
        STR_TO_MAP(
            CONCAT_WS(',', ARRAY(
                CONCAT('A', ':', 'A1'),
                CONCAT('B', ':', 'B1')
                )
            ), ',', ':'
        ) AS my_map
) t;

output: A1

11.3 MAP_KEYS 函数

MAP_KEYS 函数:提取 map 中 的 key,生成由 key 组成的 array

📖 为了后文演示方便,这里创建一个通用示例:

CREATE TABLE std.std_level_map AS
SELECT
    `id`,
    sub_score
FROM
(
    SELECT
        1 AS `id`,
        STR_TO_MAP(
            CONCAT_WS(',',
                CONCAT('语文:', '80'),
                CONCAT('数学:', '85'),
                CONCAT('英语:', '95'),
                CONCAT('政治:', '90')
            ), ',', ':'
        ) AS sub_score
    UNION ALL
    SELECT
        2 AS `id`,
        STR_TO_MAP(
            CONCAT_WS(',',
                CONCAT('法律:', '90'),
                CONCAT('英语:', '75'),
                CONCAT('政治:', '92')
            ), ',', ':'
        ) AS sub_score
) t;

使用 MAP_KEYS 函数将字典的 key 转为 key 的 ARRAY

SELECT
    `id`,
    MAP_KEYS(sub_score) AS sub_list
FROM std.std_level_map;

11.4 MAP_VALUES 函数

MAP_VALUES 函数:把 map 中的 value 转为 value 的 array

与 MAP_KEYS 类似,用 MAP_VALUES 可以把 map 的 value 转为 value 的 ARRAY

SELECT
    `id`,
    MAP_VALUES(sub_score) AS score_list
FROM std.std_level_map;

12 构造 JSON

12.1 构造简单的 JSON

SELECT
    CONCAT('{"',
        CONCAT_WS('","',
            CONCAT('语文:', '80'),
            CONCAT('数学:', '90')
        ), '"}'
    ) AS my_map;

output: {"语文:80","数学:90"}

12.2 将 MAP 转换为 JSON 格式的字符串

SELECT
    `id`,
    CONCAT('{"',
        CONCAT_WS('","',
            COLLECT_SET(
                CONCAT(k, '":"', v)
            )
        ), '"}'
    ) AS my_map
FROM std.std_level_map
LATERAL VIEW OUTER EXPLODE(sub_score) e AS k, v
GROUP BY `id`;

output:

id my_map
1 {"语文":"80","数学":"85","英语":"95","政治":"90"}
2 {"法律":"90","英语":"75","政治":"92"}

12.3 构造形如 ARRAY<MAP<STRING,STRING>> 的字符串

SELECT
    CONCAT(
        '[',
        CONCAT_WS(',',
            CONCAT('{"',
                CONCAT_WS('","',
                    CONCAT('语文:', '80'),
                    CONCAT('数学:', '90')
                ), '"}'
            )
        ), ']'
    ) AS my_map;

output: [{"语文:80","数学:90"}]

12.4 基于 MAP 构造形如 ARRAY<MAP<STRING,STRING>> 的字符串

SELECT
    IF(SUM(IF(ARRAY_CONTAINS(MAP_KEYS(sub_score), '英语') AND ARRAY_CONTAINS(MAP_KEYS(sub_score), '政治'), 1, 0)) = 0, '',
        CONCAT('[',
            CONCAT_WS(',',
                COLLECT_SET(
                    CONCAT('{"',
                        CONCAT_WS('","',
                            CONCAT('英语', '":"', CAST(sub_score['英语'] AS STRING)),
                            CONCAT('政治', '":"', CAST(sub_score['政治'] AS STRING))
                        ), '"}'
                    )
                )
            ), ']'
        )
    ) AS my_map
FROM std.std_level_map;

output: [{"英语":"95","政治":"90"},{"英语":"75","政治":"92"}]

Note: 如果原MAP中不包含任何选定数据,应该返回空字符串,为了实现这个效果,需要在外层套一个IF函数来判断。常用的判断条件有:

  • ARRAY_CONTAINS(MAP_KEYS(sub_score), '[YOUR_KEY_HERE]')
  • SIZE(MAP_KEYS(sub_score)) > 0

12.5 利用跨行数据构造 JSON

Note: 凡是有CONCAT_WS的地方,里面都可以套一层COLLECT_SET或者COLLECT_LIST做跨行处理。

可以参考 2)4),这两条代码实际都有跨行处理能力

13. 解析 JSON

13.1 用 get_json_object 解析单个字段

SELECT get_json_object(json_column, '$.name')
FROM your_table;

13.2 用 json_tuple 解析多个字段

如果有大量字段需要解析,用 json_tuple 会快非常多。

SELECT json_tuple(json_column, 'key1', 'key2') as (key1, key2)
FROM my_table;

14. LATERAL VIEW OUTER EXPLODE 函数

LATERAL VIEW OUTER EXPLODE() 用于把 ARRAY 或者 MAP 中的元素展成多行。该语句一般放在 FROM 之后,WHERE 之前,示例如下👇

14.1 将 ARRAY 展成多行

SELECT
    `id`,
    gid
FROM
(
    SELECT
        1 AS `id`,
        ARRAY(1,2,3,5,5) AS group_id
    UNION ALL
    SELECT
        2 AS `id`,
        ARRAY(1,2) AS group_id
) t
LATERAL VIEW OUTER EXPLODE(group_id) g AS gid
WHERE id IN (1,3);

output:

id gid
1 1
1 2
1 3
1 5
1 5

14.2 将 ARRAY 展成多行:更复杂的例子

SELECT
    gid,
    CONCAT_WS(',', COLLECT_SET(CAST(id AS STRING))) AS id_list
FROM
(
    SELECT
        1 AS `id`,
        ARRAY(1,2,3,4,5,5) AS group_id
    UNION ALL
    SELECT
        2 AS `id`,
        ARRAY(2,4,4,3) AS group_id
) t
LATERAL VIEW OUTER EXPLODE(group_id) g AS gid
WHERE gid > 1
GROUP BY gid;

output:

gid id_list
2 1,2
3 1,2
4 1,2
5 1

14.3 多个 ARRAY 展开:group_idlevel 之间做笛卡尔积

SELECT
    `id`,
    gid,
    lvl
FROM
(
    SELECT
        1 AS `id`,
        ARRAY(1,2) AS group_id,
        ARRAY('A','B') AS level
) t
LATERAL VIEW OUTER EXPLODE(group_id) g AS gid
LATERAL VIEW OUTER EXPLODE(level) l AS lvl;

output:

id gid lvl
1 1 A
1 1 B
1 2 A
1 2 B

14.4 MAP 也能展开

Note: 请看 12.2 将 MAP 转换为 JSON 格式的字符串。同理,多个 MAP 也能像多个 ARRAY 一样展开,这里就不多做演示了。

15. 窗口函数

窗口函数的处理效果,是输入 n 行,输出 n 行。但这不意味窗口函数是行处理的,它可以综合其他行的信息对当前行做运算。

15.1 找同班同学: PARTITION BY

  • 输入是 id(学号)和 class(班级)
  • 输出是 id(学号)和 class(班级)和 id_list(同班同学学号列表)
SELECT
    id,
    class,
    COLLECT_SET(id) OVER(PARTITION BY class) AS id_list
FROM
(
    SELECT 1 AS id, 'A' AS class
    UNION ALL
    SELECT 2 AS id, 'A' AS class
    UNION ALL
    SELECT 3 AS id, 'B' AS class
) t;

output:

id class id_list
1 A [1,2]
2 A [1,2]
3 B [3]

Note: 常用的聚合函数,都可以替换本例中的 COLLECT_SET(id),比如:

  • SUM()
  • COUNT()
  • MAX()

等等……

15.2 为每个数据行生成序号: ROW_NUMBER 函数

SELECT
    id,
    class,
    score,
    ROW_NUMBER() OVER(PARTITION BY class ORDER BY score DESC) AS class_no
FROM
(
    SELECT 1 AS id, 'A' AS class, 90 AS score
    UNION ALL
    SELECT 2 AS id, 'A' AS class, 91 AS score
    UNION ALL
    SELECT 3 AS id, 'A' AS class, 91 AS score
    UNION ALL
    SELECT 4 AS id, 'B' AS class, 100 AS score
) t;

output:

id class score class_no
1 A 90 3
2 A 91 1
3 A 91 2
4 B 100 1

Note: 注意到虽然学生2和学生3分数都是91分,但他们的班级序号(class_no)不重叠。2号学生班级序号是1,3号学生班级序号是2

15.3 排名: RANK 函数

SELECT
    id,
    class,
    score,
    RANK() OVER(PARTITION BY class ORDER BY score DESC) AS class_rk
FROM
(
    SELECT 1 AS id, 'A' AS class, 90 AS score
    UNION ALL
    SELECT 2 AS id, 'A' AS class, 91 AS score
    UNION ALL
    SELECT 3 AS id, 'A' AS class, 91 AS score
    UNION ALL
    SELECT 4 AS id, 'B' AS class, 100 AS score
) t;

output:

id class score class_rk
1 A 90 3
2 A 91 1
3 A 91 1
4 B 100 1

Note: 注意看,虽然学生2和学生3分数都是91分,但他们的班级排名(class_rk)是相同的。2号同学与3号同学并列第一,1号学生由于有2位同学排在他前面,直接变成第3名。

15.4 不计人头排名: DENSE_RANK 函数

DENSE_RANK() 是不计人头,纯计分数的排名方式。

上例中的1号同学,在考试中获得了90分,虽然按人头数计,有两位同学排在他前面,因此成为了第三名。但如果纯以分数计,他的90分是仅次于91分的班级第二好名次。如果我们希望不考虑人头数,仅考虑分数在排序中的位置,那么可以使用DENSE_RANK函数。

SELECT
    id,
    class,
    score,
    DENSE_RANK() OVER(PARTITION BY class ORDER BY score DESC) AS class_rk
FROM
(
    SELECT 1 AS id, 'A' AS class, 90 AS score
    UNION ALL
    SELECT 2 AS id, 'A' AS class, 91 AS score
    UNION ALL
    SELECT 3 AS id, 'A' AS class, 91 AS score
    UNION ALL
    SELECT 4 AS id, 'B' AS class, 100 AS score
) t;

output:

id class score class_rk
1 A 90 2
2 A 91 1
3 A 91 1
4 B 100 1

16. 尾缀方法

16.1 对结果排序: ORDER BY 方法

寻找分数 TOP2 的两位同学:

SELECT
    id,
    score
FROM
(
    SELECT 1 AS id, 70 AS score
    UNION ALL 
    SELECT 2 AS id, 60 AS score
    UNION ALL 
    SELECT 3 AS id, 90 AS score
) t
ORDER BY score DESC
LIMIT 2;

ORDER BY也常用于随机抽样:

ORDER BY RAND();

16.2 组筛选: HAVING 方法

HAVING语句能对GROUP BY的结果做筛选。以下 HQL 选出班级人数大于1的班级:

SELECT
    class,
    COUNT(DISTINCT id) AS id_cnt
FROM
(
    SELECT 1 AS id, 'A' AS class
    UNION ALL
    SELECT 2 AS id, 'A' AS class
    UNION ALL
    SELECT 3 AS id, 'B' AS class
) t
GROUP BY class
HAVING id_cnt > 1;

16.3 CLUSTER BY, DISTRIBUTE BYSORT BY

这三个语句是控制reducer的。DISTRIBUTE BY [YOUR_COLUMN_HERE] 用于控制数据如何进入reducer,比如:

SELECT
    class,
    id
FROM
(
    SELECT 1 AS id, 'A' AS class
    UNION ALL
    SELECT 2 AS id, 'A' AS class
    UNION ALL
    SELECT 3 AS id, 'B' AS class
) t
DISTRIBUTE BY class;

如上,则拥有相同class值的记录,会进入同一个reducer。

SORT BY [YOUR_COLUMN_HERE] 用于控制reducer中数据的排序,比如:

SELECT
    class,
    id
FROM
(
    SELECT 1 AS id, 'A' AS class
    UNION ALL
    SELECT 2 AS id, 'A' AS class
    UNION ALL
    SELECT 3 AS id, 'B' AS class
) t
DISTRIBUTE BY class
SORT BY id;

如上例,在每个reducer中,会用id字段做reducer内的排序。

对某字段使用 CLUSTER BY,等于对该字段既使用DISTRIBUTE BY,又使用SORT BY,一种语法糖了属于是。

Note: 这三个语句日常不太常见,什么场景能用上呢?

举个例子,用户 (uid) 的购物车订单 (oid) 有时候会被拆成多个包裹,这时每个包裹会对应一个子单 (sub_oid)。这种情况下,一个订单就会对应多个子单。假设我们只知道 uid - oid 的关系与 uid - sub_oid 的关系。如果已知 sub_oid 是 oid 后加下划线加子单单号,我们希望获取 oid - sub_oid 的关系。这种情况就可以使用 DISTRIBUTE BY uid,这可以大幅节省算力。

17. TRANSFORM 关键字

HQL 部分:

SELECT TRANSFORM
-- PYTHON文件的输入
(
    COL_1,
    COL_2
)
USING '[PYTHON_ENV_PATH_HERE] [YOUR_PYTHON_FILE_HERE]' AS
-- PYTHON文件的输出,输出的行数和内容取决于PYTHON怎么写
(
    OUTPUT_1,
    OUTPUT_2,
    OUTPUT_3
)
FROM
-- 数据源,会传给PYTHON文件的输入
(
    SELECT
        COL_1,
        COL_2
    FROM [DB_NAME].[SOURCE_TABLE]
    WHERE [CONDITIONS]
    DISTRIBUTE BY [YOUR_KEY]
) t;

PYTHON 部分:

import sys


class Processor:

    @staticmethod
    def add(lst):
        nlst = list()
        for col_1, col_2 in lst:
            nlst.append((col_1, col_2, col_1+'-'+col_2))

        return nlst

    def main(self):
        lst = list()
        for line in sys.stdin:
            col_1, col_2 = line.strip().split('\t')[:2]
            lst.append((col_1, col_2))

        for col_1, col_2, col_3 in self.add(lst):
            print('{}\t{}\t{}'.format(col_1, col_2, col_3))


if __name__ == '__main__':
    p = Processor()
    p.main()

18. 空值处理

18.1 COALESCE 函数

COALESCE 函数:接受多个参数。从左到右遍历参数,返回第一个遇到的非 NULL 参数

SELECT
    COALESCE(col_1, col_2, 0) AS res
FROM
(
    SELECT NULL AS col_1, 10 AS col_2
    UNION ALL
    SELECT NULL AS col_1, NULL AS col_2
) t;

output:

res
10
0

18.2 NVL 函数

NVL 函数:接受两个参数。如果第一个参数为 NULL,返回第二个参数,否则返回第一个参数

SELECT
    NVL(col_1, col_2) AS res
FROM
(
    SELECT NULL AS col_1, 10 AS col_2
    UNION ALL
    SELECT NULL AS col_1, NULL AS col_2
    UNION ALL
    SELECT 100 AS col_1, 200 AS col_2
) t;

output:

res
10
NULL
100

19. Useful Codes

19.1 行内跨字段计算最大 / 最小值

原始用法:

SELECT
    GREATEST(3,1,2) AS greatest_num
    LEAST(6,5,7) AS least_num;

output:

greatest_num least_num
3 5

场景:计算一天中最贵/最便宜的一餐

SELECT
    LEAST(breakfast, lunch, dinner) AS cheapest_meal,
    GREATEST(breakfast, lunch, dinner) AS expensive_meal
FROM
(
    SELECT 3 AS breakfast, 25 AS lunch, 15 AS dinner
    UNION ALL
    SELECT 8 AS breakfast, 0 AS lunch, 20 AS dinner
    UNION ALL
    SELECT 0 AS breakfast, 35 AS lunch, 22 AS dinner
) t;

output:

cheapest_meal expensive_meal
3 25
0 20
0 35

19.2 计算分位数

简单的例子:计算学生分数的95分位数

SELECT
    ttl_score,
    quantile,
    distance,
    ROUND(quantile, 3) AS quantile_round, -- 精确到小数点后三位便于展示
    ROUND(distance, 3) AS distance_round
FROM
(
    SELECT
        ttl_score,
        id_cnt_accumulation / all_cnt AS quantile,    -- 实际的分位数值
        ABS((id_cnt_accumulation / all_cnt) - 0.05) AS distance    -- 与标准95分位数的距离
    FROM
    (
        SELECT
            ttl_score,
            SUM(id_cnt) OVER(ORDER BY ttl_score DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS id_cnt_accumulation,
            SUM(id_cnt) OVER() AS all_cnt
        FROM
        (
            SELECT
                ttl_score,
                COUNT(DISTINCT `id`) AS id_cnt
            FROM std.std_student_score
            GROUP BY ttl_score
        ) t1
    ) t2
) t3
ORDER BY distance
LIMIT 1;

复杂一点的例子:计算多组学生分数的分位数:25/50/75分位数

SELECT
    ttl_score,
    quantile,
    ROUND(quantile, 3) AS quantile_round, -- 精确到小数点后三位便于展示
    CONCAT_WS(',',
        IF(rk_25 = 1, '25', null),
        IF(rk_50 = 1, '50', null),
        IF(rk_75 = 1, '75', null)
    ) AS prompt, -- 提示是哪一种分位数,可能存在同一个ttl_score值为多个分位数的情况
    distance_25,
    distance_50,
    distance_75
FROM
(
    SELECT
        ttl_score,
        quantile,
        ABS(quantile - 0.25) AS distance_25,
        ABS(quantile - 0.5) AS distance_50,
        ABS(quantile - 0.75) AS distance_75,
        row_number() over(order by ABS(quantile - 0.25)) AS rk_25,
        row_number() over(order by ABS(quantile - 0.5)) AS rk_50,
        row_number() over(order by ABS(quantile - 0.75)) AS rk_75
    FROM
    (
        SELECT
            ttl_score,
            id_cnt_accumulation / all_cnt AS quantile    -- 实际的分位数值
        FROM
        (
            SELECT
                ttl_score,
                SUM(id_cnt) OVER(ORDER BY ttl_score ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS id_cnt_accumulation,
                SUM(id_cnt) OVER() AS all_cnt
            FROM
            (
                SELECT
                    ttl_score,
                    COUNT(DISTINCT `id`) AS id_cnt
                FROM std.std_student_score
                GROUP BY ttl_score
            ) t1
        ) t2
    ) t3
) t4
WHERE rk_25 = 1
    OR rk_50 = 1
    OR rk_75 = 1;

简单方法:使用 NTILE() 函数

SELECT
    ttl_score,
    NTILE(10) OVER (ORDER BY ttl_score) AS level
FROM std.std_student_score

20. 用户自定义函数

20.1 UDF

Hive 允许用户编写 UDF (User-Defined Function) 来拓展 Hive 的功能,以处理内置函数无法满足的需求。

现在来编写一个简单的 UDF 函数,我们约定:

  • 输入是由逗号作为分隔符的元素序列:"apple,orange,banana,apple"
  • 输出是无重复元素的个数,在当前这个例子下输出是:3

1)新建一个 Java 工程

使用 IntelliJ IDEA 默认项目模版初始化后,新建两个文件:

  • pom.xml
  • UniqueCountUDF.java

文件位置如下:

.
├── pom.xml
└─ src
    └── main
        └── java
            └── com
                └── bigdata
                    └── hive
                        └── udf
                            └── UniqueCountUDF.java

2)编写 UDF 程序和 Maven 配置

UniqueCountUDF.java 文件:

package com.bigdata.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;

import java.util.Arrays;

@SuppressWarnings("deprecation")
public class UniqueCountUDF extends UDF {

    /**
     * 计算输入字符串中通过逗号分隔的元素列表的无重复元素数量
     *
     * @param input 输入的字符串,元素通过逗号分隔
     * @return 无重复元素的数量
     */
    public IntWritable evaluate(Text input) {
        if (input == null || input.toString().isEmpty()) {
            return new IntWritable(0);
        }
        String[] arr = input.toString().split(",");
        return new IntWritable((int) Arrays.stream(arr).distinct().count());
    }
}

pom.xml 文件:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.bigdata.hive.udf</groupId>
    <artifactId>hive-udf</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>8</maven.compiler.source>
        <maven.compiler.target>8</maven.compiler.target>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-exec</artifactId>
            <version>3.1.3</version>
        </dependency>
    </dependencies>
</project>

如果你的 Hive 版本不是 3.1.3,需要改成实际的版本号。查询 Hive 版本的命令是 hive --version

3)打包项目

在 IntelliJ IDEA 右侧边栏 [hive-udf] -> [Lifecycle] 这里,依次点击 cleanpackage,先清理再打包。

这会在 target 目录下,生成名为 hive-udf-1.0-SNAPSHOT.jar 的文件。

.
├── pom.xml
├── src
│   └── ...
└── target
    ├── ...
    └── hive-udf-1.0-SNAPSHOT.jar

Note: 也可以用命令行打包

打开命令行,在项目根目录运行以下命令,以清理、编译、打包项目:

mvn clean package

4)上传 jar 包

将 jar 包上传至 HDFS:

# 检测上级路径是否存在
hdfs dfs -ls hdfs://[HOSTNAME]:[PORT]/[PATH/TO/USER/DIR]

# 为 hive-udf 创建新目录
hdfs dfs -mkdir hdfs://[HOSTNAME]:[PORT]/[PATH/TO/USER/DIR]/hive
hdfs dfs -mkdir hdfs://[HOSTNAME]:[PORT]/[PATH/TO/USER/DIR]/hive/udf

# 上传 jar 包
hdfs dfs -put [PATH/TO/JAR/FILE] \
    hdfs://[HOSTNAME]:[PORT]/[PATH/TO/USER/DIR]/hive/udf

# 删除 jar 包
hdfs dfs -rm \
    hdfs://[HOSTNAME]:[PORT]/[PATH/TO/USER/DIR]/hive/udf/[JAR_FILE]

# 删除整个目录及其内容(递归删除)
hdfs dfs -rm -r hdfs://[HOSTNAME]:[PORT]/[PATH/TO/USER/DIR]

5)测试

写个 demo 测试一下,看函数功能是否符合预期。

unique_count.hql 文件:

ADD JAR hdfs://[HOSTNAME]:[PORT]/[PATH/TO/USER/DIR]/hive/udf/hive-udf-1.0-SNAPSHOT.jar;
CREATE TEMPORARY FUNCTION uniq_count AS 'com.bigdata.hive.udf.UniqueCountUDF';

select uniq_count("apple,orange,banana,apple") as fruit_cnt;

打开命令行,运行:

hive -f unique_count.hql

如符合预期,应输出:3

Note: UniqueCountUDF.java 文件是老版写法(但也能跑)。对于高版本的 Hive,推荐使用:

import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;

你需要创建一个继承自 GenericUDF 的类,并重写父类方法。

20.2 UDAF

如果你的数据涉及“聚合”操作(可以简单理解为跨行处理),那么 UDF 已经满足不了你,你需要 UDAF (User-Defined Aggregate Function)

import org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver;

你需要继承并实现这个抽象类,根据你的需求重写类方法。

21. 命令行操作

  • hive -e "[HQL]": 命令行执行 hql 语句
  • hive -f [PATH/TO/HQL/FILE]: 命令行执行 hql 文件
  • hive -f [PATH/TO/HQL/FILE] > [LOCAL_FILE].csv: 把结果存入 csv
  • hive -e "SET hive.metastore.uris;": 查看 Hive 服务的 ip 和端口
  • hive -e "SET hive.cli.print.header=true;select * from [YOUR_TABLE]": 查询时打印表头

附录