OVER ([PARTITION BY <…>] [ORDER BY <….>] 其中包括以下可选项:
PARTITION BY 表示将数据先按 字段 进行分区 ORDER BY 表示将各个分区内的数据按 排序字段 进行排序
c1jWq8
window_expression 用于确定窗边界:
名词
含义
preceding
往前
following
往后
current row
当前行
unbounded
起点
unbounded preceding
从前面的起点
unbounded following
到后面的终点
窗口边界使用详解
如果不指定 PARTITION BY,则不对数据进行分区,换句话说,所有数据看作同一个分区;
如果不指定 ORDER BY,则不对各分区做排序,通常用于那些顺序无关的窗口函数,例如 SUM()
如果不指定窗口子句,则默认采用以下的窗口定义: a、若不指定 ORDER BY,默认使用分区内所有行 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. b、若指定了 ORDER BY,默认使用分区内第一行到当前值 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
select emp_name, emp_mgr, dealer_id, sales, sum(sales) over () as sample1, -- 所有sales和 sum(sales) over (partition by dealer_id) as sample2, -- 按dealer_id分组,组内数据累加 sum(sales) over (partition by dealer_id ORDER BY stat_date) as sample3, -- 按dealer_id分组,时间排序,组内数据逐个相加 sum(sales) OVER (PARTITION BY dealer_id ORDER BY stat_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as sample4, -- 按dealer_id分组,时间排序,组内由起点到当前行的聚合 sum(sales) OVER (PARTITION BY dealer_id ORDER BY stat_date ROWS BETWEEN 1 PRECEDING and CURRENT ROW) as sample5, -- 按dealer_id分组,时间排序,组内当前行和前面一行做聚合 sum(sales) over (PARTITION BY dealer_id ORDER BY stat_date ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) as sample6, -- 按dealer_id分组,时间排序,组内当前行和前一行和后一行聚合 sum(sales) over (PARTITION BY dealer_id ORDER BY stat_date ROWS BETWEEN CURRENT ROW and UNBOUNDED FOLLOWING) as sample7 -- 按dealer_id分组,时间排序,组内当前行和后面所有行 from q1_sales;
hive sum窗口函数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
select emp_name, emp_mgr, dealer_id, sales, count(sales) over () as sample1, -- 所有条数 count(sales) over (partition by dealer_id) as sample2, -- 按dealer_id分组,组内数据数量 count(sales) over (partition by dealer_id ORDER BY stat_date) as sample3, -- 按dealer_id分组,时间排序,组内数据条数逐个相加 count(sales) OVER (PARTITION BY dealer_id ORDER BY stat_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as sample4, -- 按dealer_id分组,时间排序,组内由起点到当前行的聚合 count(sales) OVER (PARTITION BY dealer_id ORDER BY stat_date ROWS BETWEEN 1 PRECEDING and CURRENT ROW) as sample5, -- 按dealer_id分组,时间排序,组内当前行和前面一行做聚合 count(sales) over (PARTITION BY dealer_id ORDER BY stat_date ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) as sample6, -- 按dealer_id分组,时间排序,组内当前行和前一行和后一行聚合 count(sales) over (PARTITION BY dealer_id ORDER BY stat_date ROWS BETWEEN CURRENT ROW and UNBOUNDED FOLLOWING) as sample7 -- 按dealer_id分组,时间排序,组内当前行和后面所有行 from q1_sales;
hive count窗口函数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
select emp_name, emp_mgr, dealer_id, sales, avg(sales) over () as sample1, -- 所有sales聚合 avg(sales) over (partition by dealer_id) as sample2, -- 按dealer_id分组,组内数据累加 avg(sales) over (partition by dealer_id ORDER BY stat_date) as sample3, -- 按dealer_id分组,时间排序,组内数据逐个相加 avg(sales) OVER (PARTITION BY dealer_id ORDER BY stat_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as sample4, -- 按dealer_id分组,时间排序,组内由起点到当前行的聚合 avg(sales) OVER (PARTITION BY dealer_id ORDER BY stat_date ROWS BETWEEN 1 PRECEDING and CURRENT ROW) as sample5, -- 按dealer_id分组,时间排序,组内当前行和前面一行做聚合 avg(sales) over (PARTITION BY dealer_id ORDER BY stat_date ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) as sample6, -- 按dealer_id分组,时间排序,组内当前行和前一行和后一行聚合 avg(sales) over (PARTITION BY dealer_id ORDER BY stat_date ROWS BETWEEN CURRENT ROW and UNBOUNDED FOLLOWING) as sample7 -- 按dealer_id分组,时间排序,组内当前行和后面所有行 from q1_sales;
hive avg窗口函数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
select emp_name, emp_mgr, dealer_id, sales, max(sales) over () as sample1, -- 所有sales聚合 max(sales) over (partition by dealer_id) as sample2, -- 按dealer_id分组,组内数据累加 max(sales) over (partition by dealer_id ORDER BY stat_date) as sample3, -- 按dealer_id分组,时间排序,组内数据逐个相加 max(sales) OVER (PARTITION BY dealer_id ORDER BY stat_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as sample4, -- 按dealer_id分组,时间排序,组内由起点到当前行的聚合 max(sales) OVER (PARTITION BY dealer_id ORDER BY stat_date ROWS BETWEEN 1 PRECEDING and CURRENT ROW) as sample5, -- 按dealer_id分组,时间排序,组内当前行和前面一行做聚合 max(sales) over (PARTITION BY dealer_id ORDER BY stat_date ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) as sample6, -- 按dealer_id分组,时间排序,组内当前行和前一行和后一行聚合 max(sales) over (PARTITION BY dealer_id ORDER BY stat_date ROWS BETWEEN CURRENT ROW and UNBOUNDED FOLLOWING) as sample7 -- 按dealer_id分组,时间排序,组内当前行和后面所有行 from q1_sales;
max
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
select emp_name, emp_mgr, dealer_id, sales, min(sales) over () as sample1, -- 所有sales聚合 min(sales) over (partition by dealer_id) as sample2, -- 按dealer_id分组,组内数据累加 min(sales) over (partition by dealer_id ORDER BY stat_date) as sample3, -- 按dealer_id分组,时间排序,组内数据逐个相加 min(sales) OVER (PARTITION BY dealer_id ORDER BY stat_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as sample4, -- 按dealer_id分组,时间排序,组内由起点到当前行的聚合 min(sales) OVER (PARTITION BY dealer_id ORDER BY stat_date ROWS BETWEEN 1 PRECEDING and CURRENT ROW) as sample5, -- 按dealer_id分组,时间排序,组内当前行和前面一行做聚合 min(sales) over (PARTITION BY dealer_id ORDER BY stat_date ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) as sample6, -- 按dealer_id分组,时间排序,组内当前行和前一行和后一行聚合 min(sales) over (PARTITION BY dealer_id ORDER BY stat_date ROWS BETWEEN CURRENT ROW and UNBOUNDED FOLLOWING) as sample7 -- 按dealer_id分组,时间排序,组内当前行和后面所有行 from q1_sales;
计算某个窗口或分区中某个值的累积分布。假定升序排序,则使用以下公式确定累积分布:小于等于当前值x的行数 / 窗口或partition分区内的总行数。其中,x 等于 order by 子句中指定的列的当前行中的值
NTILE()
INT
已排序的行划分为大小尽可能相等的指定数量的排名的组,并返回给定行所在的组的排名。如果切片不均匀,默认增加第一个切片的分布,不支持ROWS BETWEEN
1 2 3 4 5 6
select *, ROW_NUMBER() over(partition by dealer_id order by sales desc) rk01, RANK() over(partition by dealer_id order by sales desc) rk02, DENSE_RANK() over(partition by dealer_id order by sales desc) rk03, PERCENT_RANK() over(partition by dealer_id order by sales desc) rk04 from q1_sales;
开窗排名函数
1 2 3 4
select *, CUME_DIST() over(partition by dealer_id order by sales ) rk05, CUME_DIST() over(partition by dealer_id order by sales desc) rk06 from q1_sales;
开窗函数CUME_DIST
1 2 3 4 5
select *, NTILE(2) over(partition by dealer_id order by sales ) rk07, NTILE(3) over(partition by dealer_id order by sales ) rk08, NTILE(4) over(partition by dealer_id order by sales ) rk09 from q1_sales;
实现这个需求我们需要先使用row_number()over按商铺分组,按月份排序得出这样一个结果: SELECT month ,shop ,money ,ROW_NUMBER() OVER ( PARTITION BY shop ORDER BY month ) AS rn FROM temp_test12;
结果: month shop money rn 2019-01 a 1 1 2019-02 a 2 2 2019-03 a 3 3 2019-04 a 4 4 2019-05 a 5 5 2019-06 a 6 6 2019-01 b 2 1 2019-02 b 4 2 2019-03 b 6 3 2019-04 b 8 4 2019-05 b 10 5 2019-06 b 12 6
然后进行偏移自关联,将每个商铺的每个月的营业额和上个月的关联在一起:
WITH a AS ( SELECT month ,shop ,MONEY ,ROW_NUMBER() OVER ( PARTITION BY shop ORDER BY month ) AS rn FROM temp_test12 ) SELECT a1.month ,a1.shop ,a1.MONEY ,nvl(a2.month, '2018-12') before_month --为了便于理解,这里加入上月的月份。如果上月没有的月份取为2018-12 ,nvl(a2.MONEY, 1) before_money --上月没有的营业额取为1 FROM a a1 --代表本月 LEFT JOIN a a2 --代表上月 ON a1.shop = a2.shop AND a1.month = substr(add_months(CONCAT ( a2.month ,'-01' ), 1), 1, 7) --增加月份的函数add_months中至少要传入年月日 GROUP BY a1.month ,a1.shop ,a1.MONEY ,nvl(a2.month, '2018-12') ,nvl(a2.MONEY, 1);
结果: a1.month a1.shop a1.money before_month before_money 2019-01 a 1 2018-12 1 2019-02 a 2 2019-01 1 2019-03 a 3 2019-02 2 2019-04 a 4 2019-03 3 2019-05 a 5 2019-04 4 2019-06 a 6 2019-05 5 2019-01 b 2 2018-12 1 2019-02 b 4 2019-01 2 2019-03 b 6 2019-02 4 2019-04 b 8 2019-03 6 2019-05 b 10 2019-04 8 2019-06 b 12 2019-05 10
lag 开窗函数实现环比
1 2 3 4 5 6
SELECTmonth ,shop ,MONEY ,LAG(MONEY, 1, 1) OVER ( --取分组内上一行的营业额,如果没有上一行则取1 PARTITION BY shop ORDER BY month --按商铺分组,按月份排序 ) AS before_money FROM temp_test12; -- 结果集如下month shop money before_money 2019-01 a 112019-02 a 212019-03 a 322019-04 a 432019-05 a 542019-06 a 652019-01 b 212019-02 b 422019-03 b 642019-04 b 862019-05 b 1082019-06 b 1210
lag 其他用法演示
1 2 3 4 5 6 7 8 9 10
SELECTmonth ,shop ,MONEY ,LAG(MONEY, 1, 1) OVER ( PARTITIONBY shop ORDERBYmonth ) AS before_money ,LAG(MONEY, 1) OVER ( PARTITIONBY shop ORDERBYmonth ) AS before_money --第三个参数不写的话,如果没有上一行值,默认取null ,LAG(MONEY) OVER ( PARTITIONBY shop ORDERBYmonth ) AS before_money --第二个参数不写默认为1,第三个参数不写的话,如果没有上一行值,默认取null,结果与上一列相同 ,LAG(MONEY, 2, 1) OVER ( PARTITIONBY shop ORDERBYmonth ) AS before_2month_money --取两个月前的营业额FROM temp_test12; -- 结果集month shop money before_money before_money before_money before_2month_money 2019-01 a 11NULLNULL12019-02 a 211112019-03 a 322212019-04 a 433322019-05 a 544432019-06 a 655542019-01 b 21NULLNULL12019-02 b 422212019-03 b 644422019-04 b 866642019-05 b 1088862019-06 b 121010108-- 解释说明:-- shop为a时,before_money指定了往上第1行的值,如果没有上一行值,默认取null,这里指定为1。-- a的第1行,往上1行值为NULL,指定第三个参数取1,不指定取null 。-- a的第2行,往上1行值为第1行营业额值,1。-- a的第6行,往上1行值为为第5行营业额值,5
SELECT month ,shop ,MONEY ,LEAD(MONEY, 1, 7) OVER ( PARTITION BY shop ORDER BY month ) AS after_money ,LEAD(MONEY, 1) OVER ( PARTITION BY shop ORDER BY month ) AS after_money --第三个参数不写的话,如果没有下一行值,默认取null ,LEAD(MONEY, 2, 7) OVER ( PARTITION BY shop ORDER BY month ) AS after_2month_money --取两个月后的营业额 FROM temp_test12;
结果: month shop money after_money after_money after_2month_money 2019-01 a 1 2 2 3 2019-02 a 2 3 3 4 2019-03 a 3 4 4 5 2019-04 a 4 5 5 6 2019-05 a 5 6 6 7 2019-06 a 6 7 NULL 7 2019-01 b 2 4 4 6 2019-02 b 4 6 6 8 2019-03 b 6 8 8 10 2019-04 b 8 10 10 12 2019-05 b 10 12 12 7 2019-06 b 12 7 NULL 7
ELECT month ,shop ,MONEY ,first_value(MONEY) OVER ( PARTITION BY shop ORDER BY month ) AS first_money FROM temp_test12;
结果: month shop money first_money 2019-01 a 1 1 2019-02 a 2 1 2019-03 a 3 1 2019-04 a 4 1 2019-05 a 5 1 2019-06 a 6 1 2019-01 b 2 2 2019-02 b 4 2 2019-03 b 6 2 2019-04 b 8 2 2019-05 b 10 2 2019-06 b 12 2
SELECT month ,shop ,MONEY ,last_value(MONEY) OVER ( PARTITION BY shop ORDER BY month ) AS last_money FROM temp_test12;
结果: month shop money last_money 2019-01 a 1 1 2019-02 a 2 2 2019-03 a 3 3 2019-04 a 4 4 2019-05 a 5 5 2019-06 a 6 6 2019-01 b 2 2 2019-02 b 4 4 2019-03 b 6 6 2019-04 b 8 8 2019-05 b 10 10 2019-06 b 12 12
select * from ( select id , date, lead(date,1,-1) over(partition by id order by date desc ) as date1 -- 按照用户分组,登录时间降序排序,获取上一次登录日期 from tb_use a group by id,date -- 去重当日重复登录, ) as b where date_sub(cast(b.date as date),1)=cast(b.date1 as date); -- 判定当前登录日期的上一天是否与上一次登录日期一致,一致则判定为连续登录
结果: b.id b.date b.date1 A 2018-09-06 2018-09-05 A 2018-09-05 2018-09-04 C 2018-09-06 2018-09-05 C 2018-09-05 2018-09-04
统计连续登陆两天的用户个数
(n天就只需要把lead(date,2,-1)中的2改成n-1并且把date_sub(cast(b.date as date),2)中的2改成n-1)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
select count(distinct b.id) as c1 from ( select id ,date, lead(date,1,-1) over(partition by id order by date desc ) as date1 from tb_use a group by id,date ) as b where date_sub(cast(b.date as date),1)=cast(b.date1 as date);
结果: c1 2
特说说明:上文指出了连续登录2天的场景,针对其他连续登录场景,假设连续登录n天,可将lead(date,1,-1)中的1改成n-1,date_sub(cast(b.date as date),1)中的1改成n-1。
SELECT order_month, num, -- 月销量 total, -- 年销量 round( num / total, 2 ) AS ratio -- 月销量占年销量比 FROM ( select substr(order_time, 1, 7) as order_month, --查询月份 sum(order_num) over (partition by substr(order_time, 1, 7)) as num, --根据月份分组,统计月销量 sum( order_num ) over ( PARTITION BY substr( order_time, 1, 4 ) ) total, --根据年分组,统计年销量 row_number() over (partition by substr(order_time, 1, 7)) as rk from saleorder ) temp where rk = 1;
SELECT name, orderdate, cost, --当前window内,当前行的前一行到后一行 金额总和 sum(cast(cost AS INT)) over(PARTITION BY name ORDER BY orderdate DESC ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS precedingFollow, --当前window内,当前行到最后行的金额总和 sum(cast(cost AS INT)) over(PARTITION BY name ORDER BY orderdate DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS currentFollow, --当前window内,按照时间进行排序 row_number() OVER(PARTITION BY name ORDER BY orderdate DESC) AS rank,--用户上次购买的时间 lag(orderdate,1,'查无结果') over(PARTITION BY name ORDER BY orderdate) AS lastTime,--用户下一次购买的时间 lead(orderdate,1,'查无结果') over(PARTITION BY name ORDER BY orderdate)AS nextTime,--用户上次购物金额 lag(cost,1,'查无结果')over(PARTITION BY name ORDER BY orderdate) AS lastCost,--用户下次购物金额 lead(cost,1,'查无结果') OVER (PARTITION BY name ORDER BY orderdate) AS nextCost,--用户上一次+这次的购物金额 sum(cast(cost AS INT)) over(PARTITION BY name ORDER BY orderdate ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS lastCurrentCost,--用户每月购物金额 sum(cast(cost AS INT)) over(PARTITION BY name,month(orderdate) ORDER BY month(orderdate)) AS monthCost,--用户当月单词消费最大值 max(cast(cost AS INT)) over(PARTITION BY name,month(orderdate) ORDER BY orderdate) AS monthMaxCost,--用户当月单词消费最小值 min(cast(cost AS INT)) over(PARTITION BY name,month(orderdate) ORDER BY orderdate) as monthMinCost FROM TEST.COSTITEM
间隔,最近两次间隔,登录间隔,出院间隔等等
1 2 3 4 5 6 7
select user_name, age, in_hosp, out_hosp, datediff(in_hosp,LAG(out_hosp,1,in_hosp) OVER(PARTITION BY user_name ORDER BY out_hosp asc)) as days from t_hosp;
从聚合这个意义上出发,似乎窗口函数和 Group By 聚合函数都能做到同样的事情。但是,它们之间的相似点也仅限于此了!这其中的关键区别在于: 窗口函数仅仅只会将结果附加到当前的结果上,它不会对已有的行或列做任何修改。而 Group By 的做法完全不同:对于各个 Group 它仅仅会保留一行聚合结果。
有的读者可能会问,加了窗口函数之后返回结果的顺序明显发生了变化,这不算一种修改吗?因为 SQL 及关系代数都是以 multi-set 为基础定义的,结果集本身并没有顺序可言,ORDER BY 仅仅是最终呈现结果的顺序。
另一方面,从逻辑语义上说,SELECT 语句的各个部分可以看作是按以下顺序“执行”的:
窗口函数执行
注意到窗口函数的求值仅仅位于 ORDER BY 之前,而位于 SQL 的绝大部分之后。这也和窗口函数只附加、不修改的语义是呼应的,结果集在此时已经确定好了,再依次计算窗口函数。