大厂面试SQL题(三) ——连续登陆(含间隔)(字节)
承继上篇文章 大厂面试SQL题(一) ——连续登陆,阐述了最为简易的连续登录题目。然而,倘若在面试题目中增添难度,将 隔天统计问题 纳入考量,该作何处理呢?
沿用第一篇题目的场景实例:
(1)存在一张用户视频投稿行为表,如下所示,旨在找出连续 7 天发布视频、连续 30 天发布视频的用户,间隔一天也算连续。(解释:如果一个用户在 2024年1月1、2、4、6 号发布视频,则视为连续 6 天投稿)
-- 测试表生成SQL实例如下:
WITH author_publish_table AS (
select 802264760 AS author_id,'2024-01-01' AS publish_date
union all
select 802264760 AS author_id,'2024-01-02' AS publish_date
union all
select 802264760 AS author_id,'2024-01-04' AS publish_date
union all
select 802264760 AS author_id,'2024-01-06' AS publish_date
union all
select 802264760 AS author_id,'2024-01-07' AS publish_date
union all
select 970014504 AS author_id,'2024-01-01' AS publish_date
union all
select 970014504 AS author_id,'2024-01-02' AS publish_date
union all
select 970014504 AS author_id,'2024-01-05' AS publish_date
union all
select 802264760 AS author_id,'2024-01-10' AS publish_date
union all
select 802264760 AS author_id,'2024-01-12' AS publish_date
union all
select 802264760 AS author_id,'2024-01-13' AS publish_date
union all
select 802264760 AS author_id,'2024-01-14' AS publish_date
)
-- ----------------------------- 测试表生成 end -----------------------
select * from author_publish_table
针对时间连续类题目,必然无法回避日期差的计算。我们从这方面入手:
首先需要解决是,获取每次行为记录与前一条行为记录的日期差值,查看其是否符合“连续行为”的标准(例如两次行为的日期差<xx 天)。
其次要解决的是,能够清晰明了每一次的连续行为,即能够区分出不同的连续片段。
而针对第一点:前后两条行为的日期差值,就有必要深入研讨几类行处理类窗口函数的应用场景,下面用一个简单小实例,辅助理解这几个函数,以便更顺畅向下阅读理解最终SQL方案。
lag(): 返回分区中当前行之前行(可以指定第几行)的值。 如果没有行,则返回null。
lead():返回分区中当前行后面行(可以指定第几行)的值。 如果没有行,则返回null。
first_value():返回相对于窗口中第一行的指定列的值。
last_value():返回相对于窗口中最后一行的指定列的值。
-- 简单小实例辅助更顺畅理解最终SQL:
WITH author_publish_table AS (
select 802264760 AS author_id,'2024-01-01' AS publish_date
union all
select 802264760 AS author_id,'2024-01-02' AS publish_date
union all
select 802264760 AS author_id,'2024-01-04' AS publish_date
union all
select 970014504 AS author_id,'2024-01-06' AS publish_date
union all
select 970014504 AS author_id,'2024-01-01' AS publish_date
)
-- ----------------------------- 测试表生成 end -----------------------
SELECT author_id,
publish_date, -- 本次投稿日期
lag(publish_date, 1, NULL) OVER (PARTITION BY author_id ORDER BY publish_date ASC) AS pre_publish_date, -- 上次投稿日期
lead(publish_date,1, NULL) OVER (PARTITION BY author_id ORDER BY publish_date ASC) AS next_publish_date, -- 后次投稿日期
first_value(publish_date) OVER (PARTITION BY author_id ORDER BY publish_date ASC) AS first_publish_date, -- 首次投稿日期
last_value(publish_date) OVER (PARTITION BY author_id ORDER BY publish_date ASC rows between current row and unbounded following) AS last_publish_date -- 末次投稿日期
FROM author_publish_table
ORDER BY author_id ASC, publish_date ASC
以上小实例查询的呈现结果:
回归正文,针对此题目的解法实现:
- 首先,可以通过 lag()函数获取前后两次投稿的日期差,倘若日期差值>3,则代表“非连续投稿”,也意味着新一轮连续行为的开端。
- 其次,可以依据差值>3 的断层进行分组标识,以区分每次连续投稿的片段。
- 而后,能够通过获取每个连续投稿片段内的最大与最小投稿日期,从而计算得出本次连续投稿的天数。
最终SQL如下:
SELECT author_id,
publish_date,
last_publish_date,
publish_datediff,
refer_group,
max_publish_date,
min_publish_date,
datediff(max_publish_date,min_publish_date)+1 as continue_publish_days -- 连续投稿的天数
from (SELECT author_id,
publish_date,
last_publish_date,
publish_datediff,
refer_group, -- 连续投稿片段分组
max(publish_date) OVER (PARTITION by author_id,refer_group) max_publish_date, -- 连续投稿的结束日期
min(publish_date) OVER (PARTITION by author_id,refer_group) min_publish_date -- 连续投稿的开始日期
from (SELECT author_id,
publish_date,
last_publish_date,
publish_datediff,
sum(if(datediff(publish_date, last_publish_date)<=2,0,1) ) OVER (PARTITION BY author_id ORDER BY publish_date ASC) refer_group -- 连续投稿片段分组
FROM
(SELECT author_id,
publish_date,
last_publish_date,
datediff(publish_date, last_publish_date) publish_datediff -- 两次投稿日期差
FROM (
SELECT author_id,
publish_date, -- 本次投稿日期
lag(publish_date, 1, NULL) OVER (PARTITION BY author_id ORDER BY publish_date ASC) AS last_publish_date -- 上次投稿日期
FROM author_publish_table
ORDER BY author_id ASC, publish_date ASC
)
)
)
)
以上SQL查询数据结果呈现如下:
欢迎探讨,其他类似衍生题目补充中...