大厂面试SQL题(三) ——连续登陆(含间隔)(字节)

yumo6661个月前 (03-30)技术文章15

承继上篇文章 大厂面试SQL题(一) ——连续登陆,阐述了最为简易的连续登录题目。然而,倘若在面试题目中增添难度,将 隔天统计问题 纳入考量,该作何处理呢?

沿用第一篇题目的场景实例:

(1)存在一张用户视频投稿行为表,如下所示,旨在找出连续 7 天发布视频、连续 30 天发布视频的用户,间隔一天也算连续(解释:如果一个用户在 2024年1月1、2、4、6 号发布视频,则视为连续 6 天投稿)

-- 测试表生成SQL实例如下:
WITH author_publish_table AS (
    select  802264760 AS author_id,'2024-01-01' AS publish_date
    union all
    select  802264760 AS author_id,'2024-01-02' AS publish_date
    union all
    select  802264760 AS author_id,'2024-01-04' AS publish_date
    union all
    select  802264760 AS author_id,'2024-01-06' AS publish_date
    union all
    select  802264760 AS author_id,'2024-01-07' AS publish_date
    union all
    select  970014504 AS author_id,'2024-01-01' AS publish_date
    union all
    select  970014504 AS author_id,'2024-01-02' AS publish_date
    union all
    select  970014504 AS author_id,'2024-01-05' AS publish_date
    union all
    select  802264760 AS author_id,'2024-01-10' AS publish_date
    union all
    select  802264760 AS author_id,'2024-01-12' AS publish_date
    union all
    select  802264760 AS author_id,'2024-01-13' AS publish_date
    union all
    select  802264760 AS author_id,'2024-01-14' AS publish_date
)
-- ----------------------------- 测试表生成 end -----------------------
select * from author_publish_table


针对时间连续类题目,必然无法回避日期差的计算。我们从这方面入手:

首先需要解决是,获取每次行为记录与前一条行为记录的日期差值,查看其是否符合“连续行为”的标准(例如两次行为的日期差<xx 天)。

其次要解决的是,能够清晰明了每一次的连续行为,即能够区分出不同的连续片段。

而针对第一点:前后两条行为的日期差值,就有必要深入研讨几类行处理类窗口函数的应用场景,下面用一个简单小实例,辅助理解这几个函数,以便更顺畅向下阅读理解最终SQL方案。

lag(): 返回分区中当前行之前行(可以指定第几行)的值。 如果没有行,则返回null。

lead():返回分区中当前行后面行(可以指定第几行)的值。 如果没有行,则返回null。

first_value():返回相对于窗口中第一行的指定列的值。
last_value():返回相对于窗口中最后一行的指定列的值。

-- 简单小实例辅助更顺畅理解最终SQL:
WITH author_publish_table AS (
    select  802264760 AS author_id,'2024-01-01' AS publish_date
    union all
    select  802264760 AS author_id,'2024-01-02' AS publish_date
    union all
    select  802264760 AS author_id,'2024-01-04' AS publish_date
    union all
    select  970014504 AS author_id,'2024-01-06' AS publish_date
    union all
    select  970014504 AS author_id,'2024-01-01' AS publish_date
)
-- ----------------------------- 测试表生成 end -----------------------
SELECT  author_id,
        publish_date, -- 本次投稿日期
        lag(publish_date, 1, NULL) OVER (PARTITION BY author_id ORDER BY publish_date ASC) AS pre_publish_date,  -- 上次投稿日期
        lead(publish_date,1, NULL) OVER (PARTITION BY author_id ORDER BY publish_date ASC) AS next_publish_date, -- 后次投稿日期
        first_value(publish_date) OVER (PARTITION BY author_id ORDER BY publish_date ASC) AS first_publish_date, -- 首次投稿日期
        last_value(publish_date) OVER (PARTITION BY author_id ORDER BY publish_date ASC rows between current row and unbounded following)  AS last_publish_date   -- 末次投稿日期
FROM    author_publish_table
ORDER BY author_id ASC, publish_date ASC

以上小实例查询的呈现结果:

回归正文,针对此题目的解法实现:

  • 首先,可以通过 lag()函数获取前后两次投稿的日期差,倘若日期差值>3,则代表“非连续投稿”,也意味着新一轮连续行为的开端。
  • 其次,可以依据差值>3 的断层进行分组标识,以区分每次连续投稿的片段。
  • 而后,能够通过获取每个连续投稿片段内的最大与最小投稿日期,从而计算得出本次连续投稿的天数。

最终SQL如下:

SELECT author_id,
        publish_date,
        last_publish_date,
        publish_datediff,
        refer_group,
        max_publish_date,
        min_publish_date,
        datediff(max_publish_date,min_publish_date)+1 as continue_publish_days -- 连续投稿的天数
from (SELECT author_id,
        publish_date,
        last_publish_date,
        publish_datediff,
        refer_group,  -- 连续投稿片段分组
        max(publish_date) OVER (PARTITION by author_id,refer_group) max_publish_date, -- 连续投稿的结束日期
        min(publish_date) OVER (PARTITION by author_id,refer_group) min_publish_date  -- 连续投稿的开始日期
from (SELECT  author_id,
        publish_date,
        last_publish_date,
        publish_datediff,
        sum(if(datediff(publish_date, last_publish_date)<=2,0,1) ) OVER (PARTITION BY author_id ORDER BY publish_date ASC) refer_group -- 连续投稿片段分组
FROM  
   (SELECT author_id,
            publish_date,
            last_publish_date,
            datediff(publish_date, last_publish_date) publish_datediff -- 两次投稿日期差
    FROM  (
            SELECT  author_id,
                    publish_date, -- 本次投稿日期
                    lag(publish_date, 1, NULL) OVER (PARTITION BY author_id ORDER BY publish_date ASC) AS last_publish_date -- 上次投稿日期
            FROM    author_publish_table
            ORDER BY author_id ASC, publish_date ASC
        )
    )
  )
)

以上SQL查询数据结果呈现如下:

欢迎探讨,其他类似衍生题目补充中...

相关文章

70道SQL经典面试题大全,看这一篇就够了

吐血整理了一周SQL面试题,包括了选择题、问答题、实战题。每道题都给大家加了详细的独家解析。几乎每个知识点都涉及到了。刷完这套题,关于SQL的面试妥妥的。题目来源:各大厂面试题、牛客网选择题(1)基础...

总结面试:SQL常见面试题-1

刚刚走过了金三银四,相信许多小伙伴已经得到了自己心仪的offer,在这里先恭喜大家啦!但可能还有一些小伙伴刚刚开始找工作,作为一名数据分析师,还是想总结一些面试的经验及SQL的笔试面试题给各位小伙伴,...

初学者最热门和棘手的10个T-SQL问题

T-SQL有助于在物理或虚拟实例中创建和管理对象,以及插入、检索、修改和删除数据表。国际标准化组织(ISO)和美国国家标准协会(ANSI)发布的SQL标准定义了一种软件语言,Transact-SQL就...

网络安全面试题-SQL注入问题总结

一、知识储备类1.SQL与NoSQL的区别?SQL:关系型数据库NoSQL:非关系型数据库存储方式:SQL具有特定的结构表,NoSQL存储方式灵活性能:NoSQL较优于SQL数据类型:SQL适用结构化...

MYSQL经典面试题汇总

1.非聚簇索引一定会回表查询吗?不一定,这涉及到查询语句所要求的字段是否全部命中了索引,如果全部命中了索引,那么就不必再进行回表查询。举个简单的例子:假设我们在员工表的年龄上建立了索引,那么当进行的查...

SQL面试经典问题-开窗函数

分组取每组最大值案例:按课程号分组取成绩最大值所在行的数据select 学号, 课程号, max(成绩) as 最高分, from a group by 课程号分组取每...