Question: Assume you have been given the below tables on sessions that users have generated, and a users table. Write a query to get the active user fraction of daily cohorts (acquisition date).
Explanation & Definition: The date in the users table is the user acquisition date, a user acquired on a given date is called ‘active’ if he/she has generated at least 3 sessions since the date of acquisition.
Also, all of these sessions must have happened after the acquisition date. In other words, sessions generated on the acquisition date should not be considered when deciding the active state of a user.
Active User Fraction (of a given acquisition date) = total active users /total users acquired that day
Difficulty Level: Medium, Source: CrackingProduct, Tag: Snapchat SQL Interview Question
Assumptions: No NULLS present; DBMS: Spark SQL
Output Columns: acquisition_date, active_user_frac
Approach:
Basic Explanation: date column in the sessions table is the session date and the date column from the users table is the user acquisition date.
for each date in the users table, we will have a bunch of users who were acquired on that date and their generated sessions are recorded in the sessions table.
Note that, not all users from the users table will be there in the sessions table since it’s possible that after being acquired (say installing the app), the user didn’t open the app or left the app on the login screen.
Also, only the sessions generated after the user acquisition date have to be taken into account.
Step-1: we will first join both tables in such a way that we don’t lose any rows corresponding to the users table since we have to calculate the total acquired users of a given acquisition date (irrespective of the fact whether the user is active or not).
Step-2: Once we have the data in the above step, for each user of a given acquisition date, we will count the number of valid sessions (here, a valid session is one generated after the user acquisition date), and based on which, we will tag the user as active or inactive
Solution:
with session_acquisition as ( select u.date as acquisition_date, u.user_id, s.session_id, s.date as session_date from users u left join sessions s on u.user_id = s.user_id and s.date > u.date) select acquisition_date, avg(is_active) as active_user_frac from( select acquisition_date, user_id, count(session_id) as num_sessions, case when count(session_id) >=3 then 1 else 0 end as is_active from session_acquisition group by acquisition_date, user_id) group by acquisition_date order by acquisition_date