event_date - has a detailed time of emails sent or actions taken. While email send times are quite systematic, purchase times are directly from user behavior, therefore will vary. Also, it's important to note that these are all in the same month.
event_type - delivered vs. halted will tell me if the user is BC or Control group; purchase actions will be key to defining metrics and exploration for this exercise.
action - will be a good feature to use for exploration
total - the total value has some missing values, which will be covered and cleaned.
action are categorical variables.
Below, I will make a new dataframe with only the features I want to keep, and dummifying the categorical variables. This will allow pandas to start identifying this information as ints and floats, allowing me to make visualizations and clean the data further for the next part of the assignment.
Above is a visualization of how each events fluctuated during the time period. I used the
zscore of these values, which will allow me to display the events' change relative to itself, and put these three events on the same measuring board.
Dotted Line at y = 0 is where the mean is, therefore falling below that line means the events were happening less than average, and vice versa.
As we can see from the visualization, emails were much more frequently sent out during the beginning of the month. Interestingly, purchases have sharper spikes during the very beginning and near the 11th but still somewhat follow the email trend.
Note: Delivered (Red) and Halted (Blue) follow very similar patterns, by nature
This is the same standardized sum of events like above, but this time by the hour of the day.
Yes! At 15:00 (3:00 PM) we see an immense spike!
Above displays which day of the week people purchased the most. Tuesday and Wednesday look significantly higher while Saturday is especially low.
Part 2 of the assignment is as follows:
A table that breaks down the following metrics, per action:
purchase-conversion rate for BC cohort and control cohort, % lift in purchase-conversion rate
customer-conversion rate for BC cohort and control cohort, % lift in customer-conversion rate
email-conversion rate for BC cohort and control cohort, % lift in email-conversion rate
attributed revenue per email for BC cohort and control cohort, % lift in attributed revenue per email
This whole part is possible by identifying credited emails (or, emails that were sent within 5 days before a purchase). My idea is to create 2 new columns
credit_event_type_halted, where it is either 0 (False) or 1 (True) if an email was sent(halted) and a purchase was made within 5 days.
1) make new dataframe called
converted_df which has all the actions of only the email ids that made a purchase at one point.
2) Make a function (with nested for-loop) that:
Iteration through unique ids that made a purchase -> the days that they were sent an email -> checking if the purchase day was within the 5-day period -> give credit if it is
credit values to define which emails were sent within 5 days before a purchase, build metrics table with defined measurements
credit features made above identifies emails that were sent 5 days before a purchase was made.
purchase-conversion rate = sum of purchases made after
credit emails / number of unique ids that had a deliver(halt)
customer-conversions rate = unique ids that had a credited email / number of unique ids with a deliver(halt)
email-conversion rate = sum of
credit / total sum of delivered(halt)
attributed revenue per email = sum of total per customer / sum of total
credit per customer
Above is the metrics table, where different measurement is divided up by experimental and control groups to distinguish lift in the two groups.
The conversion rate metrics actually do not see a positive lift. However, the attributed revenue per email is significantly higher than the control group, which I can assume means that the emails do help customers spend more.