Sticky MPKnowledgeDrop

Why is my data changing in Mixpanel? Common data discrepancies explained

  • 9 April 2020
  • 0 replies
  • 1022 views
Why is my data changing in Mixpanel? Common data discrepancies explained
Userlevel 1
Badge
  • Mixpanel Employee
  • 1 reply

Every week, we will release tips to help you get the most out of Mixpanel. Want to see more? Click here to see other #mpknowledgedrop articles.

Having trust in your data is essential to relying on your analyses to make business decisions. In this Community Post, we will highlight some of the ways your data may change over time if you are pulling the same report within Mixpanel (even when the date range, filters, etc. do not change). The goal here is to explain common discrepancies in reporting so you can feel fully confident as you navigate the different Mixpanel reports. You’ll be walked through this exercise by Ryan, one of our Professional Services Managers who has 5 years of experience working with various Mixpanel customers. Hi! :wave:

 

User Properties versus Event Properties

User properties are mutable, meaning they can change over time. On the contrary, event properties are immutable, meaning they will not change over time. Since user properties change, if I pull a report today that filters, breaks down, or otherwise involves user properties (even in a Cohort definition) that same report will not have the same numbers next week. Let’s take an example:

Last week: I added a new user, with user property “User Type” = “Free”. When I pull a report for this date range, I would see all of the data charted as “Free” for this user:

This week: My new user converts, with user property “User Type” = “Paid” now. When I pull the same report as last week for the old date range, I would now see all of the data charted as “Paid” for this user:

Essentially I am applying a present filter (the user property) on past event data - this is an extremely powerful tool for my analysis, but can be quite confusing when I am unsure what type of report I am pulling. Just keep in mind that any analysis with user properties will be expected to change over time as the status of users change.

 

Rolling Cohort Definitions

Similar to user properties, you have the ability in Mixpanel to create Cohorts which have definitions that may change over time. For example, I may have a Cohort of users who have signed up in the past 7 days, or set a Cohort with a rolling date range to show users who have been active 5 out of the past 7 days. With these Cohorts, the date range of analysis is constantly changing which means the users in the Cohort are changing as well. Let’s take an example:

Last week: A new user signs up. When I pull a Cohort for this date range, I would see this user included for having performed an event in the past 7 days:

This week: 7+ days have passed since the user signs up. If I viewed this Cohort now, or applied it as a filter within reporting, the user would no longer be included in the analysis since they no longer meet the Cohort criteria:

Cohorts with dynamic date ranges will be constantly adding/removing members over time and will act similar to user properties in that they are a present filter of the current Cohort members over past data. 

Pro Tip: Plot a Cohort instead of an Event in Insights to see how Cohort size is trending up or down over time and analyze complex growth KPIs.

 

Incomplete Reports

Since Mixpanel reports around user behavior are generated in real time, if you analyze current dates the data being shown is not yet complete and results are still being collected. For example, let’s say I look at any report for the current date. Since there are users actively using your product and sending events, it would be expected for the Flows report and common paths to change as their data is sent into Mixpanel. This same concept applies to other reports where the dates of analysis are still accepting new data.

Some examples of this include:

Funnels: When the date range of analysis includes today OR the date range set by the conversion window includes today (for example, if I am analyzing yesterday’s data with a 7 day conversion window, there are still 6 more days for users to convert!)

Retention: The final two buckets are always still in progress within the report, as indicated within the reporting itself and refreshed as each period ends

Flows: When the date range of analysis includes today users are mid session and still need their data to send to Mixpanel to be counted

 

5 Day Ingestion Window from Mobile

In our mobile libraries, by default, Mixpanel accepts data up to 5 days old. This is due to cases of mobile connectivity issues where all data is not flushed to Mixpanel before a user goes offline or loses their strong connection. Take for example the tube rider who is reading your app whilst offline commuting to work - this data won’t be flushed until the app is open again with an internet connection.

For these cases, if I pull a report today for yesterday’s data, and then do the same for the next 5 days I will continue to see discrepancies between the reports. As an example, my data might look like:

Thursday, April 2 (pulled Friday, April 3): 1,000 events

Thursday, April 2 (pulled Saturday, April 4): 1,021 events (2.1%)

Thursday, April 2 (pulled Sunday, April 5): 1,029 events (2.9%)

Thursday, April 2 (pulled Monday, April 6): 1,033 events (3.3%)

Thursday, April 2 (pulled Tuesday, April 7): 1,035 events (3.5%)

As you can see, the data continues to fill in for April 2 for up to 5 days after the day’s data has been recorded. This is especially important to factor into analyses you are doing outside of Mixpanel where pulling these reports is static, as you’ll want to factor in the common delay % as an error bound in your reporting.

For an added bonus trick, you can utilize the “Time Stamp” property in Mixpanel to calculate potential delays. This property records the ingestion time of data into Mixpanel, and can be used to figure out how long data collection is delayed (if at all). As a rule of thumb, discrepancies here less than 5% are totally normal; anything higher than that and we have something else on our hands that needs further evaluation.

 

Identity Management

In our new identity management system, Mixpanel will retroactively stitch together the activity of users across devices. This change applies retroactively, and will result in old data being modified to account for a user later being associated on multiple devices. Let’s imagine I have the below:

Last week: User visits the home page on their computer, then again on their mobile phone (all anonymously)

If I was looking at a report about users visiting my homepage during this time, I would see 2 unique users - one for each device. This is expected since we cannot tell whom a user is until they authenticate. But, let’s fast forward to this week:

This week: User signs up on their computer, then logs in on their mobile device

With identity merge, we will update the retroactive data to all be tied to the same user. If I look at the same report of homepage views for last week, I now see 1 user! This is great since the data is more accurate this way and stitched across any device or authentication; however, it’s important to understand how this works so you aren’t caught by surprise as your reports involving unique analysis - Funnels, Flows, Retention - do change slightly.

 

Event Deduplication

Events from our default SDKs are automatically deduplicated at the end of each day. When we find data is exactly the same (determined by a property, “insert_id”), Mixpanel will automatically remove duplicate entries at the end of the day. This prevents bugs in the code adversely affecting your analytics, and allows for clean analysis of all unique event data. Note this process is automatically performed on the Mixpanel end if you are using our default SDKs which send unique “insert_id” properties for each event.

 

Data Imported via ETL

As we learned earlier, user properties are mutable. This means I could edit them in an ETL (Extract Transform Load) process by importing data from another source - even if the user in question was not active in your product. Typically these processes would be set up by someone on your technical team who is enriching the data on some regular basis to ensure the data about your users is up to date.

In addition to user properties, you can also import event data for prior dates if you so desire. This data is sent to a specific endpoint in Mixpanel (/import) and will always show a property in Mixpanel where “Import” = True so you can distinguish if the data was sent live to Mixpanel or imported after the fact.

 

Data Removed for GDPR

As it is law, companies must allow for end consumers to request the right to erasure through GDPR. Depending on your data setup, this may range from removing some sensitive data about each user all the way to removing data entirely. It’s possible in some cases users will be removed from the dataset entirely, allowing for the data to be modified as GDPR requests are processed by your team.

 

And that’s that! While there are certainly other reasons you may see discrepancies in reporting, the above are what we see often with our customers. Hope you learned something through this post, and please feel free to add on if you find new common scenarios for this! If you have any discrepancies like the above and aren’t sure what to do next, let us know and we can help further with that as well :heart_eyes:


0 replies

Be the first to reply!

Reply


Mixpanel