Sticky mpknowledgedrop

Data Pipelines FAQ

  • 30 July 2020
  • 2 replies
  • 1079 views
Data Pipelines FAQ
Userlevel 1
Badge

Mixpanel's Data Pipelines team works with all kinds of users to supply event and user data to various warehouse and storage platforms. Through these conversations, we've seen a set of common questions ranging topics like timestamps, data transformations, and warehouse behaviors.

This article highlights some frequent scenarios our users encounter, along with explanations and troubleshooting.

 

I tried to create a pipeline, but encountered an error

A few cases for errors during creation relate to trial pipeline expiration, authorization, and Google Groups. The Data Pipelines API is also a reference point for accepted parameters during pipeline creation. Below are some common errors.
 

{"Error":"the account associated with this project has not purchased the Data Warehouse Export package. You can still use the one-time trial to test the pipeline."}

The above error occurs when creating non-trial pipelines for a Mixpanel organization that does not have the Data Pipelines package enabled. Users may create one trial pipeline per Mixpanel project, after which point opting into the Data Pipelines Package is required to continue using the service.
 

{"Error":"sharing bigquery with share group failed: failed updating metadata with access: googleapi: Error 400: IAM setPolicy failed for Dataset <dataset>: Account <email> is of type \\"user\\". Please set the type prefix to be \\"user:\\"., invalid"}

The above error occurs in creating BigQuery pipelines and passing an invalid email to the bq_share_with_group parameter. That parameter requires a Google Group email, and will throw errors if passed other forms of email accounts, like User emails or Service Account emails.
 

{"Error":"authorization failed"}

The Data Pipelines API uses basic authorization to process requests. Your project's api_secret is the value for the username field, meaning if your api_secret were 123456789abcdef the format would look as follows (in this example we're creating a trial BigQuery pipeline):

curl https://data.mixpanel.com/api/2.0/nessie/pipeline/create \
-u 123456789abcdef: \
-d type="bigquery" \
-d bq_region="US_EAST_4" \
-d trial=true \
-d bq_share_with_group="bq-access-alias@somecompany.com"

Note that the authentication line ends after the colon.

 

What is the bq_share_with_group parameter? How do I add users to view my Mixpanel data in BigQuery?

A BigQuery Data Pipeline will write your Mixpanel data to a shared BigQuery instance. Mixpanel uses Google Groups to provision access to the shared dataset.

You must pass a Google Group as the bq_share_with_group parameter. You can learn how to create a new Google Group here. The Google Group passed as a parameter will have access to the dataset.

To add additional users the view the data, simply add them to the Google Group.

 

Can I create multiple pipelines?

The Data Pipelines package allows you to create as many pipelines as you like, provided the pipelines' names are unique. Pipeline names are a combination of the following parameters provided during initialization:

  1. project_id

  2. data_source

  3. frequency

  4. type:

  5. schema_type

  6. trial

The pipeline name ends up being <project_id-data_source-frequency-type-schema_type> + trial (if trial=true). If you attempt to create a pipeline which has the 6 parameters set to the same configurations as a pipeline you already have running, you will encounter the error {“Error”:”a pipeline with the same configuration already exists”}.

 

Why can't I access my Mixpanel data in BigQuery?

If you see the error User does not have bigquery.jobs.create permission in project when attempting to query your BigQuery dataset, this usually means you need to check that you have selected the proper BigQuery project in the top navigation bar.

If you still do not have permissions to query the dataset after selecting the correct project, confirm that you are a member of the Google group that has BigQuery Data Viewer permissions on the dataset.

 

How do I combine my Mixpanel data in BigQuery with other BigQuery datasets?

The following steps outline how to move data from the Mixpanel-shared BigQuery view to your own instance:

  1. Verify that your Mixpanel BigQuery dataset is in the same region as your own BigQuery instance.

  2. Create a Scheduled Query in your Mixpanel BigQuery table with a query of the data you want to bring over. There are a few steps and considerations here:

  • Query the data you want to transfer. For instance, if you want to transfer the entire table query Select * From Dataset.Tablename.
  • Choose the BigQuery dataset instance in your own BigQuery account as the destination.
  • Overwriting old data with new, incoming updates to the table is preferable to strictly appending new data as it arrives. It helps to avoid duplicate data and ensures consistency with deletions or queued mobile event ingestion.
  • The cadence of this query can be hourly, daily, weekly, or larger frequencies. It's a good idea to schedule the query when your team is off-hours in order to not interfere with normal operations.

Once your queries complete, you'll be able to combine Mixpanel data with any of your other data in BigQuery.

 

Why is my exported data offset from UTC? How does Mixpanel export timestamps?

Data Pipelines exports event timestamps using epoch UTC time. This timestamp does not reflect project timezone.

While the timestamp on the event is UTC, Mixpanel partitions data using your project's timezone. This can lead to UTC timestamps from a date that differs from the partition date. If you convert the timestamp to your project time, however, there will be no more offset as the timezones match.

As a result, as you can use built-in tools that expect UTC without having to alter the timestamp. Alternatively you can write queries using the partition to easily reflect data as seen in Mixpanel.

 

Why don't I see a certain property or event name in my table?

Often times a missing column name is the result of the transformations Data Pipelines apply to your raw data to create warehouse-compatible naming conventions. In these cases, the data may have a reformatted name in the warehouse schema (for example, "Song Played" becomes "song_played"). You can read the full set of transformation rules here.

 

Why is data missing from my warehouse compared to my Mixpanel project?

There are multiple Data Pipelines behaviors that can lead to discrepancies between your warehouse and the data seen in your Mixpanel project. Below are a set of common scenarios that contribute to this difference.

Data Pipelines do not export modeling layer data

In addition to raw data sent through SDKs and HTTP calls, Mixpanel allows users to create several types of data within the product, including:

These computed "modeling layer" values are separate from the raw data exported via Data Pipelines, and as a result will not propagate to your warehouse. Using Lexicon is an effective way to understand what events and properties are custom transformations versus raw data.

The pipeline hasn't run its most recent sync operation yet

In addition to running exports for the most recent day or hour of your Mixpanel data, Data Pipelines may also be configured to perform syncing operations to patch in latent data.

For a variety of reasons, data in Mixpanel may end up differing from your recurring exports - examples can be found here. When those situations alter your data history, pipelines created with a "sync=true" parameter will perform the necessary exports / deletions to put the warehouse back in line with your Mixpanel data set.

The Data Pipelines API has a timeline endpoint to see when the most recent sync operations have run - often times a pending sync operation will smooth the data discrepancy.

Your Data Pipeline is running the trial version

We offer a free trials of Data Pipelines for users to sample the product before opting into the full add-on. Trial pipelines have an explicit set of parameters they allow, which you can find here. Some of these parameters will affect the set of data seen in the warehouse, e.g. sync functionality is disabled, and the pipeline does not backfill historical windows.


2 replies

Badge

Hi @StevenBaum,

Thanks for the article, 

I have a small issue, and thought to ask here,

 

using Data pipeline API, as described here: https://developer.mixpanel.com/reference/data-pipelines-api-overview

I have created a pipeline to export data to BigQuery, it returned that pipeline is created successfully,

and I can even use the “pipeline/status” & “pipeline/jobs” end-points, and can confirm that the pipeline is running,

 

But still under Mixpanel’s integrations tab, it list the following:

Google BigQuery [Status: Not Connected]

 

Also, from BigQuery itself I can’t find the project itself, nor the dataset

also using the link provided from the create pipeline api, it just says the following on Google:

The Classic UI has been decomissioned as of October 1st.

 

Thanks & Kind regards

Userlevel 1
Badge

Hey @mohheader, thanks for reaching out! I’ve messaged you with some updates we’ve made to address some of the points you brought up. :slight_smile:

Reply