How we handle messy data

Engineering . Jan 26, 2024 . 4 MIN

Saiyam Shah

Jim Laurain

Every company thinks their data is "too messy." Here's how we make that data usable.

As our platform has to work with various data providers, CDPs, and many different companies with very different data lakes and schemas, it's actually more common for us to encounter messy data than anything else.

Here's our approach to cleaning up this data so it can be usable and actionable:

What is "messy" data?

"Messy data" can come in various forms. Here's what we see most commonly:

Empty data values
Missing data points
Unstructured form
Duplicate rows
Wrongly formatted columns
Incorrect data types
and more.

Also, this data can come in any of these formats:

CSV
JSON
Parquets
txt
Data tables
Avro
etc.

How we clean messy data

We're aware of all of these possible scenarios, but we don't have liberties to drastically change our customer's data formats, so we apply a structured minimal data model along with a data cleaner and translator to all the data we ingest.

This allows to ingest data in any format and any consistency to our system making it easy for customers to get onboarded with minimal effort.

Use as little as possible

The minimum data that we require for learning infrastructure consists of the following fields

user_id: An ID signifying the identifier of the user. (If it is PII we hash it.)
event_name: This can be anything like app_opened, added_to_cart, order_completed etc.
timestamp: The timestamp of the event - converted to UTC
metadata: This is any kind of metadata that is included with the event in whatever form. We convert this to a JSON string, so it can be unstructured and can contain any information present in the customer database.

Parallel Process

We employ a parallel-compute data processing pipeline when we ingest data from the customer into the system. This parallel processing pipeline performs all of the necessary data cleaning steps and then translation as outlined below:

This allows us to be able to handle huge volumes of data, as it distributes data translation and cleaning across several compute machines and scales vertically whenever needed. Any other custom use cases to format data are also baked into the parallel pipeline.

Conclusion

By employing a minimal data model Aampe can work with any format of data the customer gives us and translate into the Aampe model with the required minimal fields. Any extra information allows us to add use cases, but the minimum data is used by our AI model to learn each user’s timing, copy, frequency and channel preferences to target efficiently.

To learn more, click the orange button, below.

Learn more

How we handle messy data

Engineering . Jan 26, 2024 . 4 MIN

Saiyam Shah

Jim Laurain

Every company thinks their data is "too messy." Here's how we make that data usable.

What is "messy" data?

How we clean messy data

Use as little as possible

Parallel Process

Conclusion

Similar Articles

Guide to integrate Firebase Cloud Messaging (FCM) in Android

An overview of Aampe: AI-driven data and message orchestration for customer engagement

The journey from ETL to reverse ELT

How (and why) to export Firebase events to BigQuery

Product

Company

Hey there! 👋

Thanks for reaching out. We'll answer your message as soon as possible.

Something went wrong! Please reload this page and submit your message again.

If this problem persists, reach out to us directly at get@aampe.com