Time travel: What did they do to my data?

Unlocking insights and resolving data anomalies: Harnessing the power of Starburst, Iceberg, and Tableau for effective data analysis

Share

It was a cold, blustery night…

You’re tired. You glance at a dashboard that has looked the same way for years, and years. Big Air Domestic should be sitting there at #1, as always. It’s a comforting thought: 

…but something has changed. This is not the familiar dashboard you know and love. It smells funny. It didn’t sound right when it loaded. That bar chart is different. It’s wrong. The data is not what it should be.

You make sure your filters are OK. OMG. The data has changed somehow. It shouldn’t. It can’t. Panic.

Ladies and gentlemen, welcome to the Thunderdome. At this point you have some choices: 

  • Weep
  • Go to bed and hope it fixes itself overnight
  • Weep, then go to bed and hope it fixes itself overnight
  • Other

You choose “Other”. That’s because you understand the trifacta of Starburst, Iceberg, and Tableau. That combination makes you a Time Traveller.

You open up a different version of the same chart and travel back in time a day.

No change: 

So you try thirty-six hours: 

Nothing. You keep sliding, and you spot a change to the data:

Clicks look correct now, but you still have a bad company name. Try once more. 

…and the world is once again as it should be. Someone changed up the data during a couple-hour time period. Now that you know the time period in question, you can take a look at Query History inside Starburst Galaxy and actually find the culprit:

Lesson learned, don’t give accounts to people with an aol.com domain…

Mystery solved. You can report this problem to your admin, or if you have permissions, fix the problem yourself: We’ll show you how in a minute.

What just happened?

Here’s a really brief summary of what happened under the covers:

  • Tableau is connecting to an Iceberg-based data lake using Starburst
  • Iceberg has the ability to time travel within your data, seeing what it looks like at any point in time
  • We implemented a little custom SQL magic in Tableau to decorate base queries with additional, parameterized Iceberg syntax to do our time travelling

The details

The workbook in question has three data sources: 

The “Star Schema by Tableau” data source is just a garden variety data source that connects to the lake and joins tables the way you always do. Nothing to see here, frankly. The information returned by this data source always reflects the most current data as far as Iceberg is concerned.

 

The data source “Custom SQL for Time Travel” is where things start to get fun. Its a query that returns the output of “Star Schema by Tableau“, but does so with Custom SQL. Unfortunately, there is no way to append my own SQL to the data source I created originally, so I had to create a new data source to get to where I wanted to go.

And because we’re forcing Tableau to execute the entire, reasonably complex query “as-is and no matter what,” you’re going to lose some of the performance-enhancing benefits Tableau gives you, like Join Culling. As a result, you shouldn’t use this data source as the primary mechanism of querying the lake. Use it for trouble-shooing and investigation only. 

Here’s where the magic happens:

FOR TIMESTAMP AS OF (date_add('hour', <Parameters.Hours in the past>, current_timestamp)) 

FOR TIMESTAMP AS OF allows us to see data from a table on Iceberg as it existed at a specific point in time. If I wanted make this a bit less fancy, I might get into a query editor and execute something like: 

SELECT 1, 2, 3 FROM "glue"."wwreach_analytics"."factdailyresponse" FOR TIMESTAMP AS OF TIMESTAMP '2023-05-12 08:44:08.502 America/New_York' LIMIT 10

To make this work in Tableau, we’re leveraging date_add() and a parameter from Tableau in order to subtract X hours from the current date/time.

But wait, there’s more…

View Table History” connects to the same lake, and uses Custom SQL to return interesing internal Iceberg information about the various tables we care about. We use a simple UNION ALL to get information about each of the tables we care about:

SELECT 'factdailyresponse' as tablename, * FROM "glue"."wwreach_analytics"."factdailyresponse$history" UNION ALL SELECT 'dimad' as tablename, * FROM "glue"."wwreach_analytics"."dimad$history" UNION ALL SELECT 'dimcampaign' as tablename, * FROM "glue"."wwreach_analytics"."dimcampaign$history" UNION ALL . . .

The two bits of interest in this query are: 

  • I manually added the table name to display in the tablename column to make it easier to understand what I’m looking at
  • We use the Iceberg $history notation at the end of each table to tell Iceberg we’d like to see the snapshot history of each table 

Using this query, I can create a worksheet that shows me everything I need to know about the changes to my data. Here’s an example of the output before any changes were made: 

…and then the idiot from aol.com came along and made a change: 

* I apologize if you have AOL account. I’m sure you are very smart, and many other people with AOL accounts are also amazing people. 

Note a new snapshot has been generated for the dimclient table, telling me some sort of write occurred.

Here comes the next change, this time to the fact table: 

With this information, I could have just as easily (and more efficiently, frankly) identified where my potential problems were. So if you’re willing to abandon the amaze-and-impress-your-friends “Time Travel Chart,” you’ll probably get your work done faster using this technique. 

Let’s Roll Back Time

Did I mention you can roll back time, too? I didn’t? Whoops! Shame on me. You can. 

Find and use the snapshot _id value which represents the point in time that you’d like to roll your data back to. In this case, it’s 2434539040574761686 for dimclient and 2179438286382406895 for factdailyresponse. 

Then, you can call a procedure to roll back changes: 

CALL glue.system.rollback_to_snapshot('wwreach_analytics', 'factdailyresponse', 2179438286382406895); CALL glue.system.rollback_to_snapshot('wwreach_analytics', 'dimclient', 2434539040574761686);

Here’s the original “real time” dashboard being refreshed after I execute the functions above. Note how the data “reverts” to my original values:

And of course, we can do the reverse. This time, I’m using Tableau’s Initial SQL dialog to execute the two functions (one at a time – an odd quirk of Initial SQL and Trino) to roll the data forward in time. I’m going to “break” it again.

(in some browsers, you’ll need to press play)

What’s next? Create your own Iceberg data lake using Starburst

So now you know about time travel and how it can impact your Tableau experience. Want to know more? 

Want the workbook this blog is based on? Here you are. Originally published on Russell Christopher.

Try Starburst Galaxy

The analytics platform for your data.

Start free