Data Lifecycles

Feb 19

Data Retention

I have worked on large scale data systems since 2008.

In the early days of "data science", it became common practice to store everything forever^[1]. The idea was that with enough data points, we could make a complicated enough model to get the result we wanted.

Scale of Data

It is hard to fathom some of the quantities of data that are generated each day of our lives.

In the US, Axon (the company most famous for developing Tasers), produces body cameras for police forces and also hosts that data for police departments.

The Seattle Police Department has about 900 current police officers. Axon recommends that its cameras be set to record at 720p High. That means that each frame has 1280 * 720 * 3 bytes of data, and this format typically stores either 30 or 60 frames per second. This means that each hour of uncompressed footage will require between 250 to 500 gigabytes of storage space. So multiply that by the 900 officers and their daily shifts and you have a tremendous amount of data being generated in just one city.

Is it useful?

Who among us haven't said something we regret in the heat of the moment?

Maybe that can be juicy tabloid fodder. But is it useful? And by "useful", I mean, is it predictive?

If a piece of data does not help predictions, then by definition, it is noise.

So how often is data purged in most systems? And what data should we expect to be purged? What kind of mechanisms are fair to people? Most drivers in the United States are familiar with the periods of times where insurance can factor traffic violations or accidents on their pricing.

My brother used to joke about accidentally liking a song from an artist and that ruining his music recommendations for months.

Maybe that was an acceptable user experience a few years ago, but it won't cut it today.

Digital Twins

The concept of a "digital twin" refers to a digital representation of an object or process that is supposed to capture all the important data and dyanmics of the original. During my time in academia, I focused my research on dimension reduction, which describes techniques to determine the structures and most important data of a dataset. Despite a ball being made of a huge amount of atoms, we can accurately model how long a ball will take to fall from a certain height with just one variable - the height. Certainly other variables (like the size of the ball, if it is solid, wind, etc) can effect the time, but we can likely get quite close with just one variable.

So how many data points do you need to predict what the user will click next?

How many data points do you need to predict how a user will spend their money?

Predicting People

I have worked with huge text data sets in a few domains. I have personally witnessed how predictable customer conversations can be. One of my favorite questions to ask programming candidates is to describe and/or implement an autocomplete for search. With a good dataset and some helpful data structures, one can accomplish quite a bit with just a small amount of code. We all can see how good our phones are at predicting the next word we want to write once we start typing in a sentence. Modern techniques, which work off of Large Language Models (LLMs), can take us much further.

Anyone can use ChatGPT or similar tools to generate content as if it was written by someone else now. Some techniques generate words and then ask if that matches the mostly word or pattern from the authoer that is being emulated.

So if we can simulate the text output from someone, how much of the person can we simulate?

An adult may consider many dimensions when making decisions. How many dimensions do younger people work with? Now that even 1 year olds are regularly using tablets, what do the companies observe with regards to that data?

And if this data is retained indefinately, what will the data profile be like for people who have spent nearly their entire lives on tablets? And who will have access to that data? What kind of uses of that data are acceptable or necessary^[2]?

Programming People

How much data do you need to show a user in order to change their actions? How many dimensions do we consider when we make decisions and which of those are most important? When I am deciding where to take a vacation, the fact that I studied math may factor very little into that decision. However, factors like budget and how much time I have available are likely to have a much bigger impact.

When most people are shown content showing injustice, they often filled with a desire to see it addressed.

Companies that distribute content can use our generally predictable responses to stimuli to impart to us different thoughts. Essentially, they may be programming us.

Facebook famously experimented with their users' emotions in 2014. Cambridge Analytica took this a few steps further in the following two years.

So where are we 6 years later? Have you ever felt programmed recently?

I will never forget the panic I witnessed when I was deployed to a client and had to explain to the C levels that the past team had only stored data for the past year. ↩︎
Perhaps identifying prodigies and getting them proper support is one such case. Most people support efforts of intervention when children face danger. ↩︎

Andy Barr https://www.andy-barr.com

Data Lifecycles

Scale of Data

Is it useful?

Digital Twins

Predicting People

Programming People

How Vision 2030 and the GCC will help the world

Getting More People Involved in Food Production

Andy Barr