LLMs in Data Analytics: An Arbitrage Opportunity
What do the best data analytics teams look like in 5 years?
Prediction markets are often hailed for their accuracy, stemming from the amalgamation of diverse opinions that filter out noise to reveal the signal1. In data analytics, imagine replicating this effect. Traditionally, redundancy in analysis is frowned upon in organizations due to perceived resource wastage. However, what if you could simulate having multiple analysts or teams working on the same problem without the associated costs?
LLMs provide this opportunity, allowing for varied analytical approaches to coalesce into a comprehensive understanding.
Case Study: Analyzing the Holiday Dataset
Over the past few weeks, I’ve participated in R4DS’s weekly #tidytuesday hackathon. The goal is to take the dataset(s) provided and create some kind of insight or visualization. Then, post it on social media. To ensure it wasn’t distracting me from growing Storied, I time boxed my efforts to 1 hour. Last week2, the dataset captured characteristics of holiday movies listed on IMBd dating all the way back to 1929. There were over 4,000 movies in the database.
What do you expect I could even do in just 1 hour? You’ll be surprised.
By employing chatGPT, I was able to quickly dissect the data through multiple lenses, uncovering significant inflection points in trends.
I knew I wanted to analyze the movies overtime. I wanted to see if I could find inflection points through history based on the characteristics of these movies. It only took a few minutes to create both a k-means clustering analysis and a univariate changepoint analysis. Each analysis hinted at slightly different inflection points, but there was overlap. The general timing was circa 1965, 1988, and 2010.
Does anyone know why these time periods might have been inflection points for the movie industry?
I didn’t. But, I had a partner with some ideas. The ones that impressed me most were:
1965? The 1960s marked significant technological changes in filmmaking. The introduction of new camera and sound technology, along with color film becoming more common, dramatically changed how movies were made and experienced. Also, the growing popularity of television during this era posed a significant competition to movies, leading filmmakers to explore more novel and boundary-pushing content to attract audiences back to theaters.
1988? The late 1980s saw significant advancements in special effects technology, particularly with the introduction of computer-generated imagery (CGI). Movies like "Who Framed Roger Rabbit" (1988) and "The Abyss" (1989) showcased groundbreaking uses of CGI.
2010? Social media. The rise of social media platforms has had a profound impact on how movies are marketed and reviewed. It has also provided a platform for more diverse voices in cinema and influenced the types of stories that gain popularity.
It’s incredible that I was able to do all that in 1 hour. And, by being able to use multiple methods, I’m getting an ensemble effect that would be similar to multiple individuals working on the same problem. If I had only done the clustering analysis, I would have missed 1965…
Getting to Know Each Other
Adapting to LLMs in data analytics can be likened to a cyclist mastering a new mountain bike. Initially, there's a period of adjustment. But once the rider and bike are in sync, the potential for speed and agility is unparalleled. Similarly, as data analysts become attuned to the capabilities of their LLMs, their analytical prowess accelerates dramatically.
You can call it prompt engineering, I suppose. It feels bigger than that though.
Interactive Data Visualization: The Story of Life Expectancy
Hans Rosling famously used dynamic data visualizations to narrate compelling stories. In a similar vein, I utilized ChatGPT (in a previous #tidytuesday session) to create an interactive visualization about life expectancy trends. The LLM's assistance in sifting through data and scripting the narrative allowed for a rich, engaging presentation (at least the potential start of one) that would have been labor-intensive otherwise.
I’ll reiterate here that perhaps the most amazing part of this is being able to quickly formulate hypotheses on what the data is showing us. I have a hunch that organizations with great historical documentation are going to thrive because analysts can now learn about the business at a rapid rate.
SQL Analysis: Leveraging LLMs for Enhanced Thinking
I’ve long been ashamed of my SQL acumen. But, not ashamed enough to divert my attention to make it better. Instead, I focused my attention on statistical methods, critical thinking practice, and learning about the domains I was supporting (supply chain, transportation, communication, etc.). By accident, it may have strongly worked in my favor.
LLMs are really good at writing SQL. But they are only helpful if you know what and why you are looking for something. In the past decade, I’ve focused on the latter.
At Wayfair (circa 2015), I remember constructing a complex SQL query for a critical report. It was a meticulous, albeit unrefined, process. I knew exactly what the data meant and what we needed to understand, but it took a lot of iterations on the SQL writing to get there. That gap is now closing for me thanks to AI. I know that lots of SQL experts (not all, of course) won’t love this change, because they’ve invested the other way around.
The Current Arbitrage Opportunity with LLMs
Reflecting on the early 2000s when I was in college, the emergence of the Internet created an arbitrage opportunity in academic paper writing. I always felt that writing papers, during this time period, was relatively easier for that cohort. It wasn’t because we had the Internet. The real arbitrage happened because the professors grading the papers still had the old school vision of paper writing in their mental model bank (not all of them, of course, but many). That meant with relatively less effort, students - like me - could get the A. Data analysts using LLMs today have this same arbitrage experience3. Use it, and use it wisely.
Are You Using LLMs for Data Analytics?
The best data leaders today are those who recognize and embrace the transformative potential of LLMs. No, try again4. I think the best data leaders today are those who recognize and embrace the transformative potential of LLMs.
I have a limited perspective here and would love your help in broadening it. If you are a data analyst using LLMs to make you better or you are a data leader enabling your teams with this power, please let me buy you coffee? I’d love to hear about it.
Not always, of course, but prediction markets have a rich history of being more right than individuals.
If you saw my original post on LinkedIn, I used 4 clusters but here I updated it to 3 to match the number of changepoints found.
I am slightly concerned about “Effort Justification Bias.” This bias is a mental shortcut that leads us to value an outcome more highly if we put a lot of effort into achieving it, regardless of the actual value of the outcome. It's like thinking, "I spent so much time on this, it must be good!" Being a more effective data analysts might backfire. Stakeholders’ minds might discount the insights you create if you can generate them much faster than their historical experience predicts your should be able to…
In middle school, I had a German teacher who would lose his mind if you made a subjective statement without first saying “I think.”
Great post. More tactical question, are you feeding the data to GPT? Share some insight on the “how” if you can