The Story Behind WSJ’s New Data Pipeline for Audience Analytics

Louise Story

Published in

WSJ Digital Experience & Strategy

7 min readApr 6, 2021

How we moved from batch, asynchronous processing to a real-time, full-circle process

By Dion Bailey and Louise Story

illustration of blue fibre cables with ones and zeros passing through inside — *Photo: Getty Images*

It was election night, audiences were flocking to our platforms, and — — beyond all the great journalism going on — we had a big technology breakthrough.

We were collecting close to the order of a million data points every few minutes and processing it all in real-time. It was a remarkable feat compared to what we experienced just months earlier during coronavirus traffic spikes before our internal systems were built.

Behind this remarkable feat was our new data pipeline, where we collect audience data from our platforms and send it to our central collection pool.

As one of our tech leaders described it, “a good David versus Goliath comparison.”

This post is about our new data pipeline and why it stands out in its ability to help us answer questions about our audience and our content in real-time, and how we went about handling large amounts of audience data, even during traffic spikes. (For anyone who doesn’t know it: capturing and processing large amounts of data when the use of a product is really high is something many companies struggle with.)

“WSJ’s new data pipeline is a game changer,” said Ross Fadely, WSJ’s chief data scientist who has been an important leader on this project. “By giving us the ability to measure what we want and collect data how we want it, the pipeline now enables us to unlock the power of data science and give our newsroom insights in real-time. These capabilities have set WSJ up for long-term success, allowing us to adapt with changing audiences at the pace of news.”

Stepping back: for years, websites and mobile apps at many media companies have relied on outside third parties to gather data on the actions of people visiting their sites. These third parties put their javascript code — so-called “pixels” or software development kits — on websites and mobile apps. These pixels drop cookies every time people come to those platforms. The cookies are small files that stay on people’s computers or phones that allow the web sites to recognize that it’s the same visitors again when they come back. That way, sites can see what visitors do over time and over multiple visits, on their platforms. You’ve probably seen cookies have been back in the news as Google provided more details on its plans to reduce use of third-party cookies.

Audience actions are important today in most newsrooms. It allows newsrooms to figure out what content and experiences their audiences are most interested in. But newsrooms have not always been data-driven. At the start of digital news, data was in many places seen as a corporate thing, to be kept away from people assigning stories. With this as the roots of data in news companies, many news companies did not prioritize the engineering needed to fully understand their audiences. They, instead, relied on outside vendors who could put pixels on their sites.

In the summer of 2019, as we began overhauling our approach to audience data, one thing we found quickly is the metrics available from outside vendors were inflexible. We wanted to be able to really understand how our users behaved on our platform — for instance, what’s the level of engagement scroll depth, attention to a particular area on the page, and so on. But outside vendors may or may not record it in the way we think should be done. For instance, one common issue was with dwell time — how long someone is on a page. It often gets recorded only when someone takes another action on the page. So dwell time can be missing the time spent by people who left the website after reading that article.

To provide the most useful experience to our audiences, there are many questions we are trying to answer. What impacts whether people choose to consume our content? Is a story ranking well in search? How much does its homepage placement affect its traffic? How does story length affect read time? And so on.

As we built up a world-class data science team, we have refined our questions more and more. And we found more and more shortcomings in outside solutions.

And then there was the cookies issue. We knew back in early 2019 that some large technology companies were taking a harsher view of cookies. We hear more on that every day from Google and Apple. Our products must function well in the ecosystems built by the large tech companies — their browsers, hardware, app store policies and search algorithms. Policies taken by large technology companies to limit cookies are things we must pay attention to because our products live within the technology landscape and we want our products to function well on, for example, Google’s Chrome browser (which has a 63% global market share). In addition, pixels weigh our products down and slow the speed of our pages. Readers don’t like waiting for stories to load.

So we got building.

Our goal? To bring smart data collection practices into our product development process in the service of learning more about what our visitors find most useful when they come to our news experience. And, just as important, to make that audience information useful to our content teams in real-time.

How did we do this? First, we looked at what services and libraries existed that are battle-tested and maintained by large companies or open-source communities. We don’t ever build for building sake, so we always look to the solutions out there that we can incorporate.

Central to what we built out was Amazon Kinesis, a data-streaming service that serves as a base at many companies that are ingesting large amounts of data. We had to build complementary parts of our data pipeline around this, and we used a third party open source collector called Snowplow. We customized Snowplow with additional javascript to target the actions we wanted and how we wanted them to stream back to the platform. Once we have collected all the scrolls and mouse clicks that we want over a fixed interval of time, we send those back to cleanse them, format them, then process them into a model we use for our metrics.

This may sound simple, but we have a heavily trafficked website, so the size of the data is large. How large? In big traffic periods, we approach a million actions collected every few minutes. Not all systems can handle this.

Making sure we could handle huge data spikes coming into our pipeline was key to us in what we decided to build. But we wanted to take that a step further. We didn’t want to just gather the data, we wanted to clean it, sort it, and run it through our data science models in real-time to deliver metrics back in real-time to news. Our teams took on this challenge, building a great system to move the event stream through AWS with a combination of Lambda (small single-purpose programs) that allow us to take bad events out of the mix (putting them into S3 buckets, which store data ). In partnership with the Dow Jones data team, we worked out a great system where we take the good data that’s left, we put it into a map-reduce cluster and where it is then into our data science models. For much of this we’ve used Spark, an open-source framework that helps run large amounts of data in parallel across different machines.

The long and short of all this is that instead of having batches of new data coming in and taking 48 to 72 hours to sort before analysis, we are now able to get the data in, cleaned up and run through our data science models within five minutes. We have taken what was batches of data running asynchronously and moved it into a real-time full-circle process.

Along the way, our data engineers creating our new data pipeline have been working closely with our data science team. After all, collecting the wrong data or collecting it at the wrong intervals might get in the way of the audience insights we are trying to gain.

Why does our new data pipeline matter? It’s because it will allow our editors to react to audience actions in real-time and make great decisions on how to serve audiences. And those decisions are informed by our own data science models.

And, all this is done without third party cookies. That’s increasingly important given tides in the technology industry.

Dion Bailey is the WSJ’s VP, Head of Technology and Architecture. Louise Story is the WSJ’s Chief Product and Technology Officer and its Chief News Strategist.

Thanks to Ross Fadely, the WSJ’s Chief Data Scientist for his assistance on this post and on this project. Special thanks to the team members and colleagues who have worked on this project:Tess Jeffers, Rye Zupancis, Ming-Yuan Lu,Jeff Parkinson, Abraham Alcantara, Alberto Leal, Estefania Jacobo, Florencia Silva, Cesar Manrique,Gerson Acuña Flores, Edwin Moedano, Oscar Sosa, Hermes Espinola, Eduardo Avendaño, Hrusikesh Panda, Zach Taher, Tania Feliz, Andrea Kebalo, Edgar Magaña Mercado, Guthrie Collin.

The Story Behind WSJ’s New Data Pipeline for Audience Analytics

Written by Louise Story