Financial storytelling using time-series classification

Let’s teach a machine to describe how markets moved.

When it comes to narratives, financial markets are particularly appealing because they only ever move on one axis over time. A person can be hungry yet happy, but in finance things go up or down; get stronger or weaker; expand or contract.

As humans, when we look at a chart of market movements, we can easily arrive at an essential story of what markets did over a period:

But markets make up for this simplicity with noise. For our above example, three of the images below would describe a similar scenario, with the fourth describing a different one:

To us, all this is clear. But could you teach a computer to turn these inputs into appropriate narratives?

One way to do it is to decide, for a given time series of financial performance, which of a given set of narratives best describes it.

Notice that, though this is a natural language generation (NLG) problem, the actual text the machine spits out is not entirely generated by a machine; much of it is pre-written by a human.

Some NLG practitioners now consider this approach rather unfashionable. This is partly because nailing down selected narratives feels like either there’s not going to be enough richness to the text, or there’ll be too many narratives to be worth the effort. (If you had to write 1000 different narratives, did it actually save any bother?)

We flatter ourselves. We may believe humans crave infinite narrative diversity, but even in fiction, we actually gravitate towards very few types of plots (seven, according to Booker; six, according to Vonnegut).

In finance, the simplest narrative forms take account of only of the starting point and the end point, discarding everything in between. For example: “Gold prices rose over the month”. Let’s call this a 2-point narrative.

This form of narration is okay, but when discussing markets much depends on the time frame and context. For stocks, say, you might want to go into finer detail over the ups and downs over a year; over an afternoon, not so much. Below are examples of slightly richer narratives:

So now, have a look at these well-written summary bullet points, published by asset manager Schroder:

Notice that the first and third bullets are 3-point narratives, and the second and fourth are 2-point narratives.

We don’t mind the narrative simplicity because the latter parts of the sentence enrich the text. The first part is the narrative of WHAT happened, and the second part — very loosely — talks about WHY.

Getting back to our own problem, we’ll limit ourselves to less than 20 narrative types, covering 2- and 3-point narratives. Writing the text components won’t take long, and we can easily offer alternative phrasings to the same scenario to mix things up.

Our Task

Clearly, this is a classification problem, since we ask: given a new series, which of our selected narrative types does it most resemble? Therefore, we’ll be building some form of classifier.

But what shall we train it with?

We haven’t got any real-world labelled data (meaning, a ton of past performance curves, each labelled with the most suitable narrative type). Obtaining such data would be time-consuming and laborious. So instead, we’ll manufacture labelled examples that’ll look pretty close to the real thing.

For the remainder of this article, here’s the plan:

1. We’ll generate the basis for our data: lots of samples per narrative.

2. We’ll add noise to each data sample, so it looks the way financial markets behave.

3. We’ll use the now-noisy data to train a classifier.

4. We’ll test it on both manufactured and real data and see what we get.

1. Generating narrative samples [This is the easy part]

Say that over a month, an index starts at 100, goes down, and then goes up again to end higher than where it started. By now we know that this is a 3-point narrative.

But there are all sorts of things that can vary, while still keeping with the story: WHEN did the index reach its low point? HOW LOW was this low-point? And what does it mean to PROPERLY RISE?

Our first point is always fixed: at time 0, we normalize the value to 100. But the rest can jiggle around, and the above questions illustrate parameters that take us from the general description of a narrative to a specific instance.

Therefore, every narrative type has several parameters that can vary over a range. To generate lots and lots of samples, we randomly pick values out of these ranges. For the down-ending-up narrative, say, we may end up with a bunch of samples that look like this:

But these look nothing like how financial markets move in real life. In real life markets are volatile, so they bounce around. What we need to do is take our neat narrative samples, and add some noise.

2. Adding Noise [This is the statistics part]

My problem was that I needed to add just enough noise.

If I added too little noise, the lines would look unnaturally straight, not at all the way real markets behave. But if I added too much, or wouldn’t control the randomness, then my gorgeous narrative points might drown in the noise.

Personally, I didn’t know the answer to this one.

So I turned to my Swiss-Army-knife of a husband, who suggested I use something called a Brownian Bridge.

This will be a pain-free explanation of what it is and why it works. For this purpose, let’s say you are seriously drunk.

The pub just closed, your home is a mile away, and you need to get there by sunrise. However hard you try to walk straight, you wobble. When you try to move, sometimes you take a step forward, sometimes a step back. Being this drunk, you have zero memory of the steps you’ve previously taken. 
 
After an hour, have you made any progress? Well, you’ve traced a random path, so we can’t say for sure. But because of the way you’ve traced that random path, we know that your distance from your starting point is distributed like a bell-curve, with a variance that depends on the size of your step and the number of steps you’ve taken.

Another hour passes. And since you’ve gotten no soberer, your distance from your starting point is again distributed like a bell curve, but the variance is now twice as big. And so it goes on.

My friend, you are experiencing a Brownian motion.

A Brownian Bridge is a Brownian motion that forces an end point: It looks only at the paths traced by a Brownian motion that end at a specific point (say, your home) at a specific time (sunrise). So again, it’s a statistical process — we can’t know for sure where you are at any given point in time between the start and the finish. But it’s a process that is conditional on being of a certain value at a certain time.

In between the start and finish, you’re still walking like a total drunk. But because your future has been foretold (you will end up at home at sunrise), you’re not quite as free to roam as you were before. You’re freest at around-mid-distance, and your freedom shrinks the closer you are to your end-point.

OK, enough with the drunk metaphor, back to our case.

We want zero noise in our fixed narrative points so that they don’t move. In between, we want it to get noisy, so that our stock market movements jump around. A Brownian bridge does exactly that:

3. Classification [This is the machine learning part]

Because our data is a time-series, we squarely reside in the world of time-series classification.

But a big world it is. In order to choose a classification method, we’ll use a machete to carve off its irrelevant parts.

First off, we chop out methods for predictive purposes (for example, guessing whether stocks will go up or down in future based on past data). That’s not our task.

Then, there’s a whole class of algorithms that tries to learn explicit features for the curves; a 2016 review of this school of thought is available in [1].

Explicit features are useful to people who don’t assume much about their data, and want to identify key attributes so they could, for example, generate similar samples, or describe which qualities “make” something representative of its kind. Again, also not us:

First, we already assume quite a lot about our data. Plus, in both samples and real data, we consider most of the curve to be just random noise.

Moving on to neural-network approaches. Here, we classify, but can’t describe fully what differentiates one class from another. Most conveniently, an article published this March reviews the different approaches to deep learning for time series classification [2].

It has a handy chart that helps us chop further: as before, we’re not interested in a solution that learns how to create new samples, and we’re not interested in feature engineering:

That leaves us with End-to-End methods. Also, the name carries the promise of making life easy, which I find highly appealing.

The results of the above paper suggest that, for the purposes of our one-dimensional problem (= everything moves on a single axis, which is time), the end-to-end approach that tends to performs best on most datasets is ResNet.

Being a state-of-the-art image classifier, this hardly comes as a shock. (Here’s FastAi explaining it, and here’s the PyTorch source code. I ended up getting sufficiently good results with ResNet-34).

4. Results [This is the fun part]

I’m going to almost make it sound as if everything instantly worked. Because in the real world, all anyone will ever care about is: DOES IT WORK?
 
And at first, of course it didn’t. But for very interesting reasons.

a. On fabricated data

When I looked at the initial outcomes, the success rate was not great. But it soon became clear that, for many curves, though the classification differed to what was intended, its interpretation was just as valid:

Once I had removed the noise and re-trained a model, sure, the success rate surged; but this data no longer looked the way markets actually move. This meant that, once real noise was added, the distinction between certain narrative types tended to blur.

I found this rather beautiful: all too often the classifier wasn’t wrong per se; it was picking up the salient meaning I had hoped for, just not in a way I could regard as trustworthy. But this did raise the question: what exactly should I aim to improve?

One thing that stood out almost immediately was that, with just the time series as input, the model struggled to infer relationships between the beginning of the series and the end of it. But that meant losing important information, in case the end of a narrative was defined in relation to the starting point.

This was both not surprising, given how ResNet works, and easily addressed, by changing the input to a doubled-up series:

Further testing revealed that one of the toughest things for it to learn was when to ignore noise and see when something stayed “about the same”. Thus, the way to address performance was not to fiddle with the network, but look more carefully at the parameters used to define narratives. 
 
 I was never going to be completely rid of the ambiguity arising from being able to tell more than one story about a curve. Thus, I was aiming to get it to do a pretty good job on fabricated data, but not shoot for near 100% classification. Ultimately, it’d reached well over 85% classification in noisy test data, and the mistakes were likely to be in closely-related narratives.

b. On real past data

With real data, some normalization was in order, to account for how stocks in Japan, say, may behave differently to oil prices. For example, some markets have inherently more variance than others. These adjustments were done using historical data, which I got from Investing.com.

But even when attempting to be rigorous, such tweaking is not without problems. For example, stocks in developed markets have been supported by a never-ending, über-lax monetary policy, which led to record-low volatility for long stretches. Meaning that, even when the classifier’s noise parameters were adjusted based on “real market data”, they are still skewed towards a particular economic environment.

All the same, the result works well.

Naturally, I would’ve liked to pin a number to this statement, but this gets us right back to where we started: the oh-so-common problem of acquiring labelled data.

So instead, I’ll end with a little sample of outputs for real market data from recent months: