This year, I have attempted to delve into the world of quantitative finance through independent research and coding, as well as speaking with people actually working in this industry. If you have followed my blog, you would know that I’ve been posting a lot this year, but the great majority of these posts have been speculative or analytical pieces. There’s certainly nothing wrong with this, but I feel like I haven’t given enough attention to my actual code. If we were on Wall Street right now, the code would be all that really matters, so I feel like I should exercise some due diligence by showing you all what my trading algorithm has been up to.
Let’s begin with some background: The whole concept behind my project this year has been to use natural language processing (NLP) machine learning algorithms to analyze historical stock market prices, and to eventually find some new market patterns through this analysis. Specifically, I’ve been using Word2Vec, an ML model which works by reading through a corpus of text, creating vectors for each unique word, and adjusting these vectors by observing which words appear in which contexts. These vectors are referred to as embeddings, and by the end of Word2Vec training, words which appear in similar contexts will have embeddings which appear close to one another in a vector space.
On the left, you will see an example of a Word2Vec model which was trained on Estonian and English. Since the word “cat” will likely appear in the same textual contexts across different languages (a cat in Estonia is no different from a cat in California, it’s just a different word), the word embeddings for “cat” and “kass” appear very close to each other in the vector space. On the right, you will see a 2-D representation of word embeddings, where each dot represents a different word embedding. As you can see, words with similar definitions like ”country” and “union” appear closely to one another. What’s interesting is that Word2Vec was able to group words which aren’t similar by definition, but by assumption and inference. For instance, you can see that the word “island” is right next to the word “sea”, and “together” is next to “energy”. How can a computer know that island is related to sea? This is the type of inference you would expect from a small child, not a computer! That’s the truly amazing part about NLP machine learning, and this is a large part of the reason why I chose to focus on this project. If Word2Vec can infer which words are similar just off a simple skip-gram model, I imagine that it might be able to infer something from reading through stock prices.
But how could natural language processing work for stock prices? These are numbers, not words?
You are right to ask, and this was one of the biggest challenges I faced in my project. I tried to come up with clever ways to convert price shifts to words as input vectors, but in the end, I realized that I was overthinking.
I decided to treat a series of historical prices (e.g: Daily AAPL close prices) as my text corpus. So, rather than having Word2Vec read through a Wikipedia page, it would read through a CSV file of closing prices for a particular stock or ETF. Each closing price (like: ‘$124.2’, ‘$0.25’, ‘$6.73’) would be a word in this corpus, and Word2Vec would create embeddings for each unique closing price much like how it would make embeddings for each unique word in a piece of text.
But to me, a closing price isn’t significant enough. A closing price is not an action, but a result of a day of trading. So, I decided to use daily price shifts as my metric (price shift = closing price – opening price).
Take a look at the chart above; These are historical prices for the Dow Jones dating back 100 years. If you look at the chart, nearly all of the sharpest price drops are followed by sudden growth. This is purely psychological, as people rush to buy when prices reach record lows. Yet, if Word2Vec could detect this pattern, it would mean that the neural network has made an inference about stock behavior which integrates psychology, chart analysis, and cycle interpretation! And all of this is done indirectly, simply because we are using neural networks which learn from context. The whole point of Word2Vec is to understand words based on their contexts. If it is true that the stock market behaves in a contextual manner, meaning future prices are influenced by recent price changes, then Word2Vec is an ideal ML model for quant finance.
So, we have trained Word2Vec on stock prices to create embeddings for unique price shifts. There’s now a file full of embeddings for each price shift, and we can query through this file to find similar embeddings. This is no different from the example above where we compared the embeddings of “cat” and “kass”, only now we are trying to find contextually similar price shifts in the markets. This is an incredibly powerful capability, and this is the basis of the entire Word2Vec trading algorithm.
A traditional trading algorithm takes real-time market prices as inputs and runs these inputs through some algorithm. My Word2Vec trading algorithm is no different, as it follows a general schema to return an output of whether or not to invest. As with all algorithms, there’s a lot of steps in between input and output. Here are the steps my Word2Vec trading algorithm takes:
The first 3.5 steps should be clear by now, or else I’ve done a poor job explaining (or you’ve done a poor job reading). The fourth and fifth steps are crucial as well, and these involve what we do when we find an embedding which is similar to our query input.
When you query an input, this input will likely be a sequence of stock price shifts over a certain period of time. Then, the algorithm will return which trading day(s) in history map most closest to this input query. If this search yields a very closely matching embedding (similarity of >95%) with a promising expected outcome, then that might seem great at first. However, each search doesn’t only yield the single closest-matching embedding, but it also returns a list of the top 5 or 10 or 15 closest matching embeddings (we call this a topK list). This is pretty much the exact same system as Google uses when you search a query.
A google search for “Damian Lillard” yields a series of matches (links) ordered by relevancy. Note that Google’s algorithm determines which matches are most relevant, much like how our Word2Vec network determines which embeddings are closest.
In our algorithm, querying “+0.4%” yields the above list. Each item is a different historical embedding within our data set. The first match has an expected outcome of +0.8%, which means that based off this first result, the algorithm thinks our query will be followed by a gain of +0.8% (which would be great!). However, the second match has an expected outcome of only +0.2%, while the third has an expected outcome of -1.5%. After noticing this inconsistency in the topK list, you would be less confident about making an investment than you were after seeing the first match.
With this in mind, the algorithm needs to be able to read through a topK list and gauge its volatility and consistency. If a topK list shows volatile expected outcomes which vary greatly, then the confidence in the investment goes down. My approach to this problem was rather simple, as I run a series of calculations to find a confidence factor. So, as soon as our algorithm matches the query to a series of historical embeddings, it reads through the series of matches (the topK list) and calculates a confidence factor. If the confidence factor is high and the expected outcome is in our favor, then the algorithm proceeds to suggest an investment.
This algorithm currently works, and I’ve trained it on multiple stocks and ETFS. Here are some sample results from querying various price movements in AAPL stock:
As you can see, my code returns an embedding which matches the query. In these cases, it appears that the matching embedding is always a price shift of the same magnitude (querying +0.3% will most likely return the embedding of a price shift of +0.3%). This is happening because I am only querying a single day rather than a sequence of days (I explain this better later on in the post). In addition to finding a match, my code uses topK list analysis + other predictive factors to return a prediction for what will happen to the stock’s price over the next 3 days.
It would be nice to see what happens when you train on stocks other than AAPL (or on ETFs or entire sectors or even the entire market), so that is precisely what I am working on right now as I’m making the final adjustments to my code.
These are preliminary results for several reasons.
Firstly, the queries are single-day price shifts instead of sequences of price shifts. Ideally, such a program could analyze week-long, month-long, even yearlong patterns in stock prices and return similar patterns in stock price behavior from the past. For some odd reason, my program cannot make a query_embedding for arrays with multiple elements (i.e: a sequence). Right now, I’m working to solve this problem with Universal Sentence Encoder so that I can match longer sequences of price changes. Once this is done, my algorithm can be used to figure out what stage of a growth cycle we are currently in, to deduce whether or not we are approaching a relative maximum/minimum for stock prices, and to interpret how a stock will react to a particularly good/bad week of trading. It’s nice to see that the program works with single-day price shifts right now, but this is ultimately meaningless. At the same time though, I know that I am just one step away from a working market analysis tool.
Second, I clearly need a nicer interface. I’m thinking of creating a user interface for this tool where users can select a stock/ETF, choose a time window as a query, and have my code return a visual representation of its prediction (through something like a mock price graph or a table of values). Also, once I have a nice way of packaging these predictions, I can implement some of the portfolio management functions that I learned through Quantopian. This would allow for automatized investing and portfolio optimization, which is what a trading algorithm would do in an ideal setting.
Third, my preliminary results raise some interesting questions which should be addressed. For instance, why does my algorithm return embeddings so far in the past? In each of the four queries, the top result is an embedding from pre-2000, when the data I am using spans from 1980 to 2017. Is my algorithm ignoring more recent embeddings because it stops searching after finding a match early on in the data set, or do older price shifts truly match more closely to our queries?
Fourth, I need to backtest this algorithm to see how it stacks up compared to established trading algorithms. Once I have backtest results like loss % and Sharpe Ratios, I can market my product more easily and I can also fine-tune my product to achieve better and better results.
This is where my algorithm stands right now, and every day I am getting closer to a working product which can give users meaningful insights into market patterns. I will continue to post updates as I continue with my work, and I’m still open to advice, suggestions, and criticism as I do so (so feel free to reach out if you want to get involved!).
These slides were taken from my final presentation to my compsci research class. You can view the presentation in its entirety below: