How can anyone possibly use Word2Vec on a single parameter of financial data, such as opening and close prices of stocks, to make any sort of meaningful prediction? Are we simply going to ignore the heap of fundamental data behind every stock, such as P/E ratios, return on assets, etc…?
I’ve pondered these questions for a while, and I’ve finally found an answer: Word2Vec doesn’t know anything about the fundamental data behind words (grammar, syntax, or even definitions of individual words), yet it still does remarkably well at matching similar words, as we have seen.
Going off of this observation, is it safe to say that Word2Vec can match stock price shifts of similar significance without knowing anything about the nature of these stocks? My intuition is yes, because of how well Word2Vec does for mapping similar words/texts, but also because of how Word2Vec models are designed on a technical level. There are two variations of Word2Vec, continuous bag of words (CBOW) and skip-gram models, with skip-gram models being better-suited for larger datasets (meaning I’ll probably have to use skip-grams). However, both models work in similar ways: You read through a corpus of text, generate vectors for each unique word within the text, and then create new vector embeddings (what we’re actually looking for) based on the context these words appear in.
So, how are these initial vectors created? It’s quite simple, actually. For the first word in the corpus, you start by creating a one-element array, which looks like: . If the word is ‘The’, then the word ‘the’ corresponds to the first element in the array. For the next unique word you encounter, you create a new array: [0, 1], but you must also add a zero to the end of the first array as well: [1,0]. If the second word is ‘mighty’, then the second element in the array corresponds to mighty. Whenever an array’s second element == 1, then you know the array corresponds to the word ‘mighty’.
You go on like this until you’ve accounted for every unique word. You will end up with an array whose size == the number of unique words, and each word is represented by a huge array full of zeroes with a single 1 at the location which corresponds to that word. For instance, if our data set had 20 unique words, the array for the word ‘mighty’ would be [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
One observation you can make right off the bat is that the more unique words you have in your dataset, the longer these individual arrays will be, and more computational energy (i.e: time, money) will be required. So, how do we reduce the size of these vectors?
Recall that Word2Vec neural networks have three layers, with one hidden layer. The first layer, the input, are the arrays we generated above. The second layer, the hidden, is where array compression happens. This array compression works because the hidden layer is simply a matrix of weights (a matrix of dimensions: Number of unique words X desired vector size, between 50 and 300 dimensions). We multiply the original array for a word by the learned weight matrix for that word, and voila, we have a vector (still an array) of between 50 and 300 dimensions (elements).
How these weights get adjusted is a story for another time, but in general, they work by calculating probabilities of target words appearing in a certain context:
You use nearby words to generate training samples, and you adjust weights for the target word (blue) by seeing how often it appears, throughout the rest of the dataset, in proximity to those samples (how likely ‘quick’ is to appear next to ‘fox’ would be used to adjust weights for ‘quick’).
By the end of training, we are not even interested in the output layer of the neural network, as all we need are the vectors created by multiplying the input layer by the weight matrix. These embedded* vectors can then be used to find words (or events in the stock market) of similar significance.
* Embedded is just another way of saying compressed, so embeddings are essentially compressed vectors.
What Does a Falling Tree Teach Us?
So why I am bringing technical definitions into this? To show that Word2Vec learning is entirely based on context. To me, the stock market is a contextual thing, where future activity is predicated on what has happened before (cause/effect relationships). Many would call me wrong to say this, but no one is ever completely right about the markets, so it is worth taking a look at (Bridgewater Associates should be on my side for this one). Therefore, Word2Vec-generated embeddings do not need to include any fundamental background data about stocks, because we don’t care about these fundamentals when looking at context. If a tree falls over in a forest and hits a nearby house, we don’t care about the exact sedimentary composition beneath the tree’s roots at n seconds before the tree fell, we care that we have learned a new thing: Trees falling can lead to houses being destroyed, and we can use this knowledge to make predictions about what will happen when more trees fall in the future. Sure, the sedimentary composition might have caused the tree to fall, but we don’t know for certain, and if it did, it is reflected through the fact that the tree fell, and it is therefore irrelevant.
This reminds me of a point brought up by HFT trader Tom Conerly when I spoke with him, which is that high frequency traders don’t look at fundamentals of stocks, because they believe all fundamentals are proportionally reflected through market price. This saves space, time, and energy, as they can avoid processing mountains of other data which they would otherwise have to look at.
So, now that I’ve decided to ignore fundamental data when creating my neural network, the next big question I need to answer before actually uploading data is which data I want to train my net on. I’m still thinking of starting with basic open/close prices, and making further judgments based on how well this performs.
Areas of Interest, More Potential Datasets
Another interesting financial correlation I’ve read about, and which can be potentially learned through Word2Vec-type A.I, is the relationship between Federal Reserve Funds Rate (interest prices) and general stock market behavior. As interest rates go up, the stock market tightens and starts to go down.
This is super interesting and incredibly useful information. Looking at the graph, you notice there is a delay period between a shift in FFR and subsequent changes in the market. You can definitely use Word2Vec, or some other neural network, to approximate how long this delay period is, and then use that information to make predictions about shifts in the market caused by changes in FFR.
A second interesting correlation I’ve looked at is the relationship between U.S GDP growth and stock market performance. In essence, the idea is that when GDP is projected to grow, there will be a bull market with inflated prices. The opposite is true as well, with poor GDP outlook reflected through bear markets. So, the goal is to learn what indicators anticipate a growing GDP. There’s a really good post on Medium about this, which claims that the best indicator for GDP performance is ISM’s monthly index score. Again, you could train a Word2Vec model to predict GDP growth, and use that information to make stock price projections.
In addition to doing research on advances in A.I for finance, I am going to spend a lot of time researching known correlation relationships in the market, and seeing if these can be applied as datasets for my neural network. Speaking of personal research, I just finished The Black Swan by Nassim Taleb, and will be writing a reflection post soon which should capture the main points I learned from Taleb.