How Word2Vec Accounts for Fundamental Stock Data, and Other Cool Things

How can anyone possibly use Word2Vec on a single parameter of financial data, such as opening and close prices of stocks, to make any sort of meaningful prediction? Are we simply going to ignore the heap of fundamental data behind every stock, such as P/E ratios, return on assets, etc…?
I’ve pondered these questions for a while, and I’ve finally found an answer: Word2Vec doesn’t know anything about the fundamental data behind words (grammar, syntax, or even definitions of individual words), yet it still does remarkably well at matching similar words, as we have seen.
Going off of this observation, is it safe to say that Word2Vec can match stock price shifts of similar significance without knowing anything about the nature of these stocks? My intuition is yes, because of how well Word2Vec does for mapping similar words/texts, but also because of how Word2Vec models are designed on a technical level. There are two variations of Word2Vec, continuous bag of words (CBOW) and skip-gram models, with skip-gram models being better-suited for larger datasets (meaning I’ll probably have to use skip-grams). However, both models work in similar ways: You read through a corpus of text, generate vectors for each unique word within the text, and then create new vector embeddings (what we’re actually looking for) based on the context these words appear in.
So, how are these initial vectors created? It’s quite simple, actually. For the first word in the corpus, you start by creating a one-element array, which looks like: [1]. If the word is ‘The’, then the word ‘the’ corresponds to the first element in the array. For the next unique word you encounter, you create a new array: [0, 1], but you must also add a zero to the end of the first array as well: [1,0]. If the second word is ‘mighty’, then the second element in the array corresponds to mighty. Whenever an array’s second element == 1, then you know the array corresponds to the word ‘mighty’.
You go on like this until you’ve accounted for every unique word. You will end up with an array whose size == the number of unique words, and each word is represented by a huge array full of zeroes with a single 1 at the location which corresponds to that word. For instance, if our data set had 20 unique words, the array for the word ‘mighty’ would be [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
One observation you can make right off the bat is that the more unique words you have in your dataset, the longer these individual arrays will be, and more computational energy (i.e: time, money) will be required. So, how do we reduce the size of these vectors?
Image result for neural network 1 hidden layer
Recall that Word2Vec neural networks have three layers, with one hidden layer. The first layer, the input, are the arrays we generated above. The second layer, the hidden, is where array compression happens. This array compression works because the hidden layer is simply a matrix of weights (a matrix of dimensions: Number of unique words X desired vector size, between 50 and 300 dimensions). We multiply the original array for a word by the learned weight matrix for that word, and voila, we have a vector (still an array) of between 50 and 300 dimensions (elements).
Hidden Layer Weight Matrix
How these weights get adjusted is a story for another time, but in general, they work by calculating probabilities of target words appearing in a certain context:
Training Data
You use nearby words to generate training samples, and you adjust weights for the target word (blue) by seeing how often it appears, throughout the rest of the dataset, in proximity to those samples (how likely ‘quick’ is to appear next to ‘fox’ would be used to adjust weights for ‘quick’).
By the end of training, we are not even interested in the output layer of the neural network, as all we need are the vectors created by multiplying the input layer by the weight matrix. These embedded* vectors can then be used to find words (or events in the stock market) of similar significance.
* Embedded is just another way of saying compressed, so embeddings are essentially compressed vectors.

What Does a Falling Tree Teach Us?

So why I am bringing technical definitions into this? To show that Word2Vec learning is entirely based on context. To me, the stock market is a contextual thing, where future activity is predicated on what has happened before (cause/effect relationships). Many would call me wrong to say this, but no one is ever completely right about the markets, so it is worth taking a look at (Bridgewater Associates should be on my side for this one). Therefore, Word2Vec-generated embeddings do not need to include any fundamental background data about stocks, because we don’t care about these fundamentals when looking at context. If a tree falls over in a forest and hits a nearby house, we don’t care about the exact sedimentary composition beneath the tree’s roots at n seconds before the tree fell, we care that we have learned a new thing: Trees falling can lead to houses being destroyed, and we can use this knowledge to make predictions about what will happen when more trees fall in the future. Sure, the sedimentary composition might have caused the tree to fall, but we don’t know for certain, and if it did, it is reflected through the fact that the tree fell, and it is therefore irrelevant.
Image result for falling trees
This reminds me of a point brought up by HFT trader Tom Conerly when I spoke with him, which is that high frequency traders don’t look at fundamentals of stocks, because they believe all fundamentals are proportionally reflected through market price. This saves space, time, and energy, as they can avoid processing mountains of other data which they would otherwise have to look at.
So, now that I’ve decided to ignore fundamental data when creating my neural network, the next big question I need to answer before actually uploading data is which data I want to train my net on. I’m still thinking of starting with basic open/close prices, and making further judgments based on how well this performs.

Areas of Interest, More Potential Datasets

Another interesting financial correlation I’ve read about, and which can be potentially learned through Word2Vec-type A.I, is the relationship between Federal Reserve Funds Rate (interest prices) and general stock market behavior. As interest rates go up, the stock market tightens and starts to go down.
This is super interesting and incredibly useful information. Looking at the graph, you notice there is a delay period between a shift in FFR and subsequent changes in the market. You can definitely use Word2Vec, or some other neural network, to approximate how long this delay period is, and then use that information to make predictions about shifts in the market caused by changes in FFR.
A second interesting correlation I’ve looked at is the relationship between U.S GDP growth and stock market performance. In essence, the idea is that when GDP is projected to grow, there will be a bull market with inflated prices. The opposite is true as well, with poor GDP outlook reflected through bear markets. So, the goal is to learn what indicators anticipate a growing GDP. There’s a really good post on Medium about this, which claims that the best indicator for GDP performance is ISM’s monthly index score. Again, you could train a Word2Vec model to predict GDP growth, and use that information to make stock price projections.
In addition to doing research on advances in A.I for finance, I am going to spend a lot of time researching known correlation relationships in the market, and seeing if these can be applied as datasets for my neural network. Speaking of personal research, I just finished The Black Swan by Nassim Taleb, and will be writing a reflection post soon which should capture the main points I learned from Taleb.

Most of the visuals and definitions I used for explaining Word2Vec Skip-Grams was through Chris McCormick’s tutorial, which I highly suggest reading. 

The Shortcomings of Neural Networks for Trading Predictions

As someone who is devoting a large-portion of their senior year (and very likely time beyond that) to researching potential applications of deep learning in trading, I wasn’t thrilled to learn about the recent shortcomings of quantitative traders. Let’s begin with Marcos López de Prado, a frequently cited algorithmic trader who recently  published Advances in Financial Machine Learning. One thing that De Prado talks about is the idea of ‘red-herring patterns’ that are extrapolated by machine learning algorithms. These types of algorithms are, by design, created to analyze large bodies of data and identify patterns within this data. In fact, this idea of noticing patterns is one of the main assumptions I am basing my work on (using Word2Vec embeddings to identify past financial patterns and apply them to real-time data for more accurate predictions). But, what happens when these algorithms identify patterns that aren’t real? An aggressive neural network (In my case: One which adjusts vector weights heavily while learning from data) is prone to make these types of mistakes. Think of this example: A stock happens to go up a couple percent points every Thursday for three weeks in a row. A (poorly written) neural network would deduce that every Thursday in the future, this stock would go up by at least a percent point or two. Now, this is easily avoidable by training a trading algorithm on larger sets of data, but even large data sets are prone to these types of red-herrings. Once a trading algorithm clings on to a pattern, it could backfire horribly when that pattern eventually breaks.
This brings the idea of Black Swans into light. The theory of Black Swans was popularized by Nasim Taleb in his accurately-titled book The Black Swan: The Impact of the Highly Improbable. The general gist of this theory is that the most profoundly impactful events oftentimes are the ones we least expect, due to our fallacious tendencies in analyzing statistics (I will go into more detail on these topics and more in a future blog post, once I am done reading the whole book). Taleb argues that one of our biggest shortcomings in analyzing data is creating ‘false narratives’, which are more convenient and easier to sell to clients. These false narratives oftentimes omit crucial data (silent data), which backfires once the narrative breaks.
But, on the other end, a more passive neural network (one which more slightly adjusts vector weights) can sometimes come to no meaningful conclusions, which means wasted time and computational energy. I want to create a Word2Vec model which can detect patterns, but I also don’t want it to actively follow patterns with no longevity.
So, what does one do? How aggressive/passive should I make my Word2Vec neural network? 

Another theory which I encountered over the weekend is the idea of survivorship bias. In training neural networks, how do we treat data from companies which have failed? If we are analyzing the stock price data for various important stocks over time, what do we with data from once-important stocks which are now defunct, such as Lehman Brothers? I initially thought it would be best to throw this data out, since it is no longer applicable, but it turns out this strategy can have negative consequences. If we only train our network on stocks which have survived, then we will miss out on crucial data about when stocks go bankrupt. So, how do we properly treat this type of data?


All of these seemingly insignificant flaws in trading algorithms can evoke catastrophic mistakes. This concept is synthesized by quantitative investment officer Nigol Koulajian, saying: “You can have one little pindrop that can basically make you lose over 20 years of returns.” This ‘little pindrop’ which Koulajian mentions is the eventual divergence from the false patterns identified by neural networks. I personally think it would take more than a little pindrop to erase 20 years of returns, but the idea still stands. So, this warrants the question, how do we avoid the little pindrop? My (far-fetched?) theory is that you can use neural networks to estimate worst-case scenarios int the same way they are designed to estimate best-case scenarios, and then work to avoid this.
In broader terms, Bloomberg reports that the Eureka Hedge Fund Index, which tracks the returns of hedge funds which are known for using machine learning, has under performed yearly compared to the S&P 500. The harsh truth (right now) is that simply investing in the S&P500 will return ~13% yearly, while machine-learning based hedge funds return ~9% yearly.
Eureka Hedge Fund Index
(The keen observer will notice that despite all the noise, the index has been steadily going up over the past 7 years)
These are some of the questions I ask those few who read what I am writing, and are the types of questions I will ask through my personal research interviews (Good News! I have my first interview scheduled this upcoming Tuesday, and, interviewee permitting, I will post a summary of our talk later in the week).
In my personal opinion, the recent under performance of trading algorithms in general is not a bad sign. This is still a relatively new field, meaning that more research needs to be done and new discoveries need to be made. I think of it this way: If trading algorithms are working perfectly, then what’s the point of a newcomer (like me) coming in and doing research on them? If it ain’t broke, don’t fix it.