New Applications for BERT, in Solving Sequencing & Topk List Problems

This year, I have been researching if context-learning A.I (Word2Vec in my case) can be used to match patterns in the stock market. The premise is this: Since Word2Vec can understand which words are similar by learning in which contexts they appear, it makes sense that Word2Vec could understand which events in the stock market are similar through the same learning process. At this point, I’ve defined a ‘stock market event’ as a % shift in the closing price of a stock over a day (From close to close). This is subject to change as I experiment with results, but I chose this because % shifts capture the magnitude of the price movement and also the direction of the price movement, which are both crucial in defining an event in the stock market. Another key advantage of this system is that % shifts will never have the problem of homonyms: Words, such as ‘store’, have multiple meanings (“I went to the store”, “I need to store this item”, “I have something in store for you”), and Word2Vec doesn’t know this. Instead, Word2Vec and similar embedding systems load all data into single embeddings and assume that the word ‘store’ possesses the same meaning the same in every context. With % price shifts, we do not have this problem, as there is no ambiguity as to what +2.4% means.
Image result for stock up 10%
Interestingly enough, Google AI’s new BERT system, which I referenced a couple posts ago and which I am continuing to learn about, handles the problem of homonyms by creating ‘layered’ embeddings. Homonyms are dealt with because BERT uses transformers rather than recurrent neural networks (like the RNNs used in Word2Vec). Transformers start by creating an embedding for each unique word in a sentence, like Word2Vec does. But, the same words can appear in different sentences (with different contexts), so to address this problem, the transformer then creates embeddings for each word pair in the sentence, taking into account how close these words are to one another in the sentence. For example, in the sentence “The mighty USS Nimitz blazed through the Pacific, leaving nothing but lead and destruction in its wake.”, one ‘word-pair’ embedding is: (mighty, Pacific). These words appear in the same sentence, so they are related, but they are not close to one another (meaning a low dependency factor), so the influence of this ‘word-pair’ embedding is rather weak. These ‘word-pair’ embeddings are then factored into the primary embedding for each word, so the embedding for (mighty) is adjusted based on the embedding for (mighty, Pacific). This offers a more in-depth learning approach for each word, which could be more insightful than the sequential Word2Vec model that only learns from the small windows surrounding words, rather than the entire sentences they appear in.
There’s still a lot to unpack within BERT and NLP Transformers, which is to be expected considering how new this model is, but there’s already people claiming BERT is paving ground for the future of NLP because of how versatile it is, and how it can extract more information from smaller datasets thanks to masking and transformers. Here’s a really great article which covers some of BERT’s technical components in more detail, and if you have any similar articles which might be helpful, please send them my way.

The Sequencing Problem (and Solutions)

So, back to my main point, which is that once I train my Word2Vec model on the stock price data, I will then have a data set of embeddings. This data set is static for training purposes, as we don’t have to keep updating embeddings for the historical stock data we have since these numbers aren’t changing. Another plus is that with every passing day, we acquire more and more data which can be factored into our embeddings, meaning a more experienced neural network.
Once I have this data set at my disposal, I can begin using real-time data to search for matches within my dataset. First off, there’s the problem of sequencing, which is: Are we trying to find matches over 3 day periods? Over 1 day periods? Over 10 day period? If we’re looking for similar % shifts in a stock’s price, it’s likely not significant enough if we find two single-day shifts which are contextually similar to one another, because this is too small of a time window to extrapolate a pattern from. At the same time, if we are only looking for strings of 15 consecutive embeddings which match one another (this would mean 15 days in this case, because each embedding represent a % shift over one day), a match would be much more significant, but we will also likely never find two fifteen-day period which match one another. That time period is just too large.
Ideally, we would want to loop through all the plausible possibilities so that we don’t miss any significant patterns. I addressed this problem of sequencing in one of my earlier posts, and there are now three clear solutions:

  1. Follow the concept of Fingerprints used by Shazam: Once you find two embeddings that match (call them Q and D), check to see if their neighbors match as well. That is, compare Q+1 to D+1, Q+2 to D+2, Q+3 to D+3, …, Q+n to D+n, where ‘n’ is the max window size you’re looking at (Note: We would also want to compare Q-1 to D-1, Q-2 to D-2, …). This way, we can gauge how significant of a match there is between the real-time data we are observing and the historical stock price embeddings, as a match spanning over 4 days would be more significant and unique than a match spanning over 2 days. The magnitude of the match will then factor into our Confidence Score, which our algorithm uses to decide whether or not we have found an alpha (opportunity for profit). I am working to implement this system into my algorithm because it is the most flexible, as you can easily change your parameters (n) and you aren’t making any changes to the embeddings as you go along. Also, all of my stock price shifts are linked to one another by date, so it is easy to access previous_embedding and next_embedding.
  2. Create different classes of embeddings: Create different classes of embeddings, which represent different time periods. So, you will have 1-day embeddings, 2-day embeddings, 3-day embeddings, 4-day embeddings, and so on. You can still compare these embeddings to look for matches, but you will have much more data to search through. With this approach, we also don’t have to worry about sequencing, since all potential sequences are covered. We can accomplish this using Universal Sentence Encoder to convert sequences of price shifts (‘sentences’) into a single input, or even BERT’s sentence-embedding capabilities. This a really interesting potential approach, which I will consider implementing, mainly because you can generate 10+ times more embeddings from the same amount of data. One problem with this approach, which I can see right away, is what happens when a 2-day embedding matches with a 5-day embedding? What does this mean? This essentially means a certain sequence of stock price changes over 2 days is similar to a sequence of stock price changes over 5 days. This is certainly interesting, but I don’t know how to treat this information (are these sequences really similar if they occupy different time windows?). To avoid unnecessary complication, I will avoid this approach in my algorithm’s first trials, but it is still definitely worth looking into.
  3. BERT Next_Sentence predictor: I’ve discussed this superficially in previous blog posts, but the BERT model offers a function which takes two sentences as inputs, and then decides whether or not the second sentence could logically follow the first one. This approaches the sequencing problem in reverse, by taking the real-time data as an input and looping through possible outcomes to figure out which one is most likely to happen.

These are the three solutions to the sequencing problem I’ve identified, and I’m sure there are other solutions out there, so I will keep researching this in the upcoming weeks. In the meantime though, I think solution #1 is most viable.

The TopK List Problem

Another problem to consider is how to interpret TopK lists.
topklistsWhen you input a query into my Word2Vec code, it returns a list of the top k (k is just a number which you can adjust, so it can be the top 10 list or just the top 1) embeddings which most closely match the query.
If you zoom into this image (sorry about the sizing — For some reason, Linux has terrible screenshot software. I’ll get better pictures up soon), you can see the top 3 passages which most closely match the word ‘debt’. My intuition was that when I input real-time stock data, it is only worth looking at the closest match (#1 in the topK list), since we’re trying to find similar patterns in the stock market’s behavior.
However, Andrew Merrill brought up a great point which I unfortunately neglected, which is to consider the whole top k list, rather than only the expected outcome of the closest match. See, if our program returns a topK list where the number 1 result suggests that the market will drop, the number 2 result suggests the market will go up immensely, and the number 3 result suggests the market will stay flat, then this is an overall uncertain topK list (even if the query embedding matches the data embedding with 95%+ accuracy). We need to assign more value to a topK list with more consistent results, because this inspires more confidence in the overall projection (If all ten of the top ten closest matches suggest the market will go up, then this is a stronger signal that the market will actually go up).
One way to solve this issue is to calculate the volatility of the topK list, and factor this into our Confidence Score. The Confidence Score is the final value assigned to a potential investment, which currently takes into account the volatility of a topK list, the strength of the match (what % similarity), magnitude of match (how many days in the sequence match one another).
Right now, my priority is figuring out what else should factor into the Confidence Score, and what’s a fair way to factor all of these parameters into a single score. I’m thinking of using Z-Scores, but I’m still learning.

A.I + Finance Startups on the Rise

Finally, I came across a very interesting research startup called, which, from my understanding, uses AI to actively search for new investment opportunities and to optimize their portfolios. Here’s an excerpt from their website: “Solving Intelligence for Investment Management entails designing new Machine Learning methodologies to automate the alpha exploration process, so as to give an irrevocable and unfair advantage to (our) machines over the best human experts (Quants), in finding and exploiting new alphas, in making markets more efficient.” They don’t explicitly mention which AI models and strategies they use for their ‘alpha exploration process’, but it is interesting to see new startups focused on finding investment opportunities strictly through AI. I’m hoping to build my algorithm into a more general startup-esque venture similar to what is doing, so I’m curious to see how things play out for them and which steps they take moving forward. If you’re curious to read more about some of their research, click here.

Leave a Reply

Your email address will not be published. Required fields are marked *