Preparing my first neural net, ft.(Considering viable data sets)


It has been a pretty busy stretch of days for seniors all across America, with early applications being due four days ago for most colleges. Upon submitting my application, I took a second to step back away from the madness and think about my research into quantitative trading, among other things. At a time when I am so invested in this project, it is difficult for me to transition away from my research and move into coursework, knowing that time I spend doing readings for a class I’m not so interested in could be exchanged for building my neural network. This makes me wonder about being a well-rounded scholar, and if it’s really better to be generally educated on various different topics rather than being an expert on a single, specific topic. Through my research, I am beginning to see how many research papers, data sets, and A.I strategies have been used for quant trading, and am feeling overwhelmed by all of this. Yet, one thing I like to keep in mind is a quote from Nassim Taleb’s The Black Swan (which I am still reading, and would definitely recommend you read):

“The more information you give someone, the more hypotheses they will formulate along the way, and the worse off they will be. They see random noise and mistake it for information.” – Taleb, 144

So, according to Taleb, information should be assessed by quality rather than quality — It is not about how much data you have, but what data you have.
Image result for quantity vs quality

How’s the coding?

Speaking of data, I am now at the point where I need to sit down and seriously consider which data sets I can use for my Word2Vec (or Market2Vec, or Stock2Vec, or Vec2Money) trading algorithm. Right now, on the programming end of things, I’m at the point where my neural network is structured, and is ready to be customized for more mathematical data rather than the natural language I have been using up until now. Essentially, I have python code (running from a Linux computer where my Tensorflow, Universal Sentence Encoders, and Word2Vec models are installed) which takes a data set as an input and creates embeddings of this data. In past cases, I have used Q&A data to train my neural net on, and this has yielded very interesting results. I will go into more technical data in my next blog post, but the general framework I have goes as follows:

  1. Upload data set.
  2. Read through data in effective manner (either organize the data well before uploading it or tailor your reader function to the data you have).
  3. Create embeddings for the input data.
  4. Come up with some query, and make an embedding for this query.
  5. Compare query_embedding to mainData_embeddings, see which vectors map closest to each other.
  6. See causes/effects of the data_embedding you find, and use this information to make a prediction about your query.

Right now, steps 3 and 6 appear to be the most challenging and thought provoking. For step 3, what is the best way to make an embedding for stock data? If, for instance, I train my net with a dataset of historical open/close prices for stocks, what parts of this data should I use for creating my embeddings? More importantly, HOW do I create an embedding for open/close prices? I will go over this second question in more detail in, you guessed it, my next blog post.
Here is the problem with step 6: Once you find a data_embedding mapping to the query_embedding, how do you use this information to make a useful prediction? In order to do this, you would need to understand the significance of this data_embedding. This is a whole different problem on its own, and the answer will depend based on what data set I am using. Going back to the example of open/close prices, how would we find the cause/effects of a data_embedding in this case? My intuition is to simply look at the open/close prices of the next day, and if the close price is up, then there is a higher chance that the query_embedding will go up the next day as well. This would require a system which keeps track of ‘neighbors’, so that we know which data_embeddings precede/follow the data_embedding we are interested in.
The good news is that steps 1, 2, 4, and 5 are relatively simply, and can be done pretty easily.

Who have I been talking to?

This past week, I have continued my outreach efforts, emailing financial engineering professors, capital fund managers, independent figures, and even some people involved in A.I startups. Though most of them left me on read, I am getting a first-class education in persistence and perseverance. I have been able to connect with one former financial engineering graduate student from Columbia University, and am planning on speaking with him soon (stay tuned!). I am going to ask similar questions to those from my last interview, but also want to add-in some of the more specific questions which arose in my last two posts. If anyone has suggestions for good questions to ask, share them in the comments (always appreciated).
I have been reading a lot on Quora to see what people think about deep learning and finance. There seems to be an active debate, between those who believe in the future of A.I for predicting market movements, and those who believe A.I is incompatible with the noisiness of stock data. There is a lot of research being done on this as well, on how to manage noisy and misleading data.
I brought this issue to my dad, who mentioned that precious metals markets might be better for modelling, as there are less external factors involved with changes in, say, gold prices. Less factors means the data would be less noisy, which would mean more insightful neural network predictions.
I also have been reading about Marcos Lopez de Prado’s new book, Advances in Financial Machine Learning, which I will definitely want to read. De Prado is another firm believer in the powers of A.I, who thinks that the future of trading will be dominated by competing trading algorithms.

More Abstract

Another interesting note is how involved China seems to be in the revolution of financial engineering. There are, of course, many American, Indian, European researchers publishing on this field, but the great majority seem to be written by Chinese scholars. One clear example of this is from Two Sigma’s recent post about their ‘favorite’ papers on A.I from the International Conference of Machine Learning 2018. The article, written by three Asian researchers at Two Sigma, features a handful of cutting-edge pieces predominantly written by Asian researchers. This is very interesting to think about, especially considering how China publicly stated that they want to be the world leader in A.I by 2030. Perhaps this surge in quantitative trading research by Asian authors is indicative of China’s plan, and says something about which side China takes in the debate of financial deep learning.×530.png

Leave a Reply

Your email address will not be published. Required fields are marked *