Chengyi (Jeff) Chen’s Data Science Blog ΨΦ: Pursuit to discover Unknown Unknowns¶

Fig. 1 Hi, I’m Jeff and \(\Psi \Phi\) (pronounced “sci-fi”) is my technical blog dedicated to connecting seemingly disparate data science concepts.¶

Introduction¶

United States Secretary of Defense Donald Rumsfeld once stated::

Reports that say that something hasn’t happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns—the ones we don’t know we don’t know. And if one looks throughout the history of our country and other free countries, it is the latter category that tends to be the difficult ones.

Learning about the categorization of information by United States Secretary of Defense Donald Rumsfeld was an important point in my life. Although this quote is more often used in the context of risk management, I see it more as a guide on how to traverse to the right end of the Dunning-Kruger effect curve.

Fig. 2 Dunning-Kruger effect curve¶

Obviously, in order to succeed in life (“succeed” could be in the context of anything – e.g. healthy relationships, career, financial, academics, …) without relying on purely luck, one needs to have the bare minimum drive and resolve to want to work hard and actually execute it. But what does it mean to work hard? What does it mean to really understand the material that you’re learning? Clearly, the first step would be to start learning about what you know you don’t know. However, if you’re only doing this, you might get the false sense of confidence that brings you up to the Peak of “Mount Stupid” as shown in the figure above. To get to the “Plateau of Sustainability”, one has to trudge forward to find out what one doesn’t know one doesn’t know, i.e. the unknown unknowns. I’ll be using this space to demonstrate times when I’ve tried to push past the area of known unknowns through asking questions that drive me to the unknown unknowns.

Machine Learning Notes¶

The first section of my blog just contains some notes for my own reference on machine learning. Overall, my interests lie in the realm of probabilistic machine learning because it provides a framework for both learning about unbobserved variables as well as a good measure of uncertainty for predictions. “Pattern Recognition and Machine Learning” by Christopher M. Bishop is the best textbook resource I could find for learning about probabilistic machine learning.

Personal Projects¶

This section features some of the projects that I do in my free time to better understand some machine learning concepts as well as to better understand the pyro probabilistic programming language.

Featured Course Work & Extra-curriculars¶

This last section contains some of the important course work that I’ve enjoyed while completing my Bachelor’s in Computer Science and Business Administration and Master’s in Analytics from 2017 - 2021. If you’re looking for more traditional optimization-related course work, it’ll primarily be found inside ISE-530: Optimization Analytics, while some of the more quant-finance-focused course work will be found in ISE-537: Financial Analytics, though solving optimization problems exist in practically all the course-work I’ve featured in this blog.

No Free Lunch Theorem

If you’re not familiar with the NFL Theorem, it basically states that there isn’t a single “best” machine-learning / optimization algorithm for any problem. We often here of how most ML practitioners have a favourite algorithm that they turn to, almost like a silver bullet, for any given machine learning problem such as tree-based ensembling models like XGBoost or LightGBM. After building ML systems for trading strategies at Plutus Mazu, I’ve become truly humbled by the implications of this theorem. A tree-based ensembling model might work exceptionally well on a large dataset, but with very little data, the most regularized tree-based ensembling model still fails catastrophically during cross-validation as compared to a simple unscaled, KNN for example. Speaking of which, I think this theorem also generalizes to data preprocessing techniques as I’ve seen that applying scaling for example, is not necessarily always beneficial, especially when you’re using an algorithm that computes the distance between features of samples such as KNN, KMeans, …

Because of this, my priors on AutoML have shifted much more positively. Well, then what’s the point of hiring a data scientist then? From my knowledge so far, I think that AutoML systems have yet to become “smarter”. There’s an unimaginably huge search space of models to be used in the machine learning pipeline – which missing data imputer? which dimensionality reduction technique? which scaler to use? which feature selection algorithm? which machine learning algorithm?… There’s probably a way to “bayesianify” (use bayesian optimization to search over ML models) the AutoML process once we figure out a way to measure similarily across ML techniques on different types of datasets.