Warning to Data Science

In Warning to the West, by Alexander Solzhenitsyn (McGraw-Hill, 1976), it was stated that the Russian empire will one day establish supremacy over the West without risk of a nuclear holocaust. Interestingly, if you look at the current event horizon, it does in fact seem that lately, with international indictments against Eastern meddling to sway votes during the 2016 elections, that in part, some of Solzhenitsyn's warnings have already come true.

The explosive bubble of election hacking and fake political ads on social media is not the only thing at risk in the next decade. The implosion of data science may be the next axe that falls.   Never before in the history of technology has there been such an explosive growth of jobs in a single field. Already, there are thousands of data science jobs, and hundreds if not thousands of web sites that offer commercially available online training courses in data science. Included with the onslaught of these do-it-yourself training sites is all the academic campuses getting in on the act to provide masters degrees and "authentic" certificates of training in data science from established universities.

Ignorance is Bliss. While there are many benefits from the explosive growth of data science, there is an equal if not greater disadvantage from its inflation and overheating. The biggest problem facing data science is that it is technology-based, which means that the "pendulum paradox" of science applies.   Large pendulum swings always occur as new technologies appear, and counter-swings occur as too much uninterpretable information becomes available.   We have seen this before in molecular biology, where a doctoral dissertation is earned based on given technology for interrogating biological samples, but within 5-7 years thereafter, the technology is no longer used.   The classical numerical methods for data science, however, are not like this, as the fundamental mathematics for each method live on to see another day and many decades.   But the fundamental numerical methods and mathematical algorithms are not what most online data science trainees are learning. Instead, they are learning how to slap interpretive language code together to launch a solution into the cloud in order to obtain a solution.

Along these lines, I have run into several engineers over the last few years who have said that they have started to look into data science using R, Python, SciKit, Keras, Anaconda, Jupyter, etc., because of the explosive growth of the field.   Not that they need to do this for their job, but rather, that they are merely interested in jumping on the bandwagon.   My response has commonly been that that's okay, but when you assemble together interpretive code, how much of what you are doing requires you to fully know the:

-Numerical methods for obtaining pdfs and cdfs for 20+ probability distributions?
-Numerical methods surrounding dozens of hypothesis tests used in data analyses?
-Numerical methods of matrix algebra?
-Numerical methods for the various ways of inverting a matrix (Gauss-Jordan elimination, Gauss-Seidel, Jacobi methods, SVD)?
-Numerical methods related to gradient ascent and gradient descent: Jacobian and Hessian matrices, and first and second partial derivatives?
-Numerical methods involving neural network learning rates and momentum?
-Numerical methods for Latin hypercube sampling and why LHS has to be used in some neural network problems?
-Numerical methods for various neural network architectures, the partial derivatives of error with respect to tanh, logistic, linear, Hermite and Laguerre activation functions?
-Numerical methods of batch vs. online learning?
-Numerical methods of bootstrapping, boosting, cross-validation?
-Numerical methods for unsupervised non-linear distance metric manifold learning or dimensional reduction techniques?
-Numerical methods for supervised learning involving 15-20 techniques?
-Numerical methods for performance assessment of supervised methods?
-Numerical methods for text mining, n-gram analysis, and sentiment mining?
-Hidden numerical tricks required for implementing techniques listed above?
-How techniques above can be mis-used, break down, and fail?

So, in short, the majority of these engineers wanted to learn data science, but knew nothing about how to specifically solve problems based on the mathematical approaches to algorithms, numerical methods, and the many ways these methods can break down. This, unfortunately, is not data science.

False Messiahs. There are already advertisements by companies and universities that "our way is better."   However, this doesn't really matter in a free-market internet and IOT-based world where there are hundreds of apps and companies which provide the same services. (Have you searched for e-mail apps on your Android or I-tunes phone lately, and realized how many apps for e-mail alone there are? How about the ever-increasing list of SEO companies?).   Data science training web sites and university-based programs are growing at the same explosive and unregulated rate that internet and phone apps are. However, most of these training programs don't have the time to educate their students on the fundamental mathematics and numerical methods required for implementing native compilable code for any of the methods covered. Instead, students learn by example to slap together snippets of interpretive language that can be launched in the cloud or AWS for HPC-based multiple-CPU runs.   Unfortunately, this is true for a lot of masters graduates in statistics who only needed 36 hours of semester credits and a thesis or Capstone project to graduate.   I have seen and worked with many who don't know that much about logistic regression, and most know very little about matrix algebra, simply because it's not covered during training -- again, like data science training, there just isn't enough time.

Bursting of the Data Science Bubble.   With the explosive growth of data science, there will be an accompanying growth of rogue training groups ("online cupcake stores") and improperly trained students that initially become satiated from the feeding frenzy, but eventually succumb to local and global avalanches caused by man-made artificial sectors of the economy.   While humans are currently needed to fill all the available data science jobs, there is nothing preventing these jobs from being lost due to robotics and advanced AI in the near future. So the risk won't be due to a change in technology used for interrogating data, but rather the replacement of human intervention for specific monotonous tasks.   The hidden truth is that with improperly trained practitioners there is real risk of navigating down the wrong pathway of knowledge due to erroneous interpretation of results, ending up in a space-time continuum where the body of knowledge is assumed to be valid but is wholly biased.   This will be impossible for mankind to back out of. We see this all the time in molecular biology, where researchers run t-tests on "everything" without knowing the assumptions surrounding normality, heteroscedasticity (unequal variances), and outlier effects.   This also occurs with correlation, where lab researchers run Pearson correlation on "everything" without ever constructing X-Y scatter plots to look for outliers or clustering of data. In one particular case, a researcher found significant correlation between expression of a protein in serum plasma and tissue. So, I obtained the data and generated an X-Y plot of plasma expression vs. tissue. While there was a cluster of samples near the center of the plot, half of the samples did not express the protein in plasma, so the plot points for these samples varied on the X-axis (variable tissue expression) but had Y-values of zero.   This indicates that there were two clusters of samples: one group that express with variation in both plasma and tissue, and another that expresses in tissue but not in plasma.   Having two clusters of data causes the histogram for expression in plasma to be bimodal, with two spikes or "humps" in the distribution.   Multimodality violates the normality assumption and forces the mean and standard deviations used in Pearson correlation to be biased and unusable. Instead, the researcher should have run the non-parametric Spearman rank correlation, which does not require normality and happens to be insensitive to outliers.

Warning to Data Science. What I have witnessed concerning abuse of statistical hypothesis test assumptions and lack of knowledge on how tests and methods break down, will also naturally occur in the over-inflated field of data science, since it is human nature to make (erroneous) assumptions.   In another blog, I conjectured that data science will become renamed "data services," since in the current realm of data science, only services are provided and nothing is manufactured. What lies ahead in the future of data science depends on how new plans for certification, standardization, and evaluation are introduced, handled, and implemented.   As long as data science is internet-based and unregulated, it will be impossible to thwart the rogue training groups that are growing at the rate of online cupcake stores and phone apps.   The only advantage of a free-market economy vs. a fixed economy is that hopefully these rogue groups will fall to the wayside.

In Warning to the West, by Alexander Solzhenitsyn (McGraw-Hill, 1976), it was stated that the Russian empire will one day establish supremacy over the West without risk of a nuclear holocaust. Interestingly, if you look at the current event horizon, it does in fact seem that lately, with international indictments against Eastern meddling to sway votes during the 2016 elections, that in part, some of Solzhenitsyn's warnings have already come true.

The explosive bubble of election hacking and fake political ads on social media is not the only thing at risk in the next decade. The implosion of data science may be the next axe that falls.   Never before in the history of technology has there been such an explosive growth of jobs in a single field. Already, there are thousands of data science jobs, and hundreds if not thousands of web sites that offer commercially available online training courses in data science. Included with the onslaught of these do-it-yourself training sites is all the academic campuses getting in on the act to provide masters degrees and "authentic" certificates of training in data science from established universities.

Ignorance is Bliss. While there are many benefits from the explosive growth of data science, there is an equal if not greater disadvantage from its inflation and overheating. The biggest problem facing data science is that it is technology-based, which means that the "pendulum paradox" of science applies.   Large pendulum swings always occur as new technologies appear, and counter-swings occur as too much uninterpretable information becomes available.   We have seen this before in molecular biology, where a doctoral dissertation is earned based on a given technology for interrogating biological samples, but within 5-7 years thereafter, the technology is no longer used.   The classical numerical methods for data science, however, are not like this, as the fundamental mathematics for each method live on to see another day and many decades.   But the fundamental numerical methods and mathematical algorithms are not what most online data science trainees are learning. Instead, they are learning how to slap interpretive language code together to launch a solution into the cloud in order to obtain a solution.

Along these lines, I have run into several engineers over the last few years who have said that they have started to look into data science using R, Python, SciKit, Keras, Anaconda, Jupyter, etc., because of the explosive growth of the field.   Not that they need to do this for their job, but rather, that they are merely interested in jumping on the bandwagon.   My response has commonly been that that's okay, but when you assemble together interpretive code, how much of what you are doing requires you to fully know the:

-Numerical methods for obtaining pdfs and cdfs for 20+ probability distributions?
-Numerical methods surrounding dozens of hypothesis tests used in data analyses?
-Numerical methods of matrix algebra?
-Numerical methods for the various ways of inverting a matrix (Gauss-Jordan elimination, Gauss-Seidel, Jacobi methods, SVD)?
-Numerical methods related to gradient ascent and gradient descent: Jacobian and Hessian matrices, and first and second partial derivatives?
-Numerical methods involving neural network learning rates and momentum?
-Numerical methods for Latin hypercube sampling and why LHS has to be used in some neural network problems?
-Numerical methods for various neural network architectures, the partial derivatives of error with respect to tanh, logistic, linear, Hermite and Laguerre activation functions?
-Numerical methods of batch vs. online learning?
-Numerical methods of bootstrapping, boosting, cross-validation?
-Numerical methods for unsupervised non-linear distance metric manifold learning or dimensional reduction techniques?
-Numerical methods for supervised learning involving 15-20 techniques?
-Numerical methods for performance assessment of supervised methods?
-Numerical methods for text mining, n-gram analysis, and sentiment mining?
-Hidden numerical tricks required for implementing techniques listed above?
-How techniques above can be mis-used, break down, and fail?

So, in short, the majority of these engineers wanted to learn data science, but knew nothing about how to specifically solve problems based on the mathematical approaches to algorithms, numerical methods, and the many ways these methods can break down. This, unfortunately, is not data science.

False Messiahs. There are already advertisements by companies and universities that "our way is better."   However, this doesn't really matter in a free-market internet and IOT-based world where there are hundreds of apps and companies which provide the same services. (Have you searched for e-mail apps on your Android or I-tunes phone lately, and realized how many apps for e-mail alone there are? How about the ever-increasing list of SEO companies?).   Data science training web sites and university-based programs are growing at the same explosive and unregulated rate that internet and phone apps are. However, most of these training programs don't have the time to educate their students on the fundamental mathematics and numerical methods required for implementing native compilable code for any of the methods covered. Instead, students learn by example to slap together snippets of interpretive language that can be launched in the cloud or AWS for HPC-based multiple-CPU runs.   Unfortunately, this is true for a lot of masters graduates in statistics who only needed 36 hours of semester credits and a thesis or Capstone project to graduate.   I have seen and worked with many who don't know that much about logistic regression, and most know very little about matrix algebra, simply because it's not covered during training -- again, like data science training, there just isn't enough time.

Bursting of the Data Science Bubble.   With the explosive growth of data science, there will be an accompanying growth of rogue training groups ("online cupcake stores") and improperly trained students that initially become satiated from the feeding frenzy, but eventually succumb to local and global avalanches caused by man-made artificial sectors of the economy.   While humans are currently needed to fill all the available data science jobs, there is nothing preventing these jobs from being lost due to robotics and advanced AI in the near future. So the risk won't be due to a change in technology used for interrogating data, but rather the replacement of human intervention for specific monotonous tasks.   The hidden truth is that with improperly trained practitioners there is real risk of navigating down the wrong pathway of knowledge due to erroneous interpretation of results, ending up in a space-time continuum where the body of knowledge is assumed to be valid but is wholly biased.   This will be impossible for mankind to back out of. We see this all the time in molecular biology, where researchers run t-tests on "everything" without knowing the assumptions surrounding normality, heteroscedasticity (unequal variances), and outlier effects.   This also occurs with correlation, where lab researchers run Pearson correlation on "everything" without ever constructing X-Y scatter plots to look for outliers or clustering of data. In one particular case, a researcher found significant correlation between expression of a protein in serum plasma and tissue. So, I obtained the data and generated an X-Y plot of plasma expression vs. tissue. While there was a cluster of samples near the center of the plot, half of the samples did not express the protein in plasma, so the plot points for these samples varied on the X-axis (variable tissue expression) but had Y-values of zero.   This indicates that there were two clusters of samples: one group that express with variation in both plasma and tissue, and another that expresses in tissue but not in plasma.   Having two clusters of data causes the histogram for expression in plasma to be bimodal, with two spikes or "humps" in the distribution.   Multimodality violates the normality assumption and forces the mean and standard deviations used in Pearson correlation to be biased and unusable. Instead, the researcher should have run the non-parametric Spearman rank correlation, which does not require normality and happens to be insensitive to outliers.

Warning to Data Science. What I have witnessed concerning abuse of statistical hypothesis test assumptions and lack of knowledge on how tests and methods break down, will also naturally occur in the over-inflated field of data science, since it is human nature to make (erroneous) assumptions.   In another blog, I conjectured that data science will become renamed "data services," since in the current realm of data science, only services are provided and nothing is manufactured. What lies ahead in the future of data science depends on how new plans for certification, standardization, and evaluation are introduced, handled, and implemented.   As long as data science is internet-based and unregulated, it will be impossible to thwart the rogue training groups that are growing at the rate of online cupcake stores and phone apps.   The only advantage of a free-market economy vs. a fixed economy is that hopefully these rogue groups will fall to the wayside.