Everyone wants data scientists. But something entirely different is missing.


Start – Blog – When is the end of the hype around data science

Everyone wants to have data scientists. Universities offer degree programs. Coursera and others are overflowing with data science offerings. Apparently, you can learn data science in a month. And that’s important! Because data is the new oil. Without data and the data scientists who turn it into gold, the future looks bleak, everyone agrees on that. Even if you don’t have exciting data, a data scientist can perhaps still conjure up gold dust from the little you have. And actually, you usually have no idea what you can do with the data anyway, but once you have data scientists, everything will be fine. On the hype cycle, we haven’t quite reached the top yet, but it won’t take long before we head down into the valley of disillusionment (and then to the plateau of productivity). Several misunderstandings are to blame for this.

There is no universally valid definition of Data Science

Thus anyone can call themselves a data scientist if they wish to do so. And you can also title a course or a degree program as such because it’s currently trendy. That’s exactly what’s happening far too often right now.

Data Science is the interplay of data mining, statistics, and machine learning. And that’s exactly what I offer in my courses. And so that we understand each other right away: one semester is far too little for that. And that’s why we don’t even call it Data Science, but rather Data Analytics or something similar. We just dip into Data Science. But in the 60 hours of the semester, I won’t be developing a new data scientist.

In principle, you should first teach statistics for at least one semester before proceeding. Then properly learn a programming language, whether it’s R or Python. And then you would start with machine learning. In between, occasionally explain how to handle Linux/Unix, databases, and cloud technology. That could certainly fill an entire course of study.

Often it is just an introduction to Python with a bit of scikit. But, as already described above, that doesn’t matter, because the term isn’t protected anyway. And hardly anyone notices, because who’s supposed to judge that?

There is still no adequate training

Recently, I took a peek at a Data Science course on Udemy (which, by the way, only costs a few euros for just a few hours). The young man in his gaming chair could talk well, but he couldn’t go into depth. Although, it depends on how one defines depth. For me, the low point in terms of content was reached when he said that you don’t necessarily need to understand certain things mathematically, such as whether you divide by n or by n-1. Wow.

I have also already had several computer science students from the University of Hamburg, etc. with me. Apart from the fact that they lack basic knowledge (“What is a CSV file?”), they have learned a few techniques that they also dutifully write in their applications (“Experience in ML”), but they haven’t really understood what they’re doing. So, k-means is happily applied to everything, even if it’s not numerical data (which can simply be converted, then it’s numerical). That this rarely makes sense when calculating Euclidean distances, well. If you only have a hammer, everything looks like a nail.

But if the training is suboptimal, how are data scientists supposed to generate gold from data? For really tough stuff, such training won’t be enough. And then either rubbish will be delivered or the project will never be finished. This reminds me a bit of the New Economy, when suddenly everyone could build HTML pages. Only those who knew more than HTML had chances of getting a job after the crash. And too many shops went bankrupt because they simply hired poorly trained people.

Not every problem needs a data scientist

Many problems can also be solved without a data scientist. In fact, many methods have already been well addressed in statistics, from regression analysis to Bayesian inference. Classification and clustering have also existed for a long time before the era of data science. Support Vector Machines are also somewhat older (1960s!). The only new thing is that there are many more libraries that everyone can apply. But you don’t have to think of data science right away when it comes to these topics. Because then you’re immediately paying a hype bonus.

And before applying such methods, data analysis comes first. This is the competence that is most lacking. We don’t need more data scientists at first, we need more people who don’t run away from a column of numbers and can draw the right conclusions from it. And if you then don’t know how to get to a solution, you can always ask a specialist. The most frequent problems I see are not data science problems, they are initially data analysis tasks. And ideally, these tasks should not be performed by extra data analysts, but by the colleagues themselves who are experts in a subject.

What, if not data science, will be important?

Of course, working with data will not become less important in the future. On the contrary. But it is to be feared that the current hype does not do this new field of study any good. Since there’s a lot of money to be made there, talents are also rushing into it whose previous focus was not necessarily on math-related subjects. Anyone can somehow complete a Udemy course. But the quality is not equally good in every course. And accordingly, this type of training as well as the clumsy learning of methods at university is not helpful in advancing data science. As a result, data science will rather disappoint and slide into the valley of disappointment. Because not all expectations can be met.

The work with data should be in the foreground, not data science. The analysis. The acquisition. Data scientists are bored when they are only used as better-paid data analysts. And the user who cannot articulate their needs and problems (if a problem even exists and is not just asking for the “cool stuff”) will no longer understand the world when the data scientists leave and look for more exciting tasks. We need users and data scientists who first understand the problem to be solved and have also analyzed the relevant data. We must give more people the competence to analyze data themselves.