Everyone wants data scientists. But something entirely different is missing.


Start – Blog – When is the end of the hype around data science

Everyone wants to have data scientists. Universities offer degree programs. Coursera and others are overflowing with data science offerings. Apparently, you can learn data science in a month. And that’s important! Because data is the new oil. Without data and the data scientists who turn it into gold, the future looks bleak, everyone agrees on that. Even if you don’t have exciting data, a data scientist can perhaps still conjure up gold dust from the little you have. And actually, you usually have no idea what you can do with the data anyway, but once you have data scientists, everything will be fine. On the hype cycle, we haven’t quite reached the top yet, but it won’t take long before we head down into the valley of disillusionment (and then to the plateau of productivity). Several misunderstandings are to blame for this.

There is no universally valid definition of Data Science

Thus anyone can call themselves a data scientist if they wish to do so. And you can also title a course or a degree program as such because it’s currently trendy. That’s exactly what’s happening far too often right now.

Data Science is the interplay of data mining, statistics, and machine learning. And that’s exactly what I offer in my courses. And so that we understand each other right away: one semester is far too little for that. And that’s why we don’t even call it Data Science, but rather Data Analytics or something similar. We just dip into Data Science. But in the 60 hours of the semester, I won’t be developing a new data scientist.

In principle, you should first teach statistics for at least one semester before proceeding. Then properly learn a programming language, whether it’s R or Python. And then you would start with machine learning. In between, occasionally explain how to handle Linux/Unix, databases, and cloud technology. That could certainly fill an entire course of study.

Often it is just an introduction to Python with a bit of scikit. But, as already described above, that doesn’t matter, because the term isn’t protected anyway. And hardly anyone notices, because who’s supposed to judge that?

There is still no adequate training

Recently, I took a peek at a Data Science course on Udemy (which, by the way, only costs a few euros for just a few hours). The young man in his gaming chair could talk well, but he couldn’t go into depth. Although, it depends on how one defines depth. For me, the low point in terms of content was reached when he said that you don’t necessarily need to understand certain things mathematically, such as whether you divide by n or by n-1. Wow.

I have also already had several computer science students from the University of Hamburg, etc. with me. Apart from the fact that they lack basic knowledge (“What is a CSV file?”), they have learned a few techniques that they also dutifully write in their applications (“Experience in ML”), but they haven’t really understood what they’re doing. So, k-means is happily applied to everything, even if it’s not numerical data (which can simply be converted, then it’s numerical). That this rarely makes sense when calculating Euclidean distances, well. If you only have a hammer, everything looks like a nail.

But if the training is suboptimal, how are data scientists supposed to generate gold from data? For really tough stuff, such training won’t be enough. And then either rubbish will be delivered or the project will never be finished. This reminds me a bit of the New Economy, when suddenly everyone could build HTML pages. Only those who knew more than HTML had chances of getting a job after the crash. And too many shops went bankrupt because they simply hired poorly trained people.

Not every problem needs a data scientist

Many problems can also be solved without a data scientist. In fact, many methods have already been well addressed in statistics, from regression analysis to Bayesian inference. Classification and clustering have also existed for a long time before the era of data science. Support Vector Machines are also somewhat older (1960s!). The only new thing is that there are many more libraries that everyone can apply. But you don’t have to think of data science right away when it comes to these topics. Because then you’re immediately paying a hype bonus.

And before applying such methods, data analysis comes first. This is the competence that is most lacking. We don’t need more data scientists at first, we need more people who don’t run away from a column of numbers and can draw the right conclusions from it. And if you then don’t know how to get to a solution, you can always ask a specialist. The most frequent problems I see are not data science problems, they are initially data analysis tasks. And ideally, these tasks should not be performed by extra data analysts, but by the colleagues themselves who are experts in a subject.

What, if not data science, will be important?

Of course, working with data will not become less important in the future. On the contrary. But it is to be feared that the current hype does not do this new field of study any good. Since there’s a lot of money to be made there, talents are also rushing into it whose previous focus was not necessarily on math-related subjects. Anyone can somehow complete a Udemy course. But the quality is not equally good in every course. And accordingly, this type of training as well as the clumsy learning of methods at university is not helpful in advancing data science. As a result, data science will rather disappoint and slide into the valley of disappointment. Because not all expectations can be met.

The work with data should be in the foreground, not data science. The analysis. The acquisition. Data scientists are bored when they are only used as better-paid data analysts. And the user who cannot articulate their needs and problems (if a problem even exists and is not just asking for the “cool stuff”) will no longer understand the world when the data scientists leave and look for more exciting tasks. We need users and data scientists who first understand the problem to be solved and have also analyzed the relevant data. We must give more people the competence to analyze data themselves.

Sistrix traffic vs. Google AdWords keyword planner


If you read along here often, you know that Sistrix is one of my absolute favorite tools (I’ll brazenly link as the best SEO tool), if only because of the lean API, the absolutely lovable Johannes with his really clever blog posts and the calmness with which the toolbox convinces again and again. Of course, all other tools are great, but Sistrix is something like my first great tool love, which you can’t or don’t want to banish from your SEO memory. And even if the following data might scratch the paint, they didn’t cause a real dent in my Sistrix preference.

What problem am I trying to solve?

But enough of the adulation. What is it about? As already described in the post about keywordtools.io or the inaccuracies in the Google AdWords Keyword Planner data mentioned in the margin, it is a challenge to get reliable data about the search volume of keywords. And if you still believe that Google Trends provides absolute numbers, well… Sistrix offers a traffic index of 0-100 for this purpose, which is calculated on the basis of various data sources, which is supposed to result in higher accuracy. But how accurate are the numbers here? Along the way, I also want to show why box plots are a wonderful way to visualize data.

The database and first plots with data from Sistrix and Google

The database here is 4,491 search queries from a sample, where I have both the Sistrix and the Google AdWords Keyword Planner data. By the way, it’s not the first sample I’ve pulled, and the data looks about the same everywhere. So it’s not because of my sample. So let’s first look at the pure data:

As we can see, you could draw a curve into this plot, but the relation doesn’t seem to be linear. But maybe we only have a distorted picture here because of the outlier? Let’s take a look at the plot without the giant outlier:

Maybe we still have too many outliers here, let’s just take those under a search volume of 100,000 per month:

In fact, we see a tendency here to go up to the right, not a clear line (I didn’t do a regression analysis), but we also see that with a traffic value of 5, we have values that go beyond the index values of 10,15,20,25 and 30, even at 50 so we see the curve again:

The median ignores the outliers within the smaller values:

So if we look at the median data, we see a correct trend at least for the higher values, with the exception of the value for the Sistrix traffic value of 65 or 70. However, the variation around these values is very different when plotting the standard deviations for each Sistrix traffic value:

We don’t see a pattern in the spread. It is not the case that the dispersion increases with a higher index value (which would be expected), in fact it is already higher with the index value of 5 than with 10 etc. We see the highest dispersion at the value of 60.

All-in-one: box plots

Because boxplots are simply a wonderful thing, I’ll shoot it after that:

Here the data is reversed once (because it was not really easy to see with the Sistrix data on the X-axis). The box shows where 50% of the data is located, so with a search volume of 390, for example, 50% of the data is between the Sistrix value of 5 and 25, the median is indicated by the line in the box and is 15. The sizes of the boxes increase at the beginning, then they are different sizes again, which indicates a lower dispersion. At some data points, we see small circles that R has calculated as outliers. So we see outliers, especially in the low search volumes. Almost everything we plotted above we get visualized here in a plot. Boxplots are simply wonderful.

What do I do with this data now?

Does this mean that the traffic data in Sistrix is unusable? No, it doesn’t mean that. As described in the introduction, the Keyword Planner data is not always correct. So nothing is known for sure. If you see the Keyword Planner data as the ultimate, you won’t be satisfied with the Sistrix data. It would be helpful if there was more transparency about where exactly the data comes from. Obviously, tethered GSC data would be very helpful as it shows real impressions. My recommendation for action is to look at several data sources and to look at the overlaps and the deviations separately. This is unsatisfactory, as it is not automatic. But “a fool with a tool is still a fool”.

Comments (since February 2020 the comment function has been removed from my blog):

Hanns says

  1. May 2018 at 21:18 Hello, thank you very much for the interesting analysis. Have you ever tried the new traffic numbers in the SISTRIX Toolbox? This also gives you absolute numbers and not index values. To do this, simply activate the new SERP view in the SISTRIX Labs. Information can be found here (https://www.sistrix.de/news/nur-6-prozent-aller-google-klicks-gehen-auf-adwords-anzeigen/) and here (https://www.sistrix.de/changelog/listen-funktion-jetzt-mit-traffic-und-organischen-klick-daten/)

Tom Alby says

  1. May 2018 at 10:58 I hadn’t actually seen that before. Thanks for the hint. But these are the ranges here, not the really absolute numbers. But still very cool.

Martin Says

  1. April 2019 at 13:33 Moin, I read your post and tried to understand. But I can’t figure it out. Sistrix is cool yes, but unfortunately I don’t think how reliable the data is.

I actually don’t understand how this is supposed to work technically. How is Sistrix supposed to get the search queries that run through Google for each keyword? It’s not as if Google informs Sistrix briefly with every request.

The only thing I can think of is that they pull the data for each keyword from AdsPlanner. But… to present this as “own search volume” without any indication of where the data comes from, I would find grossly negligent.

Where could they still get data from?

Tom says

  1. April 2019 at 20:39 Hallo Martin,

the answer is not 1 or 0, that also comes out in the article. You also can’t rely on AdPlanner data. Sistrix also gets data from customers who have linked the Search Console data there, since you can see your page’s impressions for a keyword. But of course, all this is not for every keyword. And that’s why inaccuracies come about.

BG

Tom