Kategorie: Data Science

Statistics Basics

This course cannot substitute a basic statistics course! You will learn some fundamentals but for real data science projects, some more statistics knowledge is recommended.

Filed under: Data Science

Basic Concepts

There are five phases in Data Science and data analysis:

  • Understanding the business problem: This is one of the most neglected steps in data science and analytics although it is the most important one. You need to understand the business problem, and you need to gain a complete understanding of the problem. The main question is, what exactly is the business problem that you are asked to solve or that you want to solve.
  • Preparation Phase: Acquiring and checking the data
  • Analysis Phase: Building models
  • Reflection Phase: Reviewing results and looking at alternative models
  • Dissemination Phase: Reporting results

Another important concept is segmentation. The mean mean (double “mean” intended) will be covered later, but by looking at the average weight of a population, you are not able to learn anything. Men often are taller and weigh more as a consequence, children will weigh less than women and so on. Looking at a specific segment such as the weight of men over 18 or even further segmented in different age groups provides much more insights than just looking at the whole population.

The final important concept is the trinity of data, information, and action. Often, data is only reported because it is available but not because it is needed. Worse, it is a challenge to derive the right information from the data. And even if information has been derived, what have we learned that we can put into action? If there is no action behind a data point, the data is not needed.

 

Filed under: Data Science

Google Analytics and Piwik

Google Analytics and Piwik are both Web Analytics systems, the first being a product from Google provided as Software as a service, the latter an open source system that can be deployed on your own server. Google Analytics comes in two flavors, a free version that can be used until 10 million hits, and a premium version that starts with $150.000 (in 2016) depending on the hits sent to the database. The difference between the two versions is not only the traffic, but also some features: the free version only offers aggregated data whereas the premium version lets users download raw data; also, most sophisticated features such as data-driven attribution are only available in the premium version. Piwik does not offer all features that Google Analytics has but has the huge advantage that you don’t have to pay for the raw data.

Google did not invent Google Analytics, the product is the result of the acquisition of Urchin in 2005 (Urchin is still present when you look at so-called UTM tags where UTM means Urchin Tracking Manager).

The basic concept of (the free version of) analytics is the session. A session is set to 30 minutes (which can be changed), and with every event that a user triggers or every page that he visits, the counter starts from the beginning. In other words, if a user stays 31 minutes on one page and then clicks on a link to another page on the same website, this would be two sessions (or two visits, as some people would say, although logically, this is one visit). A new session can also be started by re-entering the site via another channel. Advanced users with access to the premium version of Analytics often do not visit the Analytics site at all but perform their own analysis based on raw data.

By understanding how exactly is being measured, you will also identify a few constraints that most people are not aware of (although it is mentioned in the Google Analytics help), and these constraints are true for all web analytics systems that are based on JavaScript tags being fired. Since the script fires when a page loads, the time a user spends on a single page is measured by the distance in time between two visited pages. You visit the first page at 8 a.m. and then click on a link to another page on the same website at 8:03 a.m. You have spent 3 minutes on the site by now. If you spend 2 minutes on the 2nd page and close the browser window after reading the page, you have spent 5 minutes, but since you have not requested another page, only the first 3 minutes have been measured. As a consequence, time on site basically is the average of the time spent on the website minues the last page because it cannot be measured (in fact, it could be measured, but most website owners don’t do that for good reasons).

Similarly, bounce rate is not the rate of users who “immediately” leave the site after entering it but users who come to your site and only see one page, no matter whether it is 5 seconds or 30 minutes. Although this can be changed (“Adjusted Bounce Rate”), this is rarely done although it provides valuable information.

Another important concept of Analytics is the existence of events, e.g. the DOM being completely loaded or a timer that fires a specific action after x seconds. This allows us, for example, to implement an Adjusted Bounce Rate since the event basically checks if the user is still there after x seconds.

Google offers access to the Analytics account of the Google Merchandising Store; go to this help page and click on the access link (a Google account is required; in the future, you can access the store account directly via the Google Analytics interface).

The Google Analytics interface provides 5 sections:

  • Realtime: While users love to see what happens on their website right now, there is no actionable insight to be derived from here unless webmasters need to debug events or other implementation details
  • Audience: Information about the users, their interests, the technology used; there is also a new feature that lets analysts explore the behavior of single users. This data cannot be connected to other reports from scratch although it is possible to hack this.
  • Acquisition: Details about where users came from, including the conversions; this, however, is a last interaction view.
  • Behavior: Interaction with the website’s content, website speed, site search, and events
  • Conversions: Conversions from defined conversions goals or ecommerce; this section also offers an attribution module that allows to view alternative touchpoint views to the last interaction.

Reports are displayed in dimensions, e.g. sessions; in most of the reports, it is possible to add a second dimension.

Filed under: Data Science

Mean, Median and Mode

When talking about an average, most people refer to the mean which is officially called the arithmetic mean. It is built by summing up all values of a population and dividing this sum by the number of elements. Unfortunately, the mean can easily be skewed by outliers in the data.

Another perspective on the average is the median, the middle value of a list sorted by their values. The advantage of the median is that it less influenced by outliers.

There is a third perspective on the average, and this is called the mode. The mode is the most frequent value in a list, and it’s beauty lies within it’s flexibility because you can have more than one mode. Also, the mode works with categorial data. If you have 15 students, 7 from Germany and 8 from France, you have two modes. You cannot ask “what is the arithmetic mean of countries?” but the mode works just fine with such data.

Filed under: Data Science

Regression: Was darf eine gebrauchte Spiegelreflexkamera kosten?

Da gerade die Canon 5d Mark IV herausgekommen ist, wird auch die 5d Mark III erschwinglich. 1.500€ für maximal 30.000 Auslösungen wurde mir geraten, aber wenn man sich die angebotenen Kameras bei eBay und den einschlägigen Foren ansieht, dann scheint der Preis viel höher zu sein. Doch was ist der faire Preis? Mit ausreichend Daten kann dieser durch Regression ermittelt werden. Continue reading

Filed under: Data ScienceTagged with: , , ,

UNIX-Kommandozeilen-Tools für Data Scientists

Die Kommandozeile (die “Shell”) mit einem Terminal-Fenster zu erreichen. Das Programm Terminal befindet sich auf dem Mac unter Programme -> Dienstprogramme -> Terminal. Windows-Nutzer müssen Google bemühen, da Windows nicht auf UNIX basiert und extra Programme installiert werden müssen.

Kommandozeile bedeutet genau das: Es werden Kommandos auf einer Zeile eingegeben, eine grafische Ausgabe oder die Möglichkeit, sich durch Menüs zu klicken, existieren nicht. Es erfordert zwar eine gewisse Lernkurve, aber beherrscht man erst einmal ein Minimal-Set an Befehlen und Optionen, so kann man sehr schnell damit arbeiten. Wichtig: Durch die Eingabe von man <Befehl> wird die Anleitung (man für Manual) angezeigt, also zum Beispiel man grep (kleiner Tipp: ruft man ein Manual auf und will wieder raus, einfach ein “q” eingeben, nicht gleich das Terminal-Fenster schließen, weil man sich im Manual gefangen fühlt :-)).

Unix-Kommandos haben in der Regel das Format Befehl Optionen Datei, zum Beispiel sort -r datei.txt.

Wichtige Befehle mit Beispielen:

  • pwd: Print Working Directory, zeigt an, wo man sich gerade befindet im Verzeichnisbaum. Nach dem Öffnen des Programms Terminal auf dem Mac befindet man sich in dem Verzeichnis /Users/<BENUTZERNAME>
  • cd: Mit cd für change directory wird durch den Verzeichnisbaum navigiert.
    • cd Desktop bringt uns in das Unterverzeichnis Desktop
    • cd .. führt uns in das nächsthöhere Verzeichnis, mit cd ../.. kann man auch zwei Ebenen höher gehen.
  • grep: Hiermit werden Muster in Dateien gesucht.
    • grep -i ‘bowie’ enwiki-latest-pages-articles.xml – Durchsuche einen Wikipedia-Dump nach dem Begriff ‘bowie’, die Option -i sorgt dafür, dass die Suche nicht “case sensitive” ist, also auch Fundstellen gefunden werden, in denen der Begriff groß geschrieben wird.
    • grep -iv ‘bowie’ enwiki-latest-pages-articles.xml – Die Option -v sorgt dafür, dass alle Zeilen ausgegeben werden, die nicht den Begriff ‘bowie’ enthalten.
  • sort: Sortiert eine Liste
    • sort -r: Sortiert eine Liste in der umgekehrten Reihenfolge
    • Gerade bei großen Dateien kann sort viel Rechenzeit kosten, eine Alternative ist das zu installierende gsort, das alle Kerne des Prozessors nutzt.
  • uniq: Entfernt alle Mehrfach-Aufkommen eines Elements in einer Liste.

Ein großer Vorteil der Kommandozeile ist die Möglichkeit der Nutzung von Pipes. Hier werden mehrere Befehle aneinander gereiht, so dass die Ausgabe eines Kommandozeilen-Programms als Eingabe für ein anderes Kommandozeilen-Programms genutzt wird. Beispiel:

sort keywords.txt | uniq -c | sort -r | less

Das Zeichen | (auf dem Mac mit alt-7 verwendbar) ist das Pipe-Symbol. In dem Beispiel wird die Datei keywords.txt zunächst sortiert, dann wird mit dem Begriff uniq und der Option -c (für count) die Häufigkeit jedes Zeileninhalts dieser Datei gezählt, und das Ergebnis wird noch mal sortiert, hier mit der Option -r für reverse, wir wollen die häufigsten Zeileninhalte zuerst sehen. Mit dem Befehl less wird dafür gesorgt, dass das Ergebnis nicht einfach nur ausgegeben wird, sondern Seite für Seite angeschaut werden kann. Soll das Ergebnis nicht auf dem Bildschirm ausgegeben werden, sondern in eine Datei, so wird die Ausgabe einfach umgelenkt, zum Beispiel

sort keywords.txt | uniq -c | sort -r > keywords.uniq.txt

Noch ein Tipp: Die Shell ist ein sehr zeitsparendes Tool, wenn man sich daran gewöhnt, nicht alles auszuschreiben, sondern die Tab-Taste (links auf der Tastatur mit dem Symbol ->| ) zu verwenden, um einen Befehl zu vervollständigen. Beispiel: Ich befinde mich in meinem Home-Verzeichnis (/Users/tomalby) und will in das Unterverzeichnis Desktop. Dann gebe ich einfach cd De ein und drücke die Tab-Taste, so dass der Befehl auf cd Desktop/ vervollständigt wird.

Filed under: Data Science

Data Science & Data Analysis

This is the (d)english version, pure language versions will follow soon. Work in Progress!

This is a growing collection of data science, data analysis and web analytics information and resources for my course at the HAW. Some parts of the script will be published here. Please find the EMIL room for the summer term 2017 here.

What is Data Science?

There is no official definition of Data Science (similar to “Big Data”); we will regard data science as the combination of different disciplines such as data mining, statistics and machine learning in order to derive information from data automatically. Whilst many of the approaches used in these fields have existed for a long time already, more and more free programming libraries and cheap computing time and storage space from AWS have been enabling more people to use the power of coping with huge amounts or complex data.

Data Analytics or Data Analysis can be regarded as a subset of Data Science, setting the focus on the analysis of data. Being very similar to statistics, the term “data analysis” is sometimes regarded as old wine in new bottles. The existence of huge and complex data, often termed as “big data”, is not required for data analysis. Most often, quality is more restricting than quantity. In fact, there is no official definition of “big data”, and just because it is “a lot of data”, it should still not be called “Big” data. Some people even say, there is no thing such as big data.

Web Analytics is a subset of data analysis, however, using also other data that do not come from a website alone. Often enough, other marketing data is connected, requiring additional knowledge about the increasing complexity of marketing technology. Without such expertise, the analysis and interpretation of such data is difficult if not impossible. And while the focus here has been on data mining and some basic statistics, we see more and more machine learning entering this space.

What we will cover

Filed under: Data Science

Mehr als einen Kern unter Mac OS X nutzen

Heutige Prozessoren haben meist mehr als einen Kern, aber die meisten Programme nutzen nur einen. Oft ist es egal, der Rechner ist auch so schnell genug. Aber dann kommt man manchmal in Bereiche, wo man sich ärgert, dass man nur einen Kern nutzen kann. Vor allem bei den UNIX-Befehlen, die als GNU-Version zum Teil mehrere Kerne ausnutzen können, langweilt sich ein Teil meiner Mac-CPU während der andere zu 100 Prozent ausgelastet ist. In meinem Beispiel geht es um eine Text-Datei mit 8.6 GigaByte (nicht MegaByte :-), die ich sortieren und verarbeiten muss. Was wäre, wenn man mehr als einen Kern nutzen könnte?Continue reading

Filed under: Data ScienceTagged with: , ,