Kategorie: Data Science
Since it was impossible to “follow” a user based on log files, Lou Montulli invented the HTTP cookie in 1994. This enabled website owners to place a small text file on a user’s hard disk and recognize him again when another page of the site is requested, the basis for building a web store, for example. Until today, cookies are a basic means to recognize users although they have several disadvantages:
- They can be deleted, and they are deleted.
- They do not represent a user but a user account on a single machine.
Cookies are differentated into 1st party and 3rd party cookies. The webserver visited by a user is allowed to set a 1st party cookie; if the web page includes a reference that tries to set a cookie from a different web server, this would be regarded as a 3rd party cookie; 3rd party cookies often are blocked. As a consequence, advertising companies have a huge interest in setting 1st party cookies.
The web analytics era started with the analysis of log files. Every time a web browser requests a page from a web server, the server logs the request for each single file that is connected to that page, e.g. the images that are referenced on that page or a CSS file. The web server logs the following information:
- the IP adress
- the requested source
- date and time of the request
- get or post
- the size of the requested file in bytes
- the HTTP version
- the HTTP status code
- The referrer
- The user agent
While in the very early days of the WWW, most users had their own dedicated IP adresses, this soon changed as the web became more popular. Users who dialed in via Compuserve or AOL got dynamic IP adresses, so that a unique IP adress did not represent a single user. Also, in some cases, several people/computers may hide behind one IP adress.
Another disadvantage of log files is that they log really every request, be it a human or a bot. For some websites, the majority of requests is created by bots, and this does not only refer to search engine crawlers.
This course cannot substitute a basic statistics course! You will learn some fundamentals but for real data science projects, some more statistics knowledge is recommended.
After the problem that we want to solve has been identified and properly defined, the next question is how to get the data that is needed to help solving the problem. Sometimes, some or tons of data are already available, sometimes, data has to be collected for the future. Experiments are another approach how to acquire data. We will look at different approaches and dive deeper into how data is collected online.
There are five phases in Data Science and data analysis:
- Understanding the business problem: This is one of the most neglected steps in data science and analytics although it is the most important one. You need to understand the business problem, and you need to gain a complete understanding of the problem. The main question is, what exactly is the business problem that you are asked to solve or that you want to solve.
- Preparation Phase: Acquiring and checking the data
- Analysis Phase: Building models
- Reflection Phase: Reviewing results and looking at alternative models
- Dissemination Phase: Reporting results
Another important concept is segmentation. The mean mean (double “mean” intended) will be covered later, but by looking at the average weight of a population, you are not able to learn anything. Men often are taller and weigh more as a consequence, children will weigh less than women and so on. Looking at a specific segment such as the weight of men over 18 or even further segmented in different age groups provides much more insights than just looking at the whole population.
The final important concept is the trinity of data, information, and action. Often, data is only reported because it is available but not because it is needed. Worse, it is a challenge to derive the right information from the data. And even if information has been derived, what have we learned that we can put into action? If there is no action behind a data point, the data is not needed.
Google Analytics and Piwik are both Web Analytics systems, the first being a product from Google provided as Software as a service, the latter an open source system that can be deployed on your own server. Google Analytics comes in two flavors, a free version that can be used until 10 million hits, and a premium version that starts with $150.000 (in 2016) depending on the hits sent to the database. The difference between the two versions is not only the traffic, but also some features: the free version only offers aggregated data whereas the premium version lets users download raw data; also, most sophisticated features such as data-driven attribution are only available in the premium version. Piwik does not offer all features that Google Analytics has but has the huge advantage that you don’t have to pay for the raw data.
Google did not invent Google Analytics, the product is the result of the acquisition of Urchin in 2005 (Urchin is still present when you look at so-called UTM tags where UTM means Urchin Tracking Manager).
The basic concept of (the free version of) analytics is the session. A session is set to 30 minutes (which can be changed), and with every event that a user triggers or every page that he visits, the counter starts from the beginning. In other words, if a user stays 31 minutes on one page and then clicks on a link to another page on the same website, this would be two sessions (or two visits, as some people would say, although logically, this is one visit). A new session can also be started by re-entering the site via another channel. Advanced users with access to the premium version of Analytics often do not visit the Analytics site at all but perform their own analysis based on raw data.
Similarly, bounce rate is not the rate of users who “immediately” leave the site after entering it but users who come to your site and only see one page, no matter whether it is 5 seconds or 30 minutes. Although this can be changed (“Adjusted Bounce Rate”), this is rarely done although it provides valuable information.
Another important concept of Analytics is the existence of events, e.g. the DOM being completely loaded or a timer that fires a specific action after x seconds. This allows us, for example, to implement an Adjusted Bounce Rate since the event basically checks if the user is still there after x seconds.
Google offers access to the Analytics account of the Google Merchandising Store; go to this help page and click on the access link (a Google account is required; in the future, you can access the store account directly via the Google Analytics interface).
The Google Analytics interface provides 5 sections:
- Realtime: While users love to see what happens on their website right now, there is no actionable insight to be derived from here unless webmasters need to debug events or other implementation details
- Audience: Information about the users, their interests, the technology used; there is also a new feature that lets analysts explore the behavior of single users. This data cannot be connected to other reports from scratch although it is possible to hack this.
- Acquisition: Details about where users came from, including the conversions; this, however, is a last interaction view.
- Behavior: Interaction with the website’s content, website speed, site search, and events
- Conversions: Conversions from defined conversions goals or ecommerce; this section also offers an attribution module that allows to view alternative touchpoint views to the last interaction.
Reports are displayed in dimensions, e.g. sessions; in most of the reports, it is possible to add a second dimension.
When talking about an average, most people refer to the mean which is officially called the arithmetic mean. It is built by summing up all values of a population and dividing this sum by the number of elements. Unfortunately, the mean can easily be skewed by outliers in the data.
Another perspective on the average is the median, the middle value of a list sorted by their values. The advantage of the median is that it less influenced by outliers.
There is a third perspective on the average, and this is called the mode. The mode is the most frequent value in a list, and it’s beauty lies within it’s flexibility because you can have more than one mode. Also, the mode works with categorial data. If you have 15 students, 7 from Germany and 8 from France, you have two modes. You cannot ask “what is the arithmetic mean of countries?” but the mode works just fine with such data.
Da gerade die Canon 5d Mark IV herausgekommen ist, wird auch die 5d Mark III erschwinglich. 1.500€ für maximal 30.000 Auslösungen wurde mir geraten, aber wenn man sich die angebotenen Kameras bei eBay und den einschlägigen Foren ansieht, dann scheint der Preis viel höher zu sein. Doch was ist der faire Preis? Mit ausreichend Daten kann dieser durch Regression ermittelt werden. Continue reading
Der Kurs kann kein Statistik-Seminar ersetzen, es werden nur elementare Grundlagen vermittelt. Es wird so weit wie möglich auf mathematische Formeln verzichtet. Für ein tieferes Studium der Statistik wird das Lehrbuch Statistik: Der Weg zur Datenanalyse (Affiliate-Link) empfohlen.