Kategorie: Data Science


Given that users delete cookies, the online marketing industry is searching for alternatives, and one way to do that is browser fingerprinting. As described in the section about Cookies and Pixels, JavaScript allows to collect more information from a user’s browser than a web server logging mechanism can. You may want to test your browser using this tool; not only are details about your browser’s configuration available but also plugins,  etc. While most people think that they have a standard configuration similar to many other users, this is actually not the case. Using several signals such as the fingerprint, a cookie if available etc, it is possible to create a much more robust tracking mechanism.

Filed under: Data Science

Cookies and Pixels

Since it was impossible to “follow” a user based on log files, Lou Montulli invented the HTTP cookie in 1994. This enabled website owners to place a small text file on a user’s hard disk and recognize him again when another page of the site is requested, the basis for building a web store, for example. Until today, cookies are a basic means to recognize users although they have several disadvantages:

  • They can be deleted, and they are deleted.
  • They do not represent a user but a user account on a single machine.

Cookies are differentated into 1st party and 3rd party cookies. The webserver visited by a user is allowed to set a 1st party cookie; if the web page includes a reference that tries to set a cookie from a different web server, this would be regarded as a 3rd party cookie; 3rd party cookies often are blocked. As a consequence, advertising companies have a huge interest in setting 1st party cookies.

A year after the introduction of cookies, JavaScript entered the world; initially called Livescript, it was renamed due to the collaboration with Java’s inventor, SUN Microsystems, and it was supposed to add the ability to run small programs in a web browser apart from the rather static HTML documents.

Before we dive further into the advantages of JavaScript, let’s have a look at pixels. Pixels, usually, are the smallest entity on a screen, and the reason why they are so interesing is that they do not create too much traffic when they are sent over the web. In the early days when every bit was expensive to push through the networks, such a pixel was a great way to measure web traffic on a web server by including the pixel from a another server that counted the requests. However, only the information that is accessible in a typical log file can be logged, the only advantage is that users who have no access to log files can use their advantages, mostly receiving reports but not raw data. This is called page-based tracking, although the tracking still happens on a server but it is triggered from a page.

With the introduction of JavaScript, tracking via a pixel became more sophisticated, and just by virtue of the first pixels being a pixel, these pieces of JavaScript that are used for tracking are still called pixels, even if they don’t make use of a pixel graphic. The anatomy of a JavaScript pixel looks like this: A user’s browser requests a web page from a server, the server delivers the page to the browser, the page contains some JavaScript that is executed after the page has been loaded. The JavaScript requests a pixel or another resource from a tracking server that then tracks the user… by using JavaScript, however, more information is available than is available via a log file; JavaScript has access to information such as the screen size, the screen color depth and much more. This information can also be used for fingerprinting.

Another huge advantage of JavaScript pixels is that they do not log requests by bots since most of these bots do no interpret JavaScript. And, JavaScript also offers the possibility for external domains to set a 1st party cookie. Google Analytics sets 1st party cookies as well as many marketing company pixels that are placed on publisher sites in order to collect as much data as possible about users, not needing to piggybag a cookie.

Next: Fingerprinting

Filed under: Data Science

Log Files

The web analytics era started with the analysis of log files. Every time a web browser requests a page from a web server, the server logs the request for each single file that is connected to that page, e.g. the images that are referenced on that page or a CSS file. The web server logs the following information:

  • the IP adress
  • the requested source
  • date and time of the request
  • get or post
  • the size of the requested file in bytes
  • the HTTP version
  • the HTTP status code
  • The referrer
  • The user agent

While in the very early days of the WWW, most users had their own dedicated IP adresses, this soon changed as the web became more popular. Users who dialed in via Compuserve or AOL got dynamic IP adresses, so that a unique IP adress did not represent a single user. Also, in some cases, several people/computers may hide behind one IP adress.

Another disadvantage of log files is that they log really every request, be it a human or a bot. For some websites, the majority of requests is created by bots, and this does not only refer to search engine crawlers.

Having said that, log file data is available in realtime (with a delay of very few seconds), allowing for immediate analysis if something goes wrong that cannot be detected by systems based on JavaScript (we will handle those in the Cookies and Pixels section). Log file data is also called server-based tracking.

Filed under: Data Science

Statistics Basics

This course cannot substitute a basic statistics course! You will learn some fundamentals but for real data science projects, some more statistics knowledge is recommended.

Filed under: Data Science

Basic Concepts

There are five phases in Data Science and data analysis:

  • Understanding the business problem: This is one of the most neglected steps in data science and analytics although it is the most important one. You need to understand the business problem, and you need to gain a complete understanding of the problem. The main question is, what exactly is the business problem that you are asked to solve or that you want to solve.
  • Preparation Phase: Acquiring and checking the data
  • Analysis Phase: Building models
  • Reflection Phase: Reviewing results and looking at alternative models
  • Dissemination Phase: Reporting results

Another important concept is segmentation. The mean mean (double “mean” intended) will be covered later, but by looking at the average weight of a population, you are not able to learn anything. Men often are taller and weigh more as a consequence, children will weigh less than women and so on. Looking at a specific segment such as the weight of men over 18 or even further segmented in different age groups provides much more insights than just looking at the whole population.

The final important concept is the trinity of data, information, and action. Often, data is only reported because it is available but not because it is needed. Worse, it is a challenge to derive the right information from the data. And even if information has been derived, what have we learned that we can put into action? If there is no action behind a data point, the data is not needed.


Filed under: Data Science

Google Analytics and Piwik

Google Analytics and Piwik are both Web Analytics systems, the first being a product from Google provided as Software as a service, the latter an open source system that can be deployed on your own server. Google Analytics comes in two flavors, a free version that can be used until 10 million hits, and a premium version that starts with $150.000 (in 2016) depending on the hits sent to the database. The difference between the two versions is not only the traffic, but also some features: the free version only offers aggregated data whereas the premium version lets users download raw data; also, most sophisticated features such as data-driven attribution are only available in the premium version. Piwik does not offer all features that Google Analytics has but has the huge advantage that you don’t have to pay for the raw data.

Google did not invent Google Analytics, the product is the result of the acquisition of Urchin in 2005 (Urchin is still present when you look at so-called UTM tags where UTM means Urchin Tracking Manager).

The basic concept of (the free version of) analytics is the session. A session is set to 30 minutes (which can be changed), and with every event that a user triggers or every page that he visits, the counter starts from the beginning. In other words, if a user stays 31 minutes on one page and then clicks on a link to another page on the same website, this would be two sessions (or two visits, as some people would say, although logically, this is one visit). A new session can also be started by re-entering the site via another channel. Advanced users with access to the premium version of Analytics often do not visit the Analytics site at all but perform their own analysis based on raw data.

By understanding how exactly is being measured, you will also identify a few constraints that most people are not aware of (although it is mentioned in the Google Analytics help), and these constraints are true for all web analytics systems that are based on JavaScript tags being fired. Since the script fires when a page loads, the time a user spends on a single page is measured by the distance in time between two visited pages. You visit the first page at 8 a.m. and then click on a link to another page on the same website at 8:03 a.m. You have spent 3 minutes on the site by now. If you spend 2 minutes on the 2nd page and close the browser window after reading the page, you have spent 5 minutes, but since you have not requested another page, only the first 3 minutes have been measured. As a consequence, time on site basically is the average of the time spent on the website minues the last page because it cannot be measured (in fact, it could be measured, but most website owners don’t do that for good reasons).

Similarly, bounce rate is not the rate of users who “immediately” leave the site after entering it but users who come to your site and only see one page, no matter whether it is 5 seconds or 30 minutes. Although this can be changed (“Adjusted Bounce Rate”), this is rarely done although it provides valuable information.

Another important concept of Analytics is the existence of events, e.g. the DOM being completely loaded or a timer that fires a specific action after x seconds. This allows us, for example, to implement an Adjusted Bounce Rate since the event basically checks if the user is still there after x seconds.

Google offers access to the Analytics account of the Google Merchandising Store; go to this help page and click on the access link (a Google account is required; in the future, you can access the store account directly via the Google Analytics interface).

The Google Analytics interface provides 5 sections:

  • Realtime: While users love to see what happens on their website right now, there is no actionable insight to be derived from here unless webmasters need to debug events or other implementation details
  • Audience: Information about the users, their interests, the technology used; there is also a new feature that lets analysts explore the behavior of single users. This data cannot be connected to other reports from scratch although it is possible to hack this.
  • Acquisition: Details about where users came from, including the conversions; this, however, is a last interaction view.
  • Behavior: Interaction with the website’s content, website speed, site search, and events
  • Conversions: Conversions from defined conversions goals or ecommerce; this section also offers an attribution module that allows to view alternative touchpoint views to the last interaction.

Reports are displayed in dimensions, e.g. sessions; in most of the reports, it is possible to add a second dimension.

Filed under: Data Science

Mean, Median and Mode

When talking about an average, most people refer to the mean which is officially called the arithmetic mean. It is built by summing up all values of a population and dividing this sum by the number of elements. Unfortunately, the mean can easily be skewed by outliers in the data.

Another perspective on the average is the median, the middle value of a list sorted by their values. The advantage of the median is that it less influenced by outliers.

There is a third perspective on the average, and this is called the mode. The mode is the most frequent value in a list, and it’s beauty lies within it’s flexibility because you can have more than one mode. Also, the mode works with categorial data. If you have 15 students, 7 from Germany and 8 from France, you have two modes. You cannot ask “what is the arithmetic mean of countries?” but the mode works just fine with such data.

Filed under: Data Science

Regression: Was darf eine gebrauchte Spiegelreflexkamera kosten?

Da gerade die Canon 5d Mark IV herausgekommen ist, wird auch die 5d Mark III erschwinglich. 1.500€ für maximal 30.000 Auslösungen wurde mir geraten, aber wenn man sich die angebotenen Kameras bei eBay und den einschlägigen Foren ansieht, dann scheint der Preis viel höher zu sein. Doch was ist der faire Preis? Mit ausreichend Daten kann dieser durch Regression ermittelt werden. Continue reading

Filed under: Data ScienceTagged with: , , ,