Online Algorithm - mean and median

Published: May 29, 2017

This post discusses so-called Online Algorithm. If fixed number of integers (or real numbers) are given, it’s easy to find a mean or median. Summing up all, then dividing by a number of given values gives us the mean. For a median, sorting the given values then finding center index (indices) would be the all.

What if input is not limited? say, a stream of data?

As for mean, still summing up and saving it to a value may work. But, what if the stream sends billions of 10? “Saving it to a value” will turn to a hard job. Nevertheless, the hard job would be hardly paid since still the answer is 10, just 10.

What about median? Still sorting may work if it uses a heap sort. The problem of heap is: I need to take out all values by the center position, say, half billion. Then, those values must get back to the heap, again half billion. Simple sorting would be very hard to say, “it works”.

As for the stream of data, it looks something called online algorithm works. The online algorithm finds an answer from previous state. Wikipedia uses the term, recurrence relation to describe the way to find the answer.

This unique way of solving a problem is definitely worth writing down a memo.

Problem Description

The problem is simple: Given a stream of data, answer the mean or median anytime. It may be everytime data is added, may be after some or many data are added.

The idea to find the mean

There are some math-y proofs in the Wikipedia, section 4: Online algorithm - Algorithms for calculating variance. The idea is to focus on the difference between a current value and mean up to a prevous value. If the difference divided by total number of values is added to the mean so far, a new mean will be calculate including a current value.

Aside of the mathematical proof, this is a pretty simple and working solution.

Java code for mean

Below is the Java code to find means anytime.

The result is:

8.0
6.4
6.5

The idea to find the median

I found a couple of websites which describe how to find the median from a stream of data. Among them, GeeksforGeeks: Median in a stream of integers (running integers) was the best to understand how to. As the entry says, there are a couple of options to solve the problem. Using two heaps is indeed a simple yet it works solution.

The idea is to maintain smaller half and greater half in two heaps. The smaller half will be sorted in descending order, while greater half will be in ascending order. If I peek the smaller half, I’ll get the biggest value in the smaller half. If I peek the greater half, I’ll get the smallest value in the greater half. These values, two or one depending on the sizes, are the median.

                                  
                                         the smallest value in the greater half
  the biggest value in the smaller half   /
                                     \   /
        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
       |<----------------------------->|<----------------------------->|
              smaller half                      greater half

I should keep the size difference of two heaps in 0 or 1. So, when I add a new value, there’s an extra work to maintain the size. However, other than that, solution is pretty simple.

Java code for median

Below is the Java code to find median anytime.

The result is:

8.0
5.0
6.0

Resources

Latest Posts

Setting Up GCP Instance for Deep Learning

This post is going to be very different from what I write here. The content is a memo how I create a GCP (Google Cloud Platform) instance for Deep Learning. While I study algorithm stuff, I also have been studying Deep Learning. In my early days, I tried to train my Deep Learning model only on a laptop. My laptop is 2012 model MacBook Pro, so I would say it is reasonably fast. However, when it comes to Deep Learning, the training was quite painful on the such machine. Often, I ran the training over night to get a disappointed result. Still, I didn’t use any of pricey cloud environment since it was my personal study unrelated to my day job. I wanted to save money.

Complete Binary Tree

Problems which ask a binary tree traverse, add/delete nodes, etc. are popular in algorithm questions. The binary trees are often just a binary tree or binary search tree. Sometimes, the problem pinpoints a particular type of a binary tree, for example, a balanced binary tree or complete binary tree.

Catalan number

What is Catalan number ? The Catalan number belongs to the domain of combinatorial mathematics. It is a sequence of natural numbers such that: 1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190, ... The sequence appears in counting problems. Wikipedia has the details about the sequence: Catalan Number.