Recent site activity


JLospinoso's github feed



@jalospinoso del.icio.us

Thinking Like a Statistician - new video available

posted Apr 4, 2012, 12:32 PM by Josh Lospinoso

New video available here!

In this video, Josh Lospinoso, Lead Technologist at Red Owl Analytics, provides an illustrative example of how using inferential statistics, rather than descriptive statistics, can provide clear and precise results that would not be possible with descriptive techniques. A demo dataset of income, age, gender, and years of education is generated for the example. All of these materials are available here: https://sites.google.com/a/lospi.net/home/downloads/

Hadoop World Presentation Video Available

posted Jan 5, 2012, 4:04 PM by Josh Lospinoso

In this session we will look at Reveal, a statistical network analysis library built on Hadoop that uses relational event history analysis to grapple with the complexity, temporal causality, and uncertainty associated with dynamically evolving, growing, and changing networks. There are a broad range of applications for this work, from finance to social 
network analysis to network security.

http://www.cloudera.com/videos/hadoop-world-2011-presentation-video-building-relational-event-history-model-with-hadoop

Statistical Tricks for Lists in C#: Scaling, Standardizing, Normalizing, Ranking, and more with extension methods

posted Dec 12, 2011, 12:02 PM by Josh Lospinoso

In lots of situations where users will want to analyze continuously valued data, it can be fruitful to do some simple manipulations on a list of doubles. In this post, I provide a few techniques for transforming this data in some way using extremely convenient extension methods. The sorts of manipulations I will be presenting here involve various ways of shaping your data, whether you want to preserve the ordering, give a regular shape for scoring purposes, or shift/scale your data onto some new interval.

 

Extension Methods

Extension methods are a little piece of syntactic sugar that allow us to tack functionality onto classes without having to extend them through a subclass. So, for example, we could extend the Double class with a method which returns the objects squared value (as a double). The method is declared as a static method within some static class, and the first argument is what gives it away as an extension method:

The this keyword allows you to use x within the body of the function to yield some result. So, the following code:

public static class ExtensionMethods
{
    public static Double SquareMe(this Double x)
    {
        return x * x;
    }
}

public class Program
{
    public static void Main(string[] args)
    {
        Double z = 7;
        Console.WriteLine(z.SquareMe());
    }
}

will yield the answer 49 as we expect.

For more information, c.f. http://msdn.microsoft.com/en-us/library/bb383977.aspx.

Scaling

When we scale a vector of numbers, we seek to take the original list and transform it into another list that has some maximum value and some minimum value, while the scale between the numbers is preserved. For example, the list

List<Double> x = new List<Double> { 0.0, 2.0, 5.0, 10.0 };

could be scaled onto another interval, say 5 to 6. We would want the new list to (1) preserve the order and scale between numbers and (2) obviously to be on the interval 5 to 6. The following list uniquely achieves this goal:

{ 5.0, 5.2, 5.5, 6.0 }

 The following extension method can be used:

        /// <summary>
        ///  Scales a list onto a new interval
        /// </summary>
        /// <param name="list"></param>
        /// <param name="lower">Lower limit of new interval</param>
        /// <param name="upper">Upper limit of new interval</param>
        /// <returns></returns>
        public static IDictionary<Double, Double> Scale(this IList<Double> list, double lower = 0, double upper = 1)
        {
            Double max = list.Max();
            Double min = list.Min();
            Double avg = list.Sum() / list.Count;
            return list.ToDictionary(x => x, x => (upper - lower) * ((x - min) / (max - min)) + lower);
        }

Scaling is useful if you want to present a user with a score on a scale from, say, 0 to 100 given some arbitrary list of data. All of the scaling is preserved in the numbers, so given the original interval you could easily back out the original list (i.e. no information is lost).

In these examples, I'll show some charts to illustrate data looks like after transformation.

For this example, I will randomly generate 1000 points on the [0,10] interval and scale them onto the [1,7] interval using the function above. The black points represent some original data, and the red points represent points after transformation:

 

Here we can see that the points have been "squished" onto the [1,7] interval, but they maintain relatively similar dispersions. Another way to see this relationship is to sort the data in ascending order:

 

We can visually confirm that the linear nature of the data (which comes from the shape of the uniform distribution) is preserved during the scaling transformation. In short, the scaling function is useful for situations where we want to keep the relative shape of the distribution, but we just need it on a different interval. As another example, we could generate noisy sinusoidal data, yielding the following plot:

where the "smushing" onto a different interval is nicely apparent.

Standardization

While scaling can help us to put on an interval of interest, we may be interested in getting some indication of just how far away from central tendencies some observations are in out data. Suppose we have data which contains some really extreme outliers (I will use Cauchy distributed data, which has infinite variance!) Here's what scaling around [-10, 10] looks like for 500 points:

What happened? It turns out that the maximums and minimums are so large that the center of the scaled numbers is somewhere around 7, not zero! Clearly, the maximum and minimum have a large effect on the outcome of scaling. To help answer this, we can scale by mean and standard deviation in the following way:

        /// <summary>
        /// Standardizes a list by subtracting its mean and dividing by its standard deviation to yield
        /// a new list with mean zero and standard deviation one.
        /// </summary>
        /// <param name="list"></param>
        /// <returns></returns>
        public static IDictionary<Double, Double> Standardize(this IList<Double> list)
        {
            Double avg = list.Sum() / list.Count;
            Double var = list.Select(x => Math.Pow(x - avg, 2)).Sum() / list.Count;
            Double sd = Math.Pow(var, .5);
            return list.ToDictionary(x => x, x => (x - avg) / sd);
        }

This formula is much less sensitive to outliers in the data, but it will nonetheless preserve the scale between the numbers. Here's what standardization looks like applied to the Cauchy data given above:

 

Now, we have this nice centering around the mean, and the outliers have much less sway over the results. Additionally, the new values have an appealing interpretation of "the original point is X standard deviations away from the mean of the distribution," which has a rich meaning in many statistical settings.

Percentile

If the shapes of distributions are really funny, or if we are only concerned with ordering, it can make sense to calculate percentiles. This transformation is interpreted as "What percentage of other points in the list are less than or equal to this point?" An extension method implementing this calculation is the following:

        /// <summary>
        /// Calculates the percentile for each element
        /// </summary>
        /// <param name="list"></param>
        /// <returns></returns>
        public static IDictionary<Double, Double> Percentile(this IList<Double> list, bool midpoint = false)
        {
            double denominator;
            double numeratorAdjustment;
            if (midpoint)
            {
                denominator = list.Count;
                numeratorAdjustment = 0.5;
            }
            else
            {
                denominator = list.Count - 1;
                numeratorAdjustment = 0;
            }
            return list.ToDictionary(x => x, x => (list.Count(y => x > y) + numeratorAdjustment) / denominator);
        }

When we apply this to a so-called "Mixing Distribution"--where we essentially stick together random numbers from two different distributions, we get the following pattern:

 

It doesn't seem all that obvious why we might want to apply such a transformation to this data, but when we compare the sorted distribution, the elegance of the percentile transformation becomes apparent:

The great feature of calculating percentiles is that we (1) have a wonderfully simple interpretation of just what the value means, (2) we know that the numbers will lie on the [0,1] interval, and (3) it is extremely well behaved for virtually any kinds of distributions.

Normalization

Finally, in some situations, it makes sense to Normalize--literally take our data and make it look like it came from a normal distribution. This can be done through a neat little mathematical trick. First, import the

using MathNet.Numerics;

package available here: http://mathnetnumerics.codeplex.com/.

This allows you to access a number of useful mathematical functions. To normalize our data, we first construct percentiles (which we have already created an extension method for above). Next, we will put this percentile into the inverse Normal cumulative density function (CDF). For the mathematically inclined, we are simply using the Inverse CDF as a map from [0,1] onto the space of the distribution function (in this case, Normal). The following function gives us the transformation:

        /// <summary>
        /// Inverse standard normal cumulative distribution function
        /// </summary>
        /// <param name="p">percentile</param>
        /// <param name="mean">mean of target distribution</param>
        /// <param name="stdev">standard deviation of target distribution</param>
        /// <returns></returns>
        private static Double InverseNormalCDF(Double p, Double mean = 0, Double stdev = 1)
        {
            return stdev * Fn.ErfInverse(2*p -1) * Math.Sqrt(2) + mean ;
        }

Note that we could, for example, simulate random normal numbers through this little trick:

            Random rng = new Random();
            InverseNormalCDF(rng.NextDouble());

The extension method is then simply

        /// <summary>
        /// "Curves" the list by mapping it onto a standard normal distribution
        /// </summary>
        /// <param name="list"></param>
        /// <returns></returns>
        public static IDictionary<Double, Double> Normalize(this IList<Double> list, double mean=0, double stdev=1)
        {
            return list.Percentile(true).ToDictionary( x => x.Key, x => InverseNormalCDF(x.Value, mean, stdev) );
        }

So our mixing distribution data can be transformed:

and corresponding kernel densities look like the following:

 

where we see that we have achieved a nice shape from a strange looking distribution of numbers. In fact, this is what most teachers mean when they say "curving" test scores--although many do a simple scaling in practice.

Putting it all together

With a few simple methods to sort output and print them to console:

/// <summary>
        /// Prints a dictionary of double, double to Console.
        /// </summary>
        /// <param name="dictionary"></param>
        /// <returns></returns>
        public static void ToConsole(this IDictionary<Double, Double> dictionary, String prepend = "", String append = "", Boolean sort = false)
        {
            IList<Double> keys;
            if (sort)
            {
                keys = dictionary.SortOnKeys().Keys.ToList();
            }
            else
            {
                keys = dictionary.Keys.ToList();
            }
            Console.Write(prepend);
            foreach (Double i in keys)
            {
                Console.WriteLine(String.Format("{0:0.0000} {1:0.0000}", i, dictionary[i]));
            }
            Console.Write(append);
        }

        /// <summary>
        /// Sorts a dictionary on its double key values
        /// </summary>
        /// <typeparam name="T">Value type</typeparam>
        /// <param name="dictionary"></param>
        /// <returns>A SortedDictionary</returns>
        public static SortedDictionary<Double, T> SortOnKeys<T>(this IDictionary<Double, T> dictionary)
        {
            return new SortedDictionary<Double, T>(dictionary);
        }

it is possible to put all of these functions together into a little demo:

public class Program
    {
        public static void Main(string[] args)
        {
            Random randomNumberGenerator = new Random("Red Owl Consulting".GetHashCode());
            List<Double> x = new List<Double>();
            for (Double i = 0; i < 10; i++)
            {
                x.Add(randomNumberGenerator.NextDouble());
            }

            x.Percentile(midpoint: false).ToConsole(prepend: "Percentile(midpoint: false):\n", append: "-----\n", sort: true);
            x.Percentile(midpoint: true).ToConsole(prepend: "Percentile(midpoint: true):\n", append: "-----\n", sort: true);

            x.Scale(lower: 0, upper: 1).ToConsole(prepend: "Scale(lower: 0, upper: 1):\n", append: "-----\n", sort: true);
            x.Scale(lower: 10, upper: 100).ToConsole(prepend: "Scale(lower: 10, upper: 100):\n", append: "-----\n", sort: true);

            x.Normalize(mean: 0, stdev: 1).ToConsole(prepend: "Normalize(mean: 0, stdev: 1):\n", append: "-----\n", sort: true);
            x.Normalize(mean: Math.PI, stdev: Math.E).ToConsole(prepend: "Normalize(mean: Math.PI, stdev: Math.E):\n", append: "-----\n", sort: true);

            x.Standardize().ToConsole(prepend: "Standardize():\n", append: "-----\n", sort: true);

            Console.ReadLine();
        }
    }

I hope that this demo helped you to understand some of the interesting transformations that you can perform on your data. There are undoubtedly other ways that you might think of transforming data based on your particular situation, but this should provide an intuitive feel for what's possible.

I know my means, medians, modes, and standard deviations—why do I need a statistician?

posted Dec 12, 2011, 12:00 PM by Josh Lospinoso

Most developers have a solid understanding of basic statistics. After mastering means, medians, modes, ranges, standard deviations, percentiles, histograms, and pie charts, it’s tempting to wonder whether any further statistical knowledge is necessary. After all, if I can describe data in a few fundamental ways—central tendency, dispersion, modality—and visualize it to a user, why would I need to get a statistical consultant involved?


In many situations, simple summary of data is all that is needed by the end user, and a few simple descriptives will do the trick nicely. In lots of other situations, however, descriptives simply aren’t enough. When we want to try to draw generalizations about a broader population based on a sample, naïve use of statistics can lead us astray. We will need inferential statistics to do the trick. In this post, I will provide a gentle introduction to the world of inferential statistics, what makes them so elegant, and how you can use them to provide incredibly rich and powerful functionality into your software projects.


One of the most fundamental features of inferential statistics is the ability to make conclusions about broader populations based on a sample while automatically accounting for uncertainty. Suppose that we are writing a plug-in for blogging software. If we want to present a user with some information on how many hits she has received on her blog in relation to other users. In this situation, we can use simple descriptive statistics to satisfy these user requirements—simply show a histogram of frequency (y axis) of the number of hits (x axis) for each blog, and show the user where they lie on this distribution (in red): 

 


We are simply describing a large amount of data in a simple, intuitive way. The user can see here how many hits her blog has received and some idea of where she lies in the distribution of hits per blog. The art of descriptive statistics is to summarize the data in a way that a large amount of data can be easily digested by a user.


Now, consider that we are writing software for a car insurance company. They want to know how various customer attributes affect the amount that they can expect a customer to claim in order to help them quote for new customers. Imagine that the database returns a query with the following data:


customerId

isMale

age

claimsThisYear

1

TRUE

19

17230

2

TRUE

53

0

3

FALSE

37

5300

4

TRUE

41

0

 


We could think about simply displaying some more descriptive statistics to the end user, e.g. by finding all rows that match the basic characteristics of the new customer being quoted—a 47 year old male—and presenting the distribution of yearly claims for them. Consider, however, the following problem: even with a large dataset, it is totally possible that you will not have any observations to match the prospective customer. Or, if you do, it is also entirely possible that you will have very few observations. The end user wants to know what the average yearly claim is for a particular demographic. If you have only observed five 72 year-old female claims, how much uncertainty is there about the amount that you quote?


In summary, our problem is that we would like to know some general rule about the average yearly claim for all prospective customers, but we only have a sampling based on our limited database. This is a problem for inferential statistics, because we do not want to say something about our sample (old customers) but about a larger population (new customers).


One way to handle this problem is to suppose that the yearly claim made by any individual is random, in the same sense as the outcome of a coin flip is random, and that the claim amount has a probability distribution. The trick is that not all demographics have the same distribution. A linear regression is a special case of this set up. The idea with a linear regression is that each person’s yearly claim has a Normal distribution, which has this shape:

 

 


The mean of this distribution shifts around based on demographics. Are you an 18 year old Male driver? Maybe the mean of the distribution is $2,000. Are you 36 instead of 18? Perhaps now the mean decreases to $750. As the name suggests, we model this idea by supposing that the mean depends on a linear combination of the demographic variables, for example:


AverageYearlyClaim = Constant + B1 * age + B2 * isMale

The constant, B1, and B2, are called statistical parameters. These statistical parameters are the focus of most of our efforts in inferential statistics. Consider what they mean; B1 is the marginal contribution of another year of a driver’s life on the average yearly claim. If B1 is -$50, then the difference between the average claim for a 20 year-old driver and a 30 year old driver is -$500. Suppose we set isMale equal to 1 when the driver is male. Then B2 is the difference in average claims between males and females. (The constant is not too terribly interesting—it simply helps us to achieve scale for the overall average yearly claim for everyone.)


So what are B1 and B2? How do we know if we’ve picked the right numbers? This problem is called estimation in the statistical literature. Perhaps the most common way to do this is to make the “line” of the linear regression closest to the observation “points” as in the following figure:

 

Another interpretation is that we want to make the Normal curves implied by the linear model for the average the most probable set of Normal curves to have generated our data. This is called maximum likelihood estimation.


The estimation process achieves point estimates for the constant, B1, and B2. One of the wonderful things about maximum likelihood estimation is that we typically know a whole lot about how these estimations behave. Accordingly, we can assess uncertainty about the estimates B1. Say, for example, that we estimate the following model from our database:


AverageYearlyClaim = 651 – 27.2 * age + 206 * isMale


Remember that this is only an estimate of the true values of B1 and B2. The intuitive idea is that if we had a different set of customers from the same population of drivers, we would have gotten different estimates. A natural question to ask is, “how certain am I that the true value of B1 is close to my estimate?” The answer, fortunately, is that this all comes for free with statistical estimation. While the details of how we determine the error bounds of the parameter estimates can get pretty technical, the results are fairly straightforward.


A confidence interval can be constructed around B1 using these errors. For example, I could say “95% of the time, if I were to collect a same sized sample from the population of drivers implied by the model, my estimate for B1 would be between 23.2 and 31.2” – a pretty powerful tool for assessing how sure I am about my estimates. Using this approach, we could fit this model to the database and use these estimates to predict the risk of a new customer. This prediction gives me both an estimate I can use to quote and a confidence interval to estimate my uncertainty—all without having to have seen lots of identical customers before.


Linear regressions are perhaps the most simple kinds of statistical models that can be used for statistical inference. In Part 2, I will walk through some models for different kinds of data—what if my response variable is binary? Or more generally discrete? What about time-stamped data? I'll use the R environment (http://www.r-project.org/).

Scalable Descriptive Statistics Calculations for Streaming Data

posted Dec 12, 2011, 11:59 AM by Josh Lospinoso   [ updated Apr 14, 2012, 5:50 AM ]

In situations where we have some streaming data coming in at reasonably high rates, we may want to present descriptive statistics to the user in near-real time. For example, we might want to give the user a standard deviation or moving average. It can be very wasteful, however, to calculate e.g. the mean and standard deviation of the whole history of observations every time a new piece of data comes in. In this article, I'll show you how you can decompose the math behind these descriptive statistics so that you can incrementally update the answer. The result is a much more efficient way of calculating descriptive statistics on streaming data.

In this article, I will build up a StreamingDescriptives class:

public class StreamingDescriptives
    {
        public Double Count { get; private set; }

        public StreamingDescriptives()
        {
            Count = 0;
        }

        public void Ingest(Double value)
        {
            // TODO: Logic on how to incrementally update values here.
            Count++;
        }

        public void Ingest(IEnumerable<Double> values)
        {
            foreach (Double value in values)
            {
                Ingest(values);
            }
        }
    }

When we have new data, one of the Ingest methods will be called, and the statistical ingredients contained in the object will be updated.

Means (Averages)

The formula for the mean m of some vector of numbers x is

Note that we only need two ingredients: the number of elements in x, and the sum of x. What if we store these ingredients within the StreamingDescriptives object, update the values incrementally on ingest, then produce the Mean based on the formula? The updated class looks like this:

public class StreamingDescriptives
    {
        public Double Mean
        {
            get
            {
                return SumX / Count;
            }
        }

        public Double Count { get; private set; }

        Double SumX = 0;

        public StreamingDescriptives()
        {
            Count = 0;
        }

        public void Ingest(Double value)
        {
            Count++;
            SumX += value;
        }

        // ...
    }

This way, instead of having to perform the sum over X every time we add some elements to it, we can simply keep track of the sum and the count in real time!

 

Max, Min, and Range

We can likewise keep track of the maximum, minimum, and range. Since the range is simply Max – Min, we really only have to worry about these two values. Consider the updated class: 

public class StreamingDescriptives
    {
        public Double Mean
        {
            get
            {
                return SumX / Count;
            }
        }

        public Double Range
        {
            get
            {
                return Max - Min;
            }
        }

        public Double Max { get; private set; }
        public Double Min { get; private set; }
        public Double Count { get; private set; }

        Double SumX = 0;

        public StreamingDescriptives()
        {
            Max = Double.MinValue;
            Min = Double.MaxValue;
            Count = 0;
        }

        public void Ingest(Double value)
        {
            Count++;
            SumX += value;
            
            if (value > Max)
            {
                Max = value;
            }

            if ( value < Min)
            {
                Min = value;
            }
        }

        //...
    }

Quite simply, we just keep track of the Max and Min, and whenever a new value is added, we update the values through simple comparison. Since the Range is a function of these two values, we just have a getter defined as Max-Min.

 

Standard deviation

 

The standard deviation is a measure of dispersion in your data. The greater the standard deviation, the greater the dispersion. In fact, you can interpret the standard deviation as the average squared distance of your data from the mean (average) of the data. The formula for the standard deviation s is as follows:

Note that this is the standard deviation for the population. If you have a sample, you should divide by N-1 instead. Since I expect that you will be applying this in situations where there is a large amount of data, the results will be negligibly different either way.

 

It might seem at first that it is hopeless to try incrementally updating the standard deviation. Yes, it’s a sum, but the mean m is changing with every observation—so we would have to go back through all the x values to get the right number, right?

 

Math to the rescue—remember the FOIL method from 7th grade? We can expand the quadratic form under the sum in the following way:

Now that we are rid of the quadratic outside of the parenthesis, we can expand the sum:

Notice that

 

so

Still with us? What this means is that the standard deviation can be decomposed into the average of x squared minus the square of the average of x! We have made it so that now we can incrementally build up the standard deviation point-by-point in the following updated class:

public class StreamingDescriptives
    {
        public Double Mean
        {
            get
            {
                return SumX / Count;
            }
        }

        public Double StdDev
        {
            get 
            {
                return SumXSq / Count - Math.Pow(Mean, 2);
            }
        }

        public Double Range
        {
            get
            {
                return Max - Min;
            }
        }

        public Double Max { get; private set; }
        public Double Min { get; private set; }
        public Double Count { get; private set; }

        Double SumX = 0;
        Double SumXSq = 0;

        public StreamingDescriptives()
        {
            Max = Double.MinValue;
            Min = Double.MaxValue;
            Count = 0;
        }

        public void Ingest(Double value)
        {
            Count++;
            SumX += value;
            SumXSq += value * value;
            
            if (value > Max)
            {
                Max = value;
            }

            if ( value < Min)
            {
                Min = value;
            }
        }

        public void Ingest(IEnumerable<Double> values)
        {
            foreach (Double value in values)
            {
                Ingest(values);
            }
        }
    }

Testing

We can make sure that the StreamingDescriptives class works properly by first creating an extension method for the standard deviation:

public static class StatisticalExtensionMethods
    {
        public static Double StandardDeviation(this ICollection<Double> x)
        {
            Double mean = x.Average();
            return x.Select(val => Math.Pow(val - mean, 2)).Sum() / x.Count();
        }
    }

And then running the following main method:

public static void Main(string[] args)
        {
            // Initialize a random number generator
            Random randomNumberGenerator = new Random("Red Owl Consulting".GetHashCode());

            // Initialize the StreamingDescriptives object
            StreamingDescriptives desc = new StreamingDescriptives();

            // Make a collection to store the values for non-streaming calculations
            List<Double> x = new List<Double>();

            for (Double i = 0; i < 10; i++)
            {
                // Draw a random number
                Double z = randomNumberGenerator.NextDouble();

                // Ingest it into the StreamingDescriptives Object
                desc.Ingest(z);

                // Add it to the collection for non-streaming calculations
                x.Add(z);

                // Compare the streaming and non-sreaming calculations for correctness:
                Console.WriteLine("-- {0} --", i);
                Console.WriteLine("Mean:  {0:0.000} {1:0.000}", x.Average(), desc.Mean);
                Console.WriteLine("StDev: {0:0.000} {1:0.000}", x.StandardDeviation(), desc.StdDev);
                Console.WriteLine("Max:   {0:0.000} {1:0.000}", x.Max(), desc.Max);
                Console.WriteLine("Min:   {0:0.000} {1:0.000}", x.Min(), desc.Min);
                Console.WriteLine("Range: {0:0.000} {1:0.000}", x.Max() - x.Min(), desc.Range);
                Console.WriteLine("---------");
            }
        }

Where we can confirm that the descriptive statistics are identical for both the run of the mill functions and the streaming data based functions.

 

Computational complexity

For fun, I set up an experiment where we calculate the mean, standard deviation, max, min, and range using the StreamingDescriptives class and some naïve extension methods:

public static void Main(string[] args)
        {
            List<String> results = new List<string>();
            for (int i=0; i < 45000; i+= 1000)
            {
                double[] vals = StreamingExperiment(i);
                string res = String.Format("{0},{1},{2}", i, vals[0], vals[1]);
                results.Add(res);
                Console.WriteLine(res);
            }
            System.IO.File.WriteAllLines(@"c:/Users/jlospinoso/Desktop/experiment.csv", results.ToArray());
            Console.ReadLine();
        }

        static double[] StreamingExperiment(int Iterations)
        {
            double[] results = { 0.0, 0.0 };
            // Perform an experiment by timing the ingest for a streaming processor versus
            // computing the mean and standard devation at each point in the event history:
            Random randomNumberGeneratorA = new Random("Red Owl Consulting".GetHashCode());
            DateTime startStreamer = DateTime.Now;
            StreamingDescriptives desc = new StreamingDescriptives();
            Double calculatedValue;
            for (Double i = 0; i < Iterations; i++)
            {
                Double z = randomNumberGeneratorA.NextDouble();
                desc.Ingest(z);
                calculatedValue = desc.Mean;
                calculatedValue = desc.StdDev;
                calculatedValue = desc.Max;
                calculatedValue = desc.Min;
                calculatedValue = desc.Range;
            }
            DateTime endStreamer = DateTime.Now;
            results[1] = (endStreamer - startStreamer).TotalMilliseconds;

            Random randomNumberGeneratorB = new Random("Red Owl Consulting".GetHashCode());
            DateTime startNonStreamer = DateTime.Now;
            IList<Double> list = new List<Double>();
            for (Double i = 0; i < Iterations; i++)
            {
                Double z = randomNumberGeneratorA.NextDouble();
                list.Add(z);
                calculatedValue = list.Average();
                calculatedValue = list.StandardDeviation();
                calculatedValue = list.Max();
                calculatedValue = list.Min();
                calculatedValue = list.Max() - list.Min();
            }
            DateTime endNonStreamer = DateTime.Now;
            results[1] = (endNonStreamer - startNonStreamer).TotalMilliseconds;
            return results;
        }

 The results are as follows, where values below StreamingDescriptives and NiaveExtensionMethods are the number of milliseconds elapsed.

Datapoints

StreamingDescriptives

NaiveExtensionMethods

1000

15.6

93.6002

2000

15.6

296.4005

3000

15.6

670.8012

4000

15.6

1170.0021

5000

15.6

1840.8032

6000

15.6

2636.4046

7000

15.6

3603.6064

...

...

...

40000

15.6

124086.18

41000

15.6

131078.18

42000

15.6

139510.24

43000

15.6

149617.47

44000

15.6

152873.32

45000

15.6

158879.27

The overhead associated with re-running all of the Average() and StandardDeviation() calls are woefully inefficient. To approach 1 second of running time (on my laptop), we would need to run 3.8 million datapoints through the StreamingDescriptives object! Compare this to the capacity of the NaiveExtensions given a second (only 4000 datapoints) and you’ll get some idea of the efficiency gains here.

Moving Averages/Deviations/Etc.

It is possible to include a sliding window concept into the StreamingDescriptives class. By adding a Queue into the mix, we can toss values from the beginning of the view, and do the exact opposite operations (adding to SumX, etc.) that we did when we ingested it. The following class shows the basic idea:

public class StreamingDescriptivesWindow
    {
        public Double Mean
        {
            get
            {
                return SumX / Count;
            }
        }

        public Double StdDev
        {
            get
            {
                return SumXSq / Count - Math.Pow(Mean, 2);
            }
        }

        Queue<Double> Window = new Queue<Double>();
        public Double Count { get; private set; }
        public int WindowSize { get; private set; }
        Double SumX = 0;
        Double SumXSq = 0;

        public StreamingDescriptivesWindow(int WindowSize = 0)
        {
            Count = 0;
            this.WindowSize = WindowSize;
        }

        public void Ingest(Double value)
        {
            Count++;
            SumX += value;
            SumXSq += value * value;

            if ((WindowSize > 0) & (Window.Count > WindowSize))
            {
                // Decrement the statistics associated with the dequeued value:
                Double exitValue = Window.Dequeue();
                SumX -= exitValue;
                SumXSq -= exitValue * exitValue;
                Count--;
            }
        }

        public void Ingest(IEnumerable<Double> values)
        {
            foreach (Double value in values)
            {
                Ingest(values);
            }
        }
    }

As an exercise, try adding Max/Min/Range into the windows class. You could simply use one of the extension methods on the Queue Window, or you might come up with something more elaborate!

 

 

 

Predicting Human Movement with Microsoft Kinect

posted Dec 12, 2011, 11:47 AM by Josh Lospinoso   [ updated Apr 14, 2012, 5:56 AM ]

The Microsoft Kinect SDK provides some really interesting opportunities to program rich, interactive applications. Berico Technologies recently hosted a great "Hack Day" where we tested out some concepts on a few recently acquired Kinects. In this article, I'll present an extension I cobbled together for the "Skeleton Viewer" sample solution--given Kinect's surprisingly good "joint tracking" abilities, I thought--what if we could predict the movement of someone's body parts based on accumulated knowledge about their trajectories?

 The lines flowing out of each joint correspond to a prediction: the JointPrediction project implements a streamlined exponentially weighted moving average on average velocities to predict an X Y Z vector corresponding to where we think each joint is headed in the next few seconds. I'll go through all of the math and the implementation in this post.

You can download the source at Codeplex.

Special thanks to John Ruiz and Robert Levy for their help!

Exponentially Weighted Moving Averages (EWMAs)

EWMAs are essentially schemes that allow us to keep a moving average of some streaming data while giving us fairly fine grained control over how volatile the moving average. Through the use of a decay constant that represents the half-life of the impact of a particular observation, we can make adjustments to just how much weight old observations will have on the current value of the moving average.

With different weights on the EWMA, you can see that the moving average can be very "stable" or it can hug the actual points very closely. The idea here is to adopt the same formula to joint movement in three dimensions (X,Y,Z) -- plus time -- and display a moving average of the joint as a predictor for the future placement of a joint in space. We will do this by modeling the EWMA of velocity.

Modeling observation and velocity

We start with two simple classes. The first tracks an observation of a joint at a discrete moment in time:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace RedOwlConsulting.JointPrediction
{
    public class Observation
    {
        /// <summary>
        /// Horizontal component
        /// </summary>
        public double X { get; set; }
        /// <summary>
        /// Vertical component.
        /// </summary>
        public double Y { get; set; }
        /// <summary>
        /// Depth component.
        /// </summary>
        public double Z { get; set; }
        /// <summary>
        /// Time of the observation
        /// </summary>
        public DateTime DateTime { get; set; }

        /// <summary>
        /// Time elapsing since some past observation, in seconds
        /// </summary>
        /// <param name="pastObservation">A observation occurring in the past. Future observations will return negative numbers.</param>
        /// <returns>Elapsed time</returns>
        public double TimeElapsed(Observation pastObservation)
        {
            return (DateTime - pastObservation.DateTime).TotalSeconds;
        }

        /// <summary>
        /// Computes the approximate/average velocity between the past observation and the current observation.
        /// </summary>
        /// <param name="pastObservation"></param>
        /// <returns></returns>
        public Velocity ApproximateVelocity(Observation pastObservation)
        {
            double timeElapsed = TimeElapsed(pastObservation);
            return new Velocity { 
                X = (X - pastObservation.X) / timeElapsed, 
                Y = (Y - pastObservation.Y) / timeElapsed,
                Z = (Z - pastObservation.Z) / timeElapsed,
                DateTime = this.DateTime };
        }

        /// <summary>
        /// Used for debugging.
        /// </summary>
        /// <returns>A string printing the object</returns>
        public override string ToString()
        {
            return String.Format("{0:0.000} {1:0.000} {2:0.000} {3}", X, Y, Z, DateTime);
        }
    }
}

As you will note, the ApproximateVelocity function calculates average velocity in the usual way. Here's what the velocity class looks like:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace RedOwlConsulting.JointPrediction
{
    /// <summary>
    /// This class is responsible for storing rates of movement on X Y Z coordinates, as well as the date time of calculation.
    /// </summary>
    public class Velocity
    {
        /// <summary>
        /// Horizontal component.
        /// </summary>
        public double X { get; set; }
        /// <summary>
        /// Vertical component.
        /// </summary>
        public double Y { get; set; }
        /// <summary>
        /// Depth component
        /// </summary>
        public double Z { get; set; }

        /// <summary>
        /// Timestamp of the calculated velocity
        /// </summary>
        public DateTime DateTime { get; set; }

        /// <summary>
        /// Used for debugging
        /// </summary>
        /// <returns>String of X Y Z datetime</returns>
        public override string ToString()
        {
            return String.Format("{0:0.000} {1:0.000} {2:0.000} {3}", X, Y, Z, DateTime);
        }
    }
}

Predicting Future Points with Average Velocity

Now that we have two classes to deal with modeling points in space-time and their velocity, we can move on to calculating EWMAs on-line. We do this with the following class:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace RedOwlConsulting.JointPrediction
{
    /// <summary>
    /// A class which is responsible for predicting future movement based on past movement using an online algorithm
    /// for calculating exponentially weighted moving averages of velocity along three dimensions in space.
    /// </summary>
    public class JointPredictor
    {
        /// <summary>
        /// Stores the last observation to be ingested by the online algorithm
        /// </summary>
        Observation lastObservation;
        /// <summary>
        /// Stores the current value of the exponentially weighted moving average velocity
        /// </summary>
        public Velocity EwmaVelocity { get; private set; }
        /// <summary>
        /// Stores the decay constant to be used by the algorithm. Larger (+) values mean that past observations have much
        /// less sway over the current EwmaVelocity value. Values close to zero mean that past observations will have a
        /// lingering effect.
        /// </summary>
        public Double DecayConstant { get; set; }

        /// <summary>
        /// Constructor
        /// </summary>
        /// <param name="decay">Value of the decay constant</param>
        public JointPredictor(double decay)
        {
            DecayConstant = decay;
        }

        /// <summary>
        /// Update ingests a new observation an is responsible for calculating, online, the update to the EwmaVelocity
        /// based on the new observation.
        /// </summary>
        /// <param name="newObservation">The new observation.</param>
        public void Update(Observation newObservation)
        {
            if (lastObservation == null)
            {
                lastObservation = newObservation;
                EwmaVelocity = new Velocity { X = 0, Y = 0, Z = 0, DateTime = lastObservation.DateTime };
            }
            else
            {
                Velocity currentVelocity = newObservation.ApproximateVelocity(lastObservation);
                double timeElapsed = newObservation.TimeElapsed(lastObservation);
                double decay = Math.Exp(-1 * DecayConstant * timeElapsed);
                EwmaVelocity.X = (1 - decay) * EwmaVelocity.X + decay * currentVelocity.X;
                EwmaVelocity.Y = (1 - decay) * EwmaVelocity.Y + decay * currentVelocity.Y;
                EwmaVelocity.Z = (1 - decay) * EwmaVelocity.Z + decay * currentVelocity.Z;
                EwmaVelocity.DateTime = currentVelocity.DateTime;
                lastObservation = newObservation;
            }
        }

    }
}

The Update function is responsible for calculating the EWMA pair-wise on the X, Y, and Z coordinates. You'll notice the familiar EWMA formula in the body!

Wiring JointPrediction into the Skeleton Viewer

The Skeleton Viewer sample solution ships with the Kinect SDK. Since it provides a great platform to build on for our purposes, we will simply hack the EwmaVelocity as a vector onto the "Skeleton Screen" whenever a joint's position is updated. In the MainWindow.xaml.cs, add the following dictionary:

Dictionary<JointID, JointPredictor> jointPredictor = new Dictionary<JointID, JointPredictor>();

Note that we are tracking the joints for only a single skeleton (strange things WILL happen if you try this out with another person in Kinect's vision!). An interesting addition to the project would be to add an index to this dictionary for another skeleton.

Within the function SkeletonFrameReady. You'll see essentially two loops. The first is over the SkeletonData, which corresponds to an object for an identified skeleton/person concept (e.g. if there are two people in the view of Kinect, there will be two SkeletonData objects available). For each SkeletonData, there is a collection Joints, which is also looped over. Within this loop, we add some logic to update our JointPredictors. Once the JointPredictors are updated, we then draw a line:

void nui_SkeletonFrameReady(object sender, SkeletonFrameReadyEventArgs e)
        {
            foreach (SkeletonData data in skeletonFrame.Skeletons)
            {
                if (SkeletonTrackingState.Tracked == data.TrackingState)
                {

                    // Draw bones ...
                    // Draw joints
                    foreach (Joint joint in data.Joints)
                    {
                        // ...

                        // If the jointPredictor entry is not yet initialized, do it
                        if (!jointPredictor.Keys.Contains(joint.ID))
                        {
                            jointPredictor.Add(joint.ID, new JointPredictor(20F));
                        }

                        // Ingest the new observation using the current position of the joint
                        Observation obs = new Observation
                        {
                            X = data.Joints[joint.ID].Position.X,
                            Y = data.Joints[joint.ID].Position.Y,
                            Z = data.Joints[joint.ID].Position.Z,
                            DateTime = DateTime.Now
                        };
                        jointPredictor[joint.ID].Update(obs);

                        // Create a line for the predicted direction of motion
                        Line direction = new Line();
                        Point displayPosition = getDisplayPosition(data.Joints[joint.ID]);

                        direction.Stroke = jointColors[joint.ID];
                        direction.StrokeThickness = 10;
                        
                        // These points correspond to the X Y planar projection of the velocity.
                        direction.X1 = displayPosition.X;
                        direction.Y1 = displayPosition.Y;
                        direction.X2 = direction.X1 + jointPredictor[joint.ID].EwmaVelocity.X * 25;
                        direction.Y2 = direction.Y1 - jointPredictor[joint.ID].EwmaVelocity.Y * 25;

                        // Add the line to the GUI
                        skeleton.Children.Add(direction);
                    }
                }
                iSkeleton++;
            } // for each skeleton
        }

The result is what we get in the demo video above. There are lots of great extensions to this general idea to think about--different weighting schemes, fitting probability distributions onto the joints that can take into account uncertainty/variance and of dependencies between the joints. It would be neat to take this technique and attempt to find "tells" for penalty kicks--which side does a person choose to kick based on some trends for their movements? Can this help a goalie to pick the right side?

Thats it! Kinect SDK is a great way to program for fun. Check out Code4fun for some more cool ideas.

Lospi.net Utils Update

posted Oct 27, 2011, 6:16 AM by Josh Lospinoso

New version out with additional features:
  • Sealed and non-sealed two key dictionaries (now implementing an interface for easy dependency injection). Sealed dictionaries do not allow you to add any more keys on the fly, but read/write can be considerably faster.
  • Added TryGet accessors for fast, safe lookups in situations where you are not sure that the dictionary contains a key pairing.
  • Added RandomEnumerable, which gives a random permutation of a range of consecutive integers (i.e. the standard Enumerable.From, Enumerable.Range)
I'm planning to add some interesting "User Choice" functionality into the library this winter--I want to have a lightweight, fast library for determining the likelihood that users will choose from a set of options (clicking on different menu items, browsing different topics) and allow developers to programmatically use these predictions to e.g. change coloration and layout in the user interface. Check out the milestone on github to track its progress.

Lospi.net Utils is now also on continuous integration, thanks to Kyle Baley and the kind folks at CodeBetter! You can download the latest version directly from their TeamCity server.


Lospi.net Utils now available!

posted Sep 26, 2011, 9:38 AM by Josh Lospinoso

This .NET utilities library represents a collection of functionality requirements that crop up from time to time in numerical and statistical contexts.

Some examples of current functionality:
  • Deep copies of collections of objects, e.g. for task local storage
  • Two key dictionaries that don't require you to declare extraneous Tuple keys
  • Symmetric two key dictionaries that don't care about the ordering of the keys
  • Sorting, scaling, standardizing, normalizing, percentile-izing of collections of doubles
  • Standard deviations and variances for collections of doubles
  • Drawing random elements from a dictionary representing a probability distribution
The operations on collections are implemented as extension methods for a fluent API.

Check out https://github.com/JLospinoso/lut to pull down the source.

You can also download the .dll in Downloads.

1-8 of 8

Comments