In lots of situations where users will want to analyze continuously valued data, it can be fruitful to do some simple manipulations on a list of doubles. In this post, I provide a few techniques for transforming this data in some way using extremely convenient extension methods. The sorts of manipulations I will be presenting here involve various ways of shaping your data, whether you want to preserve the ordering, give a regular shape for scoring purposes, or shift/scale your data onto some new interval.
Extension Methods Extension methods are a little piece of syntactic sugar that allow us to tack functionality onto classes without having to extend them through a subclass. So, for example, we could extend the Double class with a method which returns the objects squared value (as a double). The method is declared as a static method within some static class, and the first argument is what gives it away as an extension method: The this keyword allows you to use x within the body of the function to yield some result. So, the following code: public static class ExtensionMethods
{
public static Double SquareMe(this Double x)
{
return x * x;
}
}
public class Program
{
public static void Main(string[] args)
{
Double z = 7;
Console.WriteLine(z.SquareMe());
}
}
will yield the answer 49 as we expect. For more information, c.f. http://msdn.microsoft.com/en-us/library/bb383977.aspx. Scaling When we scale a vector of numbers, we seek to take the original list and transform it into another list that has some maximum value and some minimum value, while the scale between the numbers is preserved. For example, the list List<Double> x = new List<Double> { 0.0, 2.0, 5.0, 10.0 };could be scaled onto another interval, say 5 to 6. We would want the new list to (1) preserve the order and scale between numbers and (2) obviously to be on the interval 5 to 6. The following list uniquely achieves this goal: { 5.0, 5.2, 5.5, 6.0 }The following extension method can be used: /// <summary>
/// Scales a list onto a new interval
/// </summary>
/// <param name="list"></param>
/// <param name="lower">Lower limit of new interval</param>
/// <param name="upper">Upper limit of new interval</param>
/// <returns></returns>
public static IDictionary<Double, Double> Scale(this IList<Double> list, double lower = 0, double upper = 1)
{
Double max = list.Max();
Double min = list.Min();
Double avg = list.Sum() / list.Count;
return list.ToDictionary(x => x, x => (upper - lower) * ((x - min) / (max - min)) + lower);
}Scaling is useful if you want to present a user with a score on a scale from, say, 0 to 100 given some arbitrary list of data. All of the scaling is preserved in the numbers, so given the original interval you could easily back out the original list (i.e. no information is lost). In these examples, I'll show some charts to illustrate data looks like after transformation. For this example, I will randomly generate 1000 points on the [0,10] interval and scale them onto the [1,7] interval using the function above. The black points represent some original data, and the red points represent points after transformation:
Here we can see that the points have been "squished" onto the [1,7] interval, but they maintain relatively similar dispersions. Another way to see this relationship is to sort the data in ascending order:
We can visually confirm that the linear nature of the data (which comes from the shape of the uniform distribution) is preserved during the scaling transformation. In short, the scaling function is useful for situations where we want to keep the relative shape of the distribution, but we just need it on a different interval. As another example, we could generate noisy sinusoidal data, yielding the following plot:
where the "smushing" onto a different interval is nicely apparent. Standardization While scaling can help us to put on an interval of interest, we may be interested in getting some indication of just how far away from central tendencies some observations are in out data. Suppose we have data which contains some really extreme outliers (I will use Cauchy distributed data, which has infinite variance!) Here's what scaling around [-10, 10] looks like for 500 points:
What happened? It turns out that the maximums and minimums are so large that the center of the scaled numbers is somewhere around 7, not zero! Clearly, the maximum and minimum have a large effect on the outcome of scaling. To help answer this, we can scale by mean and standard deviation in the following way: /// <summary>
/// Standardizes a list by subtracting its mean and dividing by its standard deviation to yield
/// a new list with mean zero and standard deviation one.
/// </summary>
/// <param name="list"></param>
/// <returns></returns>
public static IDictionary<Double, Double> Standardize(this IList<Double> list)
{
Double avg = list.Sum() / list.Count;
Double var = list.Select(x => Math.Pow(x - avg, 2)).Sum() / list.Count;
Double sd = Math.Pow(var, .5);
return list.ToDictionary(x => x, x => (x - avg) / sd);
}This formula is much less sensitive to outliers in the data, but it will nonetheless preserve the scale between the numbers. Here's what standardization looks like applied to the Cauchy data given above:
Now, we have this nice centering around the mean, and the outliers have much less sway over the results. Additionally, the new values have an appealing interpretation of "the original point is X standard deviations away from the mean of the distribution," which has a rich meaning in many statistical settings. Percentile If the shapes of distributions are really funny, or if we are only concerned with ordering, it can make sense to calculate percentiles. This transformation is interpreted as "What percentage of other points in the list are less than or equal to this point?" An extension method implementing this calculation is the following: /// <summary>
/// Calculates the percentile for each element
/// </summary>
/// <param name="list"></param>
/// <returns></returns>
public static IDictionary<Double, Double> Percentile(this IList<Double> list, bool midpoint = false)
{
double denominator;
double numeratorAdjustment;
if (midpoint)
{
denominator = list.Count;
numeratorAdjustment = 0.5;
}
else
{
denominator = list.Count - 1;
numeratorAdjustment = 0;
}
return list.ToDictionary(x => x, x => (list.Count(y => x > y) + numeratorAdjustment) / denominator);
}When we apply this to a so-called "Mixing Distribution"--where we essentially stick together random numbers from two different distributions, we get the following pattern:
It doesn't seem all that obvious why we might want to apply such a transformation to this data, but when we compare the sorted distribution, the elegance of the percentile transformation becomes apparent:
The great feature of calculating percentiles is that we (1) have a wonderfully simple interpretation of just what the value means, (2) we know that the numbers will lie on the [0,1] interval, and (3) it is extremely well behaved for virtually any kinds of distributions. Normalization Finally, in some situations, it makes sense to Normalize--literally take our data and make it look like it came from a normal distribution. This can be done through a neat little mathematical trick. First, import the using MathNet.Numerics; package available here: http://mathnetnumerics.codeplex.com/. This allows you to access a number of useful mathematical functions. To normalize our data, we first construct percentiles (which we have already created an extension method for above). Next, we will put this percentile into the inverse Normal cumulative density function (CDF). For the mathematically inclined, we are simply using the Inverse CDF as a map from [0,1] onto the space of the distribution function (in this case, Normal). The following function gives us the transformation: /// <summary>
/// Inverse standard normal cumulative distribution function
/// </summary>
/// <param name="p">percentile</param>
/// <param name="mean">mean of target distribution</param>
/// <param name="stdev">standard deviation of target distribution</param>
/// <returns></returns>
private static Double InverseNormalCDF(Double p, Double mean = 0, Double stdev = 1)
{
return stdev * Fn.ErfInverse(2*p -1) * Math.Sqrt(2) + mean ;
}Note that we could, for example, simulate random normal numbers through this little trick: Random rng = new Random();
InverseNormalCDF(rng.NextDouble());The extension method is then simply /// <summary>
/// "Curves" the list by mapping it onto a standard normal distribution
/// </summary>
/// <param name="list"></param>
/// <returns></returns>
public static IDictionary<Double, Double> Normalize(this IList<Double> list, double mean=0, double stdev=1)
{
return list.Percentile(true).ToDictionary( x => x.Key, x => InverseNormalCDF(x.Value, mean, stdev) );
}So our mixing distribution data can be transformed:
and corresponding kernel densities look like the following:
where we see that we have achieved a nice shape from a strange looking distribution of numbers. In fact, this is what most teachers mean when they say "curving" test scores--although many do a simple scaling in practice. Putting it all together With a few simple methods to sort output and print them to console: /// <summary>
/// Prints a dictionary of double, double to Console.
/// </summary>
/// <param name="dictionary"></param>
/// <returns></returns>
public static void ToConsole(this IDictionary<Double, Double> dictionary, String prepend = "", String append = "", Boolean sort = false)
{
IList<Double> keys;
if (sort)
{
keys = dictionary.SortOnKeys().Keys.ToList();
}
else
{
keys = dictionary.Keys.ToList();
}
Console.Write(prepend);
foreach (Double i in keys)
{
Console.WriteLine(String.Format("{0:0.0000} {1:0.0000}", i, dictionary[i]));
}
Console.Write(append);
}
/// <summary>
/// Sorts a dictionary on its double key values
/// </summary>
/// <typeparam name="T">Value type</typeparam>
/// <param name="dictionary"></param>
/// <returns>A SortedDictionary</returns>
public static SortedDictionary<Double, T> SortOnKeys<T>(this IDictionary<Double, T> dictionary)
{
return new SortedDictionary<Double, T>(dictionary);
}it is possible to put all of these functions together into a little demo: public class Program
{
public static void Main(string[] args)
{
Random randomNumberGenerator = new Random("Red Owl Consulting".GetHashCode());
List<Double> x = new List<Double>();
for (Double i = 0; i < 10; i++)
{
x.Add(randomNumberGenerator.NextDouble());
}
x.Percentile(midpoint: false).ToConsole(prepend: "Percentile(midpoint: false):\n", append: "-----\n", sort: true);
x.Percentile(midpoint: true).ToConsole(prepend: "Percentile(midpoint: true):\n", append: "-----\n", sort: true);
x.Scale(lower: 0, upper: 1).ToConsole(prepend: "Scale(lower: 0, upper: 1):\n", append: "-----\n", sort: true);
x.Scale(lower: 10, upper: 100).ToConsole(prepend: "Scale(lower: 10, upper: 100):\n", append: "-----\n", sort: true);
x.Normalize(mean: 0, stdev: 1).ToConsole(prepend: "Normalize(mean: 0, stdev: 1):\n", append: "-----\n", sort: true);
x.Normalize(mean: Math.PI, stdev: Math.E).ToConsole(prepend: "Normalize(mean: Math.PI, stdev: Math.E):\n", append: "-----\n", sort: true);
x.Standardize().ToConsole(prepend: "Standardize():\n", append: "-----\n", sort: true);
Console.ReadLine();
}
}I hope that this demo helped you to understand some of the interesting transformations that you can perform on your data. There are undoubtedly other ways that you might think of transforming data based on your particular situation, but this should provide an intuitive feel for what's possible. |









