For our descriptive statistics class, we need to compute the standard deviation of a data series. We tested a half dozen or so algorithms for speed and accuracy and we settled on the “two pass method” [1].
public double Mean(IEnumerable<double> data) {
double mean = 0;
int m = 0;
foreach (var d in data){
mean += (d - mean)/++m;
}
return mean;
}
public double StdDeviation(IEnumerable<double> data) {
double mean = data.Mean();
double std = 0;
int m = 0;
foreach (double d in data){
double tmp = d - mean;
std += (tmp*tmp);
m++;
}
return Math.Sqrt(std/(m - 1));
}
NIST provides a set of univariate test data sets of varying difficulty with exact values of their mean and standard deviation. Running our code against these data sets, we get a the following log relative errors (LRE) [2].
Data Set LRE
Lottery 15
Lew 15
Mavro 13.1
Michelso 13.8
NumAcc1 15
NumAcc2 14.2
NumAcc3 9.5
NumAcc4 8.3
Those values put us right up there with most statistical software [3]. Notice that for the high difficulty data set NumAcc4, we (and most other software) are only correct to 8 significant digits.
To improve the accuracy we can use the decimal type for internal calculations.
public double StdDeviation(IEnumerable<double> data) {
decimal mean = data.Mean();
decimal std = 0;
int m = 0;
foreach (decimal d in data){
decimal tmp = d - mean;
std += (tmp*tmp);
m++;
}
return Math.Sqrt((double)std/(m - 1));
}
All data sets now return the maximum LRE of 15 - great! There are two problems though. First, the decimal version is ten time slower than the double version. But that probably isn’t too big of an issue since it only takes 20ms to calculate the NumAcc4 standard deviation using decimals. Second, there is a greater chance of an overflow since the decimal type has a smaller range than a double type.
We’ve added the decimal version to the dnAnalytics’ DescriptiveStatistics class as an option.
[1] See Wikipedia for a discussion on why calculating standard deviation is difficult and the various algorithms used to compute it.
[2] You can interpret the LRE as the number of correct significant digits.
[3] For an accuracy comparison of statistical software see: Kellie B. Keeling, Robert J. Pavur, A comparative study of the reliability of nine statistical software packages, Computational Statistics & Data Analysis, Volume 51, Issue 8, 1 May 2007, Pages 3811-3831.
(http://www.sciencedirect.com/science/article/B6V8V-4JHMGWJ-1/2/77a29a95c2071997f13fcca7267711d1)