1b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes## Module statistics.py 2b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes## 3b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes## Copyright (c) 2013 Steven D'Aprano <steve+python@pearwood.info>. 4b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes## 5b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes## Licensed under the Apache License, Version 2.0 (the "License"); 6b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes## you may not use this file except in compliance with the License. 7b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes## You may obtain a copy of the License at 8b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes## 9b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes## http://www.apache.org/licenses/LICENSE-2.0 10b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes## 11b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes## Unless required by applicable law or agreed to in writing, software 12b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes## distributed under the License is distributed on an "AS IS" BASIS, 13b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes## WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes## See the License for the specific language governing permissions and 15b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes## limitations under the License. 16b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 17b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 18b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes""" 19b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel FernandesBasic statistics module. 20b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 21b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel FernandesThis module provides functions for calculating statistics of data, including 22b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesaverages, variance, and standard deviation. 23b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 24b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel FernandesCalculating averages 25b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes-------------------- 26b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 27b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes================== ============================================= 28b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel FernandesFunction Description 29b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes================== ============================================= 30b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesmean Arithmetic mean (average) of data. 31b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesmedian Median (middle value) of data. 32b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesmedian_low Low median of data. 33b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesmedian_high High median of data. 34b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesmedian_grouped Median, or 50th percentile, of grouped data. 35b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesmode Mode (most common value) of data. 36b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes================== ============================================= 37b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 38b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel FernandesCalculate the arithmetic mean ("the average") of data: 39b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 40b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes>>> mean([-1.0, 2.5, 3.25, 5.75]) 41b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes2.625 42b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 43b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 44b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel FernandesCalculate the standard median of discrete data: 45b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 46b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes>>> median([2, 3, 4, 5]) 47b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes3.5 48b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 49b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 50b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel FernandesCalculate the median, or 50th percentile, of data grouped into class intervals 51b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandescentred on the data values provided. E.g. if your data points are rounded to 52b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesthe nearest whole number: 53b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 54b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes>>> median_grouped([2, 2, 3, 3, 3, 4]) #doctest: +ELLIPSIS 55b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes2.8333333333... 56b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 57b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel FernandesThis should be interpreted in this way: you have two data points in the class 58b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesinterval 1.5-2.5, three data points in the class interval 2.5-3.5, and one in 59b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesthe class interval 3.5-4.5. The median of these data points is 2.8333... 60b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 61b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 62b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel FernandesCalculating variability or spread 63b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes--------------------------------- 64b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 65b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes================== ============================================= 66b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel FernandesFunction Description 67b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes================== ============================================= 68b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandespvariance Population variance of data. 69b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesvariance Sample variance of data. 70b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandespstdev Population standard deviation of data. 71b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesstdev Sample standard deviation of data. 72b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes================== ============================================= 73b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 74b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel FernandesCalculate the standard deviation of sample data: 75b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 76b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes>>> stdev([2.5, 3.25, 5.5, 11.25, 11.75]) #doctest: +ELLIPSIS 77b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes4.38961843444... 78b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 79b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel FernandesIf you have previously calculated the mean, you can pass it as the optional 80b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandessecond argument to the four "spread" functions to avoid recalculating it: 81b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 82b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes>>> data = [1, 2, 2, 4, 4, 4, 5, 6] 83b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes>>> mu = mean(data) 84b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes>>> pvariance(data, mu) 85b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes2.5 86b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 87b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 88b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel FernandesExceptions 89b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes---------- 90b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 91b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel FernandesA single exception is defined: StatisticsError is a subclass of ValueError. 92b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 93b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes""" 94b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 95b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes__all__ = [ 'StatisticsError', 96b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 'pstdev', 'pvariance', 'stdev', 'variance', 97b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 'median', 'median_low', 'median_high', 'median_grouped', 98b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 'mean', 'mode', 99b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes ] 100b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 101b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 102b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesimport collections 103b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesimport math 104b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 105b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesfrom fractions import Fraction 106b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesfrom decimal import Decimal 107b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesfrom itertools import groupby 108b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 109b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 110b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 111b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes# === Exceptions === 112b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 113b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesclass StatisticsError(ValueError): 114b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes pass 115b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 116b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 117b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes# === Private utilities === 118b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 119b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef _sum(data, start=0): 120b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """_sum(data [, start]) -> (type, sum, count) 121b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 122b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes Return a high-precision sum of the given numeric data as a fraction, 123b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes together with the type to be converted to and the count of items. 124b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 125b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes If optional argument ``start`` is given, it is added to the total. 126b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes If ``data`` is empty, ``start`` (defaulting to 0) is returned. 127b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 128b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 129b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes Examples 130b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes -------- 131b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 132b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> _sum([3, 2.25, 4.5, -0.5, 1.0], 0.75) 133b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes (<class 'float'>, Fraction(11, 1), 5) 134b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 135b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes Some sources of round-off error will be avoided: 136b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 137b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> _sum([1e50, 1, -1e50] * 1000) # Built-in sum returns zero. 138b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes (<class 'float'>, Fraction(1000, 1), 3000) 139b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 140b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes Fractions and Decimals are also supported: 141b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 142b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> from fractions import Fraction as F 143b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> _sum([F(2, 3), F(7, 5), F(1, 4), F(5, 6)]) 144b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes (<class 'fractions.Fraction'>, Fraction(63, 20), 4) 145b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 146b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> from decimal import Decimal as D 147b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> data = [D("0.1375"), D("0.2108"), D("0.3061"), D("0.0419")] 148b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> _sum(data) 149b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes (<class 'decimal.Decimal'>, Fraction(6963, 10000), 4) 150b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 151b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes Mixed types are currently treated as an error, except that int is 152b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes allowed. 153b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """ 154b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes count = 0 155b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes n, d = _exact_ratio(start) 156b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes partials = {d: n} 157b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes partials_get = partials.get 158b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes T = _coerce(int, type(start)) 159b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes for typ, values in groupby(data, type): 160b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes T = _coerce(T, typ) # or raise TypeError 161b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes for n,d in map(_exact_ratio, values): 162b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes count += 1 163b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes partials[d] = partials_get(d, 0) + n 164b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if None in partials: 165b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # The sum will be a NAN or INF. We can ignore all the finite 166b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # partials, and just look at this special one. 167b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes total = partials[None] 168b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes assert not _isfinite(total) 169b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes else: 170b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # Sum all the partial sums using builtin sum. 171b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # FIXME is this faster if we sum them in order of the denominator? 172b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes total = sum(Fraction(n, d) for d, n in sorted(partials.items())) 173b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return (T, total, count) 174b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 175b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 176b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef _isfinite(x): 177b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes try: 178b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return x.is_finite() # Likely a Decimal. 179b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes except AttributeError: 180b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return math.isfinite(x) # Coerces to float first. 181b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 182b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 183b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef _coerce(T, S): 184b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """Coerce types T and S to a common type, or raise TypeError. 185b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 186b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes Coercion rules are currently an implementation detail. See the CoerceTest 187b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes test class in test_statistics for details. 188b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """ 189b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # See http://bugs.python.org/issue24068. 190b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes assert T is not bool, "initial type T is bool" 191b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # If the types are the same, no need to coerce anything. Put this 192b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # first, so that the usual case (no coercion needed) happens as soon 193b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # as possible. 194b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if T is S: return T 195b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # Mixed int & other coerce to the other type. 196b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if S is int or S is bool: return T 197b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if T is int: return S 198b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # If one is a (strict) subclass of the other, coerce to the subclass. 199b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if issubclass(S, T): return S 200b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if issubclass(T, S): return T 201b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # Ints coerce to the other type. 202b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if issubclass(T, int): return S 203b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if issubclass(S, int): return T 204b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # Mixed fraction & float coerces to float (or float subclass). 205b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if issubclass(T, Fraction) and issubclass(S, float): 206b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return S 207b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if issubclass(T, float) and issubclass(S, Fraction): 208b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return T 209b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # Any other combination is disallowed. 210b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes msg = "don't know how to coerce %s and %s" 211b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes raise TypeError(msg % (T.__name__, S.__name__)) 212b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 213b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 214b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef _exact_ratio(x): 215b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """Return Real number x to exact (numerator, denominator) pair. 216b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 217b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> _exact_ratio(0.25) 218b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes (1, 4) 219b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 220b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes x is expected to be an int, Fraction, Decimal or float. 221b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """ 222b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes try: 223b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # Optimise the common case of floats. We expect that the most often 224b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # used numeric type will be builtin floats, so try to make this as 225b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # fast as possible. 226b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if type(x) is float: 227b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return x.as_integer_ratio() 228b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes try: 229b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # x may be an int, Fraction, or Integral ABC. 230b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return (x.numerator, x.denominator) 231b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes except AttributeError: 232b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes try: 233b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # x may be a float subclass. 234b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return x.as_integer_ratio() 235b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes except AttributeError: 236b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes try: 237b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # x may be a Decimal. 238b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return _decimal_to_ratio(x) 239b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes except AttributeError: 240b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # Just give up? 241b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes pass 242b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes except (OverflowError, ValueError): 243b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # float NAN or INF. 244b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes assert not math.isfinite(x) 245b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return (x, None) 246b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes msg = "can't convert type '{}' to numerator/denominator" 247b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes raise TypeError(msg.format(type(x).__name__)) 248b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 249b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 250b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes# FIXME This is faster than Fraction.from_decimal, but still too slow. 251b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef _decimal_to_ratio(d): 252b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """Convert Decimal d to exact integer ratio (numerator, denominator). 253b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 254b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> from decimal import Decimal 255b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> _decimal_to_ratio(Decimal("2.6")) 256b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes (26, 10) 257b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 258b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """ 259b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes sign, digits, exp = d.as_tuple() 260b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if exp in ('F', 'n', 'N'): # INF, NAN, sNAN 261b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes assert not d.is_finite() 262b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return (d, None) 263b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes num = 0 264b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes for digit in digits: 265b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes num = num*10 + digit 266b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if exp < 0: 267b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes den = 10**-exp 268b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes else: 269b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes num *= 10**exp 270b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes den = 1 271b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if sign: 272b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes num = -num 273b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return (num, den) 274b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 275b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 276b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef _convert(value, T): 277b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """Convert value to given numeric type T.""" 278b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if type(value) is T: 279b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # This covers the cases where T is Fraction, or where value is 280b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # a NAN or INF (Decimal or float). 281b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return value 282b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if issubclass(T, int) and value.denominator != 1: 283b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes T = float 284b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes try: 285b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # FIXME: what do we do if this overflows? 286b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return T(value) 287b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes except TypeError: 288b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if issubclass(T, Decimal): 289b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return T(value.numerator)/T(value.denominator) 290b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes else: 291b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes raise 292b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 293b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 294b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef _counts(data): 295b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # Generate a table of sorted (value, frequency) pairs. 296b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes table = collections.Counter(iter(data)).most_common() 297b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if not table: 298b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return table 299b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # Extract the values with the highest frequency. 300b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes maxfreq = table[0][1] 301b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes for i in range(1, len(table)): 302b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if table[i][1] != maxfreq: 303b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes table = table[:i] 304b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes break 305b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return table 306b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 307b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 308b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes# === Measures of central tendency (averages) === 309b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 310b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef mean(data): 311b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """Return the sample arithmetic mean of data. 312b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 313b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> mean([1, 2, 3, 4, 4]) 314b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 2.8 315b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 316b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> from fractions import Fraction as F 317b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)]) 318b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes Fraction(13, 21) 319b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 320b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> from decimal import Decimal as D 321b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")]) 322b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes Decimal('0.5625') 323b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 324b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes If ``data`` is empty, StatisticsError will be raised. 325b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """ 326b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if iter(data) is data: 327b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes data = list(data) 328b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes n = len(data) 329b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if n < 1: 330b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes raise StatisticsError('mean requires at least one data point') 331b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes T, total, count = _sum(data) 332b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes assert count == n 333b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return _convert(total/n, T) 334b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 335b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 336b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes# FIXME: investigate ways to calculate medians without sorting? Quickselect? 337b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef median(data): 338b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """Return the median (middle value) of numeric data. 339b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 340b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes When the number of data points is odd, return the middle data point. 341b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes When the number of data points is even, the median is interpolated by 342b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes taking the average of the two middle values: 343b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 344b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> median([1, 3, 5]) 345b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 3 346b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> median([1, 3, 5, 7]) 347b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 4.0 348b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 349b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """ 350b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes data = sorted(data) 351b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes n = len(data) 352b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if n == 0: 353b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes raise StatisticsError("no median for empty data") 354b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if n%2 == 1: 355b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return data[n//2] 356b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes else: 357b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes i = n//2 358b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return (data[i - 1] + data[i])/2 359b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 360b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 361b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef median_low(data): 362b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """Return the low median of numeric data. 363b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 364b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes When the number of data points is odd, the middle value is returned. 365b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes When it is even, the smaller of the two middle values is returned. 366b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 367b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> median_low([1, 3, 5]) 368b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 3 369b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> median_low([1, 3, 5, 7]) 370b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 3 371b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 372b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """ 373b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes data = sorted(data) 374b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes n = len(data) 375b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if n == 0: 376b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes raise StatisticsError("no median for empty data") 377b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if n%2 == 1: 378b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return data[n//2] 379b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes else: 380b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return data[n//2 - 1] 381b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 382b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 383b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef median_high(data): 384b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """Return the high median of data. 385b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 386b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes When the number of data points is odd, the middle value is returned. 387b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes When it is even, the larger of the two middle values is returned. 388b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 389b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> median_high([1, 3, 5]) 390b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 3 391b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> median_high([1, 3, 5, 7]) 392b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 5 393b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 394b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """ 395b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes data = sorted(data) 396b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes n = len(data) 397b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if n == 0: 398b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes raise StatisticsError("no median for empty data") 399b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return data[n//2] 400b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 401b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 402b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef median_grouped(data, interval=1): 403b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """Return the 50th percentile (median) of grouped continuous data. 404b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 405b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5]) 406b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 3.7 407b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> median_grouped([52, 52, 53, 54]) 408b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 52.5 409b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 410b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes This calculates the median as the 50th percentile, and should be 411b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes used when your data is continuous and grouped. In the above example, 412b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes the values 1, 2, 3, etc. actually represent the midpoint of classes 413b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 0.5-1.5, 1.5-2.5, 2.5-3.5, etc. The middle value falls somewhere in 414b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes class 3.5-4.5, and interpolation is used to estimate it. 415b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 416b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes Optional argument ``interval`` represents the class interval, and 417b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes defaults to 1. Changing the class interval naturally will change the 418b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes interpolated 50th percentile value: 419b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 420b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> median_grouped([1, 3, 3, 5, 7], interval=1) 421b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 3.25 422b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> median_grouped([1, 3, 3, 5, 7], interval=2) 423b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 3.5 424b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 425b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes This function does not check whether the data points are at least 426b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes ``interval`` apart. 427b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """ 428b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes data = sorted(data) 429b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes n = len(data) 430b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if n == 0: 431b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes raise StatisticsError("no median for empty data") 432b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes elif n == 1: 433b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return data[0] 434b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # Find the value at the midpoint. Remember this corresponds to the 435b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # centre of the class interval. 436b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes x = data[n//2] 437b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes for obj in (x, interval): 438b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if isinstance(obj, (str, bytes)): 439b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes raise TypeError('expected number but got %r' % obj) 440b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes try: 441b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes L = x - interval/2 # The lower limit of the median interval. 442b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes except TypeError: 443b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # Mixed type. For now we just coerce to float. 444b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes L = float(x) - float(interval)/2 445b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes cf = data.index(x) # Number of values below the median interval. 446b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # FIXME The following line could be more efficient for big lists. 447b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes f = data.count(x) # Number of data points in the median interval. 448b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return L + interval*(n/2 - cf)/f 449b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 450b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 451b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef mode(data): 452b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """Return the most common data point from discrete or nominal data. 453b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 454b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes ``mode`` assumes discrete data, and returns a single value. This is the 455b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes standard treatment of the mode as commonly taught in schools: 456b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 457b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> mode([1, 1, 2, 3, 3, 3, 3, 4]) 458b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 3 459b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 460b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes This also works with nominal (non-numeric) data: 461b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 462b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> mode(["red", "blue", "blue", "red", "green", "red", "red"]) 463b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 'red' 464b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 465b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes If there is not exactly one most common value, ``mode`` will raise 466b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes StatisticsError. 467b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """ 468b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # Generate a table of sorted (value, frequency) pairs. 469b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes table = _counts(data) 470b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if len(table) == 1: 471b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return table[0][0] 472b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes elif table: 473b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes raise StatisticsError( 474b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 'no unique mode; found %d equally common values' % len(table) 475b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes ) 476b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes else: 477b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes raise StatisticsError('no mode for empty data') 478b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 479b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 480b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes# === Measures of spread === 481b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 482b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes# See http://mathworld.wolfram.com/Variance.html 483b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes# http://mathworld.wolfram.com/SampleVariance.html 484b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes# http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance 485b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes# 486b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes# Under no circumstances use the so-called "computational formula for 487b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes# variance", as that is only suitable for hand calculations with a small 488b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes# amount of low-precision data. It has terrible numeric properties. 489b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes# 490b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes# See a comparison of three computational methods here: 491b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes# http://www.johndcook.com/blog/2008/09/26/comparing-three-methods-of-computing-standard-deviation/ 492b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 493b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef _ss(data, c=None): 494b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """Return sum of square deviations of sequence data. 495b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 496b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes If ``c`` is None, the mean is calculated in one pass, and the deviations 497b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes from the mean are calculated in a second pass. Otherwise, deviations are 498b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes calculated from ``c`` as given. Use the second case with care, as it can 499b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes lead to garbage results. 500b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """ 501b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if c is None: 502b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes c = mean(data) 503b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes T, total, count = _sum((x-c)**2 for x in data) 504b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # The following sum should mathematically equal zero, but due to rounding 505b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes # error may not. 506b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes U, total2, count2 = _sum((x-c) for x in data) 507b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes assert T == U and count == count2 508b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes total -= total2**2/len(data) 509b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes assert not total < 0, 'negative sum of square deviations: %f' % total 510b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return (T, total) 511b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 512b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 513b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef variance(data, xbar=None): 514b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """Return the sample variance of data. 515b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 516b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes data should be an iterable of Real-valued numbers, with at least two 517b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes values. The optional argument xbar, if given, should be the mean of 518b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes the data. If it is missing or None, the mean is automatically calculated. 519b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 520b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes Use this function when your data is a sample from a population. To 521b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes calculate the variance from the entire population, see ``pvariance``. 522b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 523b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes Examples: 524b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 525b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5] 526b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> variance(data) 527b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 1.3720238095238095 528b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 529b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes If you have already calculated the mean of your data, you can pass it as 530b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes the optional second argument ``xbar`` to avoid recalculating it: 531b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 532b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> m = mean(data) 533b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> variance(data, m) 534b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 1.3720238095238095 535b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 536b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes This function does not check that ``xbar`` is actually the mean of 537b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes ``data``. Giving arbitrary values for ``xbar`` may lead to invalid or 538b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes impossible results. 539b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 540b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes Decimals and Fractions are supported: 541b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 542b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> from decimal import Decimal as D 543b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")]) 544b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes Decimal('31.01875') 545b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 546b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> from fractions import Fraction as F 547b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> variance([F(1, 6), F(1, 2), F(5, 3)]) 548b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes Fraction(67, 108) 549b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 550b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """ 551b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if iter(data) is data: 552b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes data = list(data) 553b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes n = len(data) 554b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if n < 2: 555b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes raise StatisticsError('variance requires at least two data points') 556b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes T, ss = _ss(data, xbar) 557b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return _convert(ss/(n-1), T) 558b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 559b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 560b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef pvariance(data, mu=None): 561b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """Return the population variance of ``data``. 562b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 563b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes data should be an iterable of Real-valued numbers, with at least one 564b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes value. The optional argument mu, if given, should be the mean of 565b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes the data. If it is missing or None, the mean is automatically calculated. 566b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 567b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes Use this function to calculate the variance from the entire population. 568b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes To estimate the variance from a sample, the ``variance`` function is 569b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes usually a better choice. 570b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 571b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes Examples: 572b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 573b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25] 574b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> pvariance(data) 575b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 1.25 576b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 577b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes If you have already calculated the mean of the data, you can pass it as 578b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes the optional second argument to avoid recalculating it: 579b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 580b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> mu = mean(data) 581b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> pvariance(data, mu) 582b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 1.25 583b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 584b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes This function does not check that ``mu`` is actually the mean of ``data``. 585b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes Giving arbitrary values for ``mu`` may lead to invalid or impossible 586b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes results. 587b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 588b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes Decimals and Fractions are supported: 589b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 590b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> from decimal import Decimal as D 591b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")]) 592b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes Decimal('24.815') 593b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 594b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> from fractions import Fraction as F 595b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> pvariance([F(1, 4), F(5, 4), F(1, 2)]) 596b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes Fraction(13, 72) 597b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 598b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """ 599b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if iter(data) is data: 600b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes data = list(data) 601b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes n = len(data) 602b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes if n < 1: 603b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes raise StatisticsError('pvariance requires at least one data point') 604b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes ss = _ss(data, mu) 605b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes T, ss = _ss(data, mu) 606b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return _convert(ss/n, T) 607b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 608b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 609b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef stdev(data, xbar=None): 610b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """Return the square root of the sample variance. 611b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 612b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes See ``variance`` for arguments and other details. 613b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 614b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75]) 615b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 1.0810874155219827 616b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 617b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """ 618b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes var = variance(data, xbar) 619b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes try: 620b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return var.sqrt() 621b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes except AttributeError: 622b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return math.sqrt(var) 623b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 624b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 625b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef pstdev(data, mu=None): 626b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """Return the square root of the population variance. 627b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 628b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes See ``pvariance`` for arguments and other details. 629b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 630b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75]) 631b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 0.986893273527251 632b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes 633b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes """ 634b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes var = pvariance(data, mu) 635b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes try: 636b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return var.sqrt() 637b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes except AttributeError: 638b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes return math.sqrt(var) 639