1b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes##  Module statistics.py
2b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes##
3b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes##  Copyright (c) 2013 Steven D'Aprano <steve+python@pearwood.info>.
4b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes##
5b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes##  Licensed under the Apache License, Version 2.0 (the "License");
6b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes##  you may not use this file except in compliance with the License.
7b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes##  You may obtain a copy of the License at
8b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes##
9b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes##  http://www.apache.org/licenses/LICENSE-2.0
10b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes##
11b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes##  Unless required by applicable law or agreed to in writing, software
12b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes##  distributed under the License is distributed on an "AS IS" BASIS,
13b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes##  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes##  See the License for the specific language governing permissions and
15b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes##  limitations under the License.
16b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
17b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
18b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes"""
19b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel FernandesBasic statistics module.
20b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
21b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel FernandesThis module provides functions for calculating statistics of data, including
22b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesaverages, variance, and standard deviation.
23b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
24b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel FernandesCalculating averages
25b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes--------------------
26b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
27b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes==================  =============================================
28b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel FernandesFunction            Description
29b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes==================  =============================================
30b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesmean                Arithmetic mean (average) of data.
31b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesmedian              Median (middle value) of data.
32b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesmedian_low          Low median of data.
33b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesmedian_high         High median of data.
34b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesmedian_grouped      Median, or 50th percentile, of grouped data.
35b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesmode                Mode (most common value) of data.
36b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes==================  =============================================
37b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
38b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel FernandesCalculate the arithmetic mean ("the average") of data:
39b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
40b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes>>> mean([-1.0, 2.5, 3.25, 5.75])
41b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes2.625
42b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
43b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
44b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel FernandesCalculate the standard median of discrete data:
45b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
46b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes>>> median([2, 3, 4, 5])
47b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes3.5
48b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
49b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
50b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel FernandesCalculate the median, or 50th percentile, of data grouped into class intervals
51b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandescentred on the data values provided. E.g. if your data points are rounded to
52b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesthe nearest whole number:
53b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
54b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes>>> median_grouped([2, 2, 3, 3, 3, 4])  #doctest: +ELLIPSIS
55b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes2.8333333333...
56b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
57b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel FernandesThis should be interpreted in this way: you have two data points in the class
58b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesinterval 1.5-2.5, three data points in the class interval 2.5-3.5, and one in
59b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesthe class interval 3.5-4.5. The median of these data points is 2.8333...
60b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
61b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
62b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel FernandesCalculating variability or spread
63b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes---------------------------------
64b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
65b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes==================  =============================================
66b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel FernandesFunction            Description
67b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes==================  =============================================
68b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandespvariance           Population variance of data.
69b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesvariance            Sample variance of data.
70b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandespstdev              Population standard deviation of data.
71b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesstdev               Sample standard deviation of data.
72b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes==================  =============================================
73b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
74b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel FernandesCalculate the standard deviation of sample data:
75b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
76b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes>>> stdev([2.5, 3.25, 5.5, 11.25, 11.75])  #doctest: +ELLIPSIS
77b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes4.38961843444...
78b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
79b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel FernandesIf you have previously calculated the mean, you can pass it as the optional
80b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandessecond argument to the four "spread" functions to avoid recalculating it:
81b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
82b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes>>> data = [1, 2, 2, 4, 4, 4, 5, 6]
83b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes>>> mu = mean(data)
84b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes>>> pvariance(data, mu)
85b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes2.5
86b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
87b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
88b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel FernandesExceptions
89b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes----------
90b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
91b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel FernandesA single exception is defined: StatisticsError is a subclass of ValueError.
92b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
93b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes"""
94b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
95b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes__all__ = [ 'StatisticsError',
96b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes            'pstdev', 'pvariance', 'stdev', 'variance',
97b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes            'median',  'median_low', 'median_high', 'median_grouped',
98b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes            'mean', 'mode',
99b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes          ]
100b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
101b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
102b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesimport collections
103b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesimport math
104b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
105b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesfrom fractions import Fraction
106b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesfrom decimal import Decimal
107b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesfrom itertools import groupby
108b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
109b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
110b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
111b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes# === Exceptions ===
112b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
113b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesclass StatisticsError(ValueError):
114b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    pass
115b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
116b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
117b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes# === Private utilities ===
118b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
119b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef _sum(data, start=0):
120b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """_sum(data [, start]) -> (type, sum, count)
121b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
122b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    Return a high-precision sum of the given numeric data as a fraction,
123b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    together with the type to be converted to and the count of items.
124b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
125b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    If optional argument ``start`` is given, it is added to the total.
126b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    If ``data`` is empty, ``start`` (defaulting to 0) is returned.
127b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
128b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
129b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    Examples
130b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    --------
131b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
132b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> _sum([3, 2.25, 4.5, -0.5, 1.0], 0.75)
133b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    (<class 'float'>, Fraction(11, 1), 5)
134b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
135b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    Some sources of round-off error will be avoided:
136b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
137b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> _sum([1e50, 1, -1e50] * 1000)  # Built-in sum returns zero.
138b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    (<class 'float'>, Fraction(1000, 1), 3000)
139b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
140b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    Fractions and Decimals are also supported:
141b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
142b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> from fractions import Fraction as F
143b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> _sum([F(2, 3), F(7, 5), F(1, 4), F(5, 6)])
144b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    (<class 'fractions.Fraction'>, Fraction(63, 20), 4)
145b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
146b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> from decimal import Decimal as D
147b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> data = [D("0.1375"), D("0.2108"), D("0.3061"), D("0.0419")]
148b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> _sum(data)
149b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    (<class 'decimal.Decimal'>, Fraction(6963, 10000), 4)
150b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
151b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    Mixed types are currently treated as an error, except that int is
152b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    allowed.
153b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """
154b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    count = 0
155b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    n, d = _exact_ratio(start)
156b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    partials = {d: n}
157b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    partials_get = partials.get
158b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    T = _coerce(int, type(start))
159b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    for typ, values in groupby(data, type):
160b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        T = _coerce(T, typ)  # or raise TypeError
161b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        for n,d in map(_exact_ratio, values):
162b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes            count += 1
163b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes            partials[d] = partials_get(d, 0) + n
164b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if None in partials:
165b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        # The sum will be a NAN or INF. We can ignore all the finite
166b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        # partials, and just look at this special one.
167b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        total = partials[None]
168b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        assert not _isfinite(total)
169b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    else:
170b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        # Sum all the partial sums using builtin sum.
171b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        # FIXME is this faster if we sum them in order of the denominator?
172b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        total = sum(Fraction(n, d) for d, n in sorted(partials.items()))
173b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    return (T, total, count)
174b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
175b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
176b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef _isfinite(x):
177b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    try:
178b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        return x.is_finite()  # Likely a Decimal.
179b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    except AttributeError:
180b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        return math.isfinite(x)  # Coerces to float first.
181b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
182b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
183b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef _coerce(T, S):
184b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """Coerce types T and S to a common type, or raise TypeError.
185b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
186b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    Coercion rules are currently an implementation detail. See the CoerceTest
187b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    test class in test_statistics for details.
188b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """
189b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    # See http://bugs.python.org/issue24068.
190b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    assert T is not bool, "initial type T is bool"
191b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    # If the types are the same, no need to coerce anything. Put this
192b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    # first, so that the usual case (no coercion needed) happens as soon
193b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    # as possible.
194b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if T is S:  return T
195b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    # Mixed int & other coerce to the other type.
196b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if S is int or S is bool:  return T
197b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if T is int:  return S
198b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    # If one is a (strict) subclass of the other, coerce to the subclass.
199b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if issubclass(S, T):  return S
200b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if issubclass(T, S):  return T
201b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    # Ints coerce to the other type.
202b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if issubclass(T, int):  return S
203b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if issubclass(S, int):  return T
204b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    # Mixed fraction & float coerces to float (or float subclass).
205b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if issubclass(T, Fraction) and issubclass(S, float):
206b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        return S
207b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if issubclass(T, float) and issubclass(S, Fraction):
208b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        return T
209b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    # Any other combination is disallowed.
210b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    msg = "don't know how to coerce %s and %s"
211b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    raise TypeError(msg % (T.__name__, S.__name__))
212b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
213b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
214b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef _exact_ratio(x):
215b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """Return Real number x to exact (numerator, denominator) pair.
216b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
217b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> _exact_ratio(0.25)
218b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    (1, 4)
219b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
220b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    x is expected to be an int, Fraction, Decimal or float.
221b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """
222b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    try:
223b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        # Optimise the common case of floats. We expect that the most often
224b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        # used numeric type will be builtin floats, so try to make this as
225b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        # fast as possible.
226b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        if type(x) is float:
227b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes            return x.as_integer_ratio()
228b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        try:
229b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes            # x may be an int, Fraction, or Integral ABC.
230b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes            return (x.numerator, x.denominator)
231b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        except AttributeError:
232b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes            try:
233b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes                # x may be a float subclass.
234b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes                return x.as_integer_ratio()
235b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes            except AttributeError:
236b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes                try:
237b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes                    # x may be a Decimal.
238b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes                    return _decimal_to_ratio(x)
239b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes                except AttributeError:
240b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes                    # Just give up?
241b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes                    pass
242b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    except (OverflowError, ValueError):
243b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        # float NAN or INF.
244b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        assert not math.isfinite(x)
245b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        return (x, None)
246b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    msg = "can't convert type '{}' to numerator/denominator"
247b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    raise TypeError(msg.format(type(x).__name__))
248b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
249b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
250b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes# FIXME This is faster than Fraction.from_decimal, but still too slow.
251b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef _decimal_to_ratio(d):
252b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """Convert Decimal d to exact integer ratio (numerator, denominator).
253b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
254b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> from decimal import Decimal
255b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> _decimal_to_ratio(Decimal("2.6"))
256b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    (26, 10)
257b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
258b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """
259b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    sign, digits, exp = d.as_tuple()
260b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if exp in ('F', 'n', 'N'):  # INF, NAN, sNAN
261b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        assert not d.is_finite()
262b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        return (d, None)
263b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    num = 0
264b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    for digit in digits:
265b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        num = num*10 + digit
266b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if exp < 0:
267b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        den = 10**-exp
268b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    else:
269b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        num *= 10**exp
270b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        den = 1
271b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if sign:
272b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        num = -num
273b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    return (num, den)
274b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
275b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
276b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef _convert(value, T):
277b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """Convert value to given numeric type T."""
278b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if type(value) is T:
279b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        # This covers the cases where T is Fraction, or where value is
280b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        # a NAN or INF (Decimal or float).
281b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        return value
282b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if issubclass(T, int) and value.denominator != 1:
283b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        T = float
284b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    try:
285b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        # FIXME: what do we do if this overflows?
286b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        return T(value)
287b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    except TypeError:
288b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        if issubclass(T, Decimal):
289b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes            return T(value.numerator)/T(value.denominator)
290b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        else:
291b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes            raise
292b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
293b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
294b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef _counts(data):
295b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    # Generate a table of sorted (value, frequency) pairs.
296b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    table = collections.Counter(iter(data)).most_common()
297b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if not table:
298b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        return table
299b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    # Extract the values with the highest frequency.
300b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    maxfreq = table[0][1]
301b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    for i in range(1, len(table)):
302b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        if table[i][1] != maxfreq:
303b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes            table = table[:i]
304b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes            break
305b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    return table
306b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
307b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
308b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes# === Measures of central tendency (averages) ===
309b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
310b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef mean(data):
311b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """Return the sample arithmetic mean of data.
312b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
313b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> mean([1, 2, 3, 4, 4])
314b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    2.8
315b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
316b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> from fractions import Fraction as F
317b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
318b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    Fraction(13, 21)
319b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
320b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> from decimal import Decimal as D
321b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
322b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    Decimal('0.5625')
323b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
324b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    If ``data`` is empty, StatisticsError will be raised.
325b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """
326b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if iter(data) is data:
327b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        data = list(data)
328b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    n = len(data)
329b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if n < 1:
330b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        raise StatisticsError('mean requires at least one data point')
331b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    T, total, count = _sum(data)
332b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    assert count == n
333b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    return _convert(total/n, T)
334b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
335b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
336b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes# FIXME: investigate ways to calculate medians without sorting? Quickselect?
337b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef median(data):
338b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """Return the median (middle value) of numeric data.
339b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
340b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    When the number of data points is odd, return the middle data point.
341b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    When the number of data points is even, the median is interpolated by
342b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    taking the average of the two middle values:
343b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
344b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> median([1, 3, 5])
345b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    3
346b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> median([1, 3, 5, 7])
347b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    4.0
348b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
349b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """
350b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    data = sorted(data)
351b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    n = len(data)
352b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if n == 0:
353b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        raise StatisticsError("no median for empty data")
354b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if n%2 == 1:
355b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        return data[n//2]
356b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    else:
357b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        i = n//2
358b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        return (data[i - 1] + data[i])/2
359b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
360b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
361b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef median_low(data):
362b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """Return the low median of numeric data.
363b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
364b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    When the number of data points is odd, the middle value is returned.
365b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    When it is even, the smaller of the two middle values is returned.
366b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
367b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> median_low([1, 3, 5])
368b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    3
369b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> median_low([1, 3, 5, 7])
370b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    3
371b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
372b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """
373b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    data = sorted(data)
374b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    n = len(data)
375b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if n == 0:
376b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        raise StatisticsError("no median for empty data")
377b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if n%2 == 1:
378b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        return data[n//2]
379b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    else:
380b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        return data[n//2 - 1]
381b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
382b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
383b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef median_high(data):
384b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """Return the high median of data.
385b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
386b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    When the number of data points is odd, the middle value is returned.
387b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    When it is even, the larger of the two middle values is returned.
388b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
389b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> median_high([1, 3, 5])
390b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    3
391b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> median_high([1, 3, 5, 7])
392b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    5
393b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
394b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """
395b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    data = sorted(data)
396b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    n = len(data)
397b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if n == 0:
398b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        raise StatisticsError("no median for empty data")
399b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    return data[n//2]
400b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
401b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
402b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef median_grouped(data, interval=1):
403b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """Return the 50th percentile (median) of grouped continuous data.
404b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
405b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
406b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    3.7
407b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> median_grouped([52, 52, 53, 54])
408b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    52.5
409b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
410b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    This calculates the median as the 50th percentile, and should be
411b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    used when your data is continuous and grouped. In the above example,
412b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    the values 1, 2, 3, etc. actually represent the midpoint of classes
413b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    0.5-1.5, 1.5-2.5, 2.5-3.5, etc. The middle value falls somewhere in
414b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    class 3.5-4.5, and interpolation is used to estimate it.
415b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
416b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    Optional argument ``interval`` represents the class interval, and
417b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    defaults to 1. Changing the class interval naturally will change the
418b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    interpolated 50th percentile value:
419b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
420b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> median_grouped([1, 3, 3, 5, 7], interval=1)
421b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    3.25
422b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> median_grouped([1, 3, 3, 5, 7], interval=2)
423b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    3.5
424b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
425b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    This function does not check whether the data points are at least
426b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    ``interval`` apart.
427b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """
428b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    data = sorted(data)
429b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    n = len(data)
430b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if n == 0:
431b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        raise StatisticsError("no median for empty data")
432b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    elif n == 1:
433b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        return data[0]
434b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    # Find the value at the midpoint. Remember this corresponds to the
435b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    # centre of the class interval.
436b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    x = data[n//2]
437b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    for obj in (x, interval):
438b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        if isinstance(obj, (str, bytes)):
439b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes            raise TypeError('expected number but got %r' % obj)
440b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    try:
441b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        L = x - interval/2  # The lower limit of the median interval.
442b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    except TypeError:
443b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        # Mixed type. For now we just coerce to float.
444b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        L = float(x) - float(interval)/2
445b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    cf = data.index(x)  # Number of values below the median interval.
446b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    # FIXME The following line could be more efficient for big lists.
447b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    f = data.count(x)  # Number of data points in the median interval.
448b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    return L + interval*(n/2 - cf)/f
449b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
450b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
451b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef mode(data):
452b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """Return the most common data point from discrete or nominal data.
453b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
454b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    ``mode`` assumes discrete data, and returns a single value. This is the
455b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    standard treatment of the mode as commonly taught in schools:
456b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
457b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> mode([1, 1, 2, 3, 3, 3, 3, 4])
458b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    3
459b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
460b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    This also works with nominal (non-numeric) data:
461b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
462b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
463b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    'red'
464b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
465b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    If there is not exactly one most common value, ``mode`` will raise
466b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    StatisticsError.
467b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """
468b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    # Generate a table of sorted (value, frequency) pairs.
469b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    table = _counts(data)
470b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if len(table) == 1:
471b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        return table[0][0]
472b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    elif table:
473b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        raise StatisticsError(
474b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes                'no unique mode; found %d equally common values' % len(table)
475b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes                )
476b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    else:
477b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        raise StatisticsError('no mode for empty data')
478b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
479b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
480b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes# === Measures of spread ===
481b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
482b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes# See http://mathworld.wolfram.com/Variance.html
483b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes#     http://mathworld.wolfram.com/SampleVariance.html
484b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes#     http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
485b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes#
486b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes# Under no circumstances use the so-called "computational formula for
487b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes# variance", as that is only suitable for hand calculations with a small
488b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes# amount of low-precision data. It has terrible numeric properties.
489b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes#
490b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes# See a comparison of three computational methods here:
491b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes# http://www.johndcook.com/blog/2008/09/26/comparing-three-methods-of-computing-standard-deviation/
492b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
493b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef _ss(data, c=None):
494b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """Return sum of square deviations of sequence data.
495b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
496b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    If ``c`` is None, the mean is calculated in one pass, and the deviations
497b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    from the mean are calculated in a second pass. Otherwise, deviations are
498b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    calculated from ``c`` as given. Use the second case with care, as it can
499b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    lead to garbage results.
500b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """
501b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if c is None:
502b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        c = mean(data)
503b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    T, total, count = _sum((x-c)**2 for x in data)
504b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    # The following sum should mathematically equal zero, but due to rounding
505b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    # error may not.
506b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    U, total2, count2 = _sum((x-c) for x in data)
507b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    assert T == U and count == count2
508b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    total -=  total2**2/len(data)
509b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    assert not total < 0, 'negative sum of square deviations: %f' % total
510b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    return (T, total)
511b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
512b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
513b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef variance(data, xbar=None):
514b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """Return the sample variance of data.
515b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
516b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    data should be an iterable of Real-valued numbers, with at least two
517b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    values. The optional argument xbar, if given, should be the mean of
518b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    the data. If it is missing or None, the mean is automatically calculated.
519b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
520b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    Use this function when your data is a sample from a population. To
521b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    calculate the variance from the entire population, see ``pvariance``.
522b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
523b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    Examples:
524b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
525b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
526b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> variance(data)
527b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    1.3720238095238095
528b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
529b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    If you have already calculated the mean of your data, you can pass it as
530b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    the optional second argument ``xbar`` to avoid recalculating it:
531b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
532b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> m = mean(data)
533b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> variance(data, m)
534b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    1.3720238095238095
535b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
536b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    This function does not check that ``xbar`` is actually the mean of
537b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    ``data``. Giving arbitrary values for ``xbar`` may lead to invalid or
538b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    impossible results.
539b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
540b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    Decimals and Fractions are supported:
541b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
542b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> from decimal import Decimal as D
543b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
544b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    Decimal('31.01875')
545b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
546b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> from fractions import Fraction as F
547b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> variance([F(1, 6), F(1, 2), F(5, 3)])
548b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    Fraction(67, 108)
549b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
550b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """
551b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if iter(data) is data:
552b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        data = list(data)
553b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    n = len(data)
554b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if n < 2:
555b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        raise StatisticsError('variance requires at least two data points')
556b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    T, ss = _ss(data, xbar)
557b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    return _convert(ss/(n-1), T)
558b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
559b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
560b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef pvariance(data, mu=None):
561b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """Return the population variance of ``data``.
562b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
563b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    data should be an iterable of Real-valued numbers, with at least one
564b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    value. The optional argument mu, if given, should be the mean of
565b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    the data. If it is missing or None, the mean is automatically calculated.
566b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
567b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    Use this function to calculate the variance from the entire population.
568b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    To estimate the variance from a sample, the ``variance`` function is
569b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    usually a better choice.
570b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
571b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    Examples:
572b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
573b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
574b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> pvariance(data)
575b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    1.25
576b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
577b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    If you have already calculated the mean of the data, you can pass it as
578b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    the optional second argument to avoid recalculating it:
579b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
580b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> mu = mean(data)
581b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> pvariance(data, mu)
582b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    1.25
583b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
584b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    This function does not check that ``mu`` is actually the mean of ``data``.
585b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    Giving arbitrary values for ``mu`` may lead to invalid or impossible
586b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    results.
587b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
588b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    Decimals and Fractions are supported:
589b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
590b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> from decimal import Decimal as D
591b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
592b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    Decimal('24.815')
593b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
594b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> from fractions import Fraction as F
595b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
596b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    Fraction(13, 72)
597b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
598b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """
599b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if iter(data) is data:
600b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        data = list(data)
601b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    n = len(data)
602b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    if n < 1:
603b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        raise StatisticsError('pvariance requires at least one data point')
604b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    ss = _ss(data, mu)
605b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    T, ss = _ss(data, mu)
606b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    return _convert(ss/n, T)
607b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
608b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
609b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef stdev(data, xbar=None):
610b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """Return the square root of the sample variance.
611b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
612b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    See ``variance`` for arguments and other details.
613b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
614b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
615b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    1.0810874155219827
616b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
617b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """
618b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    var = variance(data, xbar)
619b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    try:
620b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        return var.sqrt()
621b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    except AttributeError:
622b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        return math.sqrt(var)
623b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
624b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
625b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandesdef pstdev(data, mu=None):
626b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """Return the square root of the population variance.
627b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
628b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    See ``pvariance`` for arguments and other details.
629b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
630b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
631b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    0.986893273527251
632b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes
633b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    """
634b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    var = pvariance(data, mu)
635b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    try:
636b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        return var.sqrt()
637b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes    except AttributeError:
638b5fbd41b23bf309e6b420a3df4641603d55dcb68Joel Fernandes        return math.sqrt(var)
639