Thursday, January 30, 2014

Removing redundant points from a numpy array

Here is a snippet which will remove redundant points from timeseries data using numpy. I recently had to do this and had no luck finding any help via google. Here is my solution:
def remove_redundant_points(points):
    """
    Returns a point list with any redundant points (points where 
    the value didn't change from the previous point) removed.
    The resulting list has the points before and after any value 
    change.
    :param points: Array of points (time, value)
    :return: Trimmed Array of points with any points where value 
    doesn't change before or after removed.
    """
    changepoints = numpy.where(points[1:, 1] != points[:-1, 1])[0]
    keepindexes = numpy.unique(numpy.concatenate(
                     ([0, len(points) - 1], 
                       changepoints, changepoints + 1)))
    return points[keepindexes]
Examples...

>>> values = [1, 1, 1, 1, 1, 2, 2, 2, 3, 4, 5, 6, 5, 5, 5, 5, 5, 5]
>>> pts = zip(range(len(values)), values)
>>> pts
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2), (6, 2), (7, 2), (8, 3), (9, 4),
 (10, 5), (11, 6), (12, 5), (13, 5), (14, 5), (15, 5), (16, 5), (17, 5)]
>>> trimmed_pts = remove_redundant_points(numpy.array(pts))
>>> trimmed_pts.tolist()
[[0, 1], [4, 1], [5, 2], [7, 2], [8, 3], [9, 4], [10, 5], [11, 6], [12, 5], [17,
 5]]

>>> pts = []
>>> v = 0
>>> for i in range(10000):
...    if random.random() > 0.95: v += 1
...    pts.append((t + i, v))
...
>>> pts = numpy.array(pts)
>>> pts
array([[  1.39111413e+09,   0.00000000e+00],
       [  1.39111413e+09,   0.00000000e+00],
       [  1.39111414e+09,   0.00000000e+00],
       ...,
       [  1.39112413e+09,   5.00000000e+02],
       [  1.39112413e+09,   5.00000000e+02],
       [  1.39112413e+09,   5.00000000e+02]])
>>> len(pts)
10000
>>> trimmed_pts = remove_redundant_points(pts)
>>> trimmed_pts
array([[  1.39111413e+09,   0.00000000e+00],
       [  1.39111414e+09,   0.00000000e+00],
       [  1.39111415e+09,   1.00000000e+00],
       ...,
       [  1.39112412e+09,   4.99000000e+02],
       [  1.39112412e+09,   5.00000000e+02],
       [  1.39112413e+09,   5.00000000e+02]])
>>> len(trimmed_pts)
968


Tuesday, May 7, 2013

Passing multidimensional numpy arrays to C with cffi

I've been playing with using numpy with the cffi library to allow c functions to manipulate numpy arrays directly. Doing this for one dimensional arrays is fairly straightforward, but dealing with multidimensional arrays is a bit trickier. I created a simple C file with various calling options. Given a numpy array x, the x.ctypes.data attribute gives us a pointer to the memory buffer for x. We can cast it to the appropriate type using cffi and pass it directly to the single_test function and manipulate it directly there.
def single_test(x):
    C.single_test(ffi.cast("double *", x.ctypes.data), len(x))
The easiest way to do something similar with multidimensional arrays is just to use array math as in the array_math_test example above. Then the array can be passed down as it is in single_test. If you want to work with a multidimensional array without doing array math, then you need to pass down an array of pointers to your c function. To do this you first need to allocate memory for an array of pointers using the ffi.new function, then fill each index of the resulting array with a pointer to your numpy buffer. One approach would be to break your multidimensional array into multiple numpy arrays and point to each buffer individually:
multi_array = [[1.1, 2.2, 1.3, 4.4, 5.5],
               [5.0, 3.0, 2.0, 1.0, 3],
               [6.0, 1.0, -3.2, -1, 2],
               [0.0, 1.0, 1.0, 2.0, 1]]
x = [numpy.array(v, dtype='float64') for v in multi_array]


def multi_test_a(x):
    ap = ffi.new("double* [%d]" % (len(x)))
    for i in range(len(x)):
        ap[i] = ffi.cast("double *", x[i].ctypes.data)
    C.multi_test(ap, len(x), len(x[0]))

multi_test_a(x)
Perhaps a better approach is to use a regular multidimensional numpy array and calculate the offset from the original buffer pointer for each row.
def multi_test_b(x):
    dsize = ffi.sizeof("double")
    ap = ffi.new("double* [%d]" % (x.shape[0]))
    ptr = ffi.cast("double *", x.ctypes.data)
    for i in range(x.shape[0]):
        ap[i] = ptr + i*x.shape[1]
    C.multi_test(ap, x.shape[0], x.shape[1])
Here is the full python script for this example: Output:
Before single [  1.   2.   3.   4.   5.   6.   7.   8.   9.  10.]
After single [  1.   3.   5.   7.   9.  11.  13.  15.  17.  19.]
Before multi_a [array([ 1.1,  2.2,  1.3,  4.4,  5.5]), array([ 5.,  3.,  2.,  1.,  3.]), array([ 6. ,  1. , -3.2, -1. ,  2. ]), array([ 0.,  1.,  1.,  2.,  1.])]
After multi_a [array([ 1.1,  3.2,  3.3,  7.4,  9.5]), array([ 15.,  14.,  14.,  14.,  17.]), array([ 26. ,  22. ,  18.8,  22. ,  26. ]), array([ 30.,  32.,  33.,  35.,  35.])]
Before multi_b [[ 1.1  2.2  1.3  4.4  5.5]
 [ 5.   3.   2.   1.   3. ]
 [ 6.   1.  -3.2 -1.   2. ]
 [ 0.   1.   1.   2.   1. ]] (4, 5)
After multi_b [[  1.1   3.2   3.3   7.4   9.5]
 [ 15.   14.   14.   14.   17. ]
 [ 26.   22.   18.8  22.   26. ]
 [ 30.   32.   33.   35.   35. ]]

Wednesday, April 6, 2011

Sphinx figure numbers

Sphinx is pretty great for generating nice looking documents in a variety of output formats, but one place it is pretty weak is in handling figures and references to figures. In html output, figures aren't automatically numbered, while they are in latex output. Despite having nicely numbered figures in latex, when you use :ref: to refer to them, it still replaces the reference with the full text of the figure caption, instead of refering to them by number.

My attempt at a solution is this sphinx extension: numfig. It adds figure numbers to html figure captions and provides two new sphinx roles, :num: and :page:. The :num: role can be used to refer to figures by figure number instead of caption, while the :page: role can be used to refer to the page number the figure occurs on. More detailed documentation and example usage can be found in the bitbucket page wiki.