python

Cython: Tips and tricks

if you have more tricks comment below …

The last couple of days I started learning Cython …

What is Cython ?

Cython is compiled language that seamlesly integrates C with Python, plus more …

C, Cython, Python

The nice thing about is that the learning curve is very gradual. You can start w/o even changing your Python code.
Simply compiling it may speed up your script.

The next step is to start using Cython-the-language constructs.
Those include :

  • Declaring the type of the variables
  • Specifying who and how to call Functions/methods
  • Extension types : Classes which are implemented using Struct, instead of Dict allowing the dispatch resolution to happen at compile time, rather than runtime.

And finally you have the syntax to integrate directly C/C++ libraries and code.

Now on the

tips and tricks …

Creaing an array

Use 1D array instead of lists or numpy array whenever you can.
Red somewhere it is twice as fast than numpy.
In addition you can dynamically .resize() it in-place.

Here is the fastest way to create empty/zeroth array. First you need to have array templates prepared :

from cpython cimport array

cdef iARY = array.array('i') #integer
cdef IARY = array.array('I') #unsigned integer
cdef fARY = array.array('f') #float
cdef dARY = array.array('d') #double

then :

cdef ary = array.clone(fARY, size, 1)

Other options are :

cdef ary = array.array('f')
array.resize(ary, size)
array.zero(ary)

slower variant, but works on other types too :

cdef ary = array.array('f')
array.resize(ary, size)
ary[:] = 0

more on this here : Fast zero’ing

Accessing array elements

Here are several ways to access elements of array … from slower to faster.

ary[i] = value
ary._f[i] = value
#fastest, cause access the union struct directly
ary.data.as_floats[i] = value

the last one sped some portions of my code by ~30 times.

There are variations of the example above depending on the type :

  • _f, _i, _u …..
  • as_floats, as_ints, as_uints …..

A different len()

from cpython.object cimport Py_SIZE
#does not work on range()
cdef inline unsigned int clen(obj): return Py_SIZE(obj)

generates cleaner code, it should be faster. cpdef’d version is slower, which is expected.

type vs isinstance

if you have to do type checks use “type is …” instead of isinstance(), especially if you do several of them.

: x=type(5) 

: %timeit x is int
27.2 ns ± 4.12 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

: %timeit x is float
26.4 ns ± 0.731 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

: %timeit isinstance(5,int) 
52.6 ns ± 0.237 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

: %timeit isinstance(5,float) 
74.2 ns ± 1.28 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

: %timeit isinstance(x,int)                                                                                                                                           
71.5 ns ± 0.357 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

: %timeit isinstance(x,float)                                                                                                                                         
81 ns ± 1.32 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

---

: %timeit type(5) is int                                                                                                                                              
55 ns ± 0.487 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

: %timeit type(5) == int                                                                                                                                              
57.6 ns ± 1.59 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

: %timeit type(5) is float                                                                                                                                            
58 ns ± 1.26 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Python : How to check if variable is X ?

Here is how you can check the type of a variable :

In [240]: isinstance(5, int)                                                                                                                                                 
Out[240]: True

In [241]: var = 5                                                                                                                                                            

In [242]: isinstance(var, int)                                                                                                                                               
Out[242]: True

In [243]: isinstance(var, str)                                                                                                                                               
Out[243]: False

In [244]: isinstance([1,2,3], list)                                                                                                                                          
Out[244]: True

#is it one of many types
In [245]: isinstance([1,2,3], (list,tuple))                                                                                                                                  
Out[245]: True

Here is a function you can use to check if variable is an iterator :

def is_iter(x):
  try:
    iter(x); return True
  except TypeError: return False

Numpy : min/max of integer types

Here is how to find the MIN or MAX values for the Integer types in numpy :

In [235]: np.iinfo(np.int8)                                                                                                                                                  
Out[235]: iinfo(min=-128, max=127, dtype=int8)

In [236]: np.iinfo(np.int16)                                                                                                                                                 
Out[236]: iinfo(min=-32768, max=32767, dtype=int16)

In [237]: np.iinfo(np.int32)                                                                                                                                                 
Out[237]: iinfo(min=-2147483648, max=2147483647, dtype=int32)

In [238]: np.iinfo(np.int64)                                                                                                                                                 
Out[238]: iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)

In [239]: np.iinfo(np.int)                                                                                                                                                   
Out[239]: iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)

Enums

Python does not have builtin Enumerable type, but we can easily resolve this problem :

def enum(args, start=0):
	class Enum(object):
		__slots__ = args.split()
		names = {}

		def __init__(self):
			for i, key in enumerate(Enum.__slots__, start):
				self.names[i] = key
				setattr(self, key, i)

	return Enum()


e = enum('ONE TWO THREE', start=1)

print(e.ONE)
print(e.TWO)
print(e.THREE)

-----

1
2
3

Stats functions

Quick list of statistics Functions :

import numpy as np

class stats():

	"""
		Calculate the deviations in a sample.
	"""
	@staticmethod
	def dev(xs): return xs - np.mean(xs)

	"""
		Calculate the covariance between two data sets.
			sample=True : sample covarince
			sample=False: population covariance
	"""
	@staticmethod
	def cov(xs,ys,sample=True):
		dec = 1 if sample else 0 #if sample-cov decrement len by 1
		#sum of products of deviations
		return np.dot( stats.dev(xs), stats.dev(ys) ) / (len(xs) - dec)

	@staticmethod
	def var(xs,sample=True):
		dec = 1 if sample else 0 #if sample-cov decrement len by 1
		return np.sum(stats.dev(xs)**2) / (len(xs) - dec)

	@staticmethod
	def std_dev(xs,sample=True):
		return np.sqrt( stats.var(xs, sample) )

	"""
		Calculate Pearson correlation.
	"""
	@staticmethod
	def corr(xs,ys,sample=True) :
		varx = stats.var(xs)
		vary = stats.var(ys)

		corr = stats.cov(xs,ys,sample)/ np.sqrt(varx * vary)
		return corr

	@staticmethod
	def rank(xs): return np.argsort(np.argsort(xs))
	@staticmethod
	def spearman_corr(xs,ys,sample=True):
		xranks = stats.rank(xs)
		yranks = stats.rank(ys)
		return stats.corr(xranks,yranks,sample)

	@staticmethod
	def r2(ys,residuals): #coef of determination
		return ( 1 - stats.var(residuals) ) / stats.var(ys)

	"""
		Calculate auto correlation coefficients for a time series or list of values.
			lag : specifies up to how many lagging coef will be calculated.
				!! The lag should be at most the lenght of the data minus 2, we skip lag-zero.
	"""
	@staticmethod
	def auto_corr(xs,lag=1,sample=True):
		if lag > len(xs) - 2 : raise Exception("Lag(%s) is bigger than the data-len(%s) - 2" % (lag,len(xs)))
		ac = np.zeros(lag)
		for i in range(1, lag+1) :
			ac[i-1] = stats.corr(xs[:-i], xs[i:], sample)
		return ac

Combinatorics: (n choose r)

Here is a quick function to calculate (n choose r) i.e. in how many ways you can combine “r” items out of total of “n” items.

def nCr(n,r):
	print(f"n,r: {n}, {r}")
	assert n >= r, "nCr : n < r !"
	r = min(r,n-r)
	if r == 0 : return 1
	numerator = reduce(op.mul, range(n,n-r,-1))
	denominator = reduce(op.mul, range(1, r+1) )
	return numerator//denominator

and here is in JavaScript :

N :
R :
(N R) :

This uses several interesting tricks which I should probably explain in a separate post : generators, range, splat operator.

function* range(start, end, step) {
	step = step || 1
	if (step > 0) { for (let i = start; i < end; i+= step) yield i; }
	else { for (let i = start; i > end; i+= step) yield i;}
}

Math.mul = (a,b) => a * b
  
function nCr(n,r) {
        if (r > n) return 'r > n ??'
	r = Math.min(r,n-r)
	if (r == 0) return 1
	numerator = [...range(n,n-r,-1)].reduce(Math.mul)
	denominator = [...range(1, r+1)].reduce(Math.mul)
	return Math.floor(numerator/denominator)
}

Distance measures

Examples of implementations of different distance measures. When I have some time I will elaborate on their usage.

#Simple matching coeficient
def smc(sdr1, sdr2):
	n11 = (sdr1 & sdr2).count()
	n00 = (~sdr1 & ~sdr2).count()
	return float(n11+n00)/len(sdr1)

def jiccard(sdr1,sdr2):
	n11 = (sdr1 & sdr2).count()
	n01_10 = (sdr1 ^ sdr2).count()
	return n11/float(n01_10+n11)


def cosine_similarity(vector,matrix):
	sim = ( np.sum(vector*matrix,axis=1) / ( np.sqrt(np.sum(matrix**2,axis=1)) * np.sqrt(np.sum(vector**2)) ) )[::-1]
	return sim

def euclidean_distance(vector, matrix):
	dist = np.sqrt(np.sum((vector - matrix)**2,axis=1))
	#dist = np.linalg.norm(vector - matrix, axis=1)
	return dist

Pretty printing 2D python/numpy array

Below I show you quick and dirty way to print 2D array Column and Row labels/indexes. It is often more convenient to have those available so you can easily track visually the results of operations.

First lets try with numpy array :

import numpy as np
import pandas as pd

a = np.random.randint(0,100,(5,5))

print(a)

print()
print(pd.DataFrame(a))


print()
print(pd.DataFrame(a,columns=['A','B','C','D','E']))
[[70 40 64 22 91]
 [82 41 35 42 19]
 [21  7 42 63 85]
 [26 43 23  1 34]
 [44 79 88 46 62]]

    0   1   2   3   4
0  70  40  64  22  91
1  82  41  35  42  19
2  21   7  42  63  85
3  26  43  23   1  34
4  44  79  88  46  62

    A   B   C   D   E
0  70  40  64  22  91
1  82  41  35  42  19
2  21   7  42  63  85
3  26  43  23   1  34
4  44  79  88  46  62

Of course it is similar for normal Python arrays :

import numpy as np
import pandas as pd

b = [[1,2],[3,4]]

print()
print(pd.DataFrame(b,columns=['A','B']))
   A  B
0  1  2
1  3  4

here if you are too lazy to type : https://onecompiler.com/python/3xm3ms6fb

Calculating Moving (Average) ++

If you have time series data and want to calculate Moving-function like Moving Average you can use a Rolling window like shown below … enjoy

import numpy as np

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

a = np.arange(0,10)

print(rolling_window(a,5))

print(np.mean(rolling_window(a,5), axis=1))

-----
[[0 1 2 3 4]
 [1 2 3 4 5]
 [2 3 4 5 6]
 [3 4 5 6 7]
 [4 5 6 7 8]
 [5 6 7 8 9]]

[2. 3. 4. 5. 6. 7.]


b = np.random.randint(0,100,10)

print(rolling_window(b,5))

print(np.mean(rolling_window(b,5), axis=1))

-----
[[42 93 30 69 53]
 [93 30 69 53 93]
 [30 69 53 93 61]
 [69 53 93 61 22]
 [53 93 61 22 53]
 [93 61 22 53 71]]

[57.4 67.6 61.2 59.6 56.4 60. ]

or here is another way to do moving average :

import numpy as np

def moving_average(a, n=3) :
    ret = np.cumsum(a, dtype=float)
    ret[n:] = ret[n:] - ret[:-n]
    return ret[n - 1:] / n

print(moving_average([1,2,3,4,5,4,3,2,1]))

-------

[2.0 3.0 4.0 4.33 4.0 3.0 2.0]

Get the size of data structures in Python

If you want to find or measure the size of memory a data structure or objects occupies, try the function below.

import sys
    
def get_size(obj, seen=None):
    """Recursively finds size of objects"""
    size = sys.getsizeof(obj)
    if seen is None:
        seen = set()
    obj_id = id(obj)
    if obj_id in seen:
        return 0
    # Important mark as seen *before* entering recursion to gracefully handle
    # self-referential objects
    seen.add(obj_id)
    if isinstance(obj, dict):
        size += sum([get_size(v, seen) for v in obj.values()])
        size += sum([get_size(k, seen) for k in obj.keys()])
    elif hasattr(obj, '__dict__'):
        size += get_size(obj.__dict__, seen)
    elif hasattr(obj, '__iter__') and not isinstance(obj, (str, bytes, bytearray)):
        size += sum([get_size(i, seen) for i in obj])
    return size
    

print(get_size(list(range(100))))
print(get_size(list(range(1000))))
print(get_size([[ i for i in range(10)] for j in range(100)]))
print(get_size([[ i for i in range(100)] for j in range(10)]))


-----
3804
37108
20388
12108