Consider the following piece of code, which generates some (potentially) huge, multi-dimensional array and performs numpy.tensordot with it (whether we multiply the same or two different arrays here, does not really matter).

import time
import numpy
L, N = 6, 4
shape = (2*L)*[N,]
A = numpy.arange(numpy.prod(shape)).reshape(shape)
A = A % 256 - 128 # [-127,+127]
axes=(range(1,2*L,2), range(0,2*L,2))
def run(dtype, repeat=1):
A_ = A.astype(dtype)
t = time.time()
for i in range(repeat):
numpy.tensordot(A_, A_, axes)
t = time.time() - t
print(dtype, ' \t%8.2f sec\t%8.2f MB' %(t, A_.nbytes/1e6))

Now we can compare the performance for different data types, e.g.:

run(numpy.float64)
run(numpy.int64)

Since the array only consists of small integer numbers, I would like to save some memory by using dtype=int8. However, this slows down the matrix multiplication A LOT.

## Here are some test cases

The first one, is the important one for my use case. The others are just for reference. Using Numpy 1.13.1 and Python 3.4.2

### Large array

L, N = 6, 4; A.size = 4**12 = 16777216
<class 'numpy.float64'> 59.58 sec 134.22 MB
<class 'numpy.float32'> 44.19 sec 67.11 MB
<class 'numpy.int16'> 711.16 sec 33.55 MB
<class 'numpy.int8'> 647.40 sec 16.78 MB

Same array with different data types. Memory decreases as expected. But why the large differences in the CPU time? If anything I would expect int to be faster than float.

### Large array with different shape

L, N = 1, 4**6; A.size = (4**6)**2 = 16777216
<class 'numpy.float64'> 57.95 sec 134.22 MB
<class 'numpy.float32'> 42.84 sec 67.11 MB

The shape doesn't seem to have a large effect.

### Not so large array

L, N = 5, 4
<class 'numpy.float128'> 10.91 sec 16.78 MB
<class 'numpy.float64'> 0.98 sec 8.39 MB
<class 'numpy.float32'> 0.90 sec 4.19 MB
<class 'numpy.float16'> 9.80 sec 2.10 MB
<class 'numpy.int64'> 8.84 sec 8.39 MB
<class 'numpy.int32'> 5.55 sec 4.19 MB
<class 'numpy.int16'> 2.23 sec 2.10 MB
<class 'numpy.int8'> 1.82 sec 1.05 MB

Smaller values, but same weird trend.

### small array, lots of repetitions

L, N = 2, 4; A.size = 4**4 = 256; repeat=1000000

<class 'numpy.float128'> 17.92 sec 4.10 KB
<class 'numpy.float64'> 14.20 sec 2.05 KB
<class 'numpy.float32'> 12.21 sec 1.02 KB
<class 'numpy.float16'> 41.72 sec 0.51 KB
<class 'numpy.int64'> 14.21 sec 2.05 KB
<class 'numpy.int32'> 14.26 sec 1.02 KB
<class 'numpy.int16'> 13.88 sec 0.51 KB
<class 'numpy.int8'> 13.03 sec 0.26 KB

Other than float16 being much slower, everything is fine here.

## Question

Why is int8 for very large arrays so much slower? Is there any way around this? Saving memory becomes increasingly important for larger arrays!