Let's assume we want to do addition or substraction of 2 4x4 32-bit float matrices. First step is to load the arrays. We will assume that the arrays are 16-byte aligned (all/most SIMD engines require this) which will also give a nice boost. Let's assume we have the following typedef:

