16 is by no means a big number. How many MAC units would be required to compute a Convolutional Neural Network (CNN) to produce 16 output channels of a 16×16 tile from 16-deep 3×3 tensor convolution in, say, 64 clocks?

It would take at least 9,216 MAC units if not using a fast algorithm. 9,216 MAC units would commonly be used to construct a 96×96 systolic array, but it takes at least a latency of 96 clocks to calculate a 96×96 matrix multiplication (MM). A long train of 96 96×96 matrix multiplications would be required to keep the systolic array occupied.

Welcome to the world of tensors in AI. It is now time to get used to **the curse of dimensionality**.

The title of this article is inspired by and a counter-response to a quote from Professor Charles F. Van Loan in his lecture on tensor unfoldings in 2010.

All tensors secretly wish that they were matrices!

This statement demonstrates the desire of the tensor research community to study tensors by first unfolding them into matrices and then leveraging well-established matrix-theoretic frameworks to study them. It is also an industry standard practice to flatten tensors all the way to matrices to leverage the highly optimized libraries for matrix multiplication (MM) or MM accelerators (MMAs), even though tensors are considered to be the most fundamental data type in all major AI frameworks. Matrices are generally considered to be special cases of tensors by the AI community.

Here might be what conventional wisdom says:

- There are very nice parallelism and data-sharing patterns in MM to leverage.
- Matrices are a more natural or native fit for programmable hardware than tensor.
- There is a native hardware acceleration solution, the systolic array, for MM.

However,

- CNNs are structurally equivalent to MMs. It is not necessary to flatten tensors to leverage the MM-equivalent parallelism and data-sharing patterns.
- Matrices are recursively blocked for temporal locality and packed for spatial locality while climbing up the memory hierarchy. They finally become micro panels, which are small block rows or columns, to be consumed by a software microkernel or GPU shader kernel.
- In my review of Google’s TPU v1 with 256×256 systolic array, the issue of scalability from the curse of the square shape of a systolic array was addressed. Since then it seems to be mainstream to use multiple relatively smaller systolic arrays. For this reason, matrices must be similarly blocked and packed before they are in the square matrix shape that can be consumed by systolic arrays.

As a result, matrices from flattening out tensors are actually blocked and packed to be in the appropriate structure for high-performance execution as illustrated below. The former can be referred to as flat-unfolding, and the latter as block-unfolding. Matrices become the default data type for CNN and AI in general because of the efforts from decades of researching, building, and leveraging the block matrix framework for high-performance MM implementations.

Following the conventional wisdom, a feature map in CNN is coerced to be a column of some matrix and a convolution filter is flattened out to be some consecutive matrix elements in a column. The potential for spatial and temporal reuses of neighboring pixels in a feature map is lost as a result of flat-unfolding.

One key feature of recursive matrix block-unfolding in MM is that high-level temporal and spatial localities are preserved when the matrices are moving closer to the bare metal of hardware. It should be intriguing to see whether data localities in CNN can be preserved similarly in the tensor form.

It is a reasonable design choice to partition a feature map into tiles and keeping the tiling structure for re-using filters, leveraging fast algorithms for small tiles, and achieving fine-grain SIMD-like parallelism. The tensors should remain as tensors while approaching the bare metal to keep the tiling structure and data locality in a feature map intact.

Furthermore, locality patterns among the input feature maps and output feature maps must be addressed. The former allows data from multiple input feature maps to be shared for the calculation of multiple output feature maps. The latter enables a bigger audience to share the input feature maps. The twist is that you cannot share the whole, and all feature maps since producing an output pixel do not require the whole input feature maps, and it would be impractical to keep all the feature maps on-chip. To conclude, all dimensions need to be partitioned or blocked to some extent, and such a consideration leads to tensors being partitioned into smaller ones and becoming **block tensors**.

A block tensor is a tensor whose entries are tensors. It is a counterpart of the block matrix.The concept of block tensors can be used to address the curse of dimensionality and preserve the data locality in CNN, similar to how the high-performance computing (HPC) industry adopted the concept of block matrix for MM. **Tensor packets**, equivalent to micro panels or square submatrices in MM, are defined to be the basic tensor units that must be processed atomically to leverage spatial locality along all dimensions. A **tensor block**, consisting of tensor packets, is defined as the tensor units that must be transferred between computing units and external memory in their entirety to facilitate temporal locality among the tensor packets.

The atomic tensor packet operation is defined to be producing a minimally sufficient number of output channels of minimally sufficient-sized tiles from a minimally sufficient number of input channels. Due to the curse of dimensionality in tensors, processing such tensor packets could become a big effort even though each tensor packet is small in size in each dimension. It can be iterated or applied concurrently within a tensor block to solve bigger problems. The ideas are elaborated semi-formally in the rest of the post.

Input *A*** **and output

*C*

**of a CNN consist of multiple input feature maps (IFMs), and output feature maps (OFMs), respectively. They can be considered as 3-dimensional tensors, with dimensions**

*x*,

*y*

**for feature maps, input depth**

*w*

**for indexing IFMs, and output depth**

*z*for indexing OFMs. In order to achieve fine-grain SIMD parallelism and leverage fast algorithms with spatial locality, each feature map can be further partitioned along

*x*

**and**

*y*

**dimensions into tiles. The compound index**

**(**

*tx*,

*ty*) is used to identify an input tile and an output tile, respectively. For each unique pair of an IFM

*w*and an OFM

*z*, there is a filter

*B*(

*w*,

*z*), typically consisting of 3×3 filter parameters. The input and output tensors become block tensors of tiles as illustrated below

Tiled CNN can be represented more compactly and precisely in tensor-theoretic notations as follow: