
High dimensional matrix algebra is essential in numerous signal processing and machine learning algorithms. This work describes a scalable square matrixcomputing unit designed on the basis of circulant matrices. It optimizes data flow for the computation of any sequence of matrix operations removing the need for data movement for intermediate results, together with the individual matrix operations' performance in direct or transposed form (the transpose matrix operation only requires a data addressing modification). The allowed matrix operations are: matrixbymatrix addition, subtraction, dot product and multiplication, matrixbyvector multiplication, and matrix by scalar multiplication. The proposed architecture is fully scalable with the maximum matrix dimension limited by the available resources. In addition, a design environment is also developed, permitting assistance, through a friendly interface, from the customization of the hardware computing unit to the generation of the final synthesizable IP core. For N x N matrices, the architecture requires N ALURAM blocks and performs O(N*N), requiring N*N +7 and N +7 clock cycles for matrixmatrix and matrixvector operations, respectively. For the tested Virtex7 FPGA device, the computation for 500 x 500 matrices allows a maximum clock frequency of 346 MHz, achieving an overall performance of 173 GOPS. This architecture shows higher performance than other stateoftheart matrix computing units.
