@
Gene Boo, all right, that's the spirit!
If you're trying to derive & implement it yourself, make sure you have a solid theoretical basis & practical experience for this task. Unfortunately, this is not the kind of theory or practice you can rediscover on your own without any guidance (unless you happen to be a genius AND have a spare hundred years of time). Thus, in order to save you some time and accelerate your efforts, I'll try to provide you with some guidance.
Start from G.H. Golub and C.F. Van Loan "Matrix Computations", 4th Edition:
http://www.cs.cornell.edu/cv/GVL4/golubandvanloan.htm
Go through it cover-to-cover -- or, at minimum, make sure you have a complete understanding of at least the first five chapters (we're specializing solely for the task you're interested in here).
// Prerequisites: something along the lines of Strang, Gilbert "Introduction to Linear Algebra" should suffice. In terms of algorithms & data structures, let's say Chapter 28 of CLRS (including its dependencies, e.g., Chapter 4):
http://mitpress.mit.edu/books/introduction-algorithms
Review:
http://nickhigham.wordpress.com/2013/05/31/fourth-edition-of-matrix-computations/
Other related, relevant reading:
http://nickhigham.wordpress.com/2013/01/28/second-edition-2013-of-matrix-analysis/
http://nickhigham.wordpress.com/201...roximation-theory-and-approximation-practice/
Then, in order to know what can you rely on when working with floating-point numbers, acquaint yourself with "What Every Computer Scientist Should Know About Floating-Point Arithmetic":
http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
Now, regarding vectorization (SSE, AVX) -- go with Intel's Software Developer Manuals:
http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html
Somewhat more user-friendly are optimization manuals by Agner Fog -- so you may want to start with these as background reading:
http://www.agner.org/optimize/
// Regarding the
C++ part, I presume you may follow the accelerated learning path:
http://abstrusegoose.com/249
After you build the initial understanding of the aforementioned basics, you can take a stab at stepping away from naive algorithms to block matrix computation -- e.g., block decompositions and multiplication:
https://en.wikipedia.org/wiki/Block_matrix
https://en.wikipedia.org/wiki/Block_LU_decomposition
Note that one classic reference for this is a 1969 paper by McKellar & Coffman:
http://dl.acm.org/citation.cfm?id=362879
This is a reason (one of several) an ordinary Fortran program from the 1970s will be outperforming your implementation (and is, as you have illustrated with your timing results) -- the authors of BLAS and LAPACK used by that ordinary Fortran program were aware of this paper (and many other ones, too) and have taken this knowledge into account in their implementations.
To understand *why* and *how* to do that yourself, familiarize yourself (thoroughly!) with the following:
http://www.akkadia.org/drepper/cpumemory.pdf
See also references for cache-oblivious algorithms:
https://en.wikipedia.org/wiki/Cache-oblivious_algorithm
// In general, Hennessy & Patterson "Computer Architecture: A Quantitative Approach" is a good, comprehensive background read:
http://booksite.mkp.com/9780123838728/
// More introductory (you may find it more approachable if it's your first book in this area) is Patterson & Hennessy "Computer Organization and Design: The Hardware/Software Interface";
http://booksite.elsevier.com/9780124077263/
In particular, you'll be interested in the following sections:
3.8 Going Faster: Subword Parallelism and Matrix Multiply
4.12 Going Faster: Instruction-Level Parallelism and Matrix Multiply
5.14 Going Faster: Cache Blocking and Matrix Multiply
// Perhaps not now, but later: 6.12 Going Faster: Multiple Processors and Matrix Multiply
After all of the above I think you may be able to catch-up with the performance of a Fortran matrix computation program from the 1970s.
// Note that this is solely a minimal, least-resistance / least-effort path, specialized for the particular exercise you're trying to solve, assuming a "standard" desktop PC, and a classic single-threaded implementation. The more you try to generalize / the more your requirements deviate from this (e.g., using GPUs), the more further research awaits you -- this is what makes it exciting!
Good luck!