Matrix Inverse Algorithm

alain · 1/10/14

Gene Boo said:
... I'm building my own lib cuz I prefer to have full control and also im testing out diff algos to satisfyc uriosity not production work

The latter is good. Satisfying your own curiosity and checking how the algorithms work. The former shows inexperience. Linear Algebra libraries have been around since the 60s and those problems are well understood and research. You should check latest implementations of BLAS and LAPACK from different vendors and research labs. Their code is also usually open so you can see the source with your own eyes and all the tricks they are doing to achieve good performance.

I'm almost sure Microsoft didn't roll out their own implementation in Excel. They took somebody else's.

Gene Boo · 1/10/14

But say what u wanna.I'm doing the net a favour i believe by posting my results rather than simply suggesting use of open source libraries. I much prefer laying down logic than copying from others.so I learn slower but I am damn sure of what I derived for myself. Nutshell is I did not plagiarise and did my own work.I prefer comments like use vectorisation or etc other than use known stuff. blas is known I've used it but I Dun wanna copy it.I wanna approach from a seperate angle and who knows innovate smthg

Polter · 1/10/14

@Gene Boo, all right, that's the spirit!

If you're trying to derive & implement it yourself, make sure you have a solid theoretical basis & practical experience for this task. Unfortunately, this is not the kind of theory or practice you can rediscover on your own without any guidance (unless you happen to be a genius AND have a spare hundred years of time). Thus, in order to save you some time and accelerate your efforts, I'll try to provide you with some guidance.

Start from G.H. Golub and C.F. Van Loan "Matrix Computations", 4th Edition:
http://www.cs.cornell.edu/cv/GVL4/golubandvanloan.htm
Go through it cover-to-cover -- or, at minimum, make sure you have a complete understanding of at least the first five chapters (we're specializing solely for the task you're interested in here).

// Prerequisites: something along the lines of Strang, Gilbert "Introduction to Linear Algebra" should suffice. In terms of algorithms & data structures, let's say Chapter 28 of CLRS (including its dependencies, e.g., Chapter 4): http://mitpress.mit.edu/books/introduction-algorithms

Review: http://nickhigham.wordpress.com/2013/05/31/fourth-edition-of-matrix-computations/
Other related, relevant reading:
http://nickhigham.wordpress.com/2013/01/28/second-edition-2013-of-matrix-analysis/
http://nickhigham.wordpress.com/201...roximation-theory-and-approximation-practice/

Then, in order to know what can you rely on when working with floating-point numbers, acquaint yourself with "What Every Computer Scientist Should Know About Floating-Point Arithmetic": http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

Now, regarding vectorization (SSE, AVX) -- go with Intel's Software Developer Manuals:
http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html

Somewhat more user-friendly are optimization manuals by Agner Fog -- so you may want to start with these as background reading: http://www.agner.org/optimize/

// Regarding the C++ part, I presume you may follow the accelerated learning path: http://abstrusegoose.com/249

After you build the initial understanding of the aforementioned basics, you can take a stab at stepping away from naive algorithms to block matrix computation -- e.g., block decompositions and multiplication:
https://en.wikipedia.org/wiki/Block_matrix
https://en.wikipedia.org/wiki/Block_LU_decomposition

Note that one classic reference for this is a 1969 paper by McKellar & Coffman:
http://dl.acm.org/citation.cfm?id=362879

This is a reason (one of several) an ordinary Fortran program from the 1970s will be outperforming your implementation (and is, as you have illustrated with your timing results) -- the authors of BLAS and LAPACK used by that ordinary Fortran program were aware of this paper (and many other ones, too) and have taken this knowledge into account in their implementations.

To understand *why* and *how* to do that yourself, familiarize yourself (thoroughly!) with the following:
http://www.akkadia.org/drepper/cpumemory.pdf

See also references for cache-oblivious algorithms: https://en.wikipedia.org/wiki/Cache-oblivious_algorithm

// In general, Hennessy & Patterson "Computer Architecture: A Quantitative Approach" is a good, comprehensive background read: http://booksite.mkp.com/9780123838728/
// More introductory (you may find it more approachable if it's your first book in this area) is Patterson & Hennessy "Computer Organization and Design: The Hardware/Software Interface"; http://booksite.elsevier.com/9780124077263/
In particular, you'll be interested in the following sections:
3.8 Going Faster: Subword Parallelism and Matrix Multiply
4.12 Going Faster: Instruction-Level Parallelism and Matrix Multiply
5.14 Going Faster: Cache Blocking and Matrix Multiply
// Perhaps not now, but later: 6.12 Going Faster: Multiple Processors and Matrix Multiply

After all of the above I think you may be able to catch-up with the performance of a Fortran matrix computation program from the 1970s.

// Note that this is solely a minimal, least-resistance / least-effort path, specialized for the particular exercise you're trying to solve, assuming a "standard" desktop PC, and a classic single-threaded implementation. The more you try to generalize / the more your requirements deviate from this (e.g., using GPUs), the more further research awaits you -- this is what makes it exciting!

Good luck!

Gene Boo · 1/10/14

Cheereo! I'm spoilt for choice. Thanks!

Have tried block code its unfortunately slower for me cuz of matrix class calling.

Intel vector ya I already have a text but omg the assembly is the nightmare.

I understand fortrans fast but I'm limited to c++ for now...n the overhead is from the classes to interface with excel using xlw. I HV tried ublas matrix class but its slower cuz additional class to copy n dump values. So now I write directly to xlopers. I'm developing lib for my quant team at the bank that's why I am trying out diff algo. Cuz ultimately we will hardcode some algo sequences into 1 function to merge some math rather than do it matrix style.

In native c and single threaded on my Ubuntu I could invert 1000x1000 in 0.4s. Even with gauss elimination. I thought this forum thread was about which algorithm is fastest. Not much hard times posted by people upon googling, so I posted my results thinking it can help others who want a benchmark. Thanks for the list of text! I'll get crackin'.

Gene Boo · 1/12/14

Gauss Seidel to solve for inverse also looks promising, but I doubt it's faster than LU directly since it really is just iterating using LU as well...

Michael Pender · 3/6/14

Tsotne said:
What is the maximum dimensions of a matrix for which you can calculate the inverse? I mean on the ordinary PC platform. Not with MATLAB. Just handmade algorithm created in some programming language.

Actually, many of the implementations these days rely on the BLAS library to do the math on the inside. I doubt it makes much difference whether you use Matlab or another product since they are probably just packaging the data and sending it to the BLAS library to do the numerical work behind the scenes. For such large matrices (1000 x 1000) the most significant difference in performance is likely to come from whether the implementation has been designed to take advantage of multiple CPUs or GPUs on the computer.

Recently (last year?) NVIDIA updated the CUDA library and now an "ordinary PC platform" with a video card containing a GPU that supports CUDA can be tapped to do the math operations. I built my own "ordinary PC" for that purpose and routinely pull about 800 GFLOPS using the video card for math with the Thrust library (https://developer.nvidia.com/thrust). However, if you actually want to do matrix inversion in particular, then you probably want to look at the CUDA-accelerated BLAS library instead, a.k.a. cuBLAS (https://developer.nvidia.com/cublas).

Michael Pender · 3/6/14

Have you locked the pages you need in memory so they cannot be swapped to disk while you are doing the math on the matrix? Locking the memory pages in RAM so that they don't swap out to disk may buy you the rest of the time.
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366895(v=vs.85).aspx

From the system specifications you posted, it looks like you should have the RAM to spare. The 300 x 300 array should take only 720K or so of memory for double precision floats. Direct Memory Access (DMA) is the fastest way to copy buffers and move data around, but pageable host memory may not be usable with DMA because the memory may reside on disk.

In summary, the Microsoft Excel implementation may be faster than your implementation because of programming tweaks to make the code execute faster, and not because of algorithmic differences.

Matrix Inverse Algorithm

alain

Older and Wiser

Gene Boo

Polter

Gene Boo

Gene Boo

Michael Pender

Michael Pender

Similar threads