Gene H. Golub, Charles F. Van Loan

Mentioned 7

"An invaluable reference book that should be in every university library." -- Image: Bulletin of the International Linear Algebra Society

Mentioned in questions and answers.

I have a big matrix (1,000 rows and 50,000 columns). I know some columns are correlated (the rank is only 100) and I suspect some columns are even proportional. How can I find such proportional columns? (one way would be looping `corr(M(:,j),M(:,k))`

), but is there anything more efficient?

If I am understanding your problem correctly, you wish to determine those columns in your matrix that are **linearly dependent**, which means that one column is proportional or a scalar multiple of another. There's a very basic algorithm based on QR Decomposition. For QR decomposition, you can take any matrix and decompose it into a product of two matrices: `Q`

and `R`

. In other words:

```
A = Q*R
```

`Q`

is an orthogonal matrix with each column as being a unit vector, such that multiplying `Q`

by its transpose gives you the identity matrix (`Q^{T}*Q = I`

). `R`

is a right-triangular or upper-triangular matrix. One very useful theory by Golub and Van Loan in their 1996 book: *Matrix Computations* is that a matrix is considered **full rank** if all of the values of diagonal elements of `R`

are non-zero. Because of the floating point precision on computers, we will have to threshold and check for any values in the diagonal of `R`

that are greater than this tolerance. If it is, then this corresponding column is an independent column. We can simply find the absolute value of all of the diagonals, then check to see if they're greater than some tolerance.

We can slightly modify this so that we would search for values that are **less** than the tolerance which would mean that the column is **not independent**. The way you would call up the QR factorization is in this way:

```
[Q,R] = qr(A, 0);
```

`Q`

and `R`

are what I just talked about, and you specify the matrix `A`

as input. The second parameter `0`

stands for producing an **economy-size** version of `Q`

and `R`

, where if this matrix was rectangular (like in your case), this would return a square matrix where the dimensions are the largest of the two sizes. In other words, if I had a matrix like `5 x 8`

, producing an economy-size matrix will give you an output of `5 x 8`

, where as not specifying the `0`

will give you an `8 x 8`

matrix.

Now, what we actually need is this style of invocation:

```
[Q,R,E] = qr(A, 0);
```

In this case, `E`

would be a permutation vector, such that:

```
A(:,E) = Q*R;
```

The reason why this is useful is because it orders the columns of `Q`

and `R`

in such a way that the first column of the re-ordered version is the **most probable** column that is independent, followed by those columns in decreasing order of "strength". As such, `E`

would tell you how likely each column is linearly independent and those "strengths" are in decreasing order. This "strength" is exactly captured in the diagonals of `R`

corresponding to this re-ordering. In fact, the strength is proportional to this first element. What you should do is check to see what diagonals of `R`

in the re-arranged version are greater than this first coefficient scaled by the tolerance and you use these to determine which of the corresponding columns are linearly independent.

However, I'm going to flip this around and determine the point in the `R`

diagonals where the last possible independent columns are located. Anything after this point would be considered linearly **dependent**. This is essentially the same as checking to see if any diagonals are less than the threshold, but we are using the re-ordering of the matrix to our advantage.

In any case, putting what I have mentioned in code, this is what you should do, assuming your matrix is stored in `A`

:

```
%// Step #1 - Define tolerance
tol = 1e-10;
%// Step #2 - Do QR Factorization
[Q, R, E] = qr(A,0);
diag_R = abs(diag(R)); %// Extract diagonals of R
%// Step #3 -
%// Find the LAST column in the re-arranged result that
%// satisfies the linearly independent property
r = find(diag_R >= tol*diag_R(1), 1, 'last');
%// Step #4
%// Anything after r means that the columns are
%// linearly dependent, so let's output those columns to the
%// user
idx = sort(E(r+1:end));
```

Note that `E`

will be a permutation vector, and I'm assuming you want these to be sorted so that's why we sort them after the point where the vectors fail to become linearly independent anymore. Let's test out this theory. Suppose I have this matrix:

```
A =
1 1 2 0
2 2 4 9
3 3 6 7
4 4 8 3
```

You can see that the first two columns are the same, and the third column is a multiple of the first or second. You would just have to multiply either one by 2 to get the result. If we run through the above code, this is what I get:

```
idx =
1 2
```

If you also take a look at `E`

, this is what I get:

```
E =
4 3 2 1
```

This means that column 4 was the "best" linearly independent column, followed by column 3. Because we returned `[1,2]`

as the linearly dependent columns, this means that columns 1 and 2 that both have `[1,2,3,4]`

as their columns are a scalar multiple of some other column. In this case, this would be column 3 as columns 1 and 2 are half of column 3.

Hope this helps!

If you don't want to do any `QR`

factorization, then I can suggest reducing your matrix into its row-reduced Echelon form, and you can determine the basis vectors that make up the column space of your matrix `A`

. Essentially, the column space gives you the minimum set of columns that can generate all possible linear combinations of output vectors if you were to apply this matrix using matrix-vector multiplication. You can determine which columns form the column space by using the `rref`

command. You would provide a second output to `rref`

that gives you a vector of elements that tell you **which columns** are linearly independent or form a basis of the column space for that matrix. As such:

```
[B,RB] = rref(A);
```

`RB`

would give you the locations of which columns for the column space and `B`

would be the row-reduced echelon form of the matrix `A`

. Because you want to find those columns that are linearly **dependent**, you would want to return a set of elements that don't contain these locations. As such, define a linearly increasing vector from `1`

to as many columns as you have, then use `RB`

to remove these entries in this vector and the result would be those linearly dependent columns you are seeking. In other words:

```
[B,RB] = rref(A);
idx = 1 : size(A,2);
idx(RB) = [];
```

By using the above code, this is what we get:

```
idx =
2 3
```

Bear in mind that we identified columns 2 and 3 to be linearly dependent, which makes sense as both are multiples of column 1. The identification of which columns are linearly dependent are different in comparison to the QR factorization method, as QR orders the columns based on how likely that particular column is linearly independent. Because columns 1 to 3 are related to each other, it shouldn't matter which column you return. One of these forms the basis of the other two.

I haven't tested the efficiency of using `rref`

in comparison to the QR method. I suspect that `rref`

does Gaussian row eliminations, where the complexity is worse compared to doing QR factorization as that algorithm is highly optimized and efficient. Because your matrix is rather large, I would stick to the QR method, but use `rref`

anyway and see what you get!

I'm new to parallel programming using GPU so I apologize if the question is broad or vague. I'm aware there is some parallel SVD function in the CULA library, but what should be the strategy if I have a large number of relatively small matrices to factorize? For example I have `n`

matrices with dimension `d`

, `n`

is large and `d`

is small. How to parallelize this process? Could anyone give me a hint?

You can take a look at the Batched Operations post of the CULA blog for a discussion of your problem.

**EDIT**

From what I understand from your comment below, you would like each thread to calculate a separate SVD. So, basically each thread should execute a standard, sequential SVD scheme. For that some possibly useful references:

Golub, Van Loan, Matrix Computations

If you use this approach, though, I'm afraid you will not be able anymore to use cuBLAS, as those are `host`

functions not callable from the `device`

(unless you do not have a compute capability `>3.5`

, see the the `simpleDevLibCUBLAS`

example.). But basically in this way I think you are somehow implementing the batch concept by yourself.

If you decide to go to a more standard parallel GPU implementation, the reference below could be of interest:

What are good resource to read up on for *implementing* numerical algorithms in pure java?

I know nothing about the interactions of the JVM & GC, and would love to learn more.

I enjoyed Object Oriented Implementation of Numerical Methods by Didier Besset. To be honest I use C++ almost exclusively but found this book interesting and fun nonetheless. It doesn't always use the best algorithm for the job, however.

I would also point you towards Matrix Computations by Golub & Van Loan. Not about Java by any means, but a mandatory text for anybody working in this area.

I am looking for the fastest way to solve the following problem:

Given some volume of lattice points in a 3D grid, some points `b_i`

(the boundary) satisfy `f(b_i)=0`

, while another point `a_0`

satisfies `f(a_0)= 1`

.

All other points (non-boundary) are some linear combination of the surrounding 26 points. For example, I could want

```
f(i, j, k)= .05*f(i+1, j, k)+.1*f(i, j+1, k)+.07*f(i, j, k+1)+...
```

The sum of the coefficients `.05+.1+.07+...`

will add up to `1`

. My objective is to solve for `f(x_i)`

for all `x_i`

in the volume.

Currently, I am using the successive over-relaxation (SOR) method, which basically initializes the boundary of the volume, assigns to each point the weighted average of the 26 surrounding points, and repeats. The SOR method just takes a combination of `f(x_i)`

after the most recent iteration and `f(x_i)`

after an iteration before.

I was wondering if anyone knows of any faster ways to solve this problem for a 3D grid around the size 102x102x48. SOR currently takes about 500-1000 iterations to converge to the level I want (varying depending on the coefficients used). I am most willing to use matlab, idl, and c++. Does anyone have any idea of how fast SOR is compared to converting the problem into a linear system and using matrix methods (like BCGSTAB)? Also, which method would be most effectively (and easily) parallelized? I have access to a 250 core cluster, and have been trying to make the SOR method parallel using mpi and c++, but haven't seen as much speed increase as I would like (ideal would be on the order of 100-fold). I would greatly appreciate any ideas on speeding up the solution to this problem. Thanks.

If you're comfortable with multithreading, using a red-black scheme for SOR can give a decent speedup. For a 2D problem, imagine a checkerboard - the red squares only depend on the black squares (and possibly themselves), so you can update all the red squares in parallel, and then repeat for all the black ones. Note that this *does* converge more slowly than the simple ordering, but it lets you spread the work over multiple threads.

Conjugate gradient methods will generally converge faster than SOR (if I remember correctly, by about an order of magnitude). I've never used BCGSTAB, but I remember GMRES working well on non-symmetric problems, and they can probably both benefit from preconditioning.

As for opportunities for parallelization, most CG-type methods only need you to compute the matrix-vector product `A*x`

, so you never need to form the full matrix. That's probably the biggest cost of each iteration, and so it's where you should look at multithreading.

Two good references on this are Golub and Van Loan, and Trefethen and Bau. The latter is much more readable IMHO.

Hope that helps...

In order to compute the product between 2 matrices A and B (nxm dimension) in a parallel mode, I have the following restrictions: the server sends to each client a number of rows from matrix A, and a number of rows from matrix B. This cannot be changed. Further the clients may exchange between each other information so that the matrices product to be computed, but they cannot ask the server to send any other data.

This should be done the most efficient possible, meaning by minimizing the number of messages sent between processes - considered as an expensive operation - and by doing the small calculations in parallel, as much as possible.

From what I have researched, practically the highest number of messages exchanged between the clients is n^2, in case each process broadcasts its lines to all the others. Now, the problem is that if I minimize the number of messages sent - this would be around log(n) for distributing the input data - but the computation then would only be done by one process, or more, but anyhow, it is not anymore done in parallel, which was the main idea of the problem.

What could be a more efficient algorithm, that would compute this product?

(I am using MPI, if it makes any difference).

To compute the matrix product `C = A x B`

element-by-element you simply calculate `C(i,j) = dot_product(A(i,:),B(:,j))`

. That is, the (i,j) element of C is the dot product of row i of A and column j of B.

If you insist on sending rows of A and rows of B around then you are going to have a tough time writing a parallel program whose performance exceeds a straightforward serial program. Rather, what you ought to do is send rows of A and columns of B to processors for computation of elements of C. If you are constrained to send rows of A and rows of B, then I suggest that you do that, but compute the product on the server. That is, ignore all the worker processors and just perform the calculation serially.

One alternative would be to compute partial dot-products on worker processors and to accumulate the partial results. This will require some tricky programming; it can be done but I will be very surprised if, at your first attempt, you can write a program which outperforms (in execution speed) a simple serial program.

(Yes, there are other approaches to decomposing matrix-matrix products for parallel execution, but they are more complicated than the foregoing. If you want to investigate these then Matrix Computations is the place to start reading.)

You need also to think hard about your proposed measures of efficiency -- the most efficient message-passing program will be the one which passes no messages. If the cost of message-passing far outweighs the cost of computation then the no-message-passing implementation will be the most efficient by both measures. Generally though, measures of the efficiency of parallel programs are ratios of speedup to number of processors: so 8 times speedup on 8 processors is perfectly efficient (and usually impossible to achieve).

As stated yours is not a sensible problem. Either the problem-setter has mis-specified it, or you have mis-stated (or mis-understood) a correct specification.

How can I improve the efficiency of standard matrix multiplication algorithm?

The main operation involved in this approach is: `C[i][j]+=A[i][p]*B[p][j]`

What can be done to improve the efficiency of the algorithm?

I would suggest reading Chapter 1 of Golub and Van Loan, which addresses this exact question.

I have written a linear solver employing Householder reflections/transformations in ANSI C which solves Ax=b given A and b. I want to use it to find the eigenvector associated with an eigenvalue, like this:

```
(A-lambda*I)x = 0
```

The problem is that the 0 vector is always the solution that I get (before someone says it, yes I have the correct eigenvalue with 100% certainty).

**Here's an example which pretty accurately illustrates the issue:**

Given `A-lambda*I`

(example just happens to be Hermitian):

```
1 2 0 | 0
2 1 4 | 0
0 4 1 | 0
```

Householder reflections/transformation will yield something like this

```
# # # | 0
0 # # | 0
0 0 # | 0
```

Back substitution will find that solution is `{0,0,0}`

, obviously.

It's been a while since I've written an eigensolver, but I seem to recall that the trick was to refactor it from `(A - lambda*I) * x = 0`

to `A*x = lambda*x`

. Then your Householder or Givens steps will give you something like:

```
# # # | #
0 # # | #
0 0 1 | 1
```

...from which you can back substitute without reaching the degenerate 0 vector. Usually you'll want to deliver x in normalized form as well.

My memory is quite rusty here, so I'd recommend checking Golub & Van Loan for the definitive answer. There are quite a few tricks involved in getting this to work robustly, particularly for the non-symmetric case.