Contents
Introduction
SylvesterType
Matrix Equations
Triangular
Matrix Equations
Usage
Download
Prerequisites
Building
Installation
Invoking
SLICOT/LAPACK
Wrappers
Native
routines
SMP
Parallel Native Routines
Performance
Aspects
Machine Characteristics
Workspace
Matrixmatrix
Multiply
OpenMP
QZ Factorizing
Routine
Other SylvesterType Projects from Umeå
Revision
History
Contact
References
Acknowledgements and Copyright


RECSY is a library for solving
triangular Sylvestertype matrix equations. Its objectives are both speed
and reliability. In order to achieve these goals, RECSY is based on novel
recursive blocked algorithms, which call new highperformance
kernels for solving smallsized leaf problems of the recursion tree [Jonsson
and Kågström, 2002a, 2002b]. In contrast to explicit standard
blocking techniques, our recursive approach leads to an automatic variable
blocking that has the potential of matching the memory hierarchies of
today’s HPC systems. The RECSY library comprises a set of Fortran 90
files, which uses recursion and OpenMP for shared memory parallelism to
solve eight different matrix equations, including continuoustime as well
as discretetime standard and generalized Sylvester and Lyapunov
equations.
The Sylvestertype matrix
equations considered are a set of linear matrix equations that appear and
are commonly used in different control theory applications.
In the RECSY library, we
differentiate between onesided and twosided matrix equations. We use the
notation onesided matrix equations when the solution is only
involved in matrix products of two matrices, e.g., op(A)X or
Xop(A), where op(A) can be A or
A^{T}. In
twosided matrix equations, the solution is involved in matrix
products of three matrices, both to the left and to the right, e.g., op(A)Xop(B).
In the table below, we list the
cases of the eight different types of matrix equations
considered together with the acronyms used (onesided equations in the top
and twosided equations in the bottom part).
Name 
Matrix equation 
Acronym 
Standard Sylvester (CT) 
op(A)X
± Xop(B) = C 
SYCT 
Standard Lyapunov (CT) 
op(A)X
+ Xop(A^{T}) = C 
LYCT 
Generalized coupled Sylvester 
( op(A)X
± Yop(B) = C,
op(D)X ± Yop(E) = F
) 
GCSY 



Standard Sylvester (DT) 
op(A)Xop(B)
± X = C 
SYDT 
Standard Lyapunov (DT) 
op(A)Xop(A^{T})

X = C 
LYDT 
Generalized Sylvester 
op(A)Xop(B)
± op(C)Xop(D) = E 
GSYL 
Generalized Lyapunov (CT) 
op(A)Xop(A^{T})

op(E)Xop(E^{T}) = C 
GLYCT 
Generalized Lyapunov (DT) 
op(A)Xop(E^{T})
+ op(E)Xop(A^{T}) = C 
GLYDT 
The RECSY library provides the
solution to various “transpose” variants of these matrix equations with
plus or minus signs. For example, eight different variants of the
continuoustime Sylvester equation, which in condensed form can be
expressed as
op(A)X
± Xop(B)
= scale
C,
where op(A) can be A
or A^{T} and op(B) can be B or B^{T}.
The scalar scale is an output scaling factor (0
£
scale £
1), set to avoid overflow in X during the computations. In general,
the solution overwrites the right hand side (e.g., C
¬
X).
The classical
method of solution of the Sylvestertype matrix equations is based on the
BartelsStewart method, which includes three major steps. First, the
matrix (or
matrix pair)
is transformed to a Schur (or
generalized Schur)
form. This leads to a reduced triangular matrix equation, which is solved
using the RECSY library. For example, the coefficient matrices A
and B
in the Sylvester equation
AX  XB = C are in upper triangular or upper
quasitriangular form. Finally, the solution of the reduced matrix
equation is transformed back to the originally coordinate system.
Reliable and efficient algorithms
for the reduction step can be found in LAPACK for the standard case, and
in [DacklandKågström
1999] for the generalized case, where a blocked variant of the QZ method
is presented.
Triangular matrix equations also
appear naturally in estimating the condition numbers of matrix equations
and different eigenspace computations, including blockdiagonalization of
matrices and matrix pairs and computation of functions of matrices.
Related applications include the direct reordering of eigenvalues in the
real (generalized) Schur form and the computation of additive
decompositions of a (generalized) transfer function (see [JonssonKågström
2002a, 2002b] for more information and references).
This section describes how to use the library in
your applications.
New version (20091228): The source code including makefiles is available in
.tar.gz format and
in .zip format.
Old version (2003): The source code including makefiles is available in
.tar.gz format and
in .zip format.
Prerequisites
In order to build the library, you need:
 A compiler. Included in the package are makefiles for IBM
XL
Fortran, Portland Group
PGHPF, and Compaq
Visual Fortran. It should not
be difficult to modify the makefiles to match another compiler, though.
If explicit parallelization is required, the compiler must support
OpenMP.
 A good implementation of
BLAS routines, e.g., your
vendor's, or a free implementation like
ATLAS.
 An implementation of
LAPACK routines.
 The
SLICOT library.
 GNU make 3.78 or higher, if you are building under Unix.
Building
 After downloading and unpacking, try make
f Makefile.xxx librecsy.a, where xxx matches your
compiler. Or, if you are
using Visual Fortran, open the recsy.dsw workspace, select the librecsy
project and choose Build. This will compile all the files necessary for
the native routines and build the
library.

However,
if you want to use the SLICOT/LAPACK
wrappers (recommended), you will need to compile for wrapper files
and archive them into the SLICOT/LAPACK libraries. An example how
this can be done is by using
make libslicot_lapack_recsy.a .
This will first compile the archive files, then create a huge library
file, containing SLICOT, LAPACK and RECSY, see below. Before doing this,
you may have to modify Makefile and set the variables LIBSLICOT and
LIBLAPACK so they point to your original libraries. These libraries will
not be modified. If you are using Visual Fortran, select the
libslicot_lapack_recsy project and choose Build. Also here, you will
probably need to modify libslicot_lapack_recsy.mak.
There is currently no installation routine. You may copy the library file (libslicot_lapack_recsy.a
or libslicot_lapack_recsy.lib) to an appropriate library location, e.g., /usr/local/lib, in order for the linker to
find it. Then link our program using lslicot_lapack_recsy ,
or, in Visual Fortran, by adding the library to the Object/library modules
in the Link tab in project settings.
There are two ways of using the library. One way is to use the standard
SLICOT/LAPACK routines for solving Sylvestertype equations. This has
three advantages. First, you don't need to modify your already existing
application. Not a line. Plug and play. Second, the SLICOT library
includes routines for solving full equations, i.e., they include
the Schur factorizing and backtransformation steps. These routines benefit from the
faster triangular solvers in this package. Also, they include condition
estimation. Thirdly, LAPACK and SLICOT routines include parameter
checking.
The other way, if you choose to walk on the wild side, is by calling
the recursive subroutines directly. Although they give a roughcut
interface without parameter checking, this method has its advantages too.
Firstly, it provides some transposition options not provided in SLICOT/LAPACK.
Secondly, it gives you more freedom to choose workspace allocation,
although this is not recommended.
This said, it is recommended that you start out with the SLICOT/LAPACK
routines.
The following routines are overloaded in the RECSY library.
As they are overloaded, performance for routines using these subroutines,
e.g., SB04PD , will improve as well.
Routine 
Native routine 
SLICOT SB03MX 
RECLYDT 
SLICOT SB03MY 
RECLYCT 
SLICOT SB04PY 
RECSYDT 
SLICOT SG03AX 
RECGLYDT 
SLICOT SG03AY 
RECGLYCT 
LAPACK DTRSYL 
RECSYCT 
LAPACK DTGSYL 
RECGCSY * 
* The RECGCSY routine is only used in
the case where TRANS='N' and
IJOB=0 .
For the other cases, the original LAPACK code will be used.
We remark that neither LAPACK nor
SLICOT provide a GSYL solver.
To get the largest performance
gain from overloading RECSY routines, it is important that the SLICOT and
LAPACK libraries are linked with highly optimized BLAS libraries.
By calling the native routines, you obtain access to
all cases available in the RECSY library. Note however, that there is no parameter
checking. Also, these routines only solve quasitriangular problems.
Directly calling the native routines requires explicit interfaces to be
declared in your (Fortran 90) source file, since some arguments are
optional. This is done by including the file RECSY_DECL.f90 (.F in Unix).
For example:
SUBROUTINE MYSUB(A)
DOUBLE PRECISION A(10,10)
INCLUDE 'RECSY_DECL.f90'
CALL RECSYCT(...)
END SUBROUTINE
The following native sequential routines are defined in the library.
Some parameters are declared OPTIONAL , they are shown in
italics.
RECGCSY( UPLOSIGN, SCALE, M, N, A, LDA, B, LDB, C, LDC,
D, LDD, E, LDE, F, LDF, INFO, MACHINE ) Solves the generalized coupled triangular Sylvester
equation. A is upper quasitriangular M´M,
D is upper triangular M´M,
B is is upper quasitriangular N´N,
E is upper triangular N´N,
C and F are rectangular M´N.
UPLOSIGN 
GCSY equation 
0 
( AX + YB = scale
C,
DX + YE = scale F ),
(C, F)¬
(X,
Y) 
1 
( AX 
YB =
scale C, DX 
YE =
scale F ),
(C, F)¬
(X,
Y) 
2 
( AX + YB^{T} = scale
C,
DX + YE^{T} = scale F
),
(C, F)¬
(X,
Y) 
3 
( AX 
YB^{T} = scale C, DX
 YE^{T} = scale F
),
(C, F)¬
(X,
Y) 
4 
( A^{T}X + YB
= scale C, D^{T}X + YE
= scale F ),
(C, F)¬
(X,
Y) 
5 
( A^{T}X 
YB = scale C, D^{T}X 
YE = scale F ),
(C, F)¬
(X,
Y) 
6 
( A^{T}X + YB^{T} =
scale C, D^{T}X + YE^{T} =
scale F ),
(C, F)¬
(X,
Y) 
7 
( A^{T}X 
YB^{T} = scale C, D^{T}X 
YE^{T} = scale F ),
(C, F)¬
(X,
Y) 
RECGSYL( UPLOSIGN, SCALE, M, N, A, LDA,
B, LDB, C, LDC, D, LDD, E, LDE, INFO, MACHINE, WORKSPACE,
WKSIZE )
Solves the triangular generalized Sylvester equation.
A is upper quasitriangular M´M,
C is upper triangular M´M,
B is is upper quasitriangular N´N,
D is upper triangular N´N,
C and F are rectangular M´N.
UPLOSIGN 
GSYL equation 
0 
AXB + CXD = scale
E,
E
¬
X 
1 
AXB 
CXD =
scale E,
E
¬
X 
2 
AXB^{T} + CXD^{T} = scale
E,
E
¬
X 
3 
AXB^{T}
 CXD^{T} = scale E,
E
¬
X 
4 
A^{T}XB + C^{T}XD
= scale E,
E
¬
X 
5 
A^{T}XB 
C^{T}XD = scale E,
E
¬
X 
6 
A^{T}XB^{T} +
C^{T}XD^{T} =
scale E,
E
¬
X 
7 
A^{T}XB^{T} 
C^{T}XD^{T} = scale
E,
E
¬
X 
10 
AXB^{T} + CXD^{T} = scale
E,
E
¬
X, B is is upper triangular N´N,
D is upper quasitriangular N´N
(used by RECGLYCT) 
12 
A^{T}XB + C^{T}XD
= scale E,
E
¬
X, B is is upper triangular N´N,
D is upper quasitriangular N´N
(used by RECGLYCT) 
RECGLYDT( UPLO, SCALE, M, A, LDA, E,
LDE, C, LDC, INFO, MACHINE, WORKSPACE, WKSIZE )
Solves the triangular generalized discretetime Lyapunov equation.
A is upper quasitriangular M´M,
E is upper triangular M´M,
C is symmetric M´M, only
upper part of C is accessed.
UPLO 
GLYDT equation 
0 
AXA^{T} 
EXE^{T} = scale C,
C
¬
X 
1 
A^{T}XA 
E^{T}XE = scale C,
C
¬
X 
RECGLYCT( UPLO, SCALE, M, A, LDA, E,
LDE, C, LDC, INFO, MACHINE, WORKSPACE, WKSIZE )
Solves the triangular generalized continuoustime Lyapunov equation.
A is upper quasitriangular M´M,
E is upper triangular M´M,
C is symmetric M´M, only
upper part of C is accessed.
UPLO 
GLYCT equation 
0 
AXE^{T} + AXE^{T} = scale
C,
C
¬
X 
1 
A^{T}XE + A^{T}XE = scale
C,
C
¬
X 
Parameter types and meaning:
Parameter 
Intent 
Type 
Meaning 
UPLOSIGN, UPLO 
IN 
INTEGER 
Defines different options (see above). 
A, B, D 
IN 
DOUBLE
PRECISION(*) 
Coefficient matrices (see above). 
C, E 
IN, INOUT 
DOUBLE
PRECISION(*) 
Coefficient/right hand side matrices (see above). 
F 
INOUT 
DOUBLE
PRECISION(*) 
Right hand side matrix (see above). 
SCALE 
OUT 
DOUBLE PRECISION 
Scaling factor, to avoid overflow in the computed solution. 
LDA, LDB, LDC,
LDD, LDE, LDF 
IN 
INTEGER 
Leading dimension of matrices A to F. 
INFO 
OUT 
INTEGER 
INFO=100 :
Out of memory
INFO=0 : No error
INFO>0 : The equation is (nearly) singular to working
precision; perturbed values were used to solve the equation. 
MACHINE 
IN 
DOUBLE
PRECISION(10) 
Optional. Specify only if you want
different parameters than the default parameters, which are set by RECSY_MACHINE . 
WORKSPACE 
INOUT 
DOUBLE
PRECISION(*) 
Optional. Specify only
if you provide your own workspace. 
WKSIZE 
INOUT 
INTEGER 
Optional.
Specify only if you provide your own workspace, or if you want to know
the required workspace size. 
In addition to the routines listed above, a set of
parallelized routines for OpenMP platforms are provided. The signatures
for these routines mimic the sequential routines, except for the extension
"_P " to the subroutine name and an extra parameter
PROCS . PROCS , of type INTEGER ,
specifies the number of processors to use (³1).
Due to the nature of the recursion, this parallelization works best when
PROCS = 2^{d}. However, other values may work
well, also.
Regardless if the sequential or parallel routines
are used, SMP optimized BLAS routines (DGEMM ) should be used.
RECSYCT_P( PROCS, UPLOSIGN, SCALE, M,
N, A, LDA, B, LDB, C, LDC, INFO, MACHINE )
RECLYCT_P( PROCS, UPLO, SCALE, M, A, LDA, C, LDC,
INFO, MACHINE )
RECGCSY_P( PROCS, UPLOSIGN, SCALE, M, N, A, LDA, B, LDB, C, LDC,
D, LDD, E, LDE, F, LDF, INFO, MACHINE )
RECSYDT_P( PROCS, UPLOSIGN, SCALE, M,
N, A, LDA, B, LDB, C, LDC, INFO, MACHINE, WORKSPACE, WKSIZE )
RECLYDT_P( PROCS, UPLO, SCALE, M, A,
LDA, C, LDC, INFO, MACHINE, WORKSPACE, WKSIZE )
RECGSYL_P( PROCS, UPLOSIGN, SCALE, M,
N, A, LDA, B, LDB, C, LDC, D, LDD, E, LDE, INFO, MACHINE, WORKSPACE,
WKSIZE )
RECGLYDT_P( PROCS, UPLO, SCALE, M, A,
LDA, E, LDE, C, LDC, INFO, MACHINE, WORKSPACE, WKSIZE )
RECGLYCT_P( PROCS, UPLO, SCALE, M, A,
LDA, E, LDE, C, LDC, INFO, MACHINE, WORKSPACE, WKSIZE )
Like all highperformance routines, there are
tunable parameters. However, the number is not that great, since recursion
takes care of most of the blocking. All machinecharacteristic parameters
are given in the vector MACHINE(0:9) , which is initialized by the subroutine
REC_MACHINE . The elements of the vector are:
Element 
Definition (default value) 
0 (first) 
Smallest number (DLAMCH('S')/DLAMCH('P')
is default value) 
1 
Largest number (1D0/MACHINE(0)
is default value) 
2 
Machine precision (DLAMCH('P')
is default value) 
3 
Threshold for choosing rectangular matrixmatrix multiply
instead of triangular matrixmatrix multiply. (40 is default value).
For updates of the form C = C + A×B
where A or B is triangular and all matrix
dimensions are smaller than MACHINE(3) , the built in
matrixmatrix multiply routine or DGEMM is used, see
below. For larger problems, A×B is
formed by a call to DTRMM using a temporary workspace,
and C is then updated. 
4 
Threshold for choosing built in matrixmatrix multiply routine (40
is default value).
For matrix products where matrix dimensions are smaller than
MACHINE(4) , the builtin matrixmatrix multiply routine
is used. For larger problems, BLAS DGEMM is used. 
5 
Threshold for sequential routine
(3D6 is default value).
For problems which are smaller than MACHINE(5) , no more
parallel splits are done. Instead, the (remaining) problem is solved
by sequential solver. The size of the problem is the number of flops
required to solve the problem. 
6 
Reserved 
7 
Reserved 
8 
Reserved 
9 
Reserved 
Since the routines are implemented in Fortran 90,
the subroutines can dynamically allocate their own workspace. If, for
performance or memory management reasons, you would like to provide your
own buffer, specify WORKSPACE and set WKSIZE to
the size of your buffer. If it is
large enough, it will be used. If it is too small, the routine will
allocate its own buffer anyway. A special case is WKSIZE=1 .
Then WKSIZE is overwritten with the minimum workspace
necessary to solve the problem, but no problem is solved and the matrices
are never referenced. Required workspace differs between the routines, but
no routine requires more than M×N+4
elements. The number may change in future releases.
Following the ideas from GEMMbased
level 3 BLAS (Kågström, Ling and
Van Loan, 1999), the recursive algorithms rely highly on a fast
matrixmatrix multiply (MM) routine. For small problems, it is important
that the MM routine is lightweight. The authors have found that some MM
implementations do not perform optimally on small problems, because of
parameter checking and/or unnecessary data copying. Because of this, a
very lightweight MM implementation which features 4´4
outer loop unrolling is provided, showing good performance on different
types of processors. This MM is intended for matrices which fit in level 1
cache. However, if your
DGEMM
routine performs well for small problems as well, you might want to lower
MACHINE(4) ,
possibly to zero.
In order to get the explicit parallelization from
the _P subroutines, a compiler which supports OpenMP is necessary. If your
system has more than two processors, your compiler should support nested
OpenMP directives. That is, all OpenMP compilers should support nested
directives according to the specifications, but some compilers do not
parallelize inner directives. Also, the overhead for OpenMP parallelism
varies between systems. For systems whose OpenMP overhead is large, higher
performance might be obtained by increasing
MACHINE(5) .
Whilst the RECSY library reduces
the time for solving the triangular equations, it is only one part of
solving the complete dense matrix equation problem. Looking at only the
flop counts, the factorization part should take substantially
longer time. Previously, this was not the case, as most of the Sylvester
solvers were not based on true blocked algorithms. With the RECSY library,
the pressure is once again on the Schur factorization routines. One
significant improvement is to use the blocked algorithms for the
generalized Schur decomposition described in
Dackland and Kågström, 1999.
Hopefully, there will not be much need for digging into the internals
of the library. Yet, if there would be, you might want to know a bit of
the internals of the subroutines:
 The
SMP parallel subroutines call the sequential routine whenever the
problem is too small, or when there is only one processor assigned to
the work. Notice that
MACHINE(5)
cannot be too small for the parallel routines to work.
 The
sequential routines for solving Sylvester equations call a kernel solver
(ends with
_TR0

_TR7 )
for problems where M, N
£
4. If M or N is greater than 4, a recursive split is done.
If M
£
2N, the N dimension is split. If N
£
2M, the M dimension is split. Otherwise, both dimensions
are split. For the SMP parallel subroutines, the split of both
dimensions leads to a recursion tree where two of the four branches are
done in parallel.
 The
kernel solver (
_TR x)
solves several different small matrix equations (leafs in the recursion
tree). This is done by constructing the Kronecker product representation
of each matrix equation and solving the resulting linear system Zx
= c using an optimized kernel with partial pivoting and overflow
guarding. If the kernel solver detects a (near) singularity or risk of
overflow, the kernel aborts.
 If
the kernel solver aborts, the recursive subroutine backtracks and we
construct a new Kronecker product representation now leading to a matrix
Z of size 4´4
 32´32,
which is solved using complete pivoting.

The sequential routines for solving Lyapunov equations do recursion down
to M
£
2. When this limit is reached, the small problem is solved using partial
pivoting or complete pivoting for very illconditioned cases.

The SMP parallel subroutines for solving Lyapunov equations do not
utilize OpenMP in themselves, but they call SMP parallel routines for
solving Sylvester equations.
For more details we refer to
Jonsson and Kågström 2001a, 2001b, 2002a, 2002b. The
routine and call hierarchy is shown in this
PDF file.
Currently, new efficient
algorithms for ScaLAPACKtype Sylvestertype solvers and new ScaLAPACKtype
generalized Schur factorization routines are being developed. For more
information contact Bo Kågström (bokg@cs.umu.se).
20030212: First release, 0.01alpha. Web page
released.
The authors are glad to receive any comments, bug
reports, or suggestions regarding the RECSY library. Email address:
isak@cs.umu.se.
A small subset of relevant references applicable to the library. For
further references, see Jonsson and
Kågström,
2002a and 2002b.
 Bartels, R. H.
and Stewart, G. W. 1972. Algorithm 432: Solution of the equation
AX+XB=C, Commun. ACM 15, 9, 820–826.
 Dackland, K. and Kågström, B.
1999. Blocked algorithms and software for reduction of a regular
matrix pair to generalized Schur form. ACM Trans. Math. Softw. 25,
4, 425–454.
 Jonsson, I. and Kågströöm, B.
2001a. Recursive blocked algorithms for solving triangular matrix
equations — Part I: Onesided and coupled Sylvestertype equations.
SLICOT Working Note 20014.
 Jonsson, I. and Kågström, B.
2001b. Recursive blocked algorithms for solving triangular matrix
equations — Part II: Twosided and generalized Sylvester and Lyapunov
equations. SLICOT Working Note 20015.
 Jonsson, I. and Kågström, B.
2002a. Recursive blocked algorithms for solving triangular systems — Part
I: Onesided and coupled Sylvestertype matrix equations. ACM Trans.
Math. Softw. 28, 4, (Dec.), 392415.
 Jonsson, I. and Kågström, B.
2002b. Recursive blocked algorithms for solving triangular systems — Part
II: Twosided and generalized Sylvester and Lyapunov matrix equations.
ACM Trans. Math. Softw. 28, 4 (Dec.), 416435.
 Kressner, D.,
Mehrmann, V., and Penzl, T.
1999a. CTLEX — A collection of benchmark examples for continuoustime
Lyapunov equations. SLICOT Working Note 1999–6.
 Kressner, D.,
Mehrmann, V., and Penzl, T.
1999b. DTLEX — A collection of benchmark examples for discretetime
Lyapunov equations. SLICOT Working Note 1999–7.
 Penzl, T. 1998.
Numerical solution of generalized Lyapunov equations, Adv. Comp.
Math. 8, 33–48.
 SLICOT. Library and the Numerics in Control Network (NICONET) Web
site:
http://www.win.tue.nl/niconet/index.html.
Authors are Isak Jonsson and Bo Kågström, Dept of
Computing Science and HPC2N, Umeå University, SE901 87 Umeå, Sweden.
Financial support for this project has been provided by
the Swedish Research Council under grants TFR 98604, VR
62120013284 and the Swedish Foundation for Strategic Research
under grant A3 02:128.
The library is freely available for academic (noncommercial) use, and is
provided on an "as is" basis. Provided in this package are parts of the SLICOT and
LAPACK libraries, which themselves may impose other terms. (SLICOT terms
are found here. LAPACK terms
are found here.)
Any use of the RECSY library should be acknowledged by citing the papers
Jonsson, I. and Kågström, B.
2002a and 2002b above. 