Abstract: An algorithm for reduction of a regular matrix pair (A, B) to block Hessenberg-triangular form is presented. This condensed form Q^T(A, B) Z = (H,T), where H and T are block upper Hessenberg and upper triangular, respectively, and Q and Z orthogonal, may serve as a first step in the solution of the generalized eigenvalue problem Ax = \lambda B. It is shown how an elementwise algorithm can be reorganized in terms of blocked factorizations and higher level BLAS operations. Several ways to annihilate elements are compared. Specifically, the use of Givens rotations, Householder transformations, and combinations of the two. Performance results of the different variants are presented and compared to the LAPACK implementation DGGHRD, which indeed is unblocked.
Abstract: The level 3 Basic Linear Algebra Subprograms (BLAS) are designed to perform various matrix multiply and triangular system solving computations. The development of optimal level 3 BLAS code is costly and time consuming, because it requires assembly level programming/thinking. However, it is possible to develop a portable and high-performance level 3 BLAS library mainly relying on a highly optimized _GEMM, the routine for the general matrix multiply and add operation. With suitable partitioning, all the other level 3 BLAS can be defined in terms of _GEMM and a small amount of level 1 and level 2 computations. Our contribution is two-fold. First, the model implementations in Fortran 77 of the GEMM-based level 3 BLAS, which are structured to effectively reduce data traffic in a memory hierarchy. Second, the GEMM-based level 3 BLAS performance evaluation benchmark, which is a tool for evaluating and comparing different implementations of the level 3 BLAS with the GEMM-based model implementations.
Abstract: The GEMM-based level 3 BLAS model implementations, which are structured to effectively reduce data traffic in a memory hierarchy, and performance evaluation benchmark, which is a tool for evaluating and comparing different implementations of the level 3 BLAS with the GEMM-based model implementations are presented in [3]. Here, the installation and tuning of the Fortran 77 model implementations, as well as the use and installation of the performance evaluation benchmark are described. All software come in all four data precisions and are designed to be easy to implement and use on different platforms. Each of the GEMM-based routines has a few system dependent parameters that specify internal block sizes, cache characteristics, and intersection points for alternative code sections.
Abstract: A distributed algorithm with the same functionality as the single-processor level 3 BLAS operation GEMM, i.e., general matrix multiply and add, is presented. With the same functionality we mean the ability to perform GEMM operations on arbitrary subarrays of the matrices involved. The logical network is a 2D square mesh with torus connectivity. The matrices involved are distributed with non-scattered blocked data distribution. The algorithm consists of two main parts, alignment and data movement of subarrays involved in the operation and a distributed blocked matrix multiplication algorithm on (sub)matrices using only a square submesh. Our general approach makes it possible to perform GEMM operations on non-overlapping submeshes simultaneously.
Abstract: CONLAB offers an interactive desktop environment for parallel algorithm design and performance predictions. CONLAB is a MATLAB-style environment for developing parallel algorithms and for simulating their performance on different parallel architectures. Each parallel architecture model is defined through a few parameters, including the cost for one floating point operation, the start-up penalty for messages and a per-item cost in the message passing communication. The design of a non-trivial parallel algorithm in CONLAB is described, using the Hessenberg reduction algorithm from ScaLAPACK as illustration. We also show how the performance of the ScaLAPACK algorithm running on an IBM SP system can be simulated using the facilities offered by CONLAB. A toolbox of routines that are used as building blocks in ScaLAPACK are implemented in CONLAB including several BLACS communication routines and PBLAS computational kernels. Since the simulations performed in CONLAB runs on a Unix workstation there are limitations on the problem sizes and the processor grid sizes to be used. The main problem is the lack of memory and in order to be able to solve realistic problems in CONLAB we propose and illustrate a down-sizing and scaling approach for emulation of the execution of large scale problems. The problem is down-sized to fit the workstation environment. The parameters of the CONLAB timing model, which is a user-defined function, is scaled in concert so that we simulate the real problem as a down-sized problem in CONLAB.