Circular Image

J. Thies

info

Please Note

33 records found

Journal article (2026) - J. Thies, H.A. Dijkstra
Both the Subpolar Gyre (SPG) and the Atlantic Meridional Overturning Circulation (AMOC) have been labeled as separate climate tipping systems, because characteristic transitions in flow patterns have been found in a hierarchy of models of each system. However, the precise connections between the two tipping systems are still unclear, in particular how they can affect their individual tipping behavior. Here, we present bifurcation diagrams of the combined SPG-AMOC system in a spatially three-dimensional North Atlantic Ocean model. We show that SPG tipping is associated with what we call ‘Welander-snaking’ behavior and the AMOC tipping with a Stommel back-to-back saddle-node bifurcation. The details of the bifurcation diagrams are shown to be sensitive to the formulation of the convective parameterization and the horizontal diffusivity of the model. ...
Journal article (2025) - Melven Röhrig-Zöllner, Manuel Becklas, Jonas Thies, Achim Basermann
Tensor networks are a class of algorithms aimed at reducing the computational complexity of high-dimensional problems. They are used in an increasing number of applications, from quantum simulations to machine learning. Exploiting data parallelism in these algorithms is key to using modern hardware. However, there are several ways to map required tensor operations onto linear algebra routines (“building blocks”). Optimizing this mapping impacts the numerical behavior, so computational and numerical aspects must be considered hand-in-hand. In this paper we discuss the performance of solvers for low-rank linear systems in the tensor-train format (also known as matrix-product states). We consider three popular algorithms: TT-GMRES, MALS, and AMEn. We illustrate their computational complexity based on the example of discretizing a simple high-dimensional PDE in, for example, 5010 grid points. This shows that the projection to smaller sub-problems for MALS and AMEn reduces the number of floating-point operations by orders of magnitude. We suggest optimizations regarding orthogonalization steps, singular value decompositions, and tensor contractions. In addition, we propose a generic preconditioner based on a TT-rank-1 approximation of the linear operator. Overall, we obtain roughly a 5× speedup over the reference algorithm for the fastest method (AMEn) on a current multicore CPU. ...
Journal article (2024) - Christie Alappat, Jonas Thies, Georg Hager, Holger Fehske, Gerhard Wellein
Sparse linear iterative solvers are essential for many large-scale simulations. Much of the runtime of these solvers is often spent in the implicit evaluation of matrix polynomials via a sequence of sparse matrix-vector products. A variety of approaches has been proposed to make these polynomial evaluations explicit (i.e., fix the coefficients), e.g., polynomial preconditioners or s-step Krylov methods. Furthermore, it is nowadays a popular practice to approximate triangular solves by a matrix polynomial to increase parallelism. Such algorithms allow to evaluate the polynomial using a so-called matrix power kernel (MPK), which computes the product between a power of a sparse matrix (Formula presented.) and a dense vector (Formula presented.), i.e., (Formula presented.), or a related operation. Recently we have shown that using the level-based formulation of sparse matrix-vector multiplications in the Recursive Algebraic Coloring Engine (RACE) framework we can perform temporal cache blocking of MPK to increase its performance. In this work, we demonstrate the application of this cache-blocking optimization in sparse iterative solvers. By integrating the RACE library into the Trilinos framework, we demonstrate the speedups achieved in (preconditioned) s-step GMRES, polynomial preconditioners, and algebraic multigrid (AMG). For MPK-dominated algorithms we achieve speedups of up to 3 (Formula presented.) on modern multi-core compute nodes. For algorithms with moderate contributions from subspace orthogonalization, the gain reduces significantly, which is often caused by the insufficient quality of the orthogonalization routines. Finally, we showcase the application of RACE-accelerated solvers in a real-world wind turbine simulation (Nalu-Wind) and highlight the new opportunities and perspectives opened up by RACE as a cache-blocking technique for MPK-enabled sparse solvers. ...
Journal article (2023) - Martin J. Kühn, Johannes Holke, Annette Lutz, Jonas Thies, Melven Röhrig-Zöllner, Alexander Bleh, Jan Backhaus, Achim Basermann
Developments in numerical simulation of flows and high-performance computing influence one another. More detailed simulation methods create a permanent need for more computational power, while new hardware developments often require changes to the software to exploit new hardware features. This dependency is very pronounced in the case of vector-units which are featured by all modern processors to increase their numerical throughput but require vectorization of the software to be used efficiently. We study the vectorization of a simulation method that exhibits an inherent level of vector-parallelism. This is of particular interest as SIMD operations will hopefully be available with std::simd in a future C++ standard. The simulation method considered here results in the simultaneous solution of multiple sparse linear systems of equations which only differ by their main diagonal and right-hand sides. Such structure arises in the simulation of unsteady flow in turbomachinery by means of a frequency domain approach called harmonic balance. ...
Journal article (2022) - Jonas Thies, Moritz Travis Hof, Matthias Zimmermann, Maxim Efremov
We develop a computationally and numerically efficient method to calculate binding energies and corresponding wave functions of quantum mechanical three-body problems in low dimensions. Our approach exploits the tensor structure of the multidimensional stationary Schrödinger equation, being expressed as a discretized linear eigenvalue problem. In one spatial dimension, we solve the three-body problem with the help of iterative methods. Here the application of the Hamiltonian operator is represented by dense matrix-matrix products. In combination with a newly-designed preconditioner for the Jacobi-Davidson QR, our highly accurate tensor method offers a significantly faster computation of three-body energies and bound states than other existing approaches. For the two-dimensional case, we additionally make use of a hybrid distributed/shared memory parallel implementation to calculate the corresponding three-body energies. Our novel method is of high relevance for the analysis of few-body systems and their universal behavior, which is only governed by the particle masses, overall symmetries, and the spatial dimensionality. Our results have straightforward applications for ultracold atomic gases that are widespread and nowadays utilized in quantum sensors. ...
Journal article (2022) - Melven Röhrig-Zöllner, Jonas Thies, Achim Basermann
There are several factorizations of multidimensional tensors into lower-dimensional components, known as ``tensor networks."" We consider the popular ``tensor-train"" (TT) format and ask, How efficiently can we compute a low-rank approximation from a full tensor on current multicore CPUs? Compared to sparse and dense linear algebra, kernel libraries for multilinear algebra are rare and typically not as well optimized. Linear algebra libraries like BLAS and LAPACK may provide the required operations in principle but often at the cost of additional data movements for rearranging memory layouts. Furthermore, these libraries are typically optimized for the compute-bound case (e.g., square matrix operations), whereas low-rank tensor decompositions lead to memory bandwidth limited operations. We propose a ``TT singular value decomposition"" (TT-SVD) algorithm based on two building blocks: a ``Q-less tall-skinny QR"" factorization and a fused tall-skinny matrix-matrix multiplication and reshape operation. We analyze the performance of the resulting TT-SVD algorithm using the roofline performance model. In addition, we present performance results for different algorithmic variants for shared-memory as well as distributed-memory architectures. Our experiments show that commonly used TT-SVD implementations suffer severe performance penalties. We conclude that a dedicated library for tensor factorization kernels would benefit the community: Computing a low-rank approximation can be as cheap as reading the data twice from main memory. As a consequence, an implementation that achieves realistic performance will move the limit at which one has to resort to randomized methods that only process part of the data. ...
Journal article (2021) - Sven Baars, Mark van Der Klok, Jonas Thies, Fred W. Wubs
Journal article (2021) - Dominik Ernst, Georg Hager, Jonas Thies, Gerhard Wellein
General matrix-matrix multiplications with double-precision real and complex entries (DGEMM and ZGEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which are much taller than wide. NVIDIA’s current CUBLAS implementation delivers only a fraction of the potential performance as indicated by the roofline model in this case. We describe the challenges and key characteristics of an implementation that can achieve close to optimal performance. We further evaluate different strategies of parallelization and thread distribution and devise a flexible, configurable mapping scheme. To ensure flexibility and allow for highly tailored implementations we use code generation combined with autotuning. For a large range of matrix sizes in the domain of interest we achieve at least 2/3 of the roofline performance and often substantially outperform state-of-the art CUBLAS results on an NVIDIA Volta GPGPU. ...
Conference paper (2021) - Jonas Thies, Michiel Wouters, Rebekka Sarah Hennig, Wim Vanroose
The Trilinos library LOCA (http://www.cs.sandia.gov/LOCA/ ) allows computing branches of steady states of large-scale dynamical systems like (discretized) nonlinear PDEs. The core algorithms typically are (pseudo-)arclength continuation, Newton–Krylov methods and (sparse) eigenvalue solvers. While LOCA includes some basic techniques for computing bifurcation points and switching branches, the exploration of a complete bifurcation diagram still takes a lot of programming effort and manual interference. On the other hand, recent developments in algorithms for fully automatic exploration are condensed in PyNCT (https://pypi.org/project/PyNCT/ ). The scope of this algorithmically versatile software is, however, limited to relatively small (e.g. 2D) problems because it relies on linear algebra from Python libraries like NumPy. Furthermore, PyNCT currently does not support problems with a non-Hermitian Jacobian matrix, which rules out interesting applications in chemistry and fluid dynamics. In this paper we aim to combine the best of both worlds: a high-level implementation of algorithms in PyNCT with parallel models and linear algebra implemented in Trilinos. PyNCT is extended to non-symmetric systems and its complete backend is replaced by the PHIST library (https://bitbucket.org/essex/phist ), which allows us to use the same underlying HPC libraries as LOCA does. We then apply the new code to a reaction-diffusion model to demonstrate its potential of enabling fully automatic bifurcation analysis on parallel computers. ...
Journal article (2020) - Christie Alappat, Achim Basermann, Alan R. Bishop, Holger Fehske, Georg Hager, Olaf Schenk, Jonas Thies, Gerhard Wellein
The symmetric sparse matrix-vector multiplication (SymmSpMV) is an important building block for many numerical linear algebra kernel operations or graph traversal applications. Parallelizing SymmSpMV on today's multicore platforms with up to 100 cores is difficult due to the need to manage conflicting updates on the result vector. Coloring approaches can be used to solve this problem without data duplication, but existing coloring algorithms do not take load balancing and deep memory hierarchies into account, hampering scalability and full-chip performance. In this work, we propose the recursive algebraic coloring engine (RACE), a novel coloring algorithm and open-source library implementation that eliminates the shortcomings of previous coloring methods in terms of hardware efficiency and parallelization overhead. We describe the level construction, distance-k coloring, and load balancing steps in RACE, use it to parallelize SymmSpMV, and compare its performance on 31 sparse matrices with other state-of-the-art coloring techniques and Intel MKL on two modern multicore processors. RACE outperforms all other approaches substantially. By means of a parameterized roofline model, we analyze the SymmSpMV performance in detail and discuss outliers. While we focus on SymmSpMV in this article, our algorithm and software are applicable to any sparse matrix operation with data dependencies that can be resolved by distance-k coloring. ...

A Pipelined, Hybrid-Parallel Iterative Solver Toolkit

Journal article (2020) - Jonas Thies, Melven Röhrig-Zöllner, Nigel Overmars, Achim Basermann, Dominik Ernst, Georg Hager, Gerhard Wellein
The increasing complexity of hardware and software environments in high-performance computing poses big challenges on the development of sustainable and hardware-efficient numerical software. This article addresses these challenges in the context of sparse solvers. Existing solutions typically target sustainability, flexibility, or performance, but rarely all of them. Our new library PHIST provides implementations of solvers for sparse linear systems and eigenvalue problems. It is a productivity platform for performance-aware developers of algorithms and application software with abstractions that do not obscure the view on hardware-software interaction. The PHIST software architecture and the PHIST development process were designed to overcome shortcomings of existing packages. An interface layer for basic sparse linear algebra functionality that can be provided by multiple backends ensures sustainability, and PHIST supports common techniques for improving scalability and performance of algorithms such as blocking and kernel fusion. We showcase these concepts using the PHIST implementation of a block Jacobi-Davidson solver for non-Hermitian and generalized eigenproblems. We study its performance on a multi-core CPU, a GPU, and a large-scale many-core system. Furthermore, we show how an existing implementation of a block Krylov-Schur method in the Trilinos package Anasazi can benefit from the performance engineering techniques used in PHIST. ...

Equipping sparse solvers for exascale

Book chapter (2020) - Christie L. Alappat, Andreas Alvermann, Moritz Kreutzer, Bruno Lang, Kengo Nakajima, Melven Röhrig-Zöllner, Tetsuya Sakurai, Faisal Shahzad, Jonas Thies, Gerhard Wellein, Achim Basermann, Holger Fehske, Yasunori Futamura, Martin Galgon, Georg Hager, Sarah Huber, Akira Imakura, Masatoshi Kawai
The ESSEX project has investigated programming concepts, data structures, and numerical algorithms for scalable, efficient, and robust sparse eigenvalue solvers on future heterogeneous exascale systems. Starting without the burden of legacy code, a holistic performance engineering process could be deployed across the traditional software layers to identify efficient implementations and guide sustainable software development. At the basic building blocks level, a flexible MPI+X programming approach was implemented together with a new sparse data structure (SELL-C-σ) to support heterogeneous architectures by design. Furthermore, ESSEX focused on hardware-efficient kernels for all relevant architectures and efficient data structures for block vector formulations of the eigensolvers. The algorithm layer addressed standard, generalized, and nonlinear eigenvalue problems and provided some widely usable solver implementations including a block Jacobi–Davidson algorithm, contour-based integration schemes, and filter polynomial approaches. Adding to the highly efficient kernel implementations, algorithmic advances such as adaptive precision, optimized filtering coefficients, and preconditioning have further improved time to solution. These developments were guided by quantum physics applications, especially from the field of topological insulator- or graphene-based systems. For these, ScaMaC, a scalable matrix generation framework for a broad set of quantum physics problems, was developed. As the central software core of ESSEX, the PHIST library for sparse systems of linear equations and eigenvalue problems has been established. It abstracts algorithmic developments from low-level optimization. Finally, central ESSEX software components and solvers have demonstrated scalability and hardware efficiency on up to 256 K cores using million-way process/thread-level parallelism. ...
Conference paper (2020) - Dominik Ernst, Georg Hager, Jonas Thies, Gerhard Wellein
General matrix-matrix multiplications (GEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which are much taller than wide. Nvidia’s current CUBLAS implementation delivers only a fraction of the potential performance (as given by the roofline model) in this case. We describe the challenges and key properties of an implementation that can achieve perfect performance. We further evaluate different approaches of parallelization and thread distribution, and devise a flexible, configurable mapping scheme. A code generation approach enables a simultaneously flexible and specialized implementation with autotuning. This results in perfect performance for a large range of matrix sizes in the domain of interest, and at least 2/3 of maximum performance for the rest on an Nvidia Volta GPGPU. ...
Journal article (2019) - Andreas Alvermann, Achim Basermann, Thomas Huckle, Akihiro Ida, Akira Imakura, Masatoshi Kawai, Simone Koecher, Moritz Kreutzer, Pavel Kus, Bruno Lang, Hermann Lederer, Valeriy Manin, Hans-Joachim Bungartz, Andreas Marek, Kengo Nakajima, Lydia Nemec, Karsten Reuter, Michael Rippl, Melven Roehrig-Zoellner, Tetsuya Sakurai, Matthias Scheffler, Christoph Scheurer, Faisal Shahzad, Christian Carbogno, Danilo Simoes Brambila, Jonas Thies, Gerhard Wellein, Dominik Ernst, Holger Fehske, Yasunori Futamura, Martin Galgon, Georg Hager, Sarah Huber
We first briefly report on the status and recent achievements of the ELPA-AEO (Eigen value Solvers for Petaflop Applications—Algorithmic Extensions and Optimizations) and ESSEX II (Equipping Sparse Solvers for Exascale) projects. In both collaboratory efforts, scientists from the application areas, mathematicians, and computer scientists work together to develop and make available efficient highly parallel methods for the solution of eigenvalue problems. Then we focus on a topic addressed in both projects, the use of mixed precision computations to enhance efficiency. We give a more detailed description of our approaches for benefiting from either lower or higher precision in three selected contexts and of the results thus obtained. ...

A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance

Journal article (2018) - Faisal Shahzad, Jonas Thies, Moritz Kreutzer, Thomas Zeiser, Georg Hager, Gerhard Wellein
In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and still is the most widely used technique to deal with hard failures. Application-level CR is the most effective CR technique in terms of overhead efficiency but it takes a lot of implementation effort. ...
Journal article (2018) - Weiyan Song, Fred Wubs, Jonas Thies, Sven Baars
We perform a numerical study of a two-component reaction–diffusion model. By using numerical continuation methods, combined with state-of-the-art sparse linear and eigenvalue solvers, we systematically compute steady state solutions and analyze their stability and relations in both two and three space dimensions. The approach gives a more reliable and complete picture than previous efforts based on time integration schemes and is also typically much more efficient in terms of computing time. We are therefore able to produce a rich bifurcation diagram showing a variety of solution patterns and transitions. ...

Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems

Journal article (2017) - Moritz Kreutzer, Jonas Thies, Melven Röhrig-Zöllner, Andreas Pieper, Faisal Shahzad, Martin Galgon, Achim Basermann, Holger Fehske, Georg Hager, Gerhard Wellein
While many of the architectural details of future exascale-class high performance computer systems are still a matter of intense research, there appears to be a general consensus that they will be strongly heterogeneous, featuring “standard” as well as “accelerated” resources. Today, such resources are available as multicore processors, graphics processing units (GPUs), and other accelerators such as the Intel Xeon Phi. Any software infrastructure that claims usefulness for such environments must be able to meet their inherent challenges: massive multi-level parallelism, topology, asynchronicity, and abstraction. The “General, Hybrid, and Optimized Sparse Toolkit” (GHOST) is a collection of building blocks that targets algorithms dealing with sparse matrix representations on current and future large-scale systems. It implements the “MPI+X” paradigm, has a pure C interface, and provides hybrid-parallel numerical kernels, intelligent resource management, and truly heterogeneous parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We describe the details of its design with respect to the challenges posed by modern heterogeneous supercomputers and recent algorithmic developments. Implementation details which are indispensable for achieving high efficiency are pointed out and their necessity is justified by performance measurements or predictions based on performance models. We also provide instructions on how to make use of GHOST in existing software packages, together with a case study which demonstrates the applicability and performance of GHOST as a component within a larger software stack. The library code and several applications are available as open source. ...
Conference paper (2017) - Martin Galgon, Lukas Krämer, Achim Basermann, Melven Röhrig-Zöllner, Jonas Thies, Bruno Lang, Andreas Alvermann, Holger Fehske, Andreas Pieper, Georg Hager, Moritz Kreutzer, Faisal Shahzad, Gerhard Wellein
The ESSEX project is an ongoing effort to provide exascale-enabled sparse eigensolvers, especially for quantum physics and related application areas. In this paper we first briefly summarize some key achievements that have been made within this project. Then we focus on a projection-based eigensolver with polynomial approximation of the projector. This eigensolver can be used for computing hundreds of interior eigenvalues of large sparse matrices. We describe techniques that allow using lower-degree polynomials than possible with standard Chebyshev expansion of the window function and kernel smoothing. With these polynomials, the degree, and thus the number of matrix–vector multiplications, typically can be reduced by roughly one half, resulting in comparable savings in runtime. ...
Conference paper (2016) - Jonas Thies, Martin Galgon, Bruno Lang, Gerhard Wellein, Faisal Shahzad, Andreas Alvermann, Moritz Kreutzer, Andreas Pieper, Melven Röhrig-Zöllner, Achim Basermann, Holger Fehske, Georg Hager
As we approach the exascale computing era, disruptive changes in the software landscape are required to tackle the challenges posed by manycore CPUs and accelerators. We discuss the development of a new ‘exascale enabled’ sparse solver repository (the ESSR) that addresses these challenges—from fundamental design considerations and development processes to actual implementations of some prototypical iterative schemes for computing eigenvalues of sparse matrices. Key features of the ESSR include holistic performance engineering, tight integration between software layers and mechanisms to mitigate hardware failures. ...
Conference paper (2016) - Moritz Kreutzer, Jonas Thies, Georg Hager, Bruno Lang, Gerhard Wellein, Andreas Pieper, Andreas Alvermann, Martin Galgon, Melven Röhrig-Zöllner, Faisal Shahzad, Achim Basermann, Alan R. Bishop, Holger Fehske
Numerous challenges have to be mastered as applications in scientific computing are being developed for post-petascale parallel systems. While ample parallelism is usually available in the numerical problems at hand, the efficient use of supercomputer resources requires not only good scalability but also a verifiably effective use of resources on the core, the processor, and the accelerator level. Furthermore, power dissipation and energy consumption are becoming further optimization targets besides time-to-solution. Performance Engineering (PE) is the pivotal strategy for developing effective parallel code on all levels of modern architectures. In this paper we report on the development and use of low-level parallel building blocks in the GHOST library (“General, Hybrid, and Optimized Sparse Toolkit”). We demonstrate the use of PE in optimizing a density of states computation using the Kernel Polynomial Method, and show that reduction of runtime and reduction of energy are literally the same goal in this case. We also give a brief overview of the capabilities of GHOST and the applications in which it is being used successfully. ...