Develop Memory management strategies for modern HPC+AI systems
A convergence of AI, HPC & Big Data analytics is being accelerated by the extensive proliferation of modern compute workflows that combine different methodologies & techniques to solve complex problems. These domains are arguably running the same types of data and compute intensive workloads on HPC hardware nowadays, be it niche supercomputers, small institutional clusters or in the cloud.
Distributed Scaling, Occupancy and Bandwidth issues plague all these domains as well. Currently, there are four major trends for this converged domain. First, the average size of datasets for the applications is rapidly increasing. Read-only input matrices that used to be on the order of megabytes or low-order gigabytes are growing into the double-digit gigabyte range and beyond. Second, the applications are continually required to be more and more accurate. This trend leads to larger working set sizes in memory as the resolution of stored and computed data becomes finer. Third, no matter how close accelerators are to the CPU, memory address spaces are still incoherent and automated memory management systems are not yet reaching the performance of hand-crafted solutions for HPC/AI applications. Fourth, while the physical memory size of accelerators is growing it fails to grow at the same rate as the working set sizes of applications. This leads to the conclusion that future HPC systems will rely heavily on efficient memory management for accelerators to be able to handle future working set sizes, and that considerable research will be essential in this field. Thus, a confluence of HW-SW co-design choices optimized for these converged scenarios will be necessary. Memory Management and mapping form a crucial part of this.
The requirement for Dynamic Memory mapping strategies (in unsupervised and semi-supervised online training, dynamic graph analytics, data analytics, sparse linear algebra and databases) only conflates these above-mentioned issues. In conventional HPC systems, the memory management sub-system runs as a separate service or as a part of the runtime management sub-system on a Service Node and it controls memory allocation on the Computational Nodes. It deals with the following issues:
- choosing the most suitable memory according to the allocated processing elements;
- enabling concurrent, thread-safe memory allocation and deallocation while avoiding fragmentation;
- performing translation from virtual to physical addresses, and vice versa;
- performing runtime optimization.
Almost all accelerator/GPU level memory managers offer the standard malloc/free interface & operate on a block of memory with a configurable size. They all also follow a similar approach of splitting the available memory into large blocks (mostly fixed size) & using these to serve the individual allocation requests. Managing these resources includes the use of lists, queues or even hashing. These are far from optimum. A few approaches have been proposed over the last decade & these need to be evaluated on a level playing field and with state-of-the-art hardware to answer the question – if dynamic memory mapping and management is as slow as commonly thought of. This also involves thoroughly evaluating compute resource allocation (task/process–based, thread-based as well as warp/wave-based), performance scaling, fragmentation and real-world performance considering custom and synthetic workloads as well as standard benchmarks if any.
Following this, novel Memory management strategies must be proposed for these converged domains (with a particular emphasis on mapping). This must result in guidelines for the respective best usage scenario. There should also be insights into the infrastructure interfaces required to integrate any of the tested and proposed memory manager solutions into an application and switch between them for benchmarking purposes.
This project is an initiative of the Compute Systems Architecture Unit (CSA). CSA is researching emerging workloads and their performance on large-scale supercomputer architectures for next-generation Artificial Intelligence (AI) and high-performance computing (HPC) applications. The team is responsible for algorithm research, runtime management innovations, performance modeling, architecture simulation and prototyping for these future applications and the future systems to execute them, to reach multiple orders of magnitude better performance, energy-efficiency, and total-cost-of-ownership.
Similar Positions
-
Postdoc Position In The Neural Dynamics Of Emotional Memory Lab – University Of Toronto, Canadian Association for Neuroscience, Canada, about 6 hours ago
The Neural Dynamics of Emotional Memory lab at the University of Toronto Scarborough is recruiting postdocs. We are a systems neuroscience lab that uses in vivo electrophysiology, fiber photometry...
-
Researcher In The Project "Swedish Remembrance Of The Holocaust." (Pa2024/1058) , University of Lund, Sweden, 10 minutes ago
Description of the workplace The position is located at the Division of ABM, Digital Cultures and Publishing Studies, which is part of the Department of Arts and Cultural Sciences. At the departme...
-
Postdoctoral Position In Single Cell Research On Long Covid , Karolinska Institutet, Sweden, 7 days ago
Do you want to contribute to improving human health? We are seeking an ambitious postdoctoral fellow who possesses strong skills in molecular biology and single-cell sequencing to join our team fo...
-
Research Assistant I, Harvard University, United States, 23 days ago
27-Mar-2024 Faculty of Arts and Sciences 65274BR Job Summary The Computational Cognitive Neuroscience Lab at Harvard University, led by Professor Sam Gershman, is seeking a full-time Research Assi...
-
Postdoctoral Fellow In Generative Ai For Structrual Elucidation In Mass Spectrometry, Stockholm University, Sweden, about 6 hours ago
Ref. No. SU FV-1048-24 at the Department of Materials and Environmental Chemistry . Closing date: 30 April 2024. The Department of Materials and Environmental Chemistry (MMK) is, with about 160 em...
-
Postdoctoral Fellow (F/M/D) Hippocampal Spatial Codes As Predictors For Alcohol Addiction, Leibniz, Germany, about 3 hours ago
The Hippocampal Plasticity and Memory group of Dr. Alessio Attardo at the Leibniz Institute for Neurobiology(Magdeburg, Germany) offers a position as: Postdoctoral fellow (f/m/d) Hippocampal spat...