Benchmark Suite

The Benchmark Suite

The purpose of AI benchmarking is to

assess the merits and limitations of various AI solutions,
rank the multi-GPU systems and software platforms, and
interpret the measurements.

AI benchmarking is currently experiencing intensive growth, characterised by the appearance of numerous benchmark suites. Variety is good, but it makes the selection of suitable benchmarks more challenging since at present there is no generally accepted set of benchmarks let alone an agreement within the AI research community or industry.

In BASE-II, we suggest a characterisation approach which describes the resource requirements of AI applications in terms of computations and data movements for predicting runtime and scalability. The work on benchmarking is an ongoing effort the aim of which is to keep pace with the latest development of GPU chip design.

The current repository BASE-II Benchmarks contains small, medium and large-scale benchmarks, where

- Small benchmarks measure the runtime of frequently used operations and the key hardware parameters.,
- Medium size benchmarks focus on convolutional networks (ResNet, VGG) and transformer models (GPT, BERT, T5), and
- Large size benchmarks represent complex applications drawn from various domains such as numerical weather prediction, material science and natural language processing.

Small Benchmarks

Deriving algebraic expressions for predicting the runtime, power usage and scalability of AI applications is an important step of the benchmarking activity. These algebraic expressions combine parameters of the workload and the multi-GPU system. A collection of simple benchmarks has been developed for measuring the key performance characteristics such as FLOP rate of GPUs, memory, communication and I/O bandwidths. The FLOP rate of frequently used operations is also measured which includes matrix multiplication, vector operations and collective communications. The advantage of these benchmarks is that they are simple, portable and the measurements are easy to interpret.

Medium-sized Benchmarks

The medium size benchmarks represent convolutional (ResNet, VGG) and transformer networks (GPT-2, BERT, T5).

GPT-2 (Generative Pre-trained Transformer 2)	A decoder only architecture developed by OpenAI. This is one of the dominant LLM which is used mainly for text generation, completion, and understanding. The decoder architecture processes the input text in a unidirectional manner and the Self-Attention Mechanism (SAM) uses a multi-head self-attention for capturing both short and long-term relationships between tokens. GPT-2 is pre-trained using the Causal Language Modelling (CLM) algorithm which predicts the next token based on all previous ones. This makes GPT-2 suitable for text generation, sentence completion and summarisation. There are various versions of GPT-2 are available ranging from 117 million to 1.5 billion parameters (https://huggingface.co/openai-community/gpt2).
BERT	A bidirectional transformer which is an encoder-based model trained by Masked Language Modelling (MLM) and Next Sentence Prediction (NSP) algorithms (https://huggingface.co/docs/transformers/en/model_doc/bert). BERT is mainly used for text classification, sentiment analysis, question answering and named entity recognition (NER). The algorithm of Attention Masking (AM) analyses all tokens in the input sequence simultaneously.
T5 model	A versatile sequence-to-sequence model which can handle various NLP tasks like text classification, summarization and translation (https://huggingface.co/docs/transformers/en/model_doc/t5). T5 is a sequence-to-sequence model that is often pre-trained on large-scale datasets like C4 (Colossal Clean Crawled Corpus, about 200B words, 750GBytes) for tasks such as translation, summarization and question answering. For the fine-tuning of T5 a subset (af) of C4 (Colossal Clean Crawled Corpus) was used which contains 2.15 million rows of text (https://huggingface.co/datasets/allenai/c4).
Convolutional Networks	ResNet (Residual Network) is a typical representative of Convolutional Neural Networks (CNNs) frequently used for image processing tasks such as segmentation, classification and object detection. The main idea of ResNet is the introduction of shortcut connections which can skip individual or a group of layers, thus enabling to learn the residual much faster. There are different versions available ranging from ResNet-18 to ResNet-152, with 18 and 152 layers. VGG networks represent simpler modular architecture, they achieve a higher FLOP rate on GPUs and often used for object recognition tasks. For benchmarking compute and memory intensive AI applications the VGG16 and VGG19 variants were used.

Large Scale Benchmarks

Numerical Weather Prediction (NWP)- NERSC Benchmark	NWP benchmark is based on the NERSC application presented at the SC23 tutorial from (https://github.com/NERSC/sc23-dl-tutorial). The importance of NWP has been outlined in the latest NOAA report (https://sab.noaa.gov/wp-content/uploads/4.0-DL4NWP_NOAAResponse_Nov2024.pdf). The full dataset (~1TB) consists of 28 years of ERA5 planet observations containing hourly estimates of numerous variables at a 3D grid resolution of 0.25 degrees, from the surface to 100km altitude (https://www.ecmwf.int/en/forecasts/dataset/ecmwf-reanalysis-v5). The benchmark uses a smaller dataset which can be found at https://portal.nersc.gov/project/dasrepo/pharring/sc23_data. The input data is in H5 format and split into three categories, these are: training (728GB), validation (56GB) and testing (28GB). This application provides a better forecast accuracy of surface wind speed and precipitation compared to the traditional weather prediction methods. Testing of the code has been performed on Polaris (ANL), Summit (ORNL) and Crusher (ORNL) machines. The application scales well, 256 GPUs (V100) were used on Summit. The best performance was achieved on Polaris which uses NVIDIA A100 GPUs. The Crusher machine uses AMD MI250 GPUs, the performance results were similar to the runtimes achieved on Summit.
DeepCam	A climate modelling application developed by Lawrence Berkeley National Laboratory and a modified version has been included in the set of MLPerf benchmarks (https://github.com/mlcommons/hpc/tree/main/deepcam). The application uses a Convolutional Network model, the training dataset is generated using CAM5 simulations. The size of small training dataset is 650MBytes containing 1537 samples. The shape of each input image sample is (768, 1152, 16), there are three output labels representing: background, atmospheric river and tropical cyclone. The code has been profiled for determining the volume of data movement and computational load. Using the application and the multi-GPU system parameters a prediction model was derived which indicates a compute bound nature of this application. Testing has been performed on an NVIDIA DGX-2 system with A100 GPUs.
OLMo	A transformer network based open-source large language model, widely used in academic research (https://huggingface.co/allenai/OLMo-7B). Various size of models are available from 1 to 7 billion parameters. For the training the Dolma dataset was used containing 2.45 Trillion tokens (https://huggingface.co/datasets/allenai/dolma). OLMo 1 and 7 billion versions were tested on NVIDIA DGX-2 system.
Megatron	Training LLM represent many challenges in respect to data distribution, orchestration of computations and scalability. Megatron-LM was developed by NVIDIA, this is an environment for large-scale training of LLM ranging from 1B to 1T parameters (https://github.com/NVIDIA/Megatron-LM). Megatron supports three types of parallelism, these are Data, Tensor (intra-layer) and Pipeline (inter-layer). Data parallelism is used for small models which can fit into GPU’s memory. The model is replicated on the GPUs, and each GPU processes a different portion of data. During backpropagation the gradients are aggregated on the master node and averaged across GPUs. For the Tensor Parallelism single layer of the model is split across GPUs. This is an example of horizontal slicing of the network (intra-layer parallelism) where GPUs collaborate to compute outputs of a single layer. In case of Pipeline (inter-layer) parallelisation the layers of the network are grouped and assigned to different GPUs and the data flows across the GPUs. This is an example of a vertical slicing of the network. For benchmarking purposes Megatron-LM was used for training different size GPT-2 models (345M and 1.5B parameters). The application was tested on NVIDIA DGX-2 using 16 GPUs, all three types of parallelism have been tested
EquiformerV2	A GNN (Graph Neural Network) used for modelling atomic structures, developed as part of the Open Catalyst challenge (https://github.com/atomicarchitects/equiformer_v2). Model size is 153 million parameters. Two input datasets were used for training, these are OC20 and OC22 with 130 million data points representing 3D atomic coordinates (https://fair-chem.github.io/core/datasets/oc22.html). EquiformerV2 was ported to Polaris (ANL) and the scalability tested on 256 A100 GPUs.

The Benchmark Suite

Small Benchmarks

Medium-sized Benchmarks

Large Scale Benchmarks

Rutherford Appleton Laboratory

Social

Legal