Graphs, a potentially vast web of nodes connected by edges, can be used to express and delve into the links between various types of data, including molecular interactions, traffic patterns, financial transactions, and social connections. As more data is gathered and these graphical images are developed, researchers will require faster and more effective techniques as well as more computing capacity to do deep learning on them using graph neural networks (GNN).
Researchers at MIT and IBM Research have now devised a new technique dubbed SALIENT (SAmpling, sLIcing, and data movemeNT) that enhances training and inference performance by addressing three major computational bottlenecks. This significantly reduces the runtime of GNNs on huge datasets, such as those with 100 million nodes and 1 billion edges, for example. The scientists also discovered that the method scalable well when computational power is increased from one to 16 graphics processing units (GPUs). At the Fifth Conference on Machine Learning and Systems, the research was presented.
“We began to investigate the difficulties that existing systems encountered when attempting to scale cutting-edge machine learning for graphs techniques to extremely large datasets. Tim Kaler, the lead author and a postdoc in the MIT Computer Science and Artificial Intelligence Laboratory, argues that many of the prior systems were getting good performance primarily on smaller datasets that fit into GPU memory (CSAIL).
Experts refer to scales like the entire Bitcoin network when they talk about enormous datasets, where certain patterns and data links could indicate trends or criminal activity. According to co-author Jie Chen, senior research scientist and manager of IBM Research and the MIT-IBM Watson AI Lab, “there are around a billion Bitcoin transactions on the blockchain, and if we want to identify unlawful activities inside such a cooperative network, then we are facing a graph of a scale.” In order to keep up with the rate at which new data are generated, “we want to build a system that is able to handle that kind of graph and allows processing to be as efficient as feasible.”
Nickolas Stathas MEng ’21 of Jump Trading, who created SALIENT as part of his graduate work, Anne Ouyang, a former MIT-IBM Watson AI Lab intern, Alexandros-Stavros Iliopoulos, Tao B. Schardl, and Charles E. Leiserson, the Edwin Sibley Webster Professor of Electrical Engineering at MIT and a researcher with the MIT-IBM Watson AI Lab, are among the co-author
According to Kaler, the team developed its solution, SALIENT, using a systems-oriented approach for this issue. The researchers achieved this by implementing what they considered to be crucial, fundamental optimizations of elements that fit into already-existing machine-learning frameworks, such as PyTorch Geometric and the deep graph library (DGL), which are interfaces for developing a machine-learning model. According to Stathas, the procedure is similar to swapping out engines to create a speedier car. Their approach was created to integrate with existing GNN architectures, allowing domain experts to quickly apply it to their specialised sectors in order to speed up model training and extract insights from inference. The key, according to the team, was to keep all of the hardware (CPUs, data links, and GPUs) active at all times: while the CPU samples the graph and gets ready to send mini-batches of data through the data link, the more crucial GPU is working to train the machine-learning model or perform inference.
The performance of PyTorch Geometric, a popular machine-learning framework for GNNs, was examined first by the researchers, who discovered an astonishingly low GPU resource use. The researchers increased GPU utilisation from 10 to 30% by using straightforward modifications, which led to a 1.4 to two times performance increase in comparison to open-source benchmark routines. This quick baseline implementation could finish one full run of the method (one epoch) across a sizable training dataset in 50.4 seconds.
The researchers looked at the algorithms for graph sampling and mini-batch preparation, which are the bottlenecks at the start of the data pipeline, in an effort to further increase performance. In contrast to conventional neural networks, GNNs carry out a neighbourhood aggregation operation, which calculates information about a node using data from other neighbouring nodes in the graph, such as data from friends of friends of a user in a social network graph. The number of nodes the GNN must connect to for information can skyrocket as the number of layers rises, pushing it past a computer’s capabilities. The researchers discovered that current implementations of neighbourhood sampling methods were too sluggish to keep up with the processing power of contemporary GPUs, despite the fact that they aid by choosing a smaller random subset of nodes to gather. In response, they discovered a variety of data structures, algorithmic optimizations, and other techniques that increased sampling performance. As a result, the sampling process alone was eventually enhanced by around three times, reducing the duration for each epoch from 50.4 to 34.6 seconds. The team adds that they discovered a fact that had been missed in the literature: sampling can be done during inference at an appropriate pace, boosting total energy efficiency and performance.
Earlier systems used a multi-process method for this sampling stage, which resulted in additional data and pointless data movement between the processes. The researchers created a single process with thin threads that maintained the data on the CPU in shared memory, making their SALIENT technique more flexible. Additionally, according to Stathas, SALIENT makes use of a cache in contemporary processors by parallelizing feature slicing, which pulls pertinent data from nodes of interest and their neighbours and edges, within the shared memory of the CPU core cache. Again, this brought down the overall runtime per epoch from 34.6 to 27.8 seconds.
A prefetching phase, which would prepare data just before it is needed, was used to pipeline mini-batch data transfers between the CPU and GPU to overcome the final bottleneck. The researchers predicted that doing so would use all of the available bandwidth on the data link and bring the method up to 100% utilisation, but they only witnessed about 90%. They located and corrected a performance issue that resulted in pointless round-trip communications between the CPU and GPU in a well-known PyTorch module. With this bug fixed, the team was able to run SALIENT for 16.5 seconds every epoch.
According to Kaler, “Our investigation demonstrated that the devil is in the details.” “You may fix a huge number of performance difficulties when you pay great attention to the details that affect performance when training a graph neural network. The optimal outcome of such a system was achieved by our solutions, which completely bottlenecked GPU computing.
On three common datasets—ogbn-arxiv, ogbn-products, and ogbn-papers100M—as well as in multi-machine configurations, with various levels of fanout (the amount of data that the CPU would prepare for the GPU), and across a number of architectures, including the most recent cutting-edge one, GraphSAGE-RI, the speed of SALIENT was assessed. SALIENT outscored PyTorch Geometric in every scenario, most notably on the huge ogbn-papers100M dataset, which has over a billion edges and 100 million nodes. On a single GPU, it outperformed the optimised baseline that was first developed for this study by three times; when 16 GPUs were used, SALIENT outperformed the baseline by an additional eight times.
Other systems weren’t always directly comparable because of their slightly varied hardware and experimental setups, but SALIENT nevertheless surpassed them. Representative performance times for systems that attained a similar level of accuracy are 99 seconds using one GPU and 32 CPUs, and 13 seconds using 1,536 CPUs. In contrast, SALIENT ran in just two seconds with 16 GPUs and 320 CPUs and took 16.5 seconds with one GPU and 20 CPUs. “Our 16 GPU runtime (two seconds) is an order of magnitude faster than other numbers that have been reported previously on this dataset,” claims Kaler. “If you look at the bottom-line values that past work reports.” The researchers partially credited their method of optimising their code on a single computer before switching to the distributed setting for their performance benefits. The lesson here, according to Stathas, is that it makes more financial sense to utilise the technology you already have to the fullest extent possible before scaling out to additional machines. This can result in significant cost and carbon emission reductions associated with model training.
Researchers will now be able to confront and delve deeper into ever-larger graphs thanks to this new capability. The Bitcoin network, for instance, had 100,000 nodes; the SALIENT system can handle a graph that is 1,000 times larger (or three orders of magnitude greater).
In the future, Chen says, “we would be looking at not just using this graph neural network training system on the existing algorithms that we implemented for classifying or predicting the properties of each node, but also doing more in-depth tasks, such as identifying common patterns in a graph (subgraph patterns), [which] may be actually interesting for indicating financial crimes.” We also want to find nodes in a graph that are similar to one another in the sense that they might be related to the same criminal in a financial crime. These tasks would necessitate the creation of new algorithms and perhaps even neural network structures.