Why do I have heavy DeserializeSparse phase after EagerKernelExecutes on the multiple GPU training

0 votes

I'm trying to train a small TF2.x model on 4 GPUs (AWS g4dn.12xlarge) that takes both dense and sparse tensors as its input. Once I tried without sparse features and just used dense features, my distributed training code worked well without any performance degradation. After including the sparse features, however, I found numerous unexpected chunks on the TensorBoard Profiler's trace_viewer. Attached the profiler screenshot.

The main problem is that, although it seems all the GPUs computes their given batches well, there is a large timespan between a pair of computation blocks on the host side. There are 17x4 of EagerExecute:DeserializeSparse with the terminal ops of _Send input 0 from /job:localhost/replica:0/task:0/device:GPU:{gpu_number} to /job:localhost/replica:0/task:0/device:CPU:0. Here, 17 is the number of sparse features that the model receives, and 4 is the num of GPUs being utilized. Plus, tons of MemcpyD2H (small pink blocks at the screen shot) are occupying each GPU, not parallelized. That large period of time is about x6 of the actual forward pass.

Below is how the model treats sparse tensor inputs:

def call(self, inputs: tf.sparse.SparseTensor):
  with tf.device("\cpu:0"):
    x = self.hash_inputs_from_static_hash_table(inputs)
    x = self.embedding_lookup_sparse(x)
  return self.prediction_head(x)

The data can never be big (batch size = 128 per replica, sparse feature embedding dimension is <10), and I tried to move all sparse-related operations to CPU not to burden GPUs, but the problem persists just as the same as I didn't move those ops to CPU manually.

I want to know why those chunks appear after the GPU computations, and hopefully remove them to fully benefit from distributed training with multiple GPUs.

Seems like I'm still missing something that can be optimized and this situation might not that unique in distributed training, so asking for help for broader audience.

Feb 16, 2023 in AWS by sarit
• 1,830 points

1 answer to this question.

0 votes

The heavy "DeserializeSparse" phase after the "EagerKernelExecutes" on the multiple GPU training is likely caused by the serialization and deserialization of sparse tensor data during data transfer between the GPUs and the CPU. In distributed training, data parallelism is often used to split the batch across multiple GPUs, and each GPU computes its own part of the batch. However, when the computation on the GPU is finished, the results need to be aggregated on the CPU for the next step, and this requires the serialized data to be deserialized on the CPU.

The reason for the large timespan between a pair of computation blocks on the host side is that the deserialization process can be slow and may become a bottleneck when dealing with large amounts of sparse tensor data. Moreover, the large number of "MemcpyD2H" operations suggests that data transfer between the GPU and the CPU is not fully parallelized, which further contributes to the slow deserialization.

To optimize the performance of your distributed training code, you can try the following:

  1. Use the TensorFlow Dataset API to create input pipelines that can preprocess the data and batch it efficiently before it is fed to the model. This can help reduce the amount of data that needs to be serialized and deserialized during training.

  2. Consider using a sparse optimizer, such as the "Adagrad" optimizer with the "tf.IndexedSlices" data structure, to update the sparse feature embeddings. This can help reduce the memory footprint and improve performance.

  3. Use TensorFlow's distributed training strategies, such as the "MirroredStrategy" or "ParameterServerStrategy", which provide built-in support for data parallelism and can optimize the data transfer between the GPUs and the CPU.

  4. Consider using mixed precision training, which can help reduce the memory footprint and speed up training.

  5. Use profiling tools, such as TensorBoard Profiler, to identify performance bottlenecks and optimize your code accordingly.

By applying these optimizations, you should be able to improve the performance of your distributed training code and reduce the heavy "DeserializeSparse" phase after the "EagerKernelExecutes".

Elevate Your Expertise with Microservices Certification!

answered Feb 17, 2023 by anonymous

Related Questions In AWS

0 votes
2 answers
0 votes
2 answers

How and Why AWS bill comes after i suspended the account

While your account is suspended, you will ...READ MORE

answered Oct 18, 2020 in AWS by anonymous
0 votes
0 answers

Use inverse transform with deep learning. Conceptual clarity needed

I have built a deep learning model ...READ MORE

Mar 21, 2020 in Python by Anan
• 180 points
0 votes
1 answer

Role of the bias in neural networks.

Hi@akhtar, The activation function in Neural Networks takes ...READ MORE

answered Jul 15, 2020 in Machine Learning by MD
• 95,440 points
0 votes
1 answer

Tensorflow on Google Cloud Platform

There are a few ways to run ...READ MORE

answered Aug 1, 2018 in GCP by kurt_cobain
• 9,390 points
0 votes
1 answer

SKLearn NMF Vs Custom NMF

The choice of the optimizer has a ...READ MORE

answered Sep 7, 2018 in Python by Priyaj
• 58,090 points
0 votes
1 answer

Why my new server shows numbers in the URL address?

It looks like the URLs for your ...READ MORE

answered Feb 17, 2023 in AWS by anonymous
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP