According to my understanding, the "backbone" refers to the DeepLab architecture's feature extracting network. The network's input is encoded into a feature representation using this feature extractor. Around this feature extractor, the DeepLab framework "wraps" functionalities. The feature extractor can then be swapped out, and a model can be picked that best suits the task at hand in terms of accuracy, efficiency, and so on.
The phrase "backbone" in the context of DeepLab could relate to models such as ResNet, Xception, MobileNet, and so on.