Accelerated And Memory-Efficient Distributed Deep Learning: Leveraging Quantization, Parallelism Techniques, And Mix-Match Runtime Communication