The student model weighed 48MB. After training for a couple of weeks on a single P100 GPU we got some promising results. We converted the model into CoreML format, reduced the precision to FP16 (weighing only 24MB) and found negligible change in its performance compared to the FP32 model.. "/>
Average inference speed is surprisingly fast running on our T4s, around 5s for 50 tokens. Will be trying with a V100, and Quadro 8000 (full precision model) tomorrow. To fit the model on GPUs that are sub ~24GB the model in the demo and notebook are half precision in torch..

And thus we end up with 6 bytes per model parameter for mixed precision inference, plus activation memory. Let’s look at the details. Model Weights: 4 bytes * number of parameters for fp32 training; 6 bytes * number of parameters for mixed precision training (maintains a model in fp32 and one in fp16 in memory) Optimizer States:.

2020. 4. 4. · Deep learning neural network models are available in multiple floating point precisions. For Intel® OpenVINO™ toolkit, both FP16 (Half) and FP32 (Single) are generally available for pre-trained and public models. This article explores these floating point representations in more detail, and answer questions such as which precision are compatible.

