Custom FP4 CUDA Kernel – 129 Tflops on DGX Spark with Pre-Quantized Weight Cache

(forums.developer.nvidia.com)

1 points | by vkaufmann 6 hours ago ago

1 comments