Vector Floating-Point Unit (VFPU) Design Document - V2R2 (2025/01/20) Glossary VFPU: Vector Floating-Point Unit IQ: Issue Queue Design Specifications Supports vector floating-point multiplication, fused multiply-add (FMA), division, and square root Supports FP32, FP64, and FP16 computations Compatible with RV-V1.0 vector floating-point instructions Function Overview The VFPU processes vector floating-point instructions issued from the Issue Queue, based on fuType and fuOpType. It comprises four main modules: VFAlu: Handles floating-point add/sub and simple instructions like comparisons, sign injection, and reduction sums (via micro-ops). VFMA: Handles multiplication and FMA instructions. VFDivSqrt: Handles division and square root instructions. VFCvt: Handles format conversions and reciprocal estimations. Algorithm Design Challenges Support multiple single-precision formats (f16, f32, f64) simultaneously. Support mixed-precision (widening) computations defined by RISC-V Vector ISA, such as f64 = f64 + f32. Maintain high bandwidth utilization by performing multiple lower-width operations in parallel (e.g., 4 × f16 ops simultaneously). Floating-Point Addition Algorithms Single-path addition: Traditional, serial process; 2-3 signed additions; slower. Dual-path addition: Parallelizes steps for faster computation; critical path shortened; uses swapping of operands to avoid negative results and unify rounding/conversion paths. Improved dual-path addition: Further optimizes by separating far path (d > 1 or equivalent addition) and close path (d ≤ 1 and equivalent subtraction), with optimized shifting and rounding logic. Vector Floating-Point Addition Support Single precision: 1 × f64 addition 2 × f32 addition simultaneously 4 × f16 addition simultaneously Mixed precision widening formats: f64 = f64 + f32 f64 = f32 + f32 2 × f32 = f32 + f16 2 × f32 = f16 + f16 Fast Format Conversion Algorithm (e.g., f16→f32) For normalized numbers, bias adjustment and zero padding is used for fast exponent/significand conversion. For denormals, leading zero detection (LZD) and priority left shifting are applied. Implements efficient LZD using priority encoders and optimized left shift with exponent generation integrated. Fused Multiply-Add (FMA) Algorithm Computes fpa × fpb + fp_c with only one rounding at the end. Supports vectorized: 1 × fp64 FMA 2 × fp32 FMA 4 × fp16 FMA Mixed precision (e.g., 2 × f32 = f16 × f16 + f32) Uses radix-4 Booth encoding for partial products to reduce multiplication complexity. Employs carry-save adder (CSA) compression for partial products. Utilizes efficient exponent processing and right-shift alignment for operands. Integrates sign handling, leading zero detection, rounding, and exception flags. Division Algorithm Uses digit-recurrence (SRT) division with radix-64 (3 radix-4 iterations per cycle) providing 6 quotient bits per cycle. Divisor is prescaled to approximately 1 for independent quotient digit selection. Handles normalized, denormal, and exceptional (e.g., NaN, infinity) inputs. Early termination is implemented in scalar but disabled in vector VFPU (to maintain synchronous output). Latencies vary by precision and input nature; vector divider latency is fixed to worst-case. Hardware Design Vector Floating-Point Adder (VFALU) Composed of three main modules based on output format width: FloatAdderF64Widen: handles 64-bit outputs (f64 + widen variants) FloatAdderF32WidenF16: handles 32- and 16-bit outputs FloatAdderF16: dedicated for 16-bit outputs Two-stage pipeline for ~1.5 cycle addition latency. Shared modules for hardware efficiency. Supports scalar and vector operands, widening instructions, and masking for vector operations with a 20-bit flag output. Operates with