Discussion: [v0.16.0] Qwen3 30B A3B Q40 on 4 x Raspberry Pi 5 8GB Author: b4rtaz Date: September 5, 2025 Category: Results Comments: 0 Participants: 1 (b4rtaz - Maintainer) --- Setup Details Device: 4 x Raspberry Pi 5 8GB Distributed Llama Version: 0.16.0 Model: qwen330ba3bq40 --- Benchmark Results | Device | Evaluation (tokens/sec) | Prediction (tokens/sec) | |------------------------|------------------------|------------------------| | 4 x Raspberry Pi 5 8GB | 14.33 | 13.04 | --- Inference Command & Output The model was run with the following command: Model Info Architecture: Qwen3 MoE Hidden Activation: Silu Dimensions: Various (Dim: 2048, HeadDim: 128, QDim: 4096, KvDim: 512, HiddenDim: 6144) Layers: 48 Heads: 32 Experts: 128 (8 active experts) Vocabulary Size: 151,936 Required Memory: 5513 MB Floating buffer type: q80 Execution Notes Multiple worker sockets connected successfully. CPU uses neon dotprod fp16. Model weights loaded successfully. Network operates in non-blocking mode. Example Output Tokens (Prediction) The model generates an answer to "Please explain where Poland is as I have 1 year" with tokens such as "Of course! Let me explain where Poland is..." --- Summary This discussion showcases a benchmark and inference test of the Qwen3 30B A3B Q40 model running on a distributed setup with four Raspberry Pi 5 devices, each with 8GB RAM. The setup successfully runs model inference with a throughput of about 13-14 tokens per second for evaluation and prediction phases. A video demonstration (qwen330b.mov) is included showing the performance and output. No community comments are present yet. --- Additional Context The repository: b4rtaz/distributed-llama Discussion ID: #255 The discussion is part of the 'Results' category. --- This summary covers the key elements of the discussion on using the Qwen3 30B model on Raspberry Pi 5 cluster for distributed inference.