# GPU/CPU Monitoring Report - Twitch Highlight Detector ## System Information - **GPU**: NVIDIA GeForce RTX 3050 (8192 MiB) - **Driver**: 580.126.09 - **Device**: cuda (CUDA requested and available) ## Execution Summary - **Total Runtime**: ~10.5 seconds - **Process Completed**: Successfully - **Highlights Found**: 1 (4819s - 4833s, duration: 14s) ## GPU Utilization Analysis ### Peak GPU Usage - **Single Peak**: 100% GPU SM utilization (1 second only) - **Location**: During RMS calculation phase - **Memory Usage**: 0-4 MiB (negligible) ### Average GPU Utilization - **Overall Average**: 3.23% - **During Processing**: ~4% (excluding idle periods) - **Memory Utilization**: ~1% (4 MiB / 8192 MiB) ### Timeline Breakdown 1. **Chat Analysis**: < 0.1s (CPU bound) 2. **FFmpeg Audio Extraction**: 8.5s (CPU bound - FFmpeg threads) 3. **Audio Decode**: 9.1s (CPU bound - soundfile library) 4. **CPU->GPU Transfer**: 1.08s (PCIe transfer) 5. **GPU Processing**: - Window creation: 0.00s (GPU) - RMS calculation: 0.12s (GPU - **100% spike**) - Peak detection: 0.00s (GPU) ## CPU vs GPU Usage Breakdown ### CPU-Bound Operations (90%+ of runtime) 1. **FFmpeg audio extraction** (8.5s) - Process: ffmpeg - Type: Video/audio decoding - GPU usage: 0% 2. **Soundfile audio decoding** (9.1s overlap) - Process: Python soundfile - Type: WAV decoding - GPU usage: 0% 3. **Chat JSON parsing** (< 0.5s) - Process: Python json module - Type: File I/O + parsing - GPU usage: 0% ### GPU-Bound Operations (< 1% of runtime) 1. **Audio tensor operations** (0.12s total) - Process: PyTorch CUDA kernels - Type: RMS calculation, window creation - GPU usage: 100% (brief spike) - Memory: Minimal tensor storage 2. **GPU Memory allocation** - Audio tensor: ~1.2 GB (308M samples × 4 bytes) - Chat tensor: < 1 MB - Calculation buffers: < 100 MB ## Conclusion ### **FAIL: GPU not utilized** **Reason**: Despite the code successfully using PyTorch CUDA for tensor operations, GPU utilization is minimal because: 1. **Bottleneck is CPU-bound operations**: - FFmpeg audio extraction (8.5s) - 0% GPU - Soundfile WAV decoding (9.1s) - 0% GPU - These operations cannot use GPU without CUDA-accelerated libraries 2. **GPU processing is trivial**: - Only 0.12s of actual CUDA kernel execution - Operations are too simple to saturate GPU - Memory bandwidth underutilized 3. **Architecture mismatch**: - Audio processing on GPU is efficient for large batches - Single-file processing doesn't provide enough parallelism - RTX 3050 designed for larger workloads ## Recommendations ### To actually utilize GPU: 1. **Use GPU-accelerated audio decoding**: - Replace FFmpeg with NVIDIA NVDEC - Use torchaudio with CUDA backend - Implement custom CUDA audio kernels 2. **Batch processing**: - Process multiple videos simultaneously - Accumulate audio batches for GPU - Increase tensor operation complexity 3. **Alternative: Accept CPU-bound nature**: - Current implementation is already optimal for single file - GPU overhead may exceed benefits for small workloads - Consider multi-threaded CPU processing instead ## Metrics Summary - **GPU utilization**: 3.23% average (FAIL - below 10% threshold) - **CPU usage**: High during FFmpeg/soundfile phases - **Memory usage**: 4 MiB GPU / 347 MB system - **Process efficiency**: 1 highlight / 10.5 seconds