twitch-highlight-detector/monitoring_report.md

# GPU/CPU Monitoring Report - Twitch Highlight Detector

## System Information
- **GPU**: NVIDIA GeForce RTX 3050 (8192 MiB)
- **Driver**: 580.126.09
- **Device**: cuda (CUDA requested and available)

## Execution Summary
- **Total Runtime**: ~10.5 seconds
- **Process Completed**: Successfully
- **Highlights Found**: 1 (4819s - 4833s, duration: 14s)

## GPU Utilization Analysis

### Peak GPU Usage
- **Single Peak**: 100% GPU SM utilization (1 second only)
- **Location**: During RMS calculation phase
- **Memory Usage**: 0-4 MiB (negligible)

### Average GPU Utilization
- **Overall Average**: 3.23%
- **During Processing**: ~4% (excluding idle periods)
- **Memory Utilization**: ~1% (4 MiB / 8192 MiB)

### Timeline Breakdown
1. **Chat Analysis**: < 0.1s (CPU bound)
2. **FFmpeg Audio Extraction**: 8.5s (CPU bound - FFmpeg threads)
3. **Audio Decode**: 9.1s (CPU bound - soundfile library)
4. **CPU->GPU Transfer**: 1.08s (PCIe transfer)
5. **GPU Processing**:
   - Window creation: 0.00s (GPU)
   - RMS calculation: 0.12s (GPU - **100% spike**)
   - Peak detection: 0.00s (GPU)

## CPU vs GPU Usage Breakdown

### CPU-Bound Operations (90%+ of runtime)
1. **FFmpeg audio extraction** (8.5s)
   - Process: ffmpeg
   - Type: Video/audio decoding
   - GPU usage: 0%

2. **Soundfile audio decoding** (9.1s overlap)
   - Process: Python soundfile
   - Type: WAV decoding
   - GPU usage: 0%

3. **Chat JSON parsing** (< 0.5s)
   - Process: Python json module
   - Type: File I/O + parsing
   - GPU usage: 0%

### GPU-Bound Operations (< 1% of runtime)
1. **Audio tensor operations** (0.12s total)
   - Process: PyTorch CUDA kernels
   - Type: RMS calculation, window creation
   - GPU usage: 100% (brief spike)
   - Memory: Minimal tensor storage

2. **GPU Memory allocation**
   - Audio tensor: ~1.2 GB (308M samples × 4 bytes)
   - Chat tensor: < 1 MB
   - Calculation buffers: < 100 MB

## Conclusion

### **FAIL: GPU not utilized**

**Reason**: Despite the code successfully using PyTorch CUDA for tensor operations, GPU utilization is minimal because:

1. **Bottleneck is CPU-bound operations**:
   - FFmpeg audio extraction (8.5s) - 0% GPU
   - Soundfile WAV decoding (9.1s) - 0% GPU
   - These operations cannot use GPU without CUDA-accelerated libraries

2. **GPU processing is trivial**:
   - Only 0.12s of actual CUDA kernel execution
   - Operations are too simple to saturate GPU
   - Memory bandwidth underutilized

3. **Architecture mismatch**:
   - Audio processing on GPU is efficient for large batches
   - Single-file processing doesn't provide enough parallelism
   - RTX 3050 designed for larger workloads

## Recommendations

### To actually utilize GPU:
1. **Use GPU-accelerated audio decoding**:
   - Replace FFmpeg with NVIDIA NVDEC
   - Use torchaudio with CUDA backend
   - Implement custom CUDA audio kernels

2. **Batch processing**:
   - Process multiple videos simultaneously
   - Accumulate audio batches for GPU
   - Increase tensor operation complexity

3. **Alternative: Accept CPU-bound nature**:
   - Current implementation is already optimal for single file
   - GPU overhead may exceed benefits for small workloads
   - Consider multi-threaded CPU processing instead

## Metrics Summary
- **GPU utilization**: 3.23% average (FAIL - below 10% threshold)
- **CPU usage**: High during FFmpeg/soundfile phases
- **Memory usage**: 4 MiB GPU / 347 MB system
- **Process efficiency**: 1 highlight / 10.5 seconds