twitch-highlight-detector/6800xt.md

# Configuración para RX 6800 XT (16GB VRAM)

## Objetivo
Mejorar el detector de highlights para Twitch usando un modelo VLM más potente aprovechando los 16GB de VRAM de la RX 6800 XT.

## Hardware Target
- **GPU**: AMD Radeon RX 6800 XT (16GB VRAM)
- **Alternativa**: NVIDIA RTX 3050 (4GB VRAM) - configuración actual
- **RAM**: 32GB sistema
- **Almacenamiento**: SSD NVMe recomendado

## Modelos VLM Recomendados (16GB VRAM)

### Opción 1: Video-LLaMA 7B ⭐ (Recomendado)
```bash
# Descargar modelo
pip install git+https://github.com/DAMO-NLP-SG/Video-LLaMA.git

# O usar desde HuggingFace
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("DAMO-NLP-SG/Video-LLaMA-7B", device_map="auto")
```
**Ventajas**:
- Procesa video nativamente (no frames sueltos)
- Entiende contexto temporal
- Preguntas como: "¿En qué timestamps hay gameplay de LoL?"

### Opción 2: Qwen2-VL 7B
```bash
pip install transformers
from transformers import Qwen2VLForConditionalGeneration
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto"
)
```
**Ventajas**:
- SOTA en análisis de video
- Soporta videos largos (hasta 2 horas)
- Excelente para detectar actividades específicas

### Opción 3: LLaVA-NeXT-Video 7B
```bash
from llava.model.builder import load_pretrained_model
model_name = "liuhaotian/llava-v1.6-vicuna-7b"
model = load_pretrained_model(model_name, None, None)
```

## Arquitectura Propuesta

### Pipeline Optimizado para 16GB

```
Video Input (2.3h)
    ↓
[FFmpeg + CUDA] Decodificación GPU
    ↓
[Scene Detection] Cambios de escena cada ~5s
    ↓
[VLM Batch] Procesar 10 frames simultáneos
    ↓
[Clasificación] GAMEPLAY / SELECT / TALKING / MENU
    ↓
[Filtrado] Solo GAMEPLAY segments
    ↓
[Análisis Rage] Whisper + Chat + Audio
    ↓
[Highlights] Mejores momentos de cada segmento
    ↓
[Video Final] Concatenación con ffmpeg
```

## Implementación Paso a Paso

### 1. Instalación Base
```bash
# Crear entorno aislado
python3 -m venv vlm_6800xt
source vlm_6800xt/bin/activate

# PyTorch con ROCm (para AMD RX 6800 XT)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.6

# O para NVIDIA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Dependencias
pip install transformers accelerate decord opencv-python scenedetect
pip install whisper-openai numpy scipy
```

### 2. Descargar Modelo VLM
```python
# Descargar Video-LLaMA o Qwen2-VL
from huggingface_hub import snapshot_download

# Opción A: Video-LLaMA
model_path = snapshot_download(
    repo_id="DAMO-NLP-SG/Video-LLaMA-7B",
    local_dir="models/video_llama",
    local_dir_use_symlinks=False
)

# Opción B: Qwen2-VL
model_path = snapshot_download(
    repo_id="Qwen/Qwen2-VL-7B-Instruct",
    local_dir="models/qwen2vl",
    local_dir_use_symlinks=False
)
```

### 3. Script Principal
Crear `vlm_6800xt_detector.py`:

```python
#!/usr/bin/env python3
import torch
from transformers import AutoModel, AutoTokenizer
import cv2
import numpy as np
from pathlib import Path
import json

class VLM6800XTDetector:
    """Detector de highlights usando VLM en RX 6800 XT."""

    def __init__(self, model_path="models/video_llama"):
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        print(f"🎮 VLM Detector - {torch.cuda.get_device_name(0)}")
        print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

        # Cargar modelo
        print("📥 Cargando VLM...")
        self.model = AutoModel.from_pretrained(
            model_path,
            torch_dtype=torch.float16,
            device_map="auto"
        )
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        print("✅ Modelo listo")

    def analyze_video_segments(self, video_path, segment_duration=60):
        """
        Analiza el video en segmentos de 1 minuto.
        Usa VLM para clasificar cada segmento.
        """
        import subprocess

        # Obtener duración
        result = subprocess.run([
            'ffprobe', '-v', 'error',
            '-show_entries', 'format=duration',
            '-of', 'default=noprint_wrappers=1:nokey=1',
            video_path
        ], capture_output=True, text=True)

        duration = float(result.stdout.strip())
        print(f"\n📹 Video: {duration/3600:.1f} horas")

        segments = []

        # Analizar cada minuto
        for start in range(0, int(duration), segment_duration):
            end = min(start + segment_duration, int(duration))

            # Extraer frame representativo del medio del segmento
            mid = (start + end) // 2
            frame_path = f"/tmp/segment_{start}.jpg"

            subprocess.run([
                'ffmpeg', '-y', '-i', video_path,
                '-ss', str(mid), '-vframes', '1',
                '-vf', 'scale=512:288',
                frame_path
            ], capture_output=True)

            # Analizar con VLM
            image = Image.open(frame_path)

            prompt = """Analyze this gaming stream frame. Classify as ONE of:
1. GAMEPLAY_ACTIVE - League of Legends match in progress (map, champions fighting)
2. CHAMPION_SELECT - Lobby/selection screen
3. STREAMER_TALKING - Just streamer face without game
4. MENU_WAITING - Menus, loading screens
5. OTHER_GAME - Different game

Respond ONLY with the number (1-5)."""

            # Inferencia VLM
            inputs = self.processor(text=prompt, images=image, return_tensors="pt")
            inputs = {k: v.to(self.device) for k, v in inputs.items()}

            with torch.no_grad():
                outputs = self.model.generate(**inputs, max_new_tokens=10)

            classification = self.tokenizer.decode(outputs[0], skip_special_tokens=True)

            # Parsear resultado
            is_gameplay = "1" in classification or "GAMEPLAY" in classification

            segments.append({
                'start': start,
                'end': end,
                'is_gameplay': is_gameplay,
                'classification': classification
            })

            status = "🎮" if is_gameplay else "❌"
            print(f"{start//60:02d}m-{end//60:02d}m {status} {classification}")

            Path(frame_path).unlink(missing_ok=True)

        return segments

    def extract_highlights(self, video_path, gameplay_segments):
        """Extrae highlights de los segmentos de gameplay."""
        # Implementar análisis Whisper + Chat + Audio
        # Solo en segmentos marcados como gameplay
        pass

if __name__ == '__main__':
    detector = VLM6800XTDetector()

    video = "nuevo_stream_360p.mp4"
    segments = detector.analyze_video_segments(video)

    # Guardar
    with open('gameplay_segments_vlm.json', 'w') as f:
        json.dump(segments, f, indent=2)
```

## Optimizaciones para 16GB VRAM

### Batch Processing
```python
# Procesar múltiples frames simultáneamente
batch_size = 8  # Ajustar según VRAM disponible

frames_batch = []
for i, ts in enumerate(timestamps):
    frame = extract_frame(ts)
    frames_batch.append(frame)

    if len(frames_batch) == batch_size:
        # Procesar batch completo en GPU
        results = model(frames_batch)
        frames_batch = []
```

### Mixed Precision
```python
# Usar FP16 para ahorrar VRAM
model = model.half()  # Convertir a float16

# O con accelerate
from accelerate import Accelerator
accelerator = Accelerator(mixed_precision='fp16')
```

### Gradient Checkpointing (si entrenas)
```python
model.gradient_checkpointing_enable()
```

## Comparación de Modelos

| Modelo | Tamaño | VRAM | Velocidad | Precisión |
|--------|--------|------|-----------|-----------|
| Moondream 2B | 4GB | 6GB | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ |
| Video-LLaMA 7B | 14GB | 16GB | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| Qwen2-VL 7B | 16GB | 20GB* | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| LLaVA-NeXT 7B | 14GB | 16GB | ⭐⭐⭐ | ⭐⭐⭐⭐ |

*Requiere quantization 4-bit para 16GB

## Configuración de Quantization (Ahorrar VRAM)

```python
# 4-bit quantization para modelos grandes
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

model = AutoModel.from_pretrained(
    model_path,
    quantization_config=quantization_config,
    device_map="auto"
)
```

## Testing

```bash
# Verificar VRAM disponible
python3 -c "import torch; print(f'VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB')"

# Test rápido del modelo
python3 test_vlm.py --model models/video_llama --test-frame sample.jpg
```

## Troubleshooting

### Problema: Out of Memory
**Solución**: Reducir batch_size o usar quantization 4-bit

### Problema: Lento
**Solución**: Usar CUDA/ROCm graphs, precisión FP16, o modelo más pequeño

### Problema: Precision baja
**Solución**: Aumentar resolución de frames de entrada (512x288 → 1024x576)

## Referencias

- [Video-LLaMA GitHub](https://github.com/DAMO-NLP-SG/Video-LLaMA)
- [Qwen2-VL HuggingFace](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)
- [LLaVA Documentation](https://llava-vl.github.io/)
- [ROCm PyTorch](https://pytorch.org/get-started/locally/)

## Notas para el Desarrollador

1. **Prueba primero con Moondream 2B** en la RTX 3050 para validar el pipeline
2. **Luego migra a Video-LLaMA 7B** en la RX 6800 XT
3. **Usa batch processing** para maximizar throughput
4. **Guarda checkpoints** cada 10 minutos de análisis
5. **Prueba con videos cortos** (10 min) antes de procesar streams de 3 horas

## TODO

- [ ] Implementar decodificación GPU con `decord`
- [ ] Agregar detección de escenas con PySceneDetect
- [ ] Crear pipeline de batch processing eficiente
- [ ] Implementar cache de frames procesados
- [ ] Agregar métricas de calidad de highlights
- [ ] Crear interfaz CLI interactiva
- [ ] Soporte para múltiples juegos (no solo LoL)
- [ ] Integración con API de Twitch para descarga automática

---

**Autor**: IA Assistant
**Fecha**: 2024
**Target Hardware**: AMD RX 6800 XT 16GB / NVIDIA RTX 3050 4GB