Sistema completo de detección de highlights con VLM y análisis de gameplay

- Implementación de detector híbrido (Whisper + Chat + Audio + VLM) - Sistema de detección de gameplay real vs hablando - Scene detection con FFmpeg - Soporte para RTX 3050 y RX 6800 XT - Guía completa en 6800xt.md para próxima IA - Scripts de filtrado visual y análisis de contexto - Pipeline automatizado de generación de videos
2026-02-19 17:38:14 +00:00
parent c1c66a7d9a
commit 00180d0b1c
45 changed files with 10636 additions and 260 deletions
--- a/6800xt.md
+++ b/6800xt.md
@@ -0,0 +1,348 @@
+# Configuración para RX 6800 XT (16GB VRAM)
+
+## Objetivo
+Mejorar el detector de highlights para Twitch usando un modelo VLM más potente aprovechando los 16GB de VRAM de la RX 6800 XT.
+
+## Hardware Target
+- **GPU**: AMD Radeon RX 6800 XT (16GB VRAM)
+- **Alternativa**: NVIDIA RTX 3050 (4GB VRAM) - configuración actual
+- **RAM**: 32GB sistema
+- **Almacenamiento**: SSD NVMe recomendado
+
+## Modelos VLM Recomendados (16GB VRAM)
+
+### Opción 1: Video-LLaMA 7B ⭐ (Recomendado)
+```bash
+# Descargar modelo
+pip install git+https://github.com/DAMO-NLP-SG/Video-LLaMA.git
+
+# O usar desde HuggingFace
+from transformers import AutoModel, AutoTokenizer
+model = AutoModel.from_pretrained("DAMO-NLP-SG/Video-LLaMA-7B", device_map="auto")
+```
+**Ventajas**:
+- Procesa video nativamente (no frames sueltos)
+- Entiende contexto temporal
+- Preguntas como: "¿En qué timestamps hay gameplay de LoL?"
+
+### Opción 2: Qwen2-VL 7B
+```bash
+pip install transformers
+from transformers import Qwen2VLForConditionalGeneration
+model = Qwen2VLForConditionalGeneration.from_pretrained(
+    "Qwen/Qwen2-VL-7B-Instruct",
+    torch_dtype=torch.float16,
+    device_map="auto"
+)
+```
+**Ventajas**:
+- SOTA en análisis de video
+- Soporta videos largos (hasta 2 horas)
+- Excelente para detectar actividades específicas
+
+### Opción 3: LLaVA-NeXT-Video 7B
+```bash
+from llava.model.builder import load_pretrained_model
+model_name = "liuhaotian/llava-v1.6-vicuna-7b"
+model = load_pretrained_model(model_name, None, None)
+```
+
+## Arquitectura Propuesta
+
+### Pipeline Optimizado para 16GB
+
+```
+Video Input (2.3h)
+    ↓
+[FFmpeg + CUDA] Decodificación GPU
+    ↓
+[Scene Detection] Cambios de escena cada ~5s
+    ↓
+[VLM Batch] Procesar 10 frames simultáneos
+    ↓
+[Clasificación] GAMEPLAY / SELECT / TALKING / MENU
+    ↓
+[Filtrado] Solo GAMEPLAY segments
+    ↓
+[Análisis Rage] Whisper + Chat + Audio
+    ↓
+[Highlights] Mejores momentos de cada segmento
+    ↓
+[Video Final] Concatenación con ffmpeg
+```
+
+## Implementación Paso a Paso
+
+### 1. Instalación Base
+```bash
+# Crear entorno aislado
+python3 -m venv vlm_6800xt
+source vlm_6800xt/bin/activate
+
+# PyTorch con ROCm (para AMD RX 6800 XT)
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.6
+
+# O para NVIDIA
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
+
+# Dependencias
+pip install transformers accelerate decord opencv-python scenedetect
+pip install whisper-openai numpy scipy
+```
+
+### 2. Descargar Modelo VLM
+```python
+# Descargar Video-LLaMA o Qwen2-VL
+from huggingface_hub import snapshot_download
+
+# Opción A: Video-LLaMA
+model_path = snapshot_download(
+    repo_id="DAMO-NLP-SG/Video-LLaMA-7B",
+    local_dir="models/video_llama",
+    local_dir_use_symlinks=False
+)
+
+# Opción B: Qwen2-VL
+model_path = snapshot_download(
+    repo_id="Qwen/Qwen2-VL-7B-Instruct",
+    local_dir="models/qwen2vl",
+    local_dir_use_symlinks=False
+)
+```
+
+### 3. Script Principal
+Crear `vlm_6800xt_detector.py`:
+
+```python
+#!/usr/bin/env python3
+import torch
+from transformers import AutoModel, AutoTokenizer
+import cv2
+import numpy as np
+from pathlib import Path
+import json
+
+class VLM6800XTDetector:
+    """Detector de highlights usando VLM en RX 6800 XT."""
+    
+    def __init__(self, model_path="models/video_llama"):
+        self.device = "cuda" if torch.cuda.is_available() else "cpu"
+        print(f"🎮 VLM Detector - {torch.cuda.get_device_name(0)}")
+        print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
+        
+        # Cargar modelo
+        print("📥 Cargando VLM...")
+        self.model = AutoModel.from_pretrained(
+            model_path,
+            torch_dtype=torch.float16,
+            device_map="auto"
+        )
+        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
+        print("✅ Modelo listo")
+    
+    def analyze_video_segments(self, video_path, segment_duration=60):
+        """
+        Analiza el video en segmentos de 1 minuto.
+        Usa VLM para clasificar cada segmento.
+        """
+        import subprocess
+        
+        # Obtener duración
+        result = subprocess.run([
+            'ffprobe', '-v', 'error',
+            '-show_entries', 'format=duration',
+            '-of', 'default=noprint_wrappers=1:nokey=1',
+            video_path
+        ], capture_output=True, text=True)
+        
+        duration = float(result.stdout.strip())
+        print(f"\n📹 Video: {duration/3600:.1f} horas")
+        
+        segments = []
+        
+        # Analizar cada minuto
+        for start in range(0, int(duration), segment_duration):
+            end = min(start + segment_duration, int(duration))
+            
+            # Extraer frame representativo del medio del segmento
+            mid = (start + end) // 2
+            frame_path = f"/tmp/segment_{start}.jpg"
+            
+            subprocess.run([
+                'ffmpeg', '-y', '-i', video_path,
+                '-ss', str(mid), '-vframes', '1',
+                '-vf', 'scale=512:288',
+                frame_path
+            ], capture_output=True)
+            
+            # Analizar con VLM
+            image = Image.open(frame_path)
+            
+            prompt = """Analyze this gaming stream frame. Classify as ONE of:
+1. GAMEPLAY_ACTIVE - League of Legends match in progress (map, champions fighting)
+2. CHAMPION_SELECT - Lobby/selection screen
+3. STREAMER_TALKING - Just streamer face without game
+4. MENU_WAITING - Menus, loading screens
+5. OTHER_GAME - Different game
+
+Respond ONLY with the number (1-5)."""
+
+            # Inferencia VLM
+            inputs = self.processor(text=prompt, images=image, return_tensors="pt")
+            inputs = {k: v.to(self.device) for k, v in inputs.items()}
+            
+            with torch.no_grad():
+                outputs = self.model.generate(**inputs, max_new_tokens=10)
+            
+            classification = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
+            
+            # Parsear resultado
+            is_gameplay = "1" in classification or "GAMEPLAY" in classification
+            
+            segments.append({
+                'start': start,
+                'end': end,
+                'is_gameplay': is_gameplay,
+                'classification': classification
+            })
+            
+            status = "🎮" if is_gameplay else "❌"
+            print(f"{start//60:02d}m-{end//60:02d}m {status} {classification}")
+            
+            Path(frame_path).unlink(missing_ok=True)
+        
+        return segments
+    
+    def extract_highlights(self, video_path, gameplay_segments):
+        """Extrae highlights de los segmentos de gameplay."""
+        # Implementar análisis Whisper + Chat + Audio
+        # Solo en segmentos marcados como gameplay
+        pass
+
+if __name__ == '__main__':
+    detector = VLM6800XTDetector()
+    
+    video = "nuevo_stream_360p.mp4"
+    segments = detector.analyze_video_segments(video)
+    
+    # Guardar
+    with open('gameplay_segments_vlm.json', 'w') as f:
+        json.dump(segments, f, indent=2)
+```
+
+## Optimizaciones para 16GB VRAM
+
+### Batch Processing
+```python
+# Procesar múltiples frames simultáneamente
+batch_size = 8  # Ajustar según VRAM disponible
+
+frames_batch = []
+for i, ts in enumerate(timestamps):
+    frame = extract_frame(ts)
+    frames_batch.append(frame)
+    
+    if len(frames_batch) == batch_size:
+        # Procesar batch completo en GPU
+        results = model(frames_batch)
+        frames_batch = []
+```
+
+### Mixed Precision
+```python
+# Usar FP16 para ahorrar VRAM
+model = model.half()  # Convertir a float16
+
+# O con accelerate
+from accelerate import Accelerator
+accelerator = Accelerator(mixed_precision='fp16')
+```
+
+### Gradient Checkpointing (si entrenas)
+```python
+model.gradient_checkpointing_enable()
+```
+
+## Comparación de Modelos
+
+| Modelo | Tamaño | VRAM | Velocidad | Precisión |
+|--------|--------|------|-----------|-----------|
+| Moondream 2B | 4GB | 6GB | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ |
+| Video-LLaMA 7B | 14GB | 16GB | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
+| Qwen2-VL 7B | 16GB | 20GB* | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
+| LLaVA-NeXT 7B | 14GB | 16GB | ⭐⭐⭐ | ⭐⭐⭐⭐ |
+
+*Requiere quantization 4-bit para 16GB
+
+## Configuración de Quantization (Ahorrar VRAM)
+
+```python
+# 4-bit quantization para modelos grandes
+from transformers import BitsAndBytesConfig
+
+quantization_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_compute_dtype=torch.float16,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_use_double_quant=True,
+)
+
+model = AutoModel.from_pretrained(
+    model_path,
+    quantization_config=quantization_config,
+    device_map="auto"
+)
+```
+
+## Testing
+
+```bash
+# Verificar VRAM disponible
+python3 -c "import torch; print(f'VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB')"
+
+# Test rápido del modelo
+python3 test_vlm.py --model models/video_llama --test-frame sample.jpg
+```
+
+## Troubleshooting
+
+### Problema: Out of Memory
+**Solución**: Reducir batch_size o usar quantization 4-bit
+
+### Problema: Lento
+**Solución**: Usar CUDA/ROCm graphs, precisión FP16, o modelo más pequeño
+
+### Problema: Precision baja
+**Solución**: Aumentar resolución de frames de entrada (512x288 → 1024x576)
+
+## Referencias
+
+- [Video-LLaMA GitHub](https://github.com/DAMO-NLP-SG/Video-LLaMA)
+- [Qwen2-VL HuggingFace](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)
+- [LLaVA Documentation](https://llava-vl.github.io/)
+- [ROCm PyTorch](https://pytorch.org/get-started/locally/)
+
+## Notas para el Desarrollador
+
+1. **Prueba primero con Moondream 2B** en la RTX 3050 para validar el pipeline
+2. **Luego migra a Video-LLaMA 7B** en la RX 6800 XT
+3. **Usa batch processing** para maximizar throughput
+4. **Guarda checkpoints** cada 10 minutos de análisis
+5. **Prueba con videos cortos** (10 min) antes de procesar streams de 3 horas
+
+## TODO
+
+- [ ] Implementar decodificación GPU con `decord`
+- [ ] Agregar detección de escenas con PySceneDetect
+- [ ] Crear pipeline de batch processing eficiente
+- [ ] Implementar cache de frames procesados
+- [ ] Agregar métricas de calidad de highlights
+- [ ] Crear interfaz CLI interactiva
+- [ ] Soporte para múltiples juegos (no solo LoL)
+- [ ] Integración con API de Twitch para descarga automática
+
+---
+
+**Autor**: IA Assistant  
+**Fecha**: 2024  
+**Target Hardware**: AMD RX 6800 XT 16GB / NVIDIA RTX 3050 4GB