cbc2027/opus.md

# 🚀 CBCFacil - Plan de Mejoras y Optimizaciones

**Fecha:** 26 de Enero 2026
**Proyecto:** CBCFacil v9
**Documentación:** Mejoras, Fixes de Bugs, Recomendaciones e Integración con Notion

---

## 📋 TABLA DE CONTENIDOS

1. [Resumen Ejecutivo](#resumen-ejecutivo)
2. [Bugs Críticos a Corregir](#bugs-críticos-a-corregir)
3. [Mejoras de Seguridad](#mejoras-de-seguridad)
4. [Optimizaciones de Rendimiento](#optimizaciones-de-rendimiento)
5. [Mejoras de Código y Mantenibilidad](#mejoras-de-código-y-mantenibilidad)
6. [Integración Avanzada con Notion](#integración-avanzada-con-notion)
7. [Plan de Testing](#plan-de-testing)
8. [Roadmap de Implementación](#roadmap-de-implementación)

---

## 📊 RESUMEN EJECUTIVO

CBCFacil es un sistema de procesamiento de documentos con IA bien arquitectado, pero requiere mejoras críticas en seguridad, testing y escalabilidad antes de ser considerado production-ready.

### Calificación General

```
Arquitectura:     ████████░░ 8/10
Código:           ███████░░░ 7/10
Seguridad:        ████░░░░░░ 4/10
Testing:          ░░░░░░░░░░ 0/10
Documentación:    █████████░ 9/10
Performance:      ██████░░░░ 6/10

TOTAL:            ██████░░░░ 5.7/10
```

### Prioridades

- 🔴 **CRÍTICO:** Seguridad básica + Tests fundamentales (Sprint 1)
- 🟡 **ALTO:** Performance y escalabilidad (Sprint 2)
- 🟢 **MEDIO:** Frontend modernization y features avanzados (Sprint 3-4)

---

## 🐛 BUGS CRÍTICOS A CORREGIR

### 1. 🔴 Notion API Token Expuesto en `.env.example`

**Ubicación:** `config/settings.py:47`, `.env.example`

**Problema:**
```bash
# .env.example contiene un token real de Notion
NOTION_API_TOKEN=secret_XXX...REAL_TOKEN...XXX
```

**Riesgo:** Alta - Token expuesto públicamente en repositorio

**Solución:**
```bash
# .env.example
NOTION_API_TOKEN=secret_YOUR_NOTION_TOKEN_HERE_replace_this
NOTION_DATABASE_ID=your_database_id_here
```

**Acción Inmediata:**
1. Cambiar el token de Notion desde la consola de Notion
2. Actualizar `.env.example` con placeholder
3. Verificar que `.env` esté en `.gitignore`
4. Escanear el historial de Git por tokens expuestos

---

### 2. 🔴 Path Traversal Vulnerability en `/downloads`

**Ubicación:** `api/routes.py:142-148`

**Problema:**
```python
@app.route('/downloads/<path:filepath>')
def serve_file(filepath):
    safe_path = os.path.normpath(filepath)
    # Validación insuficiente - puede ser bypasseada con symlinks
    if '..' in filepath or filepath.startswith('/'):
        abort(403)
```

**Riesgo:** Alta - Acceso no autorizado a archivos del sistema

**Solución:**
```python
from werkzeug.security import safe_join
from pathlib import Path

@app.route('/downloads/<path:filepath>')
def serve_file(filepath):
    # Sanitizar filename
    safe_filename = secure_filename(filepath)

    # Usar safe_join para prevenir path traversal
    base_dir = settings.LOCAL_DOWNLOADS_PATH
    safe_path = safe_join(str(base_dir), safe_filename)

    if safe_path is None:
        abort(403, "Access denied")

    # Verificar que el path resuelto está dentro del directorio permitido
    resolved_path = Path(safe_path).resolve()
    if not str(resolved_path).startswith(str(base_dir.resolve())):
        abort(403, "Access denied")

    if not resolved_path.exists() or not resolved_path.is_file():
        abort(404)

    return send_file(resolved_path)
```

---

### 3. 🔴 SECRET_KEY Generado Aleatoriamente

**Ubicación:** `api/routes.py:30`

**Problema:**
```python
# Se genera un SECRET_KEY aleatorio si no existe
app.config['SECRET_KEY'] = os.getenv('SECRET_KEY', os.urandom(24).hex())
```

**Riesgo:** Media - Sesiones inválidas tras cada restart, inseguro en producción

**Solución:**
```python
# config/settings.py
@property
def SECRET_KEY(self) -> str:
    key = os.getenv('SECRET_KEY')
    if not key:
        raise ValueError(
            "SECRET_KEY is required in production. "
            "Generate one with: python -c 'import secrets; print(secrets.token_hex(32))'"
        )
    return key

# api/routes.py
app.config['SECRET_KEY'] = settings.SECRET_KEY
```

**Acción:**
```bash
# Generar secret key seguro
python -c 'import secrets; print(secrets.token_hex(32))' >> .env

# Agregar a .env
SECRET_KEY=<generated_key>
```

---

### 4. 🔴 Imports Dentro de Funciones

**Ubicación:** `main.py:306-342`

**Problema:**
```python
def process_audio_file(audio_path: Path):
    from processors.audio_processor import audio_processor  # Import dentro
    from document.generators import DocumentGenerator       # de función
    # ...
```

**Riesgo:** Media - Performance hit, problemas de circular imports

**Solución:**
```python
# main.py (top level)
from processors.audio_processor import audio_processor
from processors.pdf_processor import pdf_processor
from document.generators import DocumentGenerator

# Eliminar todos los imports de dentro de funciones
def process_audio_file(audio_path: Path):
    # Usar imports globales
    result = audio_processor.process(audio_path)
    # ...
```

---

### 5. 🔴 No Hay Autenticación en API

**Ubicación:** `api/routes.py` (todos los endpoints)

**Problema:** Cualquier usuario puede acceder a todos los endpoints sin autenticación

**Riesgo:** Crítica - Exposición de datos y control no autorizado

**Solución con API Key:**

```python
# config/settings.py
@property
def API_KEY(self) -> Optional[str]:
    return os.getenv('API_KEY')

@property
def REQUIRE_AUTH(self) -> bool:
    return os.getenv('REQUIRE_AUTH', 'true').lower() == 'true'

# api/auth.py (nuevo archivo)
from functools import wraps
from flask import request, abort, jsonify
from config import settings

def require_api_key(f):
    """Decorator to require API key authentication"""
    @wraps(f)
    def decorated_function(*args, **kwargs):
        if not settings.REQUIRE_AUTH:
            return f(*args, **kwargs)

        api_key = request.headers.get('X-API-Key')
        if not api_key:
            abort(401, {'error': 'API key required'})

        if api_key != settings.API_KEY:
            abort(403, {'error': 'Invalid API key'})

        return f(*args, **kwargs)
    return decorated_function

# api/routes.py
from api.auth import require_api_key

@app.route('/api/files')
@require_api_key
def get_files():
    # ...
```

**Solución con JWT (más robusta):**

```python
# requirements.txt
PyJWT>=2.8.0
flask-jwt-extended>=4.5.3

# api/auth.py
from flask_jwt_extended import JWTManager, create_access_token, jwt_required, get_jwt_identity

jwt = JWTManager(app)

@app.route('/api/login', methods=['POST'])
def login():
    username = request.json.get('username')
    password = request.json.get('password')

    # Validar credenciales (usar bcrypt en producción)
    if username == settings.ADMIN_USERNAME and password == settings.ADMIN_PASSWORD:
        access_token = create_access_token(identity=username)
        return jsonify(access_token=access_token)

    abort(401)

@app.route('/api/files')
@jwt_required()
def get_files():
    current_user = get_jwt_identity()
    # ...
```

---

### 6. 🟡 Truncamiento de Texto en Resúmenes

**Ubicación:** `document/generators.py:38, 61`

**Problema:**
```python
bullet_prompt = f"""...\nTexto:\n{text[:15000]}"""  # Trunca a 15k chars
summary_prompt = f"""...\n{text[:20000]}\n..."""     # Trunca a 20k chars
```

**Riesgo:** Media - Pérdida de información en documentos largos

**Solución - Chunking Inteligente:**

```python
def _chunk_text(self, text: str, max_chunk_size: int = 15000) -> List[str]:
    """Split text into intelligent chunks by paragraphs"""
    if len(text) <= max_chunk_size:
        return [text]

    chunks = []
    current_chunk = []
    current_size = 0

    # Split by double newlines (paragraphs)
    paragraphs = text.split('\n\n')

    for para in paragraphs:
        para_size = len(para)

        if current_size + para_size > max_chunk_size:
            if current_chunk:
                chunks.append('\n\n'.join(current_chunk))
                current_chunk = []
                current_size = 0

        current_chunk.append(para)
        current_size += para_size

    if current_chunk:
        chunks.append('\n\n'.join(current_chunk))

    return chunks

def generate_summary(self, text: str, base_name: str):
    """Generate summary with intelligent chunking"""
    chunks = self._chunk_text(text, max_chunk_size=15000)

    # Process each chunk and combine
    all_bullets = []
    for i, chunk in enumerate(chunks):
        self.logger.info(f"Processing chunk {i+1}/{len(chunks)}")
        bullet_prompt = f"""Analiza el siguiente texto (parte {i+1} de {len(chunks)})...\n{chunk}"""
        bullets = self.ai_provider.generate_text(bullet_prompt)
        all_bullets.append(bullets)

    # Combine all bullets
    combined_bullets = '\n'.join(all_bullets)

    # Generate unified summary from combined bullets
    # ...
```

---

### 7. 🟡 Cache Key Usa Solo 500 Caracteres

**Ubicación:** `services/ai_service.py:111`

**Problema:**
```python
def _get_cache_key(self, prompt: str, model: str = "default") -> str:
    content = f"{model}:{prompt[:500]}"  # Solo primeros 500 chars
    return hashlib.sha256(content.encode()).hexdigest()
```

**Riesgo:** Media - Colisiones de cache en prompts similares

**Solución:**
```python
def _get_cache_key(self, prompt: str, model: str = "default") -> str:
    """Generate cache key from full prompt hash"""
    content = f"{model}:{prompt}"  # Hash completo del prompt
    return hashlib.sha256(content.encode()).hexdigest()
```

---

### 8. 🟡 Bloom Filter Usa MD5

**Ubicación:** `storage/processed_registry.py:24`

**Problema:**
```python
import hashlib

def _hash(self, item: str) -> int:
    return int(hashlib.md5(item.encode()).hexdigest(), 16)  # MD5 no es seguro
```

**Riesgo:** Baja - MD5 obsoleto, posibles colisiones

**Solución:**
```python
def _hash(self, item: str) -> int:
    """Use SHA256 instead of MD5 for better collision resistance"""
    return int(hashlib.sha256(item.encode()).hexdigest(), 16) % (2**64)
```

---

## 🔒 MEJORAS DE SEGURIDAD

### 1. Implementar Rate Limiting

**Instalar flask-limiter:**
```bash
pip install flask-limiter
```

**Implementación:**
```python
# api/routes.py
from flask_limiter import Limiter
from flask_limiter.util import get_remote_address

limiter = Limiter(
    app=app,
    key_func=get_remote_address,
    default_limits=["200 per day", "50 per hour"],
    storage_uri="redis://localhost:6379"  # O memory:// para testing
)

@app.route('/api/files')
@limiter.limit("30 per minute")
@require_api_key
def get_files():
    # ...

@app.route('/api/regenerate-summary', methods=['POST'])
@limiter.limit("5 per minute")  # Más restrictivo para operaciones costosas
@require_api_key
def regenerate_summary():
    # ...
```

---

### 2. Configurar CORS Restrictivo

**Ubicación:** `api/routes.py:25`

**Problema:**
```python
CORS(app)  # Permite todos los orígenes (*)
```

**Solución:**
```python
# config/settings.py
@property
def CORS_ORIGINS(self) -> List[str]:
    origins_str = os.getenv('CORS_ORIGINS', 'http://localhost:5000')
    return [o.strip() for o in origins_str.split(',')]

# api/routes.py
from flask_cors import CORS

CORS(app, resources={
    r"/api/*": {
        "origins": settings.CORS_ORIGINS,
        "methods": ["GET", "POST", "DELETE"],
        "allow_headers": ["Content-Type", "X-API-Key", "Authorization"],
        "expose_headers": ["Content-Type"],
        "supports_credentials": True,
        "max_age": 3600
    }
})
```

**Configuración .env:**
```bash
# Producción
CORS_ORIGINS=https://cbcfacil.com,https://app.cbcfacil.com

# Desarrollo
CORS_ORIGINS=http://localhost:5000,http://localhost:3000
```

---

### 3. Implementar Content Security Policy (CSP)

**Nueva funcionalidad:**
```python
# api/security.py (nuevo archivo)
from flask import make_response

def add_security_headers(response):
    """Add security headers to all responses"""
    response.headers['Content-Security-Policy'] = (
        "default-src 'self'; "
        "script-src 'self' 'unsafe-inline' https://fonts.googleapis.com; "
        "style-src 'self' 'unsafe-inline' https://fonts.googleapis.com; "
        "font-src 'self' https://fonts.gstatic.com; "
        "img-src 'self' data: https:; "
        "connect-src 'self'"
    )
    response.headers['X-Content-Type-Options'] = 'nosniff'
    response.headers['X-Frame-Options'] = 'DENY'
    response.headers['X-XSS-Protection'] = '1; mode=block'
    response.headers['Strict-Transport-Security'] = 'max-age=31536000; includeSubDomains'
    return response

# api/routes.py
from api.security import add_security_headers

@app.after_request
def apply_security_headers(response):
    return add_security_headers(response)
```

---

### 4. Sanitizar Inputs y Outputs

**Nueva funcionalidad:**
```python
# core/sanitizer.py (nuevo archivo)
import re
import html
from pathlib import Path

class InputSanitizer:
    """Sanitize user inputs"""

    @staticmethod
    def sanitize_filename(filename: str) -> str:
        """Remove dangerous characters from filename"""
        # Remove path separators
        filename = filename.replace('/', '_').replace('\\', '_')

        # Remove null bytes
        filename = filename.replace('\x00', '')

        # Limit length
        filename = filename[:255]

        # Remove leading/trailing dots and spaces
        filename = filename.strip('. ')

        return filename

    @staticmethod
    def sanitize_html(text: str) -> str:
        """Escape HTML to prevent XSS"""
        return html.escape(text)

    @staticmethod
    def sanitize_path(path: str, base_dir: Path) -> Path:
        """Ensure path is within base directory"""
        from werkzeug.security import safe_join

        safe_path = safe_join(str(base_dir), path)
        if safe_path is None:
            raise ValueError("Invalid path")

        resolved = Path(safe_path).resolve()
        if not str(resolved).startswith(str(base_dir.resolve())):
            raise ValueError("Path traversal attempt")

        return resolved

# Uso en api/routes.py
from core.sanitizer import InputSanitizer

@app.route('/api/transcription/<filename>')
@require_api_key
def get_transcription(filename):
    # Sanitizar filename
    safe_filename = InputSanitizer.sanitize_filename(filename)
    # ...
```

---

### 5. Filtrar Información Sensible de Logs

**Implementación:**
```python
# core/logging_filter.py (nuevo archivo)
import logging
import re

class SensitiveDataFilter(logging.Filter):
    """Filter sensitive data from logs"""

    PATTERNS = [
        (re.compile(r'(api[_-]?key["\']?\s*[:=]\s*["\']?)([^"\']+)(["\']?)', re.I), r'\1***REDACTED***\3'),
        (re.compile(r'(token["\']?\s*[:=]\s*["\']?)([^"\']+)(["\']?)', re.I), r'\1***REDACTED***\3'),
        (re.compile(r'(password["\']?\s*[:=]\s*["\']?)([^"\']+)(["\']?)', re.I), r'\1***REDACTED***\3'),
        (re.compile(r'(secret["\']?\s*[:=]\s*["\']?)([^"\']+)(["\']?)', re.I), r'\1***REDACTED***\3'),
    ]

    def filter(self, record):
        message = record.getMessage()

        for pattern, replacement in self.PATTERNS:
            message = pattern.sub(replacement, message)

        record.msg = message
        record.args = ()

        return True

# main.py
from core.logging_filter import SensitiveDataFilter

# Agregar filtro a todos los handlers
for handler in logging.root.handlers:
    handler.addFilter(SensitiveDataFilter())
```

---

### 6. Usar HTTPS con Reverse Proxy

**nginx configuration:**
```nginx
# /etc/nginx/sites-available/cbcfacil
server {
    listen 80;
    server_name cbcfacil.com;
    return 301 https://$server_name$request_uri;
}

server {
    listen 443 ssl http2;
    server_name cbcfacil.com;

    # SSL Configuration
    ssl_certificate /etc/letsencrypt/live/cbcfacil.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/cbcfacil.com/privkey.pem;
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers HIGH:!aNULL:!MD5;
    ssl_prefer_server_ciphers on;

    # Security Headers
    add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
    add_header X-Frame-Options "DENY" always;
    add_header X-Content-Type-Options "nosniff" always;
    add_header X-XSS-Protection "1; mode=block" always;

    # Rate Limiting
    limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
    limit_req zone=api burst=20 nodelay;

    # Proxy to Flask
    location / {
        proxy_pass http://127.0.0.1:5000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # Timeouts
        proxy_connect_timeout 60s;
        proxy_send_timeout 60s;
        proxy_read_timeout 60s;
    }

    # Static files caching
    location /static/ {
        alias /home/app/cbcfacil/static/;
        expires 1y;
        add_header Cache-Control "public, immutable";
    }
}
```

---

## ⚡ OPTIMIZACIONES DE RENDIMIENTO

### 1. Implementar Queue System con Celery

**Problema Actual:** Processing síncrono bloquea el loop principal

**Instalación:**
```bash
pip install celery redis
```

**Configuración:**
```python
# celery_app.py (nuevo archivo)
from celery import Celery
from config import settings

celery_app = Celery(
    'cbcfacil',
    broker=settings.CELERY_BROKER_URL,
    backend=settings.CELERY_RESULT_BACKEND
)

celery_app.conf.update(
    task_serializer='json',
    accept_content=['json'],
    result_serializer='json',
    timezone='UTC',
    enable_utc=True,
    task_track_started=True,
    task_time_limit=3600,  # 1 hora
    task_soft_time_limit=3300,  # 55 minutos
)

# tasks/processing.py (nuevo archivo)
from celery_app import celery_app
from processors.audio_processor import audio_processor
from processors.pdf_processor import pdf_processor
from document.generators import DocumentGenerator

@celery_app.task(bind=True, max_retries=3)
def process_audio_task(self, audio_path: str):
    """Process audio file asynchronously"""
    try:
        result = audio_processor.process(Path(audio_path))
        if result.success:
            generator = DocumentGenerator()
            generator.generate_summary(result.data['text'], result.data['base_name'])
        return {'success': True, 'file': audio_path}
    except Exception as e:
        self.retry(exc=e, countdown=60)

@celery_app.task(bind=True, max_retries=3)
def process_pdf_task(self, pdf_path: str):
    """Process PDF file asynchronously"""
    try:
        result = pdf_processor.process(Path(pdf_path))
        if result.success:
            generator = DocumentGenerator()
            generator.generate_summary(result.data['text'], result.data['base_name'])
        return {'success': True, 'file': pdf_path}
    except Exception as e:
        self.retry(exc=e, countdown=60)

# main.py
from tasks.processing import process_audio_task, process_pdf_task

def process_new_files(files: List[Path]):
    """Queue files for processing"""
    for file in files:
        if file.suffix.lower() in ['.mp3', '.wav', '.m4a']:
            task = process_audio_task.delay(str(file))
            logger.info(f"Queued audio processing: {file.name} (task_id={task.id})")
        elif file.suffix.lower() == '.pdf':
            task = process_pdf_task.delay(str(file))
            logger.info(f"Queued PDF processing: {file.name} (task_id={task.id})")

# config/settings.py
@property
def CELERY_BROKER_URL(self) -> str:
    return os.getenv('CELERY_BROKER_URL', 'redis://localhost:6379/0')

@property
def CELERY_RESULT_BACKEND(self) -> str:
    return os.getenv('CELERY_RESULT_BACKEND', 'redis://localhost:6379/0')
```

**Ejecutar workers:**
```bash
# Terminal 1: Flask app
python main.py

# Terminal 2: Celery worker
celery -A celery_app worker --loglevel=info --concurrency=2

# Terminal 3: Celery beat (para tareas programadas)
celery -A celery_app beat --loglevel=info
```

---

### 2. Implementar Redis para Caching Distribuido

**Problema:** Cache LRU en memoria se pierde en restarts

**Instalación:**
```bash
pip install redis hiredis
```

**Implementación:**
```python
# services/cache_service.py (nuevo archivo)
import redis
import json
import hashlib
from typing import Optional, Any
from config import settings

class CacheService:
    """Distributed cache with Redis"""

    def __init__(self):
        self.redis_client = redis.Redis(
            host=settings.REDIS_HOST,
            port=settings.REDIS_PORT,
            db=settings.REDIS_DB,
            decode_responses=True,
            socket_connect_timeout=5,
            socket_timeout=5
        )
        self.default_ttl = 3600  # 1 hora

    def get(self, key: str) -> Optional[Any]:
        """Get value from cache"""
        try:
            value = self.redis_client.get(key)
            if value:
                return json.loads(value)
            return None
        except Exception as e:
            logger.error(f"Cache get error: {e}")
            return None

    def set(self, key: str, value: Any, ttl: Optional[int] = None) -> bool:
        """Set value in cache"""
        try:
            ttl = ttl or self.default_ttl
            serialized = json.dumps(value)
            return self.redis_client.setex(key, ttl, serialized)
        except Exception as e:
            logger.error(f"Cache set error: {e}")
            return False

    def delete(self, key: str) -> bool:
        """Delete key from cache"""
        try:
            return bool(self.redis_client.delete(key))
        except Exception as e:
            logger.error(f"Cache delete error: {e}")
            return False

    def get_or_compute(self, key: str, compute_fn, ttl: Optional[int] = None):
        """Get from cache or compute and store"""
        cached = self.get(key)
        if cached is not None:
            return cached

        value = compute_fn()
        self.set(key, value, ttl)
        return value

cache_service = CacheService()

# services/ai_service.py
from services.cache_service import cache_service

class AIService:
    def generate_text(self, prompt: str, model: str = "default") -> str:
        cache_key = self._get_cache_key(prompt, model)

        # Usar Redis cache
        def compute():
            return self.ai_provider.generate_text(prompt)

        return cache_service.get_or_compute(cache_key, compute, ttl=3600)

# config/settings.py
@property
def REDIS_HOST(self) -> str:
    return os.getenv('REDIS_HOST', 'localhost')

@property
def REDIS_PORT(self) -> int:
    return int(os.getenv('REDIS_PORT', '6379'))

@property
def REDIS_DB(self) -> int:
    return int(os.getenv('REDIS_DB', '0'))
```

---

### 3. Migrar a PostgreSQL para Metadata

**Problema:** `processed_files.txt` no escala, falta ACID

**Instalación:**
```bash
pip install psycopg2-binary sqlalchemy alembic
```

**Schema:**
```python
# models/database.py (nuevo archivo)
from sqlalchemy import create_engine, Column, Integer, String, DateTime, Boolean, JSON, Text
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from datetime import datetime
from config import settings

Base = declarative_base()

class ProcessedFile(Base):
    __tablename__ = 'processed_files'

    id = Column(Integer, primary_key=True)
    filename = Column(String(255), unique=True, nullable=False, index=True)
    filepath = Column(String(512), nullable=False)
    file_type = Column(String(50), nullable=False)  # audio, pdf, text
    status = Column(String(50), default='pending')  # pending, processing, completed, failed

    # Timestamps
    created_at = Column(DateTime, default=datetime.utcnow, nullable=False)
    processed_at = Column(DateTime)
    updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)

    # Processing results
    transcription_text = Column(Text)
    summary_text = Column(Text)

    # Generated files
    markdown_path = Column(String(512))
    docx_path = Column(String(512))
    pdf_path = Column(String(512))

    # Metadata
    file_size = Column(Integer)
    duration = Column(Integer)  # For audio files
    page_count = Column(Integer)  # For PDFs

    # Notion integration
    notion_uploaded = Column(Boolean, default=False)
    notion_page_id = Column(String(255))

    # Metrics
    processing_time = Column(Integer)  # seconds
    error_message = Column(Text)
    retry_count = Column(Integer, default=0)

    # Additional metadata
    metadata = Column(JSON)

# Database session
engine = create_engine(settings.DATABASE_URL)
SessionLocal = sessionmaker(bind=engine)

def get_db():
    db = SessionLocal()
    try:
        yield db
    finally:
        db.close()

# storage/processed_registry.py (refactor)
from models.database import ProcessedFile, get_db

class ProcessedRegistry:
    def is_processed(self, filename: str) -> bool:
        with get_db() as db:
            return db.query(ProcessedFile).filter_by(
                filename=filename,
                status='completed'
            ).first() is not None

    def mark_processed(self, filename: str, metadata: dict):
        with get_db() as db:
            file_record = ProcessedFile(
                filename=filename,
                filepath=metadata.get('filepath'),
                file_type=metadata.get('file_type'),
                status='completed',
                processed_at=datetime.utcnow(),
                transcription_text=metadata.get('transcription'),
                summary_text=metadata.get('summary'),
                markdown_path=metadata.get('markdown_path'),
                docx_path=metadata.get('docx_path'),
                pdf_path=metadata.get('pdf_path'),
                notion_uploaded=metadata.get('notion_uploaded', False),
                processing_time=metadata.get('processing_time'),
                metadata=metadata
            )
            db.add(file_record)
            db.commit()

# config/settings.py
@property
def DATABASE_URL(self) -> str:
    return os.getenv(
        'DATABASE_URL',
        'postgresql://cbcfacil:password@localhost/cbcfacil'
    )
```

**Migrations con Alembic:**
```bash
# Inicializar Alembic
alembic init migrations

# Crear migración
alembic revision --autogenerate -m "Create processed_files table"

# Aplicar migración
alembic upgrade head
```

---

### 4. WebSockets para Updates en Tiempo Real

**Instalación:**
```bash
pip install flask-socketio python-socketio eventlet
```

**Implementación:**
```python
# api/routes.py
from flask_socketio import SocketIO, emit

socketio = SocketIO(app, cors_allowed_origins=settings.CORS_ORIGINS, async_mode='eventlet')

@socketio.on('connect')
def handle_connect():
    emit('connected', {'message': 'Connected to CBCFacil'})

@socketio.on('subscribe_file')
def handle_subscribe(data):
    filename = data.get('filename')
    # Join room para recibir updates de este archivo
    join_room(filename)

# tasks/processing.py
from api.routes import socketio

@celery_app.task(bind=True)
def process_audio_task(self, audio_path: str):
    filename = Path(audio_path).name

    # Notificar inicio
    socketio.emit('processing_started', {
        'filename': filename,
        'status': 'processing'
    }, room=filename)

    try:
        # Progress updates
        socketio.emit('processing_progress', {
            'filename': filename,
            'progress': 25,
            'stage': 'transcription'
        }, room=filename)

        result = audio_processor.process(Path(audio_path))

        socketio.emit('processing_progress', {
            'filename': filename,
            'progress': 75,
            'stage': 'summary_generation'
        }, room=filename)

        generator = DocumentGenerator()
        generator.generate_summary(result.data['text'], result.data['base_name'])

        # Notificar completado
        socketio.emit('processing_completed', {
            'filename': filename,
            'status': 'completed',
            'progress': 100
        }, room=filename)

    except Exception as e:
        socketio.emit('processing_failed', {
            'filename': filename,
            'status': 'failed',
            'error': str(e)
        }, room=filename)
        raise

# templates/index.html (JavaScript)
const socket = io('http://localhost:5000');

socket.on('connect', () => {
    console.log('Connected to server');
});

socket.on('processing_started', (data) => {
    showNotification(`Processing started: ${data.filename}`);
});

socket.on('processing_progress', (data) => {
    updateProgressBar(data.filename, data.progress, data.stage);
});

socket.on('processing_completed', (data) => {
    showNotification(`Completed: ${data.filename}`, 'success');
    refreshFileList();
});

socket.on('processing_failed', (data) => {
    showNotification(`Failed: ${data.filename} - ${data.error}`, 'error');
});

// Subscribir a archivo específico
function subscribeToFile(filename) {
    socket.emit('subscribe_file', { filename: filename });
}
```

---

## 📝 MEJORAS DE CÓDIGO Y MANTENIBILIDAD

### 1. Agregar Type Hints Completos

**Problema:** No todos los métodos tienen type hints

**Solución:**
```python
# Usar mypy para verificar
pip install mypy

# pyproject.toml
[tool.mypy]
python_version = "3.10"
warn_return_any = true
warn_unused_configs = true
disallow_untyped_defs = true
disallow_incomplete_defs = true

# Ejecutar
mypy cbcfacil/
```

---

### 2. Implementar Logging Rotativo

**Problema:** `main.log` puede crecer indefinidamente

**Solución:**
```python
# main.py
from logging.handlers import RotatingFileHandler, TimedRotatingFileHandler

# Rotar por tamaño (max 10MB, 5 backups)
file_handler = RotatingFileHandler(
    'main.log',
    maxBytes=10*1024*1024,  # 10MB
    backupCount=5
)

# O rotar diariamente
file_handler = TimedRotatingFileHandler(
    'main.log',
    when='midnight',
    interval=1,
    backupCount=30  # Mantener 30 días
)

file_handler.setFormatter(formatter)
logging.root.addHandler(file_handler)
```

---

### 3. Agregar Health Checks Avanzados

```python
# core/health_check.py (mejorado)
class HealthCheckService:
    def get_full_status(self) -> Dict[str, Any]:
        """Get comprehensive health status"""
        return {
            'status': 'healthy',
            'timestamp': datetime.utcnow().isoformat(),
            'version': settings.APP_VERSION,
            'checks': {
                'database': self._check_database(),
                'redis': self._check_redis(),
                'celery': self._check_celery(),
                'gpu': self._check_gpu(),
                'disk_space': self._check_disk_space(),
                'external_apis': {
                    'nextcloud': self._check_nextcloud(),
                    'notion': self._check_notion(),
                    'telegram': self._check_telegram(),
                    'claude': self._check_claude(),
                    'gemini': self._check_gemini(),
                }
            },
            'metrics': {
                'processed_files_today': self._count_processed_today(),
                'queue_size': self._get_queue_size(),
                'avg_processing_time': self._get_avg_processing_time(),
                'error_rate': self._get_error_rate(),
            }
        }

    def _check_database(self) -> Dict[str, Any]:
        try:
            from models.database import engine
            with engine.connect() as conn:
                conn.execute("SELECT 1")
            return {'status': 'healthy'}
        except Exception as e:
            return {'status': 'unhealthy', 'error': str(e)}

    def _check_redis(self) -> Dict[str, Any]:
        try:
            from services.cache_service import cache_service
            cache_service.redis_client.ping()
            return {'status': 'healthy'}
        except Exception as e:
            return {'status': 'unhealthy', 'error': str(e)}

    def _check_celery(self) -> Dict[str, Any]:
        try:
            from celery_app import celery_app
            stats = celery_app.control.inspect().stats()
            active = celery_app.control.inspect().active()

            return {
                'status': 'healthy' if stats else 'unhealthy',
                'workers': len(stats) if stats else 0,
                'active_tasks': sum(len(tasks) for tasks in active.values()) if active else 0
            }
        except Exception as e:
            return {'status': 'unhealthy', 'error': str(e)}
```

---

### 4. Modularizar Frontend

**Problema:** `index.html` tiene 2500+ líneas

**Solución - Migrar a React:**

```bash
# Crear frontend moderno
npx create-react-app frontend
cd frontend
npm install axios socket.io-client recharts date-fns
```

**Estructura propuesta:**
```
frontend/
├── src/
│   ├── components/
│   │   ├── Dashboard/
│   │   │   ├── StatsCards.jsx
│   │   │   ├── ProcessingQueue.jsx
│   │   │   └── SystemHealth.jsx
│   │   ├── Files/
│   │   │   ├── FileList.jsx
│   │   │   ├── FileItem.jsx
│   │   │   └── FileUpload.jsx
│   │   ├── Preview/
│   │   │   ├── PreviewPanel.jsx
│   │   │   ├── TranscriptionView.jsx
│   │   │   └── SummaryView.jsx
│   │   ├── Versions/
│   │   │   └── VersionHistory.jsx
│   │   └── Layout/
│   │       ├── Sidebar.jsx
│   │       ├── Header.jsx
│   │       └── Footer.jsx
│   ├── hooks/
│   │   ├── useWebSocket.js
│   │   ├── useFiles.js
│   │   └── useAuth.js
│   ├── services/
│   │   ├── api.js
│   │   └── socket.js
│   ├── store/
│   │   └── store.js (Redux/Zustand)
│   ├── App.jsx
│   └── index.jsx
└── package.json
```

---

## 🔗 INTEGRACIÓN AVANZADA CON NOTION

### Estado Actual

La integración con Notion está **parcialmente implementada** en `services/notion_service.py` y `document/generators.py`. Actualmente:

- ✅ Upload de PDFs a Notion database
- ✅ Creación de páginas con título y status
- ⚠️ Upload con base64 (limitado a 5MB por la API de Notion)
- ❌ No hay sincronización bidireccional
- ❌ No se actualizan páginas existentes
- ❌ No se manejan rate limits de Notion
- ❌ No hay webhook para cambios en Notion

### Mejoras Propuestas

#### 1. Migrar a Cliente Oficial de Notion

**Problema:** Uso directo de `requests` sin manejo de rate limits

**Solución:**
```bash
pip install notion-client
```

```python
# services/notion_service.py (refactorizado)
from notion_client import Client
from notion_client.errors import APIResponseError
import time
from typing import Optional, Dict, Any, List
from pathlib import Path
import logging

class NotionService:
    """Enhanced Notion integration service"""

    def __init__(self):
        self.logger = logging.getLogger(__name__)
        self._client: Optional[Client] = None
        self._database_id: Optional[str] = None
        self._rate_limiter = RateLimiter(max_requests=3, time_window=1)  # 3 req/sec

    def configure(self, token: str, database_id: str) -> None:
        """Configure Notion with official SDK"""
        self._client = Client(auth=token)
        self._database_id = database_id
        self.logger.info("Notion service configured with official SDK")

    @property
    def is_configured(self) -> bool:
        return bool(self._client and self._database_id)

    def _rate_limited_request(self, func, *args, **kwargs):
        """Execute request with rate limiting and retry"""
        max_retries = 3
        base_delay = 1

        for attempt in range(max_retries):
            try:
                self._rate_limiter.wait()
                return func(*args, **kwargs)
            except APIResponseError as e:
                if e.code == 'rate_limited':
                    delay = base_delay * (2 ** attempt)  # Exponential backoff
                    self.logger.warning(f"Rate limited, waiting {delay}s")
                    time.sleep(delay)
                else:
                    raise

        raise Exception("Max retries exceeded")

    def create_page(self, title: str, content: str, metadata: Dict[str, Any]) -> Optional[str]:
        """Create a new page in Notion database"""
        if not self.is_configured:
            self.logger.warning("Notion not configured")
            return None

        try:
            # Preparar properties
            properties = {
                "Name": {
                    "title": [
                        {
                            "text": {
                                "content": title
                            }
                        }
                    ]
                },
                "Status": {
                    "select": {
                        "name": "Procesado"
                    }
                },
                "Tipo": {
                    "select": {
                        "name": metadata.get('file_type', 'Desconocido')
                    }
                },
                "Fecha Procesamiento": {
                    "date": {
                        "start": metadata.get('processed_at', datetime.utcnow().isoformat())
                    }
                }
            }

            # Agregar campos opcionales
            if metadata.get('duration'):
                properties["Duración (min)"] = {
                    "number": round(metadata['duration'] / 60, 2)
                }

            if metadata.get('page_count'):
                properties["Páginas"] = {
                    "number": metadata['page_count']
                }

            # Crear página
            page = self._rate_limited_request(
                self._client.pages.create,
                parent={"database_id": self._database_id},
                properties=properties
            )

            page_id = page['id']
            self.logger.info(f"Notion page created: {page_id}")

            # Agregar contenido como bloques
            self._add_content_blocks(page_id, content)

            return page_id

        except Exception as e:
            self.logger.error(f"Error creating Notion page: {e}")
            return None

    def _add_content_blocks(self, page_id: str, content: str) -> bool:
        """Add content blocks to Notion page"""
        try:
            # Dividir contenido en secciones
            sections = self._parse_markdown_to_blocks(content)

            # Notion API limita a 100 bloques por request
            for i in range(0, len(sections), 100):
                batch = sections[i:i+100]
                self._rate_limited_request(
                    self._client.blocks.children.append,
                    block_id=page_id,
                    children=batch
                )

            return True

        except Exception as e:
            self.logger.error(f"Error adding content blocks: {e}")
            return False

    def _parse_markdown_to_blocks(self, markdown: str) -> List[Dict]:
        """Convert markdown to Notion blocks"""
        blocks = []
        lines = markdown.split('\n')

        for line in lines:
            line = line.strip()

            if not line:
                continue

            # Headings
            if line.startswith('# '):
                blocks.append({
                    "object": "block",
                    "type": "heading_1",
                    "heading_1": {
                        "rich_text": [{"type": "text", "text": {"content": line[2:]}}]
                    }
                })
            elif line.startswith('## '):
                blocks.append({
                    "object": "block",
                    "type": "heading_2",
                    "heading_2": {
                        "rich_text": [{"type": "text", "text": {"content": line[3:]}}]
                    }
                })
            elif line.startswith('### '):
                blocks.append({
                    "object": "block",
                    "type": "heading_3",
                    "heading_3": {
                        "rich_text": [{"type": "text", "text": {"content": line[4:]}}]
                    }
                })
            # Bullet points
            elif line.startswith('- ') or line.startswith('* '):
                blocks.append({
                    "object": "block",
                    "type": "bulleted_list_item",
                    "bulleted_list_item": {
                        "rich_text": [{"type": "text", "text": {"content": line[2:]}}]
                    }
                })
            # Paragraph
            else:
                # Notion limita rich_text a 2000 chars
                if len(line) > 2000:
                    chunks = [line[i:i+2000] for i in range(0, len(line), 2000)]
                    for chunk in chunks:
                        blocks.append({
                            "object": "block",
                            "type": "paragraph",
                            "paragraph": {
                                "rich_text": [{"type": "text", "text": {"content": chunk}}]
                            }
                        })
                else:
                    blocks.append({
                        "object": "block",
                        "type": "paragraph",
                        "paragraph": {
                            "rich_text": [{"type": "text", "text": {"content": line}}]
                        }
                    })

        return blocks

    def upload_file_to_page(self, page_id: str, file_path: Path, file_type: str = 'pdf') -> bool:
        """Upload file as external file to Notion page"""
        if not file_path.exists():
            self.logger.error(f"File not found: {file_path}")
            return False

        try:
            # Notion no soporta upload directo, necesitas hosting externo
            # Opción 1: Subir a Nextcloud y obtener link público
            # Opción 2: Usar S3/MinIO
            # Opción 3: Usar servicio de hosting dedicado

            # Asumiendo que tienes un endpoint público para el archivo
            file_url = self._get_public_url(file_path)

            if not file_url:
                self.logger.warning("Could not generate public URL for file")
                return False

            # Agregar como bloque de archivo
            self._rate_limited_request(
                self._client.blocks.children.append,
                block_id=page_id,
                children=[
                    {
                        "object": "block",
                        "type": "file",
                        "file": {
                            "type": "external",
                            "external": {
                                "url": file_url
                            }
                        }
                    }
                ]
            )

            return True

        except Exception as e:
            self.logger.error(f"Error uploading file to Notion: {e}")
            return False

    def _get_public_url(self, file_path: Path) -> Optional[str]:
        """Generate public URL for file (via Nextcloud or S3)"""
        # Implementar según tu infraestructura
        # Opción 1: Nextcloud share link
        from services.webdav_service import webdav_service

        # Subir a Nextcloud si no está
        remote_path = f"/cbcfacil/{file_path.name}"
        webdav_service.upload_file(file_path, remote_path)

        # Generar share link (requiere Nextcloud API adicional)
        # return webdav_service.create_share_link(remote_path)

        # Opción 2: Usar el endpoint de downloads de tu API
        return f"{settings.PUBLIC_API_URL}/downloads/{file_path.name}"

    def update_page_status(self, page_id: str, status: str) -> bool:
        """Update page status"""
        try:
            self._rate_limited_request(
                self._client.pages.update,
                page_id=page_id,
                properties={
                    "Status": {
                        "select": {
                            "name": status
                        }
                    }
                }
            )
            return True
        except Exception as e:
            self.logger.error(f"Error updating page status: {e}")
            return False

    def search_pages(self, query: str) -> List[Dict]:
        """Search pages in database"""
        try:
            results = self._rate_limited_request(
                self._client.databases.query,
                database_id=self._database_id,
                filter={
                    "property": "Name",
                    "title": {
                        "contains": query
                    }
                }
            )
            return results.get('results', [])
        except Exception as e:
            self.logger.error(f"Error searching Notion pages: {e}")
            return []

    def get_page_content(self, page_id: str) -> Optional[str]:
        """Get page content as markdown"""
        try:
            blocks = self._rate_limited_request(
                self._client.blocks.children.list,
                block_id=page_id
            )

            markdown = self._blocks_to_markdown(blocks.get('results', []))
            return markdown

        except Exception as e:
            self.logger.error(f"Error getting page content: {e}")
            return None

    def _blocks_to_markdown(self, blocks: List[Dict]) -> str:
        """Convert Notion blocks to markdown"""
        markdown_lines = []

        for block in blocks:
            block_type = block.get('type')

            if block_type == 'heading_1':
                text = self._extract_text(block['heading_1'])
                markdown_lines.append(f"# {text}")
            elif block_type == 'heading_2':
                text = self._extract_text(block['heading_2'])
                markdown_lines.append(f"## {text}")
            elif block_type == 'heading_3':
                text = self._extract_text(block['heading_3'])
                markdown_lines.append(f"### {text}")
            elif block_type == 'bulleted_list_item':
                text = self._extract_text(block['bulleted_list_item'])
                markdown_lines.append(f"- {text}")
            elif block_type == 'paragraph':
                text = self._extract_text(block['paragraph'])
                markdown_lines.append(text)

        return '\n\n'.join(markdown_lines)

    def _extract_text(self, block_data: Dict) -> str:
        """Extract text from Notion rich_text"""
        rich_texts = block_data.get('rich_text', [])
        return ''.join(rt.get('text', {}).get('content', '') for rt in rich_texts)

# Rate limiter helper
class RateLimiter:
    def __init__(self, max_requests: int, time_window: float):
        self.max_requests = max_requests
        self.time_window = time_window
        self.requests = []

    def wait(self):
        """Wait if rate limit is reached"""
        now = time.time()

        # Remove old requests
        self.requests = [r for r in self.requests if now - r < self.time_window]

        # Wait if limit reached
        if len(self.requests) >= self.max_requests:
            sleep_time = self.time_window - (now - self.requests[0])
            if sleep_time > 0:
                time.sleep(sleep_time)
            self.requests = []

        self.requests.append(now)

# Global instance
notion_service = NotionService()
```

---

#### 2. Sincronización Bidireccional

**Implementar webhooks para recibir cambios desde Notion:**

```python
# api/webhooks.py (nuevo archivo)
from flask import Blueprint, request, jsonify
from services.notion_service import notion_service
from tasks.sync import sync_notion_changes

webhooks_bp = Blueprint('webhooks', __name__)

@webhooks_bp.route('/webhooks/notion', methods=['POST'])
def notion_webhook():
    """Handle Notion webhook events"""
    # Verificar signature (si Notion lo soporta)
    # signature = request.headers.get('X-Notion-Signature')
    # if not verify_signature(request.data, signature):
    #     abort(403)

    data = request.json

    # Procesar evento
    event_type = data.get('type')

    if event_type == 'page.updated':
        page_id = data.get('page_id')
        # Queue task para sincronizar cambios
        sync_notion_changes.delay(page_id)

    return jsonify({'status': 'ok'}), 200

# tasks/sync.py (nuevo archivo)
from celery_app import celery_app
from services.notion_service import notion_service
from models.database import ProcessedFile, get_db

@celery_app.task
def sync_notion_changes(page_id: str):
    """Sync changes from Notion back to local database"""
    logger = logging.getLogger(__name__)

    try:
        # Obtener contenido actualizado de Notion
        content = notion_service.get_page_content(page_id)

        if not content:
            logger.error(f"Could not fetch Notion page: {page_id}")
            return

        # Buscar registro local
        with get_db() as db:
            file_record = db.query(ProcessedFile).filter_by(
                notion_page_id=page_id
            ).first()

            if file_record:
                file_record.summary_text = content
                file_record.updated_at = datetime.utcnow()
                db.commit()
                logger.info(f"Synced changes from Notion for {file_record.filename}")
            else:
                logger.warning(f"No local record found for Notion page {page_id}")

    except Exception as e:
        logger.error(f"Error syncing Notion changes: {e}")
```

**Configurar webhook en Notion:**
```bash
# Nota: Notion actualmente no tiene webhooks nativos
# Alternativas:
# 1. Polling periódico (cada 5 min)
# 2. Usar servicios de terceros como Zapier/Make
# 3. Implementar polling con Celery beat

# tasks/sync.py
@celery_app.task
def poll_notion_changes():
    """Poll Notion for changes (scheduled task)"""
    # Buscar páginas modificadas recientemente
    # ...
```

---

#### 3. Pipeline Completo de Integración con Notion

**Diagrama del flujo:**

```
┌─────────────────────────────────────────────────────────────┐
│                     CBCFacil Pipeline                        │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
          ┌─────────────────────────────────┐
          │  1. Archivo detectado en         │
          │     Nextcloud                    │
          └─────────────────────────────────┘
                            │
                            ▼
          ┌─────────────────────────────────┐
          │  2. Procesar (Audio/PDF)         │
          │     - Transcripción              │
          │     - OCR                        │
          └─────────────────────────────────┘
                            │
                            ▼
          ┌─────────────────────────────────┐
          │  3. Generar Resumen con IA       │
          │     - Claude/Gemini              │
          │     - Formateo                   │
          └─────────────────────────────────┘
                            │
                            ▼
          ┌─────────────────────────────────┐
          │  4. Crear Documentos             │
          │     - Markdown                   │
          │     - DOCX                       │
          │     - PDF                        │
          └─────────────────────────────────┘
                            │
                ┌───────────┴──────────┐
                ▼                      ▼
    ┌──────────────────┐   ┌──────────────────┐
    │  5a. Subir a     │   │  5b. Guardar en  │
    │      Notion      │   │      Database    │
    │  - Crear página  │   │  - PostgreSQL    │
    │  - Agregar       │   │  - Metadata      │
    │    contenido     │   │  - notion_page_id│
    │  - Adjuntar PDF  │   │                  │
    └──────────────────┘   └──────────────────┘
                │                      │
                └───────────┬──────────┘
                            ▼
          ┌─────────────────────────────────┐
          │  6. Notificar                    │
          │     - Telegram                   │
          │     - Email (opcional)           │
          │     - WebSocket (dashboard)      │
          └─────────────────────────────────┘
```

**Implementación:**

```python
# document/generators.py (mejorado)
def generate_summary(self, text: str, base_name: str, file_metadata: Dict[str, Any]) -> Tuple[bool, str, Dict[str, Any]]:
    """Generate summary with full Notion integration"""

    try:
        # Steps 1-4: Existing logic
        # ...

        # Step 5: Upload to Notion with rich metadata
        notion_page_id = None
        if settings.has_notion_config:
            try:
                title = base_name.replace('_', ' ').title()

                # Preparar metadata enriquecida
                metadata = {
                    'file_type': file_metadata.get('file_type', 'Desconocido'),
                    'processed_at': datetime.utcnow().isoformat(),
                    'duration': file_metadata.get('duration'),
                    'page_count': file_metadata.get('page_count'),
                    'file_size': file_metadata.get('file_size'),
                }

                # Crear página en Notion
                notion_page_id = notion_service.create_page(
                    title=title,
                    content=summary,
                    metadata=metadata
                )

                if notion_page_id:
                    self.logger.info(f"Notion page created: {notion_page_id}")

                    # Upload PDF to Notion page
                    notion_service.upload_file_to_page(
                        page_id=notion_page_id,
                        file_path=pdf_path,
                        file_type='pdf'
                    )

            except Exception as e:
                self.logger.warning(f"Notion integration failed: {e}")

        # Update response metadata
        metadata = {
            'markdown_path': str(markdown_path),
            'docx_path': str(docx_path),
            'pdf_path': str(pdf_path),
            'summary': summary,
            'notion_page_id': notion_page_id,
            'notion_uploaded': bool(notion_page_id),
        }

        return True, summary, metadata

    except Exception as e:
        self.logger.error(f"Document generation failed: {e}")
        return False, "", {}
```

---

#### 4. Configuración de Base de Datos Notion

**Schema recomendado para la base de datos de Notion:**

| Propiedad | Tipo | Descripción |
|-----------|------|-------------|
| **Name** | Title | Nombre del documento |
| **Status** | Select | Procesado / En Revisión / Aprobado |
| **Tipo** | Select | Audio / PDF / Texto |
| **Fecha Procesamiento** | Date | Cuándo se procesó |
| **Duración (min)** | Number | Para archivos de audio |
| **Páginas** | Number | Para PDFs |
| **Tamaño (MB)** | Number | Tamaño del archivo |
| **Calidad** | Select | Alta / Media / Baja |
| **Categoría** | Multi-select | Tags/categorías |
| **Archivo Original** | Files & Media | Link al archivo |
| **Resumen PDF** | Files & Media | PDF generado |

**Script para crear la base de datos:**

```python
# scripts/setup_notion_database.py (nuevo archivo)
from notion_client import Client
import os

def create_cbcfacil_database(token: str, parent_page_id: str):
    """Create Notion database for CBCFacil"""
    client = Client(auth=token)

    database = client.databases.create(
        parent={"type": "page_id", "page_id": parent_page_id},
        title=[
            {
                "type": "text",
                "text": {"content": "CBCFacil - Documentos Procesados"}
            }
        ],
        properties={
            "Name": {
                "title": {}
            },
            "Status": {
                "select": {
                    "options": [
                        {"name": "Procesado", "color": "green"},
                        {"name": "En Revisión", "color": "yellow"},
                        {"name": "Aprobado", "color": "blue"},
                        {"name": "Error", "color": "red"},
                    ]
                }
            },
            "Tipo": {
                "select": {
                    "options": [
                        {"name": "Audio", "color": "purple"},
                        {"name": "PDF", "color": "orange"},
                        {"name": "Texto", "color": "gray"},
                    ]
                }
            },
            "Fecha Procesamiento": {
                "date": {}
            },
            "Duración (min)": {
                "number": {
                    "format": "number_with_commas"
                }
            },
            "Páginas": {
                "number": {}
            },
            "Tamaño (MB)": {
                "number": {
                    "format": "number_with_commas"
                }
            },
            "Calidad": {
                "select": {
                    "options": [
                        {"name": "Alta", "color": "green"},
                        {"name": "Media", "color": "yellow"},
                        {"name": "Baja", "color": "red"},
                    ]
                }
            },
            "Categoría": {
                "multi_select": {
                    "options": [
                        {"name": "Historia", "color": "blue"},
                        {"name": "Ciencia", "color": "green"},
                        {"name": "Literatura", "color": "purple"},
                        {"name": "Política", "color": "red"},
                    ]
                }
            },
        }
    )

    print(f"Database created: {database['id']}")
    print(f"Add this to your .env: NOTION_DATABASE_ID={database['id']}")

    return database['id']

if __name__ == '__main__':
    token = input("Enter your Notion API token: ")
    parent_page_id = input("Enter the parent page ID: ")

    create_cbcfacil_database(token, parent_page_id)
```

**Ejecutar:**
```bash
python scripts/setup_notion_database.py
```

---

#### 5. Features Avanzados de Notion

**Auto-categorización con IA:**

```python
# services/notion_service.py
def auto_categorize(self, summary: str) -> List[str]:
    """Auto-categorize content using AI"""
    from services.ai import ai_provider_factory

    ai = ai_provider_factory.get_best_provider()

    prompt = f"""Analiza el siguiente resumen y asigna 1-3 categorías principales de esta lista:
    - Historia
    - Ciencia
    - Literatura
    - Política
    - Economía
    - Tecnología
    - Filosofía
    - Arte
    - Deporte

    Resumen: {summary[:500]}

    Devuelve solo las categorías separadas por comas."""

    categories_str = ai.generate_text(prompt)
    categories = [c.strip() for c in categories_str.split(',')]

    return categories[:3]

def create_page(self, title: str, content: str, metadata: Dict[str, Any]):
    # ...

    # Auto-categorizar
    categories = self.auto_categorize(content)

    properties["Categoría"] = {
        "multi_select": [{"name": cat} for cat in categories]
    }

    # ...
```

**Evaluación de calidad:**

```python
def assess_quality(self, transcription: str, summary: str) -> str:
    """Assess document quality based on metrics"""

    # Criterios:
    # - Longitud del resumen (500-700 palabras = Alta)
    # - Coherencia (evaluar con IA)
    # - Presencia de datos clave (fechas, nombres)

    word_count = len(summary.split())

    if word_count < 300:
        return "Baja"
    elif word_count < 600:
        return "Media"
    else:
        return "Alta"
```

---

## ✅ PLAN DE TESTING

### Estructura de Tests

```
tests/
├── unit/
│   ├── test_settings.py
│   ├── test_validators.py
│   ├── test_webdav_service.py
│   ├── test_vram_manager.py
│   ├── test_ai_service.py
│   ├── test_notion_service.py
│   ├── test_audio_processor.py
│   ├── test_pdf_processor.py
│   ├── test_document_generator.py
│   └── test_processed_registry.py
├── integration/
│   ├── test_audio_pipeline.py
│   ├── test_pdf_pipeline.py
│   ├── test_notion_integration.py
│   └── test_api_endpoints.py
├── e2e/
│   └── test_full_workflow.py
├── conftest.py
└── fixtures/
    ├── sample_audio.mp3
    ├── sample_pdf.pdf
    └── mock_responses.json
```

### Ejemplos de Tests

```python
# tests/unit/test_notion_service.py
import pytest
from unittest.mock import Mock, patch
from services.notion_service import NotionService

@pytest.fixture
def notion_service():
    service = NotionService()
    service.configure(token="test_token", database_id="test_db")
    return service

def test_notion_service_configuration(notion_service):
    assert notion_service.is_configured
    assert notion_service._database_id == "test_db"

@patch('notion_client.Client')
def test_create_page_success(mock_client, notion_service):
    # Mock response
    mock_client.return_value.pages.create.return_value = {
        'id': 'page_123'
    }

    page_id = notion_service.create_page(
        title="Test Page",
        content="# Test Content",
        metadata={'file_type': 'pdf'}
    )

    assert page_id == 'page_123'

def test_rate_limiter():
    from services.notion_service import RateLimiter
    import time

    limiter = RateLimiter(max_requests=3, time_window=1.0)

    # Should allow 3 requests immediately
    start = time.time()
    for _ in range(3):
        limiter.wait()
    elapsed = time.time() - start
    assert elapsed < 0.1

    # 4th request should wait
    start = time.time()
    limiter.wait()
    elapsed = time.time() - start
    assert elapsed >= 0.9

# tests/integration/test_notion_integration.py
@pytest.mark.integration
def test_full_notion_workflow(tmpdir):
    """Test complete workflow: process file -> create Notion page"""
    # Setup
    audio_file = tmpdir / "test_audio.mp3"
    # ... create test file

    # Process audio
    from processors.audio_processor import audio_processor
    result = audio_processor.process(audio_file)

    # Generate summary
    from document.generators import DocumentGenerator
    generator = DocumentGenerator()
    success, summary, metadata = generator.generate_summary(
        result.data['text'],
        'test_audio'
    )

    assert success
    assert metadata.get('notion_page_id')

    # Verify Notion page exists
    from services.notion_service import notion_service
    content = notion_service.get_page_content(metadata['notion_page_id'])
    assert content is not None
```

### Coverage Goal

```bash
# Ejecutar tests con coverage
pytest --cov=. --cov-report=html --cov-report=term

# Meta: 80% coverage
# - Unit tests: 90% coverage
# - Integration tests: 70% coverage
# - E2E tests: 60% coverage
```

---

## 📅 ROADMAP DE IMPLEMENTACIÓN

### Sprint 1: Seguridad y Fixes Críticos (2 semanas)

**Semana 1:**
- [ ] Cambiar Notion API token
- [ ] Fix path traversal vulnerability
- [ ] Fix SECRET_KEY generation
- [ ] Mover imports a module level
- [ ] Implementar API authentication (JWT)

**Semana 2:**
- [ ] Configurar CORS restrictivo
- [ ] Agregar rate limiting (flask-limiter)
- [ ] Implementar CSP headers
- [ ] Input sanitization completo
- [ ] Filtrar info sensible de logs

**Entregables:**
- Sistema con seguridad básica
- Vulnerabilidades críticas resueltas
- Autenticación funcional

---

### Sprint 2: Testing y Performance (2 semanas)

**Semana 1:**
- [ ] Setup testing infrastructure
- [ ] Unit tests para services (50% coverage)
- [ ] Integration tests para pipelines
- [ ] CI/CD con GitHub Actions

**Semana 2:**
- [ ] Implementar Celery + Redis
- [ ] Queue system para processing
- [ ] Cache distribuido con Redis
- [ ] WebSockets para updates en tiempo real

**Entregables:**
- 50% code coverage
- Processing asíncrono funcional
- Real-time dashboard updates

---

### Sprint 3: Notion Integration Avanzada (2 semanas)

**Semana 1:**
- [ ] Migrar a notion-client oficial
- [ ] Implementar rate limiting para Notion
- [ ] Markdown to Notion blocks parser
- [ ] Auto-categorización con IA

**Semana 2:**
- [ ] Sistema de sincronización bidireccional
- [ ] Webhooks/polling para cambios
- [ ] File hosting para attachments
- [ ] Dashboard de métricas Notion

**Entregables:**
- Integración robusta con Notion
- Sincronización bidireccional
- Auto-categorización funcional

---

### Sprint 4: Database y Escalabilidad (2 semanas)

**Semana 1:**
- [ ] Setup PostgreSQL
- [ ] Schema design y migrations (Alembic)
- [ ] Migrar desde processed_files.txt
- [ ] Implementar repository pattern

**Semana 2:**
- [ ] Health checks avanzados
- [ ] Prometheus metrics exporter
- [ ] Logging rotativo
- [ ] Error tracking (Sentry)

**Entregables:**
- Database production-ready
- Observabilidad completa
- Sistema escalable

---

### Sprint 5: Frontend Modernization (3 semanas)

**Semana 1:**
- [ ] Setup React app
- [ ] Componentizar UI
- [ ] State management (Redux/Zustand)

**Semana 2:**
- [ ] WebSocket integration
- [ ] Real-time updates
- [ ] File upload con progress

**Semana 3:**
- [ ] Testing frontend (Jest)
- [ ] Responsive design
- [ ] Deployment production

**Entregables:**
- Frontend moderno y mantenible
- UX mejorada
- Tests de frontend

---

### Sprint 6: Features Avanzados (2 semanas)

**Semana 1:**
- [ ] i18n (internacionalización)
- [ ] Plugin system
- [ ] Video processor (nuevo)

**Semana 2:**
- [ ] Editor de prompts customizable
- [ ] Historial de versiones avanzado
- [ ] Reportes y analytics

**Entregables:**
- Sistema extensible
- Features premium
- Analytics dashboard

---

## 🎯 MÉTRICAS DE ÉXITO

### KPIs Sprint 1-2
- ✅ 0 vulnerabilidades críticas
- ✅ 50% code coverage
- ✅ 100% autenticación en endpoints
- ✅ \< 100ms response time (API)

### KPIs Sprint 3-4
- ✅ 95% uptime
- ✅ 80% code coverage
- ✅ \< 5 min tiempo de procesamiento (audio 1h)
- ✅ 100% tasa de sincronización con Notion

### KPIs Sprint 5-6
- ✅ \< 2s load time (frontend)
- ✅ 90% user satisfaction
- ✅ Soporte para 5+ idiomas
- ✅ 100+ archivos procesados/día sin degradación

---

## 📚 RECURSOS Y DOCUMENTACIÓN

### Librerías a Agregar

```txt
# requirements.txt (additions)

# Security
PyJWT>=2.8.0
flask-jwt-extended>=4.5.3
flask-limiter>=3.5.0
werkzeug>=3.0.0

# Queue & Cache
celery>=5.3.4
redis>=5.0.0
hiredis>=2.2.3

# Database
psycopg2-binary>=2.9.9
sqlalchemy>=2.0.23
alembic>=1.13.0

# Notion
notion-client>=2.2.1

# WebSockets
flask-socketio>=5.3.5
python-socketio>=5.10.0
eventlet>=0.33.3

# Monitoring
prometheus-client>=0.19.0
sentry-sdk>=1.39.1

# Testing
pytest>=7.4.3
pytest-cov>=4.1.0
pytest-asyncio>=0.21.1
pytest-mock>=3.12.0
faker>=22.0.0

# Type checking
mypy>=1.7.1
types-requests>=2.31.0
```

### Scripts Útiles

```bash
# scripts/deploy.sh
#!/bin/bash
set -e

echo "Deploying CBCFacil..."

# Pull latest code
git pull origin main

# Activate venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Run migrations
alembic upgrade head

# Restart services
sudo systemctl restart cbcfacil
sudo systemctl restart cbcfacil-worker
sudo systemctl restart nginx

echo "Deployment complete!"
```

---

## 🏁 CONCLUSIÓN

Este documento proporciona un roadmap completo para llevar CBCFacil de un prototipo funcional a un sistema production-ready y enterprise-grade.

### Próximos Pasos Inmediatos

1. **DÍA 1:** Cambiar Notion API token, fix vulnerabilidades críticas
2. **SEMANA 1:** Implementar autenticación y rate limiting
3. **SEMANA 2:** Setup testing infrastructure
4. **MES 1:** Completar Sprint 1-2

### Prioridad de Implementación

```
CRÍTICO (Ahora):
├── Seguridad básica
├── Fixes de bugs
└── Tests fundamentales

ALTO (2-4 semanas):
├── Performance (Celery + Redis)
├── Notion integration avanzada
└── Database migration

MEDIO (1-2 meses):
├── Frontend modernization
├── Observabilidad completa
└── Features avanzados
```

**Estado Final Esperado:** Sistema production-ready con 80%+ coverage, seguridad robusta, integración avanzada con Notion, y arquitectura escalable.

---

*Documento generado el 26 de Enero 2026*
*Versión: 1.0*
*Autor: CBCFacil Development Team*