feat: Docker para CBC con acceso limitado

- Dockerfile minimal con Python 3.11 - docker-compose.yml con volumenes controlados - .dockerignore para build eficiente - Script de inicio start.sh - .env.example para configuración Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fix: Mejora parser de tablas LaTeX
2026-02-25 17:44:09 +00:00 · 2026-02-25 17:32:18 +00:00 · 2026-02-25 17:28:04 +00:00 · 2026-02-25 17:12:00 +00:00
11 changed files with 332 additions and 11 deletions
--- a/core/process_manager.py
+++ b/core/process_manager.py
@@ -504,7 +504,7 @@ class ProcessManager:
        # Notificar resumen completado
        telegram_service.send_summary_complete(filepath.name, has_markdown=True)
-        # 4. Llamar a PDFGenerator.markdown_to_pdf()
+        # 4. Llam.markdown_to_pdfar a PDFGenerator()
        pdf_path = None
        try:
            from services.pdf_generator import PDFGenerator
@@ -514,7 +514,9 @@ class ProcessManager:
            pdf_generator = PDFGenerator()
            pdf_path = md_path.with_suffix(".pdf")
-            pdf_generator.markdown_to_pdf(str(md_path), str(pdf_path))
+            # Leer el contenido markdown y pasarlo al generator
            markdown_content = md_path.read_text(encoding="utf-8")
            pdf_generator.markdown_to_pdf(markdown_content, pdf_path)
            logger.info(
                "PDF generado",
--- a/docker/.dockerignore
+++ b/docker/.dockerignore
@@ -0,0 +1,16 @@
 # Excluir archivos innecesarios del build
 .git
 .gitignore
 __pycache__
 *.pyc
 .venv
 *.log
 downloads/
 transcriptions/
 .env
 .env.*
 !.env.example
 *.md
 !docker/README.md
 node_modules/
 .DS_Store
--- a/docker/.env.example
+++ b/docker/.env.example
@@ -0,0 +1,14 @@
 # CBCFacil - Configuración Docker
 # Copiar a .env y completar con tus credenciales
 # API Keys
 ANTHROPIC_AUTH_TOKEN=tu_token_aqui
 # Nextcloud
 NEXTCLOUD_URL=https://nextcloud.tudominio.com/remote.php/webdav
 NEXTCLOUD_USER=tu_usuario
 NEXTCLOUD_PASSWORD=tu_password
 # Telegram
 TELEGRAM_TOKEN=tu_token_bot
 TELEGRAM_CHAT_ID=tu_chat_id
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -0,0 +1,35 @@
 # CBC OpenClaw - Imagen Docker minimal
 # Solo acceso a /app y las tools necesarias
 FROM python:3.11-slim
 # Instalar dependencias del sistema
 RUN apt-get update && apt-get install -y \
    git \
    curl \
    wget \
    build-essential \
    && rm -rf /var/lib/apt/lists/*
 # Crear usuario no-root para seguridad
 RUN useradd -m -s /bin/bash cbc && \
    mkdir -p /home/cbc && \
    chown -R cbc:cbc /home/cbc
 # Definir workspace
 WORKDIR /app
 # Copiar solo archivos necesarios del proyecto
 COPY --chown=cbc:cbc . .
 # Cambiar a usuario no-root
 USER cbc
 # Variables de entorno para el agente
 ENV ANTHROPIC_API_KEY=""
 ENV ANTHROPIC_BASE_URL="https://api.minimax.io/anthropic"
 ENV ANTHROPIC_MODEL="MiniMax-M2.5"
 ENV HOME=/app
 # El agente solo puede acceder a /app y sus subdirectorios
 # No tiene acceso a Internet directo (solo a través de variables de entorno)
--- a/docker/README.md
+++ b/docker/README.md
@@ -0,0 +1,10 @@
 # CBC OpenClaw - Dockerizado con acceso limitado
 ## Estructura
 ```
 docker/
 ├── Dockerfile
 ├── docker-compose.yml
 ├── .dockerignore
 └── README.md
--- a/docker/docker-compose.yml
+++ b/docker/docker-compose.yml
@@ -0,0 +1,34 @@
 version: '3.8'
 services:
  cbc-openclaw:
    build:
      context: ..
      dockerfile: docker/Dockerfile
    container_name: cbc-openclaw
    volumes:
      # Solo montar las carpetas necesarias
      - ../:/app
      # Montar credenciales desde variables de entorno o archivo seguro
      - ~/.env:/app/.env:ro
    environment:
      # API Keys - pasar desde host
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - ANTHROPIC_AUTH_TOKEN=${ANTHROPIC_AUTH_TOKEN}
      - ANTHROPIC_BASE_URL=https://api.minimax.io/anthropic
      - ANTHROPIC_MODEL=MiniMax-M2.5
      # Configuración CBC
      - NEXTCLOUD_URL=${NEXTCLOUD_URL}
      - NEXTCLOUD_USER=${NEXTCLOUD_USER}
      - NEXTCLOUD_PASSWORD=${NEXTCLOUD_PASSWORD}
      - TELEGRAM_TOKEN=${TELEGRAM_TOKEN}
      - TELEGRAM_CHAT_ID=${TELEGRAM_CHAT_ID}
    working_dir: /app
    command: ["python3", "main.py"]
    restart: unless-stopped
    networks:
      - cbc-network
 networks:
  cbc-network:
    driver: bridge
--- a/docker/start.sh
+++ b/docker/start.sh
@@ -0,0 +1,20 @@
 #!/bin/bash
 # CBC OpenClaw - Script de inicio Docker
 set -e
 # Cargar variables de entorno si existe .env
 if [ -f ".env" ]; then
    export $(cat .env | grep -v '^#' | xargs)
 fi
 echo "🟢 Iniciando CBC OpenClaw..."
 # Construir imagen si no existe
 docker compose -f docker/docker-compose.yml build
 # Iniciar contenedor
 docker compose -f docker/docker-compose.yml up -d
 echo "✅ CBC OpenClaw corriendo en http://localhost:5000"
 echo "📝 Ver logs: docker compose -f docker/docker-compose.yml logs -f"
--- a/main.py
+++ b/main.py
@@ -5,10 +5,15 @@ CBFacil - Sistema de transcripción de audio con IA y Notion
 Características:
 - Polling de Nextcloud vía WebDAV
 - Transcripción con Whisper (medium, GPU)
- Resúmenes con IA (GLM-4.7)
+- Resúmenes con IA (MiniMax)
 - Generación de PDF
 - Notificaciones Telegram
 """
 from dotenv import load_dotenv
 # Cargar variables de entorno desde .env
 load_dotenv()
 import logging
 import os
 import sys
@@ -321,6 +326,13 @@ class PollingService:
    def _on_file_downloaded(self, file_path: Path) -> None:
        """Callback when a file is downloaded."""
        # Verificar si ya fue procesado (existe transcripción con nombre exacto)
        transcriptions_dir = settings.TRANSCRIPTIONS_DIR
        txt_path = transcriptions_dir / f"{file_path.stem}.txt"
        if txt_path.exists():
            logger.info(f"Skipping already processed file: {file_path.name}")
            return
        self.queue_file_for_processing(file_path)
    def queue_file_for_processing(self, file_path: Path) -> None:
@@ -414,10 +426,18 @@ class PollingService:
        # Extensiones de audio soportadas
        audio_extensions = {".mp3", ".wav", ".m4a", ".mp4", ".webm", ".ogg", ".flac"}
-        pending_files = [
+        # Obtener transcripciones existentes - verificar por nombre EXACTO
-            f for f in downloads_dir.iterdir()
+        transcriptions_dir = settings.TRANSCRIPTIONS_DIR
-            if f.is_file() and f.suffix.lower() in audio_extensions and not f.name.startswith(".")
+
-        ]
+        # Filtrar solo archivos que NO han sido procesados
        pending_files = []
        for f in downloads_dir.iterdir():
            if f.is_file() and f.suffix.lower() in audio_extensions and not f.name.startswith("."):
                # Verificar si ya existe transcripción con el MISMO nombre
                txt_path = transcriptions_dir / f"{f.stem}.txt"
                if not txt_path.exists():
                    # No existe .txt, agregar a pendientes
                    pending_files.append(f)
        if not pending_files:
            logger.debug("No pending audio files to process")
--- a/services/ai_summary_service.py
+++ b/services/ai_summary_service.py
@@ -69,8 +69,39 @@ class AISummaryService:
            logger.debug("AISummaryService not configured, returning original text")
            return text
-        default_prompt = "Resume el siguiente texto de manera clara y concisa:"
+        # Prompt siguiendo código.md - resumen académico en español
-        prompt = prompt_template.format(text=text) if prompt_template else f"{default_prompt}\n\n{text}"
+        default_prompt = """Eres un asistente académico especializado en crear resúmenes de estudio de alta calidad.
 INSTRUCCIONES OBLIGATORIAS:
 1. Escribe ÚNICAMENTE en español
 2. El resumen debe seguir esta estructura:
   - Título y objetivo de estudio
   - Índice con 6-12 secciones
   - Desarrollo conceptual (definiciones, mecanismos)
   - Casos de aplicación (ejemplos concretos)
   - Errores frecuentes
   - Checklist de repaso
 3. Cada concepto debe explicar: qué es, por qué importa, cómo se aplica
 4. Evita listas sin explicación - siempre incluir el "por qué"
 5. Para TABLAS usa formato LaTeX tabular:
   \\begin{{tabular}}{{|c|l|l|}}
   \\hline
   Encabezado 1 & Encabezado 2 & Encabezado 3 \\\\
   \\hline
   dato1 & dato2 & dato3 \\\\
   \\hline
   \\end{{tabular}}
 6. NO uses tablas ASCII ni markdown con | pipes
 7. El resumen debe poder leerse en 15-25 minutos
 8. NO incluyas rutas de archivos ni referencias técnicas
 9. Sé conciso pero con densidad informativa útil para exámenes
 Transcripción de clase:
 {text}
 Genera el resumen siguiendo las instrucciones arriba."""
        prompt = prompt_template.format(text=text) if prompt_template else default_prompt.format(text=text)
        payload = {
            "model": self.model,
@@ -96,6 +127,29 @@ class AISummaryService:
            result = response.json()
            summary = result.get("choices", [{}])[0].get("message", {}).get("content", "")
            # Limpiar respuesta: eliminar thinking tokens y ruido
            # Buscar el primer encabezado markdown y cortar ahí
            first_header = summary.find("\n# ")
            if first_header == -1:
                first_header = summary.find("# ")
            if first_header > 0:
                summary = summary[first_header:]
            # Eliminar bloques de think/error si persisten
            lines = summary.split("\n")
            clean_lines = []
            skip = False
            for line in lines:
                if line.strip().startswith("<think>") or line.strip().endswith("</think>"):
                    skip = True
                    continue
                if skip and line.strip() and not line.startswith(" "):
                    skip = False
                if not skip:
                    clean_lines.append(line)
            summary = "\n".join(clean_lines)
            logger.info("Summarization completed successfully (output length: %d)", len(summary))
            return summary
--- a/services/pdf_generator.py
+++ b/services/pdf_generator.py
@@ -5,13 +5,13 @@ Utiliza reportlab para la generación de PDFs con soporte UTF-8.
 """
 import logging
 from pathlib import Path
-from typing import Union
+from typing import Optional, Union
 from reportlab.lib import colors
 from reportlab.lib.pagesizes import A4
 from reportlab.lib.styles import ParagraphStyle, getSampleStyleSheet
 from reportlab.lib.units import cm
-from reportlab.platypus import Paragraph, SimpleDocTemplate, Spacer
+from reportlab.platypus import Paragraph, SimpleDocTemplate, Spacer, Table, TableStyle
 logger = logging.getLogger(__name__)
@@ -64,6 +64,103 @@ class PDFGenerator:
            .replace("\n", "<br/>")
        )
    def _parse_latex_table(self, lines: list[str], start_idx: int) -> tuple[Optional[Table], int]:
        """
        Parsea una tabla LaTeX y la convierte a reportlab Table.
        Returns:
            (Table, end_index) - La tabla y el índice donde termina
        """
        # Buscar begin/end tabular
        table_lines = []
        i = start_idx
        in_table = False
        while i < len(lines):
            line = lines[i].strip()
            if "\\begin{tabular}" in line or "begin{tabular}" in line:
                in_table = True
                # Extraer especificaciones de columnas
                col_spec = "l"
                if "{" in line:
                    col_spec = line.split("{")[1].split("}")[0] if "}" in line else "l"
                table_lines.append({"type": "spec", "data": col_spec})
                i += 1
                continue
            if "\\end{tabular}" in line or "end{tabular}" in line:
                in_table = False
                break
            if in_table:
                # Saltar líneas de hline
                if "hline" in line.replace("\\", "").replace(" ", ""):
                    i += 1
                    continue
                # Procesar línea de tabla
                # Reemplazar & por separador
                row_data = line.replace("&", "|")
                # Eliminar comandos LaTeX
                row_data = row_data.replace("\\", "").replace("\\\\", "").replace("hline", "")
                cells = [c.strip() for c in row_data.split("|") if c.strip()]
                # Filtrar celdas vacías
                cells = [c for c in cells if c and c != "|"]
                if cells and len(cells) > 1:  # Al menos 2 columnas para ser tabla válida
                    table_lines.append({"type": "row", "data": cells})
            i += 1
        if not table_lines:
            return None, start_idx
        # Convertir a Table de reportlab
        data = []
        col_widths = None
        for tl in table_lines:
            if tl["type"] == "row":
                # Limpiar celdas de LaTeX
                row = []
                for cell in tl["data"]:
                    cell = cell.strip()
                    # Eliminar comandos LaTeX restantes (manejar {contenido})
                    import re
                    # Eliminar \textbf{...}, \textit{...}, \emph{...}
                    cell = re.sub(r'\\textbf\{([^}]*)\}', r'\1', cell)
                    cell = re.sub(r'\\textit\{([^}]*)\}', r'\1', cell)
                    cell = re.sub(r'\\emph\{([^}]*)\}', r'\1', cell)
                    cell = cell.replace("\\", "").replace("{", "").replace("}", "")
                    cell = cell.strip()
                    if cell:
                        row.append(cell)
                if row:
                    data.append(row)
        if not data:
            return None, start_idx
        # Crear tabla
        try:
            num_cols = len(data[0]) if data else 1
            table = Table(data)
            table.setStyle(TableStyle([
                ('BACKGROUND', (0, 0), (-1, 0), colors.grey),
                ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke),
                ('ALIGN', (0, 0), (-1, -1), 'LEFT'),
                ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
                ('FONTSIZE', (0, 0), (-0, -1), 10),
                ('BOTTOMPADDING', (0, 0), (-1, 0), 12),
                ('BACKGROUND', (0, 1), (-1, -1), colors.beige),
                ('GRID', (0, 0), (-1, -1), 1, colors.black),
                ('VALIGN', (0, 0), (-1, -1), 'TOP'),
            ]))
            return table, i
        except Exception as e:
            logger.warning(f"Error parsing LaTeX table: {e}")
            return None, start_idx
    def _parse_markdown_basic(self, markdown: str) -> list[Paragraph]:
        """
        Convierte markdown básico a una lista de Paragraphs de reportlab.
@@ -101,6 +198,14 @@ class PDFGenerator:
            # Línea horizontal
            elif line == "---" or line == "***":
                elements.append(Spacer(1, 0.2 * cm))
            # Tabla LaTeX
            elif "begin{tabular}" in line or "begin{tabular" in line:
                latex_table, end_idx = self._parse_latex_table(lines, idx)
                if latex_table:
                    elements.append(Spacer(1, 0.3 * cm))
                    elements.append(latex_table)
                    elements.append(Spacer(1, 0.3 * cm))
                    idx = end_idx - 1  # Saltar las líneas de la tabla
            # Lista con guiones
            elif line.startswith("- ") or line.startswith("* "):
                text = self._escape_xml(line[2:])
--- a/watchers/folder_watcher.py
+++ b/watchers/folder_watcher.py
@@ -200,6 +200,17 @@ class RemoteFolderWatcher:
        remote_path = f"{self.remote_path}/{filename}"
        local_path = self.local_path / filename
        # Verificar si ya existe el archivo localmente
        if local_path.exists():
            # Verificar si ya fue procesado (existe transcripción con nombre EXACTO)
            stem = local_path.stem
            transcriptions_dir = self.local_path.parent / "transcriptions"
            txt_path = transcriptions_dir / f"{stem}.txt"
            if txt_path.exists():
                self.logger.info(f"Skipping already processed file: {filename}")
                return
        self.logger.info(f"Downloading: {remote_path}")
        if self.webdav.download_file(remote_path, local_path):
Author	SHA1	Message	Date
renato97	a7726365d7	feat: Docker para CBC con acceso limitado - Dockerfile minimal con Python 3.11 - docker-compose.yml con volumenes controlados - .dockerignore para build eficiente - Script de inicio start.sh - .env.example para configuración Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-25 17:44:09 +00:00
renato97	d50772d962	fix: Mejora parser de tablas LaTeX - Elimina líneas hline duplicadas - Mejora limpieza de comandos LaTeX en celdas - Usa regex para manejar {contenido} - Filtra celdas vacías Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-25 17:32:18 +00:00
renato97	d902203b59	fix: Detección estricta de archivos ya procesados - Compara nombres exactos (stem.txt) en vez de substrings - Agrega verificación en callback de descarga - Evita re-procesamiento de archivos Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-25 17:28:04 +00:00
renato97	1f6bfa771b	fix: Mejoras en generación de PDFs y resúmenes - Corrige PDFGenerator para pasar contenido (no ruta) - Agrega prompt siguiendo código.md (español, estructura académica) - Limpia thinking tokens de respuesta AI - Agrega skip de archivos ya procesados en watcher - Implementa tablas LaTeX en PDFs (reportlab Table) - Agrega load_dotenv() en main.py - Actualiza .env con MiniMax config - Agrega transcriptions/ a .gitignore Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-25 17:12:00 +00:00