feat: Implementación de Resúmenes Matemáticos con LaTeX y Pandoc

## ✨ Novedades - **Soporte LaTeX**: Generación de PDFs y DOCX con fórmulas matemáticas renderizadas correctamente usando Pandoc. - **Sanitización Automática**: Corrección de caracteres Unicode (griegos/cirílicos) y sintaxis LaTeX para evitar errores de compilación. - **GLM/Claude Prioritario**: Cambio de proveedor de IA predeterminado a Claude/GLM para mayor estabilidad y capacidad de razonamiento. - **Mejoras en Formato**: El formateo final del resumen ahora usa el modelo principal (GLM) en lugar de Gemini para consistencia. ## 🛠️ Cambios Técnicos - `document/generators.py`: Reemplazo de generación manual por `pandoc`. Añadida función `_sanitize_latex`. - `services/ai/claude_provider.py`: Soporte mejorado para variables de entorno de Z.ai. - `services/ai/provider_factory.py`: Prioridad ajustada `Claude > Gemini`. - `latex/`: Añadida documentación de referencia para el pipeline LaTeX.
2026-01-26 23:40:16 +00:00
parent f9d245a58e
commit 915f827305
4 changed files with 384 additions and 178 deletions
--- a/document/generators.py
+++ b/document/generators.py
@@ -3,6 +3,7 @@ Document generation utilities
 """

 import logging
+import subprocess
 import re
 from pathlib import Path
 from typing import Dict, Any, List, Tuple
@@ -49,17 +50,24 @@ Texto:

            # Step 2: Generate Unified Summary
            self.logger.info("Generating unified summary...")
-            summary_prompt = f"""Eres un profesor universitario experto en historia del siglo XX. Redacta un resumen académico integrado en español usando el texto y los bullet points extraídos.
+            summary_prompt = f"""Eres un profesor universitario experto en historia y economía. Redacta un resumen académico integrado en español usando el texto y los bullet points extraídos.

-REQUISITOS ESTRICTOS:
+REQUISITOS ESTRICTOS DE CONTENIDO:
 - Extensión entre 500-700 palabras
 - Usa encabezados Markdown con jerarquía clara (##, ###)
- Desarrolla los puntos clave con profundidad y contexto histórico
+- Desarrolla los puntos clave con profundidad y contexto histórico/económico
 - Mantén un tono académico y analítico
 - Incluye conclusiones significativas
 - NO agregues texto fuera del resumen
 - Devuelve únicamente el resumen en formato Markdown

+REQUISITOS ESTRICTOS DE FORMATO MATEMÁTICO (LaTeX):
+- Si el texto incluye fórmulas matemáticas o económicas, DEBES usar formato LaTeX.
+- Usa bloques $$ ... $$ para ecuaciones centradas importantes.
+- Usa $ ... $ para ecuaciones en línea.
+- Ejemplo: La fórmula del interés compuesto es $A = P(1 + r/n)^{{nt}}$.
+- NO uses bloques de código (```latex) para las fórmulas, úsalas directamente en el texto para que Pandoc las renderice.
+
 Contenido a resumir:
 {text[:20000]}

@@ -72,31 +80,29 @@ Puntos clave a incluir obligatoriamente:
                self.logger.error(f"Raw summary generation failed: {e}")
                raise e

-            # Step 3: Format with Gemini (using GeminiProvider explicitly)
-            self.logger.info("Formatting summary with Gemini...")
-            format_prompt = f"""Revisa y mejora el siguiente resumen en Markdown para que sea perfectamente legible:
+            # Step 3: Format with IA (using main provider instead of Gemini)
+            self.logger.info("Formatting summary with IA...")
+            format_prompt = f"""Revisa y mejora el siguiente resumen en Markdown para que sea perfectamente legible y compatible con Pandoc:

 {raw_summary}

 Instrucciones:
- Corrige cualquier error de formato
+- Corrige cualquier error de formato Markdown
 - Asegúrate de que los encabezados estén bien espaciados
 - Verifica que las viñetas usen "- " correctamente
 - Mantén exactamente el contenido existente
 - EVITA el uso excesivo de negritas (asteriscos), úsalas solo para conceptos clave
+- VERIFICA que todas las fórmulas matemáticas estén correctamente encerradas en $...$ (inline) o $$...$$ (display)
+- NO alteres la sintaxis LaTeX dentro de los delimitadores $...$ o $$...$$
 - Devuelve únicamente el resumen formateado sin texto adicional"""

-            # Use generic Gemini provider for formatting as requested
-            from services.ai.gemini_provider import GeminiProvider
-
-            formatter = GeminiProvider()
-
            try:
-                if formatter.is_available():
-                    summary = formatter.generate_text(format_prompt)
+                # Use the main provider (Claude/GLM) for formatting too
+                if self.ai_provider.is_available():
+                    summary = self.ai_provider.generate_text(format_prompt)
                else:
                    self.logger.warning(
-                        "Gemini formatter not available, using raw summary"
+                        "AI provider not available for formatting, using raw summary"
                    )
                    summary = raw_summary
            except Exception as e:
@@ -108,8 +114,20 @@ Instrucciones:

            # Create document
            markdown_path = self._create_markdown(summary, base_name)
-            docx_path = self._create_docx(summary, base_name)
-            pdf_path = self._create_pdf(summary, base_name)
+
+            docx_path = None
+            try:
+                docx_path = self._create_docx(markdown_path, base_name)
+            except Exception as e:
+                self.logger.error(f"Failed to create DOCX (non-critical): {e}")
+
+            pdf_path = None
+            try:
+                # Sanitize LaTeX before PDF generation
+                self._sanitize_latex(markdown_path)
+                pdf_path = self._create_pdf(markdown_path, base_name)
+            except Exception as e:
+                self.logger.error(f"Failed to create PDF (non-critical): {e}")

            # Upload to Notion if configured
            from services.notion_service import notion_service
@@ -123,7 +141,7 @@ Instrucciones:
                    # Crear página con el contenido completo del resumen
                    notion_metadata = {
                        "file_type": "Audio",  # O 'PDF' dependiendo del origen
-                        "pdf_path": pdf_path,
+                        "pdf_path": pdf_path if pdf_path else Path(""),
                        "add_status": False,  # No usar Status/Tipo (no existen en la DB)
                        "use_as_page": False,  # Usar como database, no página
                    }
@@ -149,9 +167,9 @@ Instrucciones:

            metadata = {
                "markdown_path": str(markdown_path),
-                "docx_path": str(docx_path),
-                "pdf_path": str(pdf_path),
-                "docx_name": Path(docx_path).name,
+                "docx_path": str(docx_path) if docx_path else "",
+                "pdf_path": str(pdf_path) if pdf_path else "",
+                "docx_name": Path(docx_path).name if docx_path else "",
                "summary": summary,
                "filename": filename,
                "notion_uploaded": notion_uploaded,
@@ -164,6 +182,53 @@ Instrucciones:
            self.logger.error(f"Document generation process failed: {e}")
            return False, "", {}

+    def _sanitize_latex(self, markdown_path: Path) -> None:
+        """Sanitize LaTeX syntax in Markdown file to prevent Pandoc errors"""
+        try:
+            content = markdown_path.read_text(encoding="utf-8")
+
+            # 1. Unescape escaped dollar signs which are common LLM errors for math
+            content = content.replace(r"\$", "$")
+
+            # 2. Fix common Cyrillic and Greek characters that sneak in via LLMs
+            replacements = {
+                "ч": "ch",
+                "в": "v",
+                "к": "k",
+                "м": "m",
+                "н": "n",
+                "т": "t",
+                "—": "-",
+                "–": "-",
+                "“": '"',
+                "”": '"',
+                "’": "'",
+                "Δ": "$\\Delta$",
+                "δ": "$\\delta$",
+                "Σ": "$\\Sigma$",
+                "σ": "$\\sigma$",
+                "π": "$\\pi$",
+                "Π": "$\\Pi$",
+                "α": "$\\alpha$",
+                "β": "$\\beta$",
+                "γ": "$\\gamma$",
+                "θ": "$\\theta$",
+                "λ": "$\\lambda$",
+                "μ": "$\\mu$",
+            }
+
+            # Be careful not to double-replace already correct LaTeX
+            for char, repl in replacements.items():
+                if char in content:
+                    # Check if it's already inside math mode would be complex,
+                    # but for now we assume raw unicode greek chars should become latex
+                    content = content.replace(char, repl)
+
+            markdown_path.write_text(content, encoding="utf-8")
+            self.logger.info(f"Sanitized LaTeX in {markdown_path}")
+        except Exception as e:
+            self.logger.warning(f"Failed to sanitize LaTeX: {e}")
+
    def _generate_filename(self, text: str, summary: str) -> str:
        """Generate intelligent filename"""
        try:
@@ -173,11 +238,10 @@ Summary: {summary}

 Return only the topics separated by hyphens, max 20 chars each, in Spanish:"""

-            topics_text = (
-                self.ai_provider.sanitize_input(prompt)
-                if hasattr(self.ai_provider, "sanitize_input")
-                else summary[:100]
-            )
+            try:
+                topics_text = self.ai_provider.generate_text(prompt)
+            except Exception:
+                topics_text = summary[:100]

            # Simple topic extraction
            topics = re.findall(r"\b[A-ZÁÉÍÓÚÑ][a-záéíóúñ]+\b", topics_text)[:3]
@@ -192,7 +256,7 @@ Return only the topics separated by hyphens, max 20 chars each, in Spanish:"""

        except Exception as e:
            self.logger.error(f"Filename generation failed: {e}")
-            return base_name[: settings.MAX_FILENAME_BASE_LENGTH]
+            return "documento"

    def _create_markdown(self, summary: str, base_name: str) -> Path:
        """Create Markdown document"""
@@ -217,154 +281,72 @@ Return only the topics separated by hyphens, max 20 chars each, in Spanish:"""

        return output_path

-    def _create_docx(self, summary: str, base_name: str) -> Path:
-        """Create DOCX document with Markdown parsing (Legacy method ported)"""
-        try:
-            from docx import Document
-            from docx.shared import Inches
-        except ImportError:
-            raise FileProcessingError("python-docx not installed")
-
+    def _create_docx(self, markdown_path: Path, base_name: str) -> Path:
+        """Create DOCX document using pandoc"""
        output_dir = settings.LOCAL_DOCX
        output_dir.mkdir(parents=True, exist_ok=True)

        output_path = output_dir / f"{base_name}_unificado.docx"

-        doc = Document()
-        doc.add_heading(base_name.replace("_", " ").title(), 0)
+        self.logger.info(
+            f"Converting Markdown to DOCX: {markdown_path} -> {output_path}"
+        )

-        # Parse and render Markdown content line by line
-        lines = summary.splitlines()
-        current_paragraph = []
-
-        for line in lines:
-            line = line.strip()
-            if not line:
-                if current_paragraph:
-                    p = doc.add_paragraph(" ".join(current_paragraph))
-                    p.alignment = 3  # JUSTIFY alignment (WD_ALIGN_PARAGRAPH.JUSTIFY=3)
-                    current_paragraph = []
-                continue
-
-            if line.startswith("#"):
-                if current_paragraph:
-                    p = doc.add_paragraph(" ".join(current_paragraph))
-                    p.alignment = 3
-                    current_paragraph = []
-                # Process heading
-                level = len(line) - len(line.lstrip("#"))
-                heading_text = line.lstrip("#").strip()
-                if level <= 6:
-                    doc.add_heading(heading_text, level=level)
-                else:
-                    current_paragraph.append(heading_text)
-            elif line.startswith("-") or line.startswith("*") or line.startswith("•"):
-                if current_paragraph:
-                    p = doc.add_paragraph(" ".join(current_paragraph))
-                    p.alignment = 3
-                    current_paragraph = []
-                bullet_text = line.lstrip("-*• ").strip()
-                p = doc.add_paragraph(bullet_text, style="List Bullet")
-                # Remove bold markers from bullets if present
-                if "**" in bullet_text:
-                    # Basic cleanup for bullets
-                    pass
-            else:
-                # Clean up excessive bold markers in body text if user requested
-                clean_line = line.replace(
-                    "**", ""
-                )  # Removing asterisks as per user complaint "se abusa de los asteriscos"
-                current_paragraph.append(clean_line)
-
-        if current_paragraph:
-            p = doc.add_paragraph(" ".join(current_paragraph))
-            p.alignment = 3
-
-        doc.add_page_break()
-        doc.add_paragraph(f"*Generado por CBCFacil*")
-
-        doc.save(output_path)
-        return output_path
-
-    def _create_pdf(self, summary: str, base_name: str) -> Path:
-        """Create PDF document with Markdown parsing (Legacy method ported)"""
        try:
-            from reportlab.lib.pagesizes import letter
-            from reportlab.pdfgen import canvas
-            import textwrap
-        except ImportError:
-            raise FileProcessingError("reportlab not installed")
+            cmd = [
+                "pandoc",
+                str(markdown_path),
+                "-o",
+                str(output_path),
+                "--from=markdown",
+                "--to=docx",
+            ]

+            result = subprocess.run(cmd, capture_output=True, text=True, check=True)
+
+            self.logger.info("DOCX generated successfully with pandoc")
+            return output_path
+
+        except subprocess.CalledProcessError as e:
+            self.logger.error(f"Pandoc DOCX conversion failed: {e.stderr}")
+            raise FileProcessingError(f"Failed to generate DOCX: {e.stderr}")
+        except Exception as e:
+            self.logger.error(f"Error generating DOCX: {e}")
+            raise FileProcessingError(f"Error generating DOCX: {e}")
+
+    def _create_pdf(self, markdown_path: Path, base_name: str) -> Path:
+        """Create PDF document using pandoc and pdflatex"""
        output_dir = settings.LOCAL_DOWNLOADS_PATH
        output_dir.mkdir(parents=True, exist_ok=True)

        output_path = output_dir / f"{base_name}_unificado.pdf"

-        c = canvas.Canvas(str(output_path), pagesize=letter)
-        width, height = letter
-        margin = 72
-        y_position = height - margin
+        self.logger.info(
+            f"Converting Markdown to PDF: {markdown_path} -> {output_path}"
+        )

-        def new_page():
-            nonlocal y_position
-            c.showPage()
-            c.setFont("Helvetica", 11)
-            y_position = height - margin
+        try:
+            cmd = [
+                "pandoc",
+                str(markdown_path),
+                "-o",
+                str(output_path),
+                "--pdf-engine=pdflatex",
+                "-V",
+                "geometry:margin=2.5cm",
+                "-V",
+                "fontsize=12pt",
+                "--highlight-style=tango",
+            ]

-        c.setFont("Helvetica", 11)
+            result = subprocess.run(cmd, capture_output=True, text=True, check=True)

-        # Title
-        c.setFont("Helvetica-Bold", 16)
-        c.drawString(margin, y_position, base_name.replace("_", " ").title()[:100])
-        y_position -= 28
-        c.setFont("Helvetica", 11)
+            self.logger.info("PDF generated successfully with pandoc")
+            return output_path

-        summary_clean = summary.replace(
-            "**", ""
-        )  # Remove asterisks globally for cleaner PDF
-
-        for raw_line in summary_clean.splitlines():
-            line = raw_line.rstrip()
-
-            if not line.strip():
-                y_position -= 14
-                if y_position < margin:
-                    new_page()
-                continue
-
-            stripped = line.lstrip()
-
-            if stripped.startswith("#"):
-                level = len(stripped) - len(stripped.lstrip("#"))
-                heading_text = stripped.lstrip("#").strip()
-                if heading_text:
-                    font_size = 16 if level == 1 else 14 if level == 2 else 12
-                    c.setFont("Helvetica-Bold", font_size)
-                    c.drawString(margin, y_position, heading_text[:90])
-                    y_position -= font_size + 6
-                    if y_position < margin:
-                        new_page()
-                    c.setFont("Helvetica", 11)
-                continue
-
-            if stripped.startswith(("-", "*", "•")):
-                bullet_text = stripped.lstrip("-*•").strip()
-                wrapped_lines = textwrap.wrap(bullet_text, width=80) or [""]
-                for idx, wrapped in enumerate(wrapped_lines):
-                    prefix = "• " if idx == 0 else "  "
-                    c.drawString(margin, y_position, f"{prefix}{wrapped}")
-                    y_position -= 14
-                    if y_position < margin:
-                        new_page()
-                continue
-
-            # Body text - Justified approximation (ReportLab native justification requires Paragraph styles, defaulting to wrap)
-            wrapped_lines = textwrap.wrap(stripped, width=90) or [""]
-            for wrapped in wrapped_lines:
-                c.drawString(margin, y_position, wrapped)
-                y_position -= 14
-                if y_position < margin:
-                    new_page()
-
-        c.save()
-        return output_path
+        except subprocess.CalledProcessError as e:
+            self.logger.error(f"Pandoc PDF conversion failed: {e.stderr}")
+            raise FileProcessingError(f"Failed to generate PDF: {e.stderr}")
+        except Exception as e:
+            self.logger.error(f"Error generating PDF: {e}")
+            raise FileProcessingError(f"Error generating PDF: {e}")