feat: Implementación de Resúmenes Matemáticos con LaTeX y Pandoc

##  Novedades
- **Soporte LaTeX**: Generación de PDFs y DOCX con fórmulas matemáticas renderizadas correctamente usando Pandoc.
- **Sanitización Automática**: Corrección de caracteres Unicode (griegos/cirílicos) y sintaxis LaTeX para evitar errores de compilación.
- **GLM/Claude Prioritario**: Cambio de proveedor de IA predeterminado a Claude/GLM para mayor estabilidad y capacidad de razonamiento.
- **Mejoras en Formato**: El formateo final del resumen ahora usa el modelo principal (GLM) en lugar de Gemini para consistencia.

## 🛠️ Cambios Técnicos
- `document/generators.py`: Reemplazo de generación manual por `pandoc`. Añadida función `_sanitize_latex`.
- `services/ai/claude_provider.py`: Soporte mejorado para variables de entorno de Z.ai.
- `services/ai/provider_factory.py`: Prioridad ajustada `Claude > Gemini`.
- `latex/`: Añadida documentación de referencia para el pipeline LaTeX.
This commit is contained in:
renato97
2026-01-26 23:40:16 +00:00
parent f9d245a58e
commit 915f827305
4 changed files with 384 additions and 178 deletions

View File

@@ -3,6 +3,7 @@ Document generation utilities
"""
import logging
import subprocess
import re
from pathlib import Path
from typing import Dict, Any, List, Tuple
@@ -49,17 +50,24 @@ Texto:
# Step 2: Generate Unified Summary
self.logger.info("Generating unified summary...")
summary_prompt = f"""Eres un profesor universitario experto en historia del siglo XX. Redacta un resumen académico integrado en español usando el texto y los bullet points extraídos.
summary_prompt = f"""Eres un profesor universitario experto en historia y economía. Redacta un resumen académico integrado en español usando el texto y los bullet points extraídos.
REQUISITOS ESTRICTOS:
REQUISITOS ESTRICTOS DE CONTENIDO:
- Extensión entre 500-700 palabras
- Usa encabezados Markdown con jerarquía clara (##, ###)
- Desarrolla los puntos clave con profundidad y contexto histórico
- Desarrolla los puntos clave con profundidad y contexto histórico/económico
- Mantén un tono académico y analítico
- Incluye conclusiones significativas
- NO agregues texto fuera del resumen
- Devuelve únicamente el resumen en formato Markdown
REQUISITOS ESTRICTOS DE FORMATO MATEMÁTICO (LaTeX):
- Si el texto incluye fórmulas matemáticas o económicas, DEBES usar formato LaTeX.
- Usa bloques $$ ... $$ para ecuaciones centradas importantes.
- Usa $ ... $ para ecuaciones en línea.
- Ejemplo: La fórmula del interés compuesto es $A = P(1 + r/n)^{{nt}}$.
- NO uses bloques de código (```latex) para las fórmulas, úsalas directamente en el texto para que Pandoc las renderice.
Contenido a resumir:
{text[:20000]}
@@ -72,31 +80,29 @@ Puntos clave a incluir obligatoriamente:
self.logger.error(f"Raw summary generation failed: {e}")
raise e
# Step 3: Format with Gemini (using GeminiProvider explicitly)
self.logger.info("Formatting summary with Gemini...")
format_prompt = f"""Revisa y mejora el siguiente resumen en Markdown para que sea perfectamente legible:
# Step 3: Format with IA (using main provider instead of Gemini)
self.logger.info("Formatting summary with IA...")
format_prompt = f"""Revisa y mejora el siguiente resumen en Markdown para que sea perfectamente legible y compatible con Pandoc:
{raw_summary}
Instrucciones:
- Corrige cualquier error de formato
- Corrige cualquier error de formato Markdown
- Asegúrate de que los encabezados estén bien espaciados
- Verifica que las viñetas usen "- " correctamente
- Mantén exactamente el contenido existente
- EVITA el uso excesivo de negritas (asteriscos), úsalas solo para conceptos clave
- VERIFICA que todas las fórmulas matemáticas estén correctamente encerradas en $...$ (inline) o $$...$$ (display)
- NO alteres la sintaxis LaTeX dentro de los delimitadores $...$ o $$...$$
- Devuelve únicamente el resumen formateado sin texto adicional"""
# Use generic Gemini provider for formatting as requested
from services.ai.gemini_provider import GeminiProvider
formatter = GeminiProvider()
try:
if formatter.is_available():
summary = formatter.generate_text(format_prompt)
# Use the main provider (Claude/GLM) for formatting too
if self.ai_provider.is_available():
summary = self.ai_provider.generate_text(format_prompt)
else:
self.logger.warning(
"Gemini formatter not available, using raw summary"
"AI provider not available for formatting, using raw summary"
)
summary = raw_summary
except Exception as e:
@@ -108,8 +114,20 @@ Instrucciones:
# Create document
markdown_path = self._create_markdown(summary, base_name)
docx_path = self._create_docx(summary, base_name)
pdf_path = self._create_pdf(summary, base_name)
docx_path = None
try:
docx_path = self._create_docx(markdown_path, base_name)
except Exception as e:
self.logger.error(f"Failed to create DOCX (non-critical): {e}")
pdf_path = None
try:
# Sanitize LaTeX before PDF generation
self._sanitize_latex(markdown_path)
pdf_path = self._create_pdf(markdown_path, base_name)
except Exception as e:
self.logger.error(f"Failed to create PDF (non-critical): {e}")
# Upload to Notion if configured
from services.notion_service import notion_service
@@ -123,7 +141,7 @@ Instrucciones:
# Crear página con el contenido completo del resumen
notion_metadata = {
"file_type": "Audio", # O 'PDF' dependiendo del origen
"pdf_path": pdf_path,
"pdf_path": pdf_path if pdf_path else Path(""),
"add_status": False, # No usar Status/Tipo (no existen en la DB)
"use_as_page": False, # Usar como database, no página
}
@@ -149,9 +167,9 @@ Instrucciones:
metadata = {
"markdown_path": str(markdown_path),
"docx_path": str(docx_path),
"pdf_path": str(pdf_path),
"docx_name": Path(docx_path).name,
"docx_path": str(docx_path) if docx_path else "",
"pdf_path": str(pdf_path) if pdf_path else "",
"docx_name": Path(docx_path).name if docx_path else "",
"summary": summary,
"filename": filename,
"notion_uploaded": notion_uploaded,
@@ -164,6 +182,53 @@ Instrucciones:
self.logger.error(f"Document generation process failed: {e}")
return False, "", {}
def _sanitize_latex(self, markdown_path: Path) -> None:
"""Sanitize LaTeX syntax in Markdown file to prevent Pandoc errors"""
try:
content = markdown_path.read_text(encoding="utf-8")
# 1. Unescape escaped dollar signs which are common LLM errors for math
content = content.replace(r"\$", "$")
# 2. Fix common Cyrillic and Greek characters that sneak in via LLMs
replacements = {
"ч": "ch",
"в": "v",
"к": "k",
"м": "m",
"н": "n",
"т": "t",
"": "-",
"": "-",
"": '"',
"": '"',
"": "'",
"Δ": "$\\Delta$",
"δ": "$\\delta$",
"Σ": "$\\Sigma$",
"σ": "$\\sigma$",
"π": "$\\pi$",
"Π": "$\\Pi$",
"α": "$\\alpha$",
"β": "$\\beta$",
"γ": "$\\gamma$",
"θ": "$\\theta$",
"λ": "$\\lambda$",
"μ": "$\\mu$",
}
# Be careful not to double-replace already correct LaTeX
for char, repl in replacements.items():
if char in content:
# Check if it's already inside math mode would be complex,
# but for now we assume raw unicode greek chars should become latex
content = content.replace(char, repl)
markdown_path.write_text(content, encoding="utf-8")
self.logger.info(f"Sanitized LaTeX in {markdown_path}")
except Exception as e:
self.logger.warning(f"Failed to sanitize LaTeX: {e}")
def _generate_filename(self, text: str, summary: str) -> str:
"""Generate intelligent filename"""
try:
@@ -173,11 +238,10 @@ Summary: {summary}
Return only the topics separated by hyphens, max 20 chars each, in Spanish:"""
topics_text = (
self.ai_provider.sanitize_input(prompt)
if hasattr(self.ai_provider, "sanitize_input")
else summary[:100]
)
try:
topics_text = self.ai_provider.generate_text(prompt)
except Exception:
topics_text = summary[:100]
# Simple topic extraction
topics = re.findall(r"\b[A-ZÁÉÍÓÚÑ][a-záéíóúñ]+\b", topics_text)[:3]
@@ -192,7 +256,7 @@ Return only the topics separated by hyphens, max 20 chars each, in Spanish:"""
except Exception as e:
self.logger.error(f"Filename generation failed: {e}")
return base_name[: settings.MAX_FILENAME_BASE_LENGTH]
return "documento"
def _create_markdown(self, summary: str, base_name: str) -> Path:
"""Create Markdown document"""
@@ -217,154 +281,72 @@ Return only the topics separated by hyphens, max 20 chars each, in Spanish:"""
return output_path
def _create_docx(self, summary: str, base_name: str) -> Path:
"""Create DOCX document with Markdown parsing (Legacy method ported)"""
try:
from docx import Document
from docx.shared import Inches
except ImportError:
raise FileProcessingError("python-docx not installed")
def _create_docx(self, markdown_path: Path, base_name: str) -> Path:
"""Create DOCX document using pandoc"""
output_dir = settings.LOCAL_DOCX
output_dir.mkdir(parents=True, exist_ok=True)
output_path = output_dir / f"{base_name}_unificado.docx"
doc = Document()
doc.add_heading(base_name.replace("_", " ").title(), 0)
self.logger.info(
f"Converting Markdown to DOCX: {markdown_path} -> {output_path}"
)
# Parse and render Markdown content line by line
lines = summary.splitlines()
current_paragraph = []
for line in lines:
line = line.strip()
if not line:
if current_paragraph:
p = doc.add_paragraph(" ".join(current_paragraph))
p.alignment = 3 # JUSTIFY alignment (WD_ALIGN_PARAGRAPH.JUSTIFY=3)
current_paragraph = []
continue
if line.startswith("#"):
if current_paragraph:
p = doc.add_paragraph(" ".join(current_paragraph))
p.alignment = 3
current_paragraph = []
# Process heading
level = len(line) - len(line.lstrip("#"))
heading_text = line.lstrip("#").strip()
if level <= 6:
doc.add_heading(heading_text, level=level)
else:
current_paragraph.append(heading_text)
elif line.startswith("-") or line.startswith("*") or line.startswith(""):
if current_paragraph:
p = doc.add_paragraph(" ".join(current_paragraph))
p.alignment = 3
current_paragraph = []
bullet_text = line.lstrip("-*• ").strip()
p = doc.add_paragraph(bullet_text, style="List Bullet")
# Remove bold markers from bullets if present
if "**" in bullet_text:
# Basic cleanup for bullets
pass
else:
# Clean up excessive bold markers in body text if user requested
clean_line = line.replace(
"**", ""
) # Removing asterisks as per user complaint "se abusa de los asteriscos"
current_paragraph.append(clean_line)
if current_paragraph:
p = doc.add_paragraph(" ".join(current_paragraph))
p.alignment = 3
doc.add_page_break()
doc.add_paragraph(f"*Generado por CBCFacil*")
doc.save(output_path)
return output_path
def _create_pdf(self, summary: str, base_name: str) -> Path:
"""Create PDF document with Markdown parsing (Legacy method ported)"""
try:
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
import textwrap
except ImportError:
raise FileProcessingError("reportlab not installed")
cmd = [
"pandoc",
str(markdown_path),
"-o",
str(output_path),
"--from=markdown",
"--to=docx",
]
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
self.logger.info("DOCX generated successfully with pandoc")
return output_path
except subprocess.CalledProcessError as e:
self.logger.error(f"Pandoc DOCX conversion failed: {e.stderr}")
raise FileProcessingError(f"Failed to generate DOCX: {e.stderr}")
except Exception as e:
self.logger.error(f"Error generating DOCX: {e}")
raise FileProcessingError(f"Error generating DOCX: {e}")
def _create_pdf(self, markdown_path: Path, base_name: str) -> Path:
"""Create PDF document using pandoc and pdflatex"""
output_dir = settings.LOCAL_DOWNLOADS_PATH
output_dir.mkdir(parents=True, exist_ok=True)
output_path = output_dir / f"{base_name}_unificado.pdf"
c = canvas.Canvas(str(output_path), pagesize=letter)
width, height = letter
margin = 72
y_position = height - margin
self.logger.info(
f"Converting Markdown to PDF: {markdown_path} -> {output_path}"
)
def new_page():
nonlocal y_position
c.showPage()
c.setFont("Helvetica", 11)
y_position = height - margin
try:
cmd = [
"pandoc",
str(markdown_path),
"-o",
str(output_path),
"--pdf-engine=pdflatex",
"-V",
"geometry:margin=2.5cm",
"-V",
"fontsize=12pt",
"--highlight-style=tango",
]
c.setFont("Helvetica", 11)
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
# Title
c.setFont("Helvetica-Bold", 16)
c.drawString(margin, y_position, base_name.replace("_", " ").title()[:100])
y_position -= 28
c.setFont("Helvetica", 11)
self.logger.info("PDF generated successfully with pandoc")
return output_path
summary_clean = summary.replace(
"**", ""
) # Remove asterisks globally for cleaner PDF
for raw_line in summary_clean.splitlines():
line = raw_line.rstrip()
if not line.strip():
y_position -= 14
if y_position < margin:
new_page()
continue
stripped = line.lstrip()
if stripped.startswith("#"):
level = len(stripped) - len(stripped.lstrip("#"))
heading_text = stripped.lstrip("#").strip()
if heading_text:
font_size = 16 if level == 1 else 14 if level == 2 else 12
c.setFont("Helvetica-Bold", font_size)
c.drawString(margin, y_position, heading_text[:90])
y_position -= font_size + 6
if y_position < margin:
new_page()
c.setFont("Helvetica", 11)
continue
if stripped.startswith(("-", "*", "")):
bullet_text = stripped.lstrip("-*•").strip()
wrapped_lines = textwrap.wrap(bullet_text, width=80) or [""]
for idx, wrapped in enumerate(wrapped_lines):
prefix = "" if idx == 0 else " "
c.drawString(margin, y_position, f"{prefix}{wrapped}")
y_position -= 14
if y_position < margin:
new_page()
continue
# Body text - Justified approximation (ReportLab native justification requires Paragraph styles, defaulting to wrap)
wrapped_lines = textwrap.wrap(stripped, width=90) or [""]
for wrapped in wrapped_lines:
c.drawString(margin, y_position, wrapped)
y_position -= 14
if y_position < margin:
new_page()
c.save()
return output_path
except subprocess.CalledProcessError as e:
self.logger.error(f"Pandoc PDF conversion failed: {e.stderr}")
raise FileProcessingError(f"Failed to generate PDF: {e.stderr}")
except Exception as e:
self.logger.error(f"Error generating PDF: {e}")
raise FileProcessingError(f"Error generating PDF: {e}")