feat: Implementación de Resúmenes Matemáticos con LaTeX y Pandoc
## ✨ Novedades - **Soporte LaTeX**: Generación de PDFs y DOCX con fórmulas matemáticas renderizadas correctamente usando Pandoc. - **Sanitización Automática**: Corrección de caracteres Unicode (griegos/cirílicos) y sintaxis LaTeX para evitar errores de compilación. - **GLM/Claude Prioritario**: Cambio de proveedor de IA predeterminado a Claude/GLM para mayor estabilidad y capacidad de razonamiento. - **Mejoras en Formato**: El formateo final del resumen ahora usa el modelo principal (GLM) en lugar de Gemini para consistencia. ## 🛠️ Cambios Técnicos - `document/generators.py`: Reemplazo de generación manual por `pandoc`. Añadida función `_sanitize_latex`. - `services/ai/claude_provider.py`: Soporte mejorado para variables de entorno de Z.ai. - `services/ai/provider_factory.py`: Prioridad ajustada `Claude > Gemini`. - `latex/`: Añadida documentación de referencia para el pipeline LaTeX.
This commit is contained in:
@@ -3,6 +3,7 @@ Document generation utilities
|
||||
"""
|
||||
|
||||
import logging
|
||||
import subprocess
|
||||
import re
|
||||
from pathlib import Path
|
||||
from typing import Dict, Any, List, Tuple
|
||||
@@ -49,17 +50,24 @@ Texto:
|
||||
|
||||
# Step 2: Generate Unified Summary
|
||||
self.logger.info("Generating unified summary...")
|
||||
summary_prompt = f"""Eres un profesor universitario experto en historia del siglo XX. Redacta un resumen académico integrado en español usando el texto y los bullet points extraídos.
|
||||
summary_prompt = f"""Eres un profesor universitario experto en historia y economía. Redacta un resumen académico integrado en español usando el texto y los bullet points extraídos.
|
||||
|
||||
REQUISITOS ESTRICTOS:
|
||||
REQUISITOS ESTRICTOS DE CONTENIDO:
|
||||
- Extensión entre 500-700 palabras
|
||||
- Usa encabezados Markdown con jerarquía clara (##, ###)
|
||||
- Desarrolla los puntos clave con profundidad y contexto histórico
|
||||
- Desarrolla los puntos clave con profundidad y contexto histórico/económico
|
||||
- Mantén un tono académico y analítico
|
||||
- Incluye conclusiones significativas
|
||||
- NO agregues texto fuera del resumen
|
||||
- Devuelve únicamente el resumen en formato Markdown
|
||||
|
||||
REQUISITOS ESTRICTOS DE FORMATO MATEMÁTICO (LaTeX):
|
||||
- Si el texto incluye fórmulas matemáticas o económicas, DEBES usar formato LaTeX.
|
||||
- Usa bloques $$ ... $$ para ecuaciones centradas importantes.
|
||||
- Usa $ ... $ para ecuaciones en línea.
|
||||
- Ejemplo: La fórmula del interés compuesto es $A = P(1 + r/n)^{{nt}}$.
|
||||
- NO uses bloques de código (```latex) para las fórmulas, úsalas directamente en el texto para que Pandoc las renderice.
|
||||
|
||||
Contenido a resumir:
|
||||
{text[:20000]}
|
||||
|
||||
@@ -72,31 +80,29 @@ Puntos clave a incluir obligatoriamente:
|
||||
self.logger.error(f"Raw summary generation failed: {e}")
|
||||
raise e
|
||||
|
||||
# Step 3: Format with Gemini (using GeminiProvider explicitly)
|
||||
self.logger.info("Formatting summary with Gemini...")
|
||||
format_prompt = f"""Revisa y mejora el siguiente resumen en Markdown para que sea perfectamente legible:
|
||||
# Step 3: Format with IA (using main provider instead of Gemini)
|
||||
self.logger.info("Formatting summary with IA...")
|
||||
format_prompt = f"""Revisa y mejora el siguiente resumen en Markdown para que sea perfectamente legible y compatible con Pandoc:
|
||||
|
||||
{raw_summary}
|
||||
|
||||
Instrucciones:
|
||||
- Corrige cualquier error de formato
|
||||
- Corrige cualquier error de formato Markdown
|
||||
- Asegúrate de que los encabezados estén bien espaciados
|
||||
- Verifica que las viñetas usen "- " correctamente
|
||||
- Mantén exactamente el contenido existente
|
||||
- EVITA el uso excesivo de negritas (asteriscos), úsalas solo para conceptos clave
|
||||
- VERIFICA que todas las fórmulas matemáticas estén correctamente encerradas en $...$ (inline) o $$...$$ (display)
|
||||
- NO alteres la sintaxis LaTeX dentro de los delimitadores $...$ o $$...$$
|
||||
- Devuelve únicamente el resumen formateado sin texto adicional"""
|
||||
|
||||
# Use generic Gemini provider for formatting as requested
|
||||
from services.ai.gemini_provider import GeminiProvider
|
||||
|
||||
formatter = GeminiProvider()
|
||||
|
||||
try:
|
||||
if formatter.is_available():
|
||||
summary = formatter.generate_text(format_prompt)
|
||||
# Use the main provider (Claude/GLM) for formatting too
|
||||
if self.ai_provider.is_available():
|
||||
summary = self.ai_provider.generate_text(format_prompt)
|
||||
else:
|
||||
self.logger.warning(
|
||||
"Gemini formatter not available, using raw summary"
|
||||
"AI provider not available for formatting, using raw summary"
|
||||
)
|
||||
summary = raw_summary
|
||||
except Exception as e:
|
||||
@@ -108,8 +114,20 @@ Instrucciones:
|
||||
|
||||
# Create document
|
||||
markdown_path = self._create_markdown(summary, base_name)
|
||||
docx_path = self._create_docx(summary, base_name)
|
||||
pdf_path = self._create_pdf(summary, base_name)
|
||||
|
||||
docx_path = None
|
||||
try:
|
||||
docx_path = self._create_docx(markdown_path, base_name)
|
||||
except Exception as e:
|
||||
self.logger.error(f"Failed to create DOCX (non-critical): {e}")
|
||||
|
||||
pdf_path = None
|
||||
try:
|
||||
# Sanitize LaTeX before PDF generation
|
||||
self._sanitize_latex(markdown_path)
|
||||
pdf_path = self._create_pdf(markdown_path, base_name)
|
||||
except Exception as e:
|
||||
self.logger.error(f"Failed to create PDF (non-critical): {e}")
|
||||
|
||||
# Upload to Notion if configured
|
||||
from services.notion_service import notion_service
|
||||
@@ -123,7 +141,7 @@ Instrucciones:
|
||||
# Crear página con el contenido completo del resumen
|
||||
notion_metadata = {
|
||||
"file_type": "Audio", # O 'PDF' dependiendo del origen
|
||||
"pdf_path": pdf_path,
|
||||
"pdf_path": pdf_path if pdf_path else Path(""),
|
||||
"add_status": False, # No usar Status/Tipo (no existen en la DB)
|
||||
"use_as_page": False, # Usar como database, no página
|
||||
}
|
||||
@@ -149,9 +167,9 @@ Instrucciones:
|
||||
|
||||
metadata = {
|
||||
"markdown_path": str(markdown_path),
|
||||
"docx_path": str(docx_path),
|
||||
"pdf_path": str(pdf_path),
|
||||
"docx_name": Path(docx_path).name,
|
||||
"docx_path": str(docx_path) if docx_path else "",
|
||||
"pdf_path": str(pdf_path) if pdf_path else "",
|
||||
"docx_name": Path(docx_path).name if docx_path else "",
|
||||
"summary": summary,
|
||||
"filename": filename,
|
||||
"notion_uploaded": notion_uploaded,
|
||||
@@ -164,6 +182,53 @@ Instrucciones:
|
||||
self.logger.error(f"Document generation process failed: {e}")
|
||||
return False, "", {}
|
||||
|
||||
def _sanitize_latex(self, markdown_path: Path) -> None:
|
||||
"""Sanitize LaTeX syntax in Markdown file to prevent Pandoc errors"""
|
||||
try:
|
||||
content = markdown_path.read_text(encoding="utf-8")
|
||||
|
||||
# 1. Unescape escaped dollar signs which are common LLM errors for math
|
||||
content = content.replace(r"\$", "$")
|
||||
|
||||
# 2. Fix common Cyrillic and Greek characters that sneak in via LLMs
|
||||
replacements = {
|
||||
"ч": "ch",
|
||||
"в": "v",
|
||||
"к": "k",
|
||||
"м": "m",
|
||||
"н": "n",
|
||||
"т": "t",
|
||||
"—": "-",
|
||||
"–": "-",
|
||||
"“": '"',
|
||||
"”": '"',
|
||||
"’": "'",
|
||||
"Δ": "$\\Delta$",
|
||||
"δ": "$\\delta$",
|
||||
"Σ": "$\\Sigma$",
|
||||
"σ": "$\\sigma$",
|
||||
"π": "$\\pi$",
|
||||
"Π": "$\\Pi$",
|
||||
"α": "$\\alpha$",
|
||||
"β": "$\\beta$",
|
||||
"γ": "$\\gamma$",
|
||||
"θ": "$\\theta$",
|
||||
"λ": "$\\lambda$",
|
||||
"μ": "$\\mu$",
|
||||
}
|
||||
|
||||
# Be careful not to double-replace already correct LaTeX
|
||||
for char, repl in replacements.items():
|
||||
if char in content:
|
||||
# Check if it's already inside math mode would be complex,
|
||||
# but for now we assume raw unicode greek chars should become latex
|
||||
content = content.replace(char, repl)
|
||||
|
||||
markdown_path.write_text(content, encoding="utf-8")
|
||||
self.logger.info(f"Sanitized LaTeX in {markdown_path}")
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Failed to sanitize LaTeX: {e}")
|
||||
|
||||
def _generate_filename(self, text: str, summary: str) -> str:
|
||||
"""Generate intelligent filename"""
|
||||
try:
|
||||
@@ -173,11 +238,10 @@ Summary: {summary}
|
||||
|
||||
Return only the topics separated by hyphens, max 20 chars each, in Spanish:"""
|
||||
|
||||
topics_text = (
|
||||
self.ai_provider.sanitize_input(prompt)
|
||||
if hasattr(self.ai_provider, "sanitize_input")
|
||||
else summary[:100]
|
||||
)
|
||||
try:
|
||||
topics_text = self.ai_provider.generate_text(prompt)
|
||||
except Exception:
|
||||
topics_text = summary[:100]
|
||||
|
||||
# Simple topic extraction
|
||||
topics = re.findall(r"\b[A-ZÁÉÍÓÚÑ][a-záéíóúñ]+\b", topics_text)[:3]
|
||||
@@ -192,7 +256,7 @@ Return only the topics separated by hyphens, max 20 chars each, in Spanish:"""
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Filename generation failed: {e}")
|
||||
return base_name[: settings.MAX_FILENAME_BASE_LENGTH]
|
||||
return "documento"
|
||||
|
||||
def _create_markdown(self, summary: str, base_name: str) -> Path:
|
||||
"""Create Markdown document"""
|
||||
@@ -217,154 +281,72 @@ Return only the topics separated by hyphens, max 20 chars each, in Spanish:"""
|
||||
|
||||
return output_path
|
||||
|
||||
def _create_docx(self, summary: str, base_name: str) -> Path:
|
||||
"""Create DOCX document with Markdown parsing (Legacy method ported)"""
|
||||
try:
|
||||
from docx import Document
|
||||
from docx.shared import Inches
|
||||
except ImportError:
|
||||
raise FileProcessingError("python-docx not installed")
|
||||
|
||||
def _create_docx(self, markdown_path: Path, base_name: str) -> Path:
|
||||
"""Create DOCX document using pandoc"""
|
||||
output_dir = settings.LOCAL_DOCX
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
output_path = output_dir / f"{base_name}_unificado.docx"
|
||||
|
||||
doc = Document()
|
||||
doc.add_heading(base_name.replace("_", " ").title(), 0)
|
||||
self.logger.info(
|
||||
f"Converting Markdown to DOCX: {markdown_path} -> {output_path}"
|
||||
)
|
||||
|
||||
# Parse and render Markdown content line by line
|
||||
lines = summary.splitlines()
|
||||
current_paragraph = []
|
||||
|
||||
for line in lines:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
if current_paragraph:
|
||||
p = doc.add_paragraph(" ".join(current_paragraph))
|
||||
p.alignment = 3 # JUSTIFY alignment (WD_ALIGN_PARAGRAPH.JUSTIFY=3)
|
||||
current_paragraph = []
|
||||
continue
|
||||
|
||||
if line.startswith("#"):
|
||||
if current_paragraph:
|
||||
p = doc.add_paragraph(" ".join(current_paragraph))
|
||||
p.alignment = 3
|
||||
current_paragraph = []
|
||||
# Process heading
|
||||
level = len(line) - len(line.lstrip("#"))
|
||||
heading_text = line.lstrip("#").strip()
|
||||
if level <= 6:
|
||||
doc.add_heading(heading_text, level=level)
|
||||
else:
|
||||
current_paragraph.append(heading_text)
|
||||
elif line.startswith("-") or line.startswith("*") or line.startswith("•"):
|
||||
if current_paragraph:
|
||||
p = doc.add_paragraph(" ".join(current_paragraph))
|
||||
p.alignment = 3
|
||||
current_paragraph = []
|
||||
bullet_text = line.lstrip("-*• ").strip()
|
||||
p = doc.add_paragraph(bullet_text, style="List Bullet")
|
||||
# Remove bold markers from bullets if present
|
||||
if "**" in bullet_text:
|
||||
# Basic cleanup for bullets
|
||||
pass
|
||||
else:
|
||||
# Clean up excessive bold markers in body text if user requested
|
||||
clean_line = line.replace(
|
||||
"**", ""
|
||||
) # Removing asterisks as per user complaint "se abusa de los asteriscos"
|
||||
current_paragraph.append(clean_line)
|
||||
|
||||
if current_paragraph:
|
||||
p = doc.add_paragraph(" ".join(current_paragraph))
|
||||
p.alignment = 3
|
||||
|
||||
doc.add_page_break()
|
||||
doc.add_paragraph(f"*Generado por CBCFacil*")
|
||||
|
||||
doc.save(output_path)
|
||||
return output_path
|
||||
|
||||
def _create_pdf(self, summary: str, base_name: str) -> Path:
|
||||
"""Create PDF document with Markdown parsing (Legacy method ported)"""
|
||||
try:
|
||||
from reportlab.lib.pagesizes import letter
|
||||
from reportlab.pdfgen import canvas
|
||||
import textwrap
|
||||
except ImportError:
|
||||
raise FileProcessingError("reportlab not installed")
|
||||
cmd = [
|
||||
"pandoc",
|
||||
str(markdown_path),
|
||||
"-o",
|
||||
str(output_path),
|
||||
"--from=markdown",
|
||||
"--to=docx",
|
||||
]
|
||||
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
|
||||
|
||||
self.logger.info("DOCX generated successfully with pandoc")
|
||||
return output_path
|
||||
|
||||
except subprocess.CalledProcessError as e:
|
||||
self.logger.error(f"Pandoc DOCX conversion failed: {e.stderr}")
|
||||
raise FileProcessingError(f"Failed to generate DOCX: {e.stderr}")
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error generating DOCX: {e}")
|
||||
raise FileProcessingError(f"Error generating DOCX: {e}")
|
||||
|
||||
def _create_pdf(self, markdown_path: Path, base_name: str) -> Path:
|
||||
"""Create PDF document using pandoc and pdflatex"""
|
||||
output_dir = settings.LOCAL_DOWNLOADS_PATH
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
output_path = output_dir / f"{base_name}_unificado.pdf"
|
||||
|
||||
c = canvas.Canvas(str(output_path), pagesize=letter)
|
||||
width, height = letter
|
||||
margin = 72
|
||||
y_position = height - margin
|
||||
self.logger.info(
|
||||
f"Converting Markdown to PDF: {markdown_path} -> {output_path}"
|
||||
)
|
||||
|
||||
def new_page():
|
||||
nonlocal y_position
|
||||
c.showPage()
|
||||
c.setFont("Helvetica", 11)
|
||||
y_position = height - margin
|
||||
try:
|
||||
cmd = [
|
||||
"pandoc",
|
||||
str(markdown_path),
|
||||
"-o",
|
||||
str(output_path),
|
||||
"--pdf-engine=pdflatex",
|
||||
"-V",
|
||||
"geometry:margin=2.5cm",
|
||||
"-V",
|
||||
"fontsize=12pt",
|
||||
"--highlight-style=tango",
|
||||
]
|
||||
|
||||
c.setFont("Helvetica", 11)
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
|
||||
|
||||
# Title
|
||||
c.setFont("Helvetica-Bold", 16)
|
||||
c.drawString(margin, y_position, base_name.replace("_", " ").title()[:100])
|
||||
y_position -= 28
|
||||
c.setFont("Helvetica", 11)
|
||||
self.logger.info("PDF generated successfully with pandoc")
|
||||
return output_path
|
||||
|
||||
summary_clean = summary.replace(
|
||||
"**", ""
|
||||
) # Remove asterisks globally for cleaner PDF
|
||||
|
||||
for raw_line in summary_clean.splitlines():
|
||||
line = raw_line.rstrip()
|
||||
|
||||
if not line.strip():
|
||||
y_position -= 14
|
||||
if y_position < margin:
|
||||
new_page()
|
||||
continue
|
||||
|
||||
stripped = line.lstrip()
|
||||
|
||||
if stripped.startswith("#"):
|
||||
level = len(stripped) - len(stripped.lstrip("#"))
|
||||
heading_text = stripped.lstrip("#").strip()
|
||||
if heading_text:
|
||||
font_size = 16 if level == 1 else 14 if level == 2 else 12
|
||||
c.setFont("Helvetica-Bold", font_size)
|
||||
c.drawString(margin, y_position, heading_text[:90])
|
||||
y_position -= font_size + 6
|
||||
if y_position < margin:
|
||||
new_page()
|
||||
c.setFont("Helvetica", 11)
|
||||
continue
|
||||
|
||||
if stripped.startswith(("-", "*", "•")):
|
||||
bullet_text = stripped.lstrip("-*•").strip()
|
||||
wrapped_lines = textwrap.wrap(bullet_text, width=80) or [""]
|
||||
for idx, wrapped in enumerate(wrapped_lines):
|
||||
prefix = "• " if idx == 0 else " "
|
||||
c.drawString(margin, y_position, f"{prefix}{wrapped}")
|
||||
y_position -= 14
|
||||
if y_position < margin:
|
||||
new_page()
|
||||
continue
|
||||
|
||||
# Body text - Justified approximation (ReportLab native justification requires Paragraph styles, defaulting to wrap)
|
||||
wrapped_lines = textwrap.wrap(stripped, width=90) or [""]
|
||||
for wrapped in wrapped_lines:
|
||||
c.drawString(margin, y_position, wrapped)
|
||||
y_position -= 14
|
||||
if y_position < margin:
|
||||
new_page()
|
||||
|
||||
c.save()
|
||||
return output_path
|
||||
except subprocess.CalledProcessError as e:
|
||||
self.logger.error(f"Pandoc PDF conversion failed: {e.stderr}")
|
||||
raise FileProcessingError(f"Failed to generate PDF: {e.stderr}")
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error generating PDF: {e}")
|
||||
raise FileProcessingError(f"Error generating PDF: {e}")
|
||||
|
||||
Reference in New Issue
Block a user