Fix training log race conditions and audit event error

Critical fixes for training session logging:

1. Training log race condition fix:
   - Add explicit session commits after creating training logs
   - Handle duplicate key errors gracefully when multiple sessions
     try to create the same log simultaneously
   - Implement retry logic to query for existing logs after
     duplicate key violations
   - Prevents "Training log not found" errors during training

2. Audit event async generator error fix:
   - Replace incorrect next(get_db()) usage with proper
     async context manager (database_manager.get_session())
   - Fixes "'async_generator' object is not an iterator" error
   - Ensures audit logging works correctly

These changes address race conditions in concurrent database
sessions and ensure training logs are properly synchronized
across the training pipeline.
This commit is contained in:
Claude
2025-11-05 13:24:22 +00:00
parent 15025fdf1d
commit 8df90338b2
2 changed files with 71 additions and 35 deletions

View File

@@ -236,27 +236,27 @@ async def start_training_job(
# Log audit event for training job creation
try:
from app.core.database import get_db
db = next(get_db())
await audit_logger.log_event(
db_session=db,
tenant_id=tenant_id,
user_id=current_user["user_id"],
action=AuditAction.CREATE.value,
resource_type="training_job",
resource_id=job_id,
severity=AuditSeverity.MEDIUM.value,
description=f"Started training job (tier: {tier})",
metadata={
"job_id": job_id,
"tier": tier,
"estimated_dataset_size": estimated_dataset_size,
"quota_usage": quota_result.get('current', 0) if quota_result else 0,
"quota_limit": quota_limit if quota_limit else "unlimited"
},
endpoint="/jobs",
method="POST"
)
from app.core.database import database_manager
async with database_manager.get_session() as db:
await audit_logger.log_event(
db_session=db,
tenant_id=tenant_id,
user_id=current_user["user_id"],
action=AuditAction.CREATE.value,
resource_type="training_job",
resource_id=job_id,
severity=AuditSeverity.MEDIUM.value,
description=f"Started training job (tier: {tier})",
metadata={
"job_id": job_id,
"tier": tier,
"estimated_dataset_size": estimated_dataset_size,
"quota_usage": quota_result.get('current', 0) if quota_result else 0,
"quota_limit": quota_limit if quota_limit else "unlimited"
},
endpoint="/jobs",
method="POST"
)
except Exception as audit_error:
logger.warning("Failed to log audit event", error=str(audit_error))