Appearance
DOCX Import Service API
Baseline Endpoints
POST /v1/import/docx/qasPOST /v1/import/docx/qas/streamPOST /v1/import/docx/qas/jobsPOST /v1/import/docx/docx-fast-jobsGET /v1/import/docx/docx-fast-jobs/{id}/temp-draftPOST /v1/import/docx/docx-fast-jobs/{id}/reprocessGET /v1/import/docx/teacher-libraryGET /v1/import/docx/jobsGET /v1/import/docx/jobs/{id}GET /v1/import/docx/jobs/{id}/statusGET /v1/import/docx/jobs/{id}/eventsGET /v1/import/docx/jobs/eventsPATCH /v1/import/docx/jobs/{id}/reviewPOST /v1/import/docx/jobs/{id}/approveGET /healthzGET /readyz
Current Behavior
The service validates multipart field file, requires .docx, checks the ZIP-based Office magic header, enforces a 50MB upload limit, and returns latency_ms.
Parser mode:
- If
GO_FORMULA_DOCX_URLorGO_FORMULA_DOCX_BASE_URLis set, the service streams the upload to{baseUrl}/v1/import/docx/simplewith legacy-compatible fieldsinclude_html=1,include_meta=1, andinclude_solution=1. - Local Compose, raw offline K8s, and Helm now set
GO_FORMULA_DOCX_URLtohttp://go-formula-docx:8080by default and run the existing parser binary as an internalgo-formula-docxruntime. The legacy parser source stays read-only. - If no Go Formula URL is configured, the baseline parser returns a warning-only response with
PARSER_ENGINE_NOT_CONNECTEDinstead of silently dropping content.
Media materialization:
- If
DOCUMENT_SERVICE_URLorDOCUMENT_SERVICE_BASE_URLis set, the service extracts referenced DOCX package images, uploads bytes throughdocument-service/v1/storage/presigned-upload, creates metadata through/v1/storage/media-assets, and rewrites media references to/api/storage/media-assets/{id}/content. - The materializer requires
X-Organization-Idfor upload ownership and passesX-User-Idas the owner when available. - If a referenced image cannot be extracted, uploaded, or registered, the original media reference is preserved and the response includes
GO_FORMULA_DOCX_IMAGE_MATERIALIZE_FAILED; the service must not silently drop the image. stats.total_imagesuses Go FormularenderableImageCountwhen available so WMF/OLE formula artifacts are not counted as normal renderable images; the raw media references remain available for review warnings/materialization.
Required Parser Follow-Up
The Go Formula adapter now maps direct service output into the QAS response, can materialize extractable package images into document-service, applies payload-level DOCX style answer hints, can build annotated fallback questions when upstream returns zero questions, and has native OOXML passes for template-gated DOCX files. Physics 28Q, Math 22Q, English 40Q, and DGNL 102Q corpus QAS shapes are covered by an opt-in runtime integration test.
Classifier baseline:
internal/classifierprovides a deterministic question-type fallback for obvious shapes: explicit type hints, true/false sub-items, multiple correct answers, single-choice options, short numeric answers, essay prompts, and media/table/graph review cases.- This is not a replacement for
ai-classifier-service; low-confidence or media/table/graph classifications return warnings so review can keep the source evidence visible.
Fallback and answer-hint parity:
- If the Go Formula payload includes
styleAnswerHintsordocxStyleAnswerHints, the adapter applies them to QAScorrect_answerand recordsanswer_evidence=DOCX_STYLE_HINTinsource_location. - If Go Formula returns zero questions but includes
annotatedResult.segmentsorannotatedSegments, the adapter builds warning-safe QAS questions from those annotated segments and emitsGO_FORMULA_DOCX_ANNOTATED_FALLBACK. - The native OOXML style-hint pass reads
word/document.xmlfrom the uploaded DOCX, detects red/underlined answer runs for Math/English templates and yellow-highlighted answer runs for DGNL templates, then applies hints withDOCX_NATIVE_STYLE_ANSWER_HINT_APPLIED. - If the DGNL 102Q runtime payload returns zero QAS questions, the native fallback builds
DGNL_SINGLE_CHOICEquestions from DOCX A-D option groups, answer-only->rows, and review-required unlabeled choice rows, then emitsDOCX_NATIVE_DGNL_FALLBACK. make test-docx-corpusruns the opt-in Physics/Math/English/DGNL QAS parity checks whenGO_FORMULA_DOCX_URLandHOCTAPAZ_DOCX_CORPUS_DIRpoint to the local runtime/corpus.make test-docx-materializationruns an opt-in Physics 28Q runtime check through livedocx-import-serviceplusdocument-service, asserts7materializedMediaAssetreferences, and fetches each asset back through/v1/storage/media-assets/{id}/content.make test-docx-warning-captureplusmake test-docx-warning-paritycompares captured native Physics/Math/English/DGNL response warnings against legacyoutput/import-audit-20260702/apiartifacts. The filtered comparator passes withDOCX_WARNING_PARITY_ALLOW_EXTRA_NATIVE=1, because native emits additional review-safety warnings rather than claiming exact no-extra warning counts.
Job Model
The job endpoints use a repository interface:
- If
DATABASE_URLis configured and reachable, the service uses the pgx-backeddocx_import_jobsanddocx_import_eventstables fromservices/docx-import-service/migrations/000002_import_jobs.sql, pluspayload_reffromservices/docx-import-service/migrations/000003_import_payload_spool.sql. - If no database is configured, the service uses an in-memory repository so local skeleton runs remain possible.
Job status values:
PENDINGPROCESSINGCOMPLETEDFAILED
Events currently emitted:
job.queuedjob.startedjob.completedjob.failedapproval.completedauto_approval.completedauto_approval.failed
POST /v1/import/docx/qas/jobs now writes the uploaded DOCX to the service-local payload spool, stores only an internal payload_ref on the PENDING job, and returns quickly. A bounded in-process worker queue then moves the job to PROCESSING, opens the payload reference, parses it in the background, and deletes the spool file after COMPLETED or FAILED. DOCX_IMPORT_PAYLOAD_DIR sets the spool directory; if unset, the service uses a temp-directory fallback. DOCX_IMPORT_WORKER_CONCURRENCY controls the native worker count; if unset, the service falls back to ALGORITHM_IMPORT_WORKER_CONCURRENCY and then default 2.
On startup, the service scans PENDING and PROCESSING jobs with payload refs and re-enqueues them. If the payload reference is missing, it marks the job FAILED with DOCX_IMPORT_PAYLOAD_MISSING instead of leaving it stuck.
POST /v1/import/docx/docx-fast-jobs is the legacy-compatible JSON create adapter for POST /api/exam-import/docx-fast-jobs. It accepts questionStorageKey, questionFileName/fileName, title, and autoApproveToQuestionBank, plus optional subjectId and gradeLevel metadata. When Idempotency-Key is present, the service trims it, caps it at 120 characters, and returns the recent matching native job within 15 minutes for the same organization, actor, and questionStorageKey before fetching storage or enqueueing again. Otherwise it fetches the uploaded DOCX bytes from document-service through GET /v1/storage/objects/content, validates the DOCX extension and ZIP magic, and then reuses the native queue/payload spool path. When autoApproveToQuestionBank is true, a successfully parsed native DOCX Fast job triggers asynchronous QUESTION_BANK approval and records either auto_approval.completed or auto_approval.failed without changing the normal job completion status. The response is a legacy success envelope with a camelCase job summary so the frontend api<T>() helper can continue to unwrap data.
GET /v1/import/docx/docx-fast-jobs/{id}/temp-draft is the legacy-compatible materialized temp-draft read branch. For completed native DOCX Fast jobs with a stored parse result, it returns token, jobId, expiresAt, parseResult, and an empty assets array with message DOCX Fast materialized draft. If a reviewed parse result has been saved through PATCH /v1/import/docx/jobs/{id}/review, the temp-draft read returns the reviewed payload so manual question-type overrides survive reload.
PATCH /v1/import/docx/jobs/{id}/review is the native backend foundation for legacy review save. It accepts parseResult plus optional title and sourceText, stores the reviewed parse result JSON verbatim in the service-owned job row, and returns a legacy success envelope with parseResultJson. When sourceText is provided together with an object-shaped reviewed parseResult, compatibility responses overlay it as parseResultJson.sourceText so the source editor can reload the same text after save while the raw reviewed JSON remains unchanged in storage. This preserves frontend metadata such as sourceMetadataJson.questionTypeManualOverride, sourceMetadataJson.questionTypeReviewed, and sourceMetadataJson.questionTypeAutoDetected=false without coercing the payload through the QAS struct. The endpoint records review.updated events. When sourceText is provided without parseResult, the native endpoint runs a baseline source-text reparse for simple Câu n questions with A/B/C/D options, * answer markers, Đáp án:, and Lời giải:. The generated parseResultJson includes a warning that complex legacy parser parity is still pending. When docxFastTempDraftToken is provided, the native endpoint accepts only the materialized token shape returned by native temp-draft reads: materialized-{jobId}. That token is treated as an idempotent acknowledgement that the native DOCX Fast parse payload is already materialized in the import job; arbitrary legacy temp-draft tokens and client-side temp assets remain on legacy routes until separate parity work.
GET /v1/import/docx/jobs/{id} returns the legacy-compatible detail reload shape with camelCase fields such as questionFileName, fileName, parseStatus, matchStatus, reviewStatus, packagingJson, and parseResultJson. If a reviewed result has been saved, parseResultJson returns the reviewed payload instead of the original parser output. Completed native DOCX Fast jobs include packagingJson.docxFastTempDraft with status MATERIALIZED, token materialized-{id}, and assets: [] temp-draft reads; this lets the editor skip legacy client temp-draft materialization for native materialized jobs.
GET /v1/import/docx/docx-fast-jobs/{id}/temp-draft/assets/{tempAssetId}/content is the materialized asset content adapter. It does not create a native non-materialized temp-draft store. Instead, it accepts only completed native DOCX Fast jobs, verifies that tempAssetId matches a materialized media reference in the stored parse result (media_asset_id, media id, go_asset_id, or the /api/storage/media-assets/{id}/content URL), then streams the permanent media asset through document-service /v1/storage/media-assets/{id}/content with the legacy private cache header.
POST /v1/import/docx/docx-fast-jobs/{id}/reprocess is the legacy-compatible DOCX Fast reprocess adapter. It re-reads the source DOCX from the job's stored questionStorageKey through document-service, saves a fresh payload under the same native job id, clears previous parse output, requeues the native worker, and returns message DOCX Fast import reprocessed. The foundation rejects jobs that are already active or waiting in the native queue.
ImportJobService publishes these lifecycle events to a service-owned event bus. Without REDIS_URL, the bus is in-process only. When REDIS_URL is configured and reachable, the service wraps the in-process bus with Redis pub/sub on DOCX_IMPORT_EVENTS_CHANNEL (default docx-import:jobs) so subscribers on one service process can receive job.updated events from another process. Redis payloads redact the heavy parsed result body because the SSE summary does not need it. Uploaded DOCX bytes and payload refs are not published to Redis, SSE, or logs.
Legacy-Compatible Status And Events
Phase P3-005 maps legacy import status/history surfaces without cutting over the default public route table:
- Legacy
/api/exam-import/teacher-librarymaps to nativeGET /v1/import/docx/teacher-library. - Legacy
POST /api/exam-import/docx-fast-jobsmaps to nativePOST /v1/import/docx/docx-fast-jobsin the non-default gateway create-job example. - Legacy
GET /api/exam-import/docx-fast-jobs/{id}/temp-draftmaps to nativeGET /v1/import/docx/docx-fast-jobs/{id}/temp-draftonly for the materialized read branch in the same non-default gateway example. make test-import-temp-draft-liveis the read-only live smoke for the non-default DOCX Fast materialized temp-draft route. It requires a completed native DOCX Fast job with stored parse output, auth/org context, and a gateway running with the native import-create route table.- Legacy
/api/exam-import/docx-fast-jobs/{id}/temp-draft/assets/{tempAssetId}/contentmaps to nativeGET /v1/import/docx/docx-fast-jobs/{id}/temp-draft/assets/{tempAssetId}/contentonly for materialized media already referenced by the completed native job. make test-import-temp-asset-liveis the read-only live smoke for the non-default DOCX Fast materialized temp asset content route. It requires a completed native DOCX Fast job, a materialized temp asset id present in the stored parse result, auth/org context, and a gateway running with the native import-create route table.- Legacy
POST /api/exam-import/docx-fast-jobs/{id}/reprocessmaps to nativePOST /v1/import/docx/docx-fast-jobs/{id}/reprocessin the same non-default gateway example. make test-import-reprocess-liveis the opt-in write smoke for the non-default DOCX Fast reprocess route. It requires explicit confirmation, a native DOCX Fast job with a stored source DOCX, auth/org context, and a gateway running with the native import-create route table.make test-import-create-liveis the opt-in write smoke for the non-default DOCX Fast create route. It requires explicit confirmation, a real uploaded DOCXquestionStorageKey, auth/org context, and a gateway running with the native import-create route table.make test-import-create-browseris the opt-in Playwright smoke for the real import surface. It can either upload a caller-supplied DOCX through the UI file input or use an existingquestionStorageKey, then verifies the browser-observed gateway route headers and legacy success envelope.- Legacy
PATCH /api/exam-import/jobs/{id}/reviewmaps to nativePATCH /v1/import/docx/jobs/{id}/reviewonly in the non-default review-save gateway example; all sibling job actions remain legacy. - Legacy
GET /api/exam-import/jobs/{id}maps to nativeGET /v1/import/docx/jobs/{id}only in the non-default detail reload gateway example; deeper sibling paths such as/statusand/reviewremain legacy unless separately routed. - Legacy
/api/exam-import/jobs/{id}/statusand/api/imports/{id}/statusmap to nativeGET /v1/import/docx/jobs/{id}/status. - Legacy
/api/exam-import/algorithm-jobs/eventsmaps to nativeGET /v1/import/docx/jobs/events.
The compatibility list and SSE snapshot use the existing frontend summary fields: id, jobKind, source, title, questionFileName, fileName, status, fileStatus, extractionStatus, parseStatus, matchStatus, reviewStatus, progress, optional queue, errorSummary, and timestamps. The queue object is attached only while native jobs are PENDING or PROCESSING, with legacy-shaped fields such as status, waiting, active, totalPending, waitingPosition, jobsAhead, and workerConcurrency. The status endpoint returns the legacy status counters: status, totalQuestions, parsedQuestions, pendingEquations, pendingImages, warnings, and errors.
This is a read/status compatibility layer over native docx_import_jobs and docx_import_events. The SSE route now replays stored jobs, keeps the stream open, sends periodic heartbeat events, and forwards live job.updated events for job start/completion/failure/approval changes from the current process or Redis fanout. Browser cutover remains deferred until the gateway auth/header adapter, worker topology, and browser verification are ready. The payload spool is already wired to a named Compose volume and a docx-import-payloads PersistentVolumeClaim in offline K8s/Helm. A read-only BullMQ snapshot bridge can be enabled with DOCX_IMPORT_BULLMQ_BRIDGE_ENABLED=1 to report legacy exam-import-algorithm queue depth/position when a native job is not present in the in-process queue; it does not enqueue, remove, retry, or claim legacy jobs. Full shared worker execution with the legacy BullMQ queue remains deferred. make test-docx-bullmq-status is an opt-in live smoke for this bridge path; it checks the non-default status route table and the teacher-library row queue shape for a caller-supplied pending/processing job id. make test-docx-import-library-browser adds browser-route evidence for the real /teacher/exams/library page and can optionally search for the same job id to assert the queue snapshot in the browser-observed response.
Native Approval Boundary
POST /v1/import/docx/jobs/{id}/approve approves a completed job result into native services. It is an internal /v1 contract. A non-default API Gateway route-table example can adapt legacy-compatible POST /api/exam-import/jobs/{id}/approve to this endpoint, but the default route table keeps public /api/exam-import/* on legacy until gateway auth/RBAC and browser review parity are verified.
Question-bank approval:
json
{
"target": "QUESTION_BANK",
"questionMetadata": {
"subjectId": "math",
"gradeLevel": 12,
"tags": ["approved"]
}
}The service calls question-bank-servicePOST /v1/questions/import-docx-output, forwarding X-Organization-Id and X-User-Id, and adds sourceImportJobId, sourceFileName, and approval metadata.
Draft exam approval:
json
{
"target": "EXAM_DRAFT",
"questionMetadata": {
"subjectId": "math",
"gradeLevel": 12
},
"exam": {
"title": "Approved draft",
"subjectId": "math",
"gradeLevel": 12,
"durationMinutes": 45
}
}For EXAM_DRAFT, the request must include either examId or an exam create payload. The service creates the draft when needed and then calls PUT /v1/exams/{examId}/question-snapshots with snapshots built from the QAS output and imported question IDs.
Approval is idempotent for identical requests: the service records an approval.completed event with an approval key and returns the stored result on retry instead of reinserting questions.
Gateway live approval smoke:
bash
cd go-platform
IMPORT_APPROVAL_LIVE_CONFIRM=approve-native \
IMPORT_APPROVAL_JOB_ID=<completed-job-id> \
IMPORT_APPROVAL_AUTHORIZATION='Bearer <token>' \
IMPORT_APPROVAL_ORGANIZATION_ID=<org-id> \
make test-import-approval-liveUse IMPORT_APPROVAL_BODY_JSON or IMPORT_APPROVAL_BODY_FILE to pass the exact approval payload. The script defaults to {"target":"QUESTION_BANK"} and will not run without the explicit confirmation flag because it performs native target writes.