Skip to content

DOCX Import Service API

Baseline Endpoints

  • POST /v1/import/docx/qas
  • POST /v1/import/docx/qas/stream
  • POST /v1/import/docx/qas/jobs
  • POST /v1/import/docx/docx-fast-jobs
  • GET /v1/import/docx/docx-fast-jobs/{id}/temp-draft
  • POST /v1/import/docx/docx-fast-jobs/{id}/reprocess
  • GET /v1/import/docx/teacher-library
  • GET /v1/import/docx/jobs
  • GET /v1/import/docx/jobs/{id}
  • GET /v1/import/docx/jobs/{id}/status
  • GET /v1/import/docx/jobs/{id}/events
  • GET /v1/import/docx/jobs/events
  • PATCH /v1/import/docx/jobs/{id}/review
  • POST /v1/import/docx/jobs/{id}/approve
  • GET /healthz
  • GET /readyz

Current Behavior

The service validates multipart field file, requires .docx, checks the ZIP-based Office magic header, enforces a 50MB upload limit, and returns latency_ms.

Parser mode:

  • If GO_FORMULA_DOCX_URL or GO_FORMULA_DOCX_BASE_URL is set, the service streams the upload to {baseUrl}/v1/import/docx/simple with legacy-compatible fields include_html=1, include_meta=1, and include_solution=1.
  • Local Compose, raw offline K8s, and Helm now set GO_FORMULA_DOCX_URL to http://go-formula-docx:8080 by default and run the existing parser binary as an internal go-formula-docx runtime. The legacy parser source stays read-only.
  • If no Go Formula URL is configured, the baseline parser returns a warning-only response with PARSER_ENGINE_NOT_CONNECTED instead of silently dropping content.

Media materialization:

  • If DOCUMENT_SERVICE_URL or DOCUMENT_SERVICE_BASE_URL is set, the service extracts referenced DOCX package images, uploads bytes through document-service /v1/storage/presigned-upload, creates metadata through /v1/storage/media-assets, and rewrites media references to /api/storage/media-assets/{id}/content.
  • The materializer requires X-Organization-Id for upload ownership and passes X-User-Id as the owner when available.
  • If a referenced image cannot be extracted, uploaded, or registered, the original media reference is preserved and the response includes GO_FORMULA_DOCX_IMAGE_MATERIALIZE_FAILED; the service must not silently drop the image.
  • stats.total_images uses Go Formula renderableImageCount when available so WMF/OLE formula artifacts are not counted as normal renderable images; the raw media references remain available for review warnings/materialization.

Required Parser Follow-Up

The Go Formula adapter now maps direct service output into the QAS response, can materialize extractable package images into document-service, applies payload-level DOCX style answer hints, can build annotated fallback questions when upstream returns zero questions, and has native OOXML passes for template-gated DOCX files. Physics 28Q, Math 22Q, English 40Q, and DGNL 102Q corpus QAS shapes are covered by an opt-in runtime integration test.

Classifier baseline:

  • internal/classifier provides a deterministic question-type fallback for obvious shapes: explicit type hints, true/false sub-items, multiple correct answers, single-choice options, short numeric answers, essay prompts, and media/table/graph review cases.
  • This is not a replacement for ai-classifier-service; low-confidence or media/table/graph classifications return warnings so review can keep the source evidence visible.

Fallback and answer-hint parity:

  • If the Go Formula payload includes styleAnswerHints or docxStyleAnswerHints, the adapter applies them to QAS correct_answer and records answer_evidence=DOCX_STYLE_HINT in source_location.
  • If Go Formula returns zero questions but includes annotatedResult.segments or annotatedSegments, the adapter builds warning-safe QAS questions from those annotated segments and emits GO_FORMULA_DOCX_ANNOTATED_FALLBACK.
  • The native OOXML style-hint pass reads word/document.xml from the uploaded DOCX, detects red/underlined answer runs for Math/English templates and yellow-highlighted answer runs for DGNL templates, then applies hints with DOCX_NATIVE_STYLE_ANSWER_HINT_APPLIED.
  • If the DGNL 102Q runtime payload returns zero QAS questions, the native fallback builds DGNL_SINGLE_CHOICE questions from DOCX A-D option groups, answer-only -> rows, and review-required unlabeled choice rows, then emits DOCX_NATIVE_DGNL_FALLBACK.
  • make test-docx-corpus runs the opt-in Physics/Math/English/DGNL QAS parity checks when GO_FORMULA_DOCX_URL and HOCTAPAZ_DOCX_CORPUS_DIR point to the local runtime/corpus.
  • make test-docx-materialization runs an opt-in Physics 28Q runtime check through live docx-import-service plus document-service, asserts 7 materialized MediaAsset references, and fetches each asset back through /v1/storage/media-assets/{id}/content.
  • make test-docx-warning-capture plus make test-docx-warning-parity compares captured native Physics/Math/English/DGNL response warnings against legacy output/import-audit-20260702/api artifacts. The filtered comparator passes with DOCX_WARNING_PARITY_ALLOW_EXTRA_NATIVE=1, because native emits additional review-safety warnings rather than claiming exact no-extra warning counts.

Job Model

The job endpoints use a repository interface:

  • If DATABASE_URL is configured and reachable, the service uses the pgx-backed docx_import_jobs and docx_import_events tables from services/docx-import-service/migrations/000002_import_jobs.sql, plus payload_ref from services/docx-import-service/migrations/000003_import_payload_spool.sql.
  • If no database is configured, the service uses an in-memory repository so local skeleton runs remain possible.

Job status values:

  • PENDING
  • PROCESSING
  • COMPLETED
  • FAILED

Events currently emitted:

  • job.queued
  • job.started
  • job.completed
  • job.failed
  • approval.completed
  • auto_approval.completed
  • auto_approval.failed

POST /v1/import/docx/qas/jobs now writes the uploaded DOCX to the service-local payload spool, stores only an internal payload_ref on the PENDING job, and returns quickly. A bounded in-process worker queue then moves the job to PROCESSING, opens the payload reference, parses it in the background, and deletes the spool file after COMPLETED or FAILED. DOCX_IMPORT_PAYLOAD_DIR sets the spool directory; if unset, the service uses a temp-directory fallback. DOCX_IMPORT_WORKER_CONCURRENCY controls the native worker count; if unset, the service falls back to ALGORITHM_IMPORT_WORKER_CONCURRENCY and then default 2.

On startup, the service scans PENDING and PROCESSING jobs with payload refs and re-enqueues them. If the payload reference is missing, it marks the job FAILED with DOCX_IMPORT_PAYLOAD_MISSING instead of leaving it stuck.

POST /v1/import/docx/docx-fast-jobs is the legacy-compatible JSON create adapter for POST /api/exam-import/docx-fast-jobs. It accepts questionStorageKey, questionFileName/fileName, title, and autoApproveToQuestionBank, plus optional subjectId and gradeLevel metadata. When Idempotency-Key is present, the service trims it, caps it at 120 characters, and returns the recent matching native job within 15 minutes for the same organization, actor, and questionStorageKey before fetching storage or enqueueing again. Otherwise it fetches the uploaded DOCX bytes from document-service through GET /v1/storage/objects/content, validates the DOCX extension and ZIP magic, and then reuses the native queue/payload spool path. When autoApproveToQuestionBank is true, a successfully parsed native DOCX Fast job triggers asynchronous QUESTION_BANK approval and records either auto_approval.completed or auto_approval.failed without changing the normal job completion status. The response is a legacy success envelope with a camelCase job summary so the frontend api<T>() helper can continue to unwrap data.

GET /v1/import/docx/docx-fast-jobs/{id}/temp-draft is the legacy-compatible materialized temp-draft read branch. For completed native DOCX Fast jobs with a stored parse result, it returns token, jobId, expiresAt, parseResult, and an empty assets array with message DOCX Fast materialized draft. If a reviewed parse result has been saved through PATCH /v1/import/docx/jobs/{id}/review, the temp-draft read returns the reviewed payload so manual question-type overrides survive reload.

PATCH /v1/import/docx/jobs/{id}/review is the native backend foundation for legacy review save. It accepts parseResult plus optional title and sourceText, stores the reviewed parse result JSON verbatim in the service-owned job row, and returns a legacy success envelope with parseResultJson. When sourceText is provided together with an object-shaped reviewed parseResult, compatibility responses overlay it as parseResultJson.sourceText so the source editor can reload the same text after save while the raw reviewed JSON remains unchanged in storage. This preserves frontend metadata such as sourceMetadataJson.questionTypeManualOverride, sourceMetadataJson.questionTypeReviewed, and sourceMetadataJson.questionTypeAutoDetected=false without coercing the payload through the QAS struct. The endpoint records review.updated events. When sourceText is provided without parseResult, the native endpoint runs a baseline source-text reparse for simple Câu n questions with A/B/C/D options, * answer markers, Đáp án:, and Lời giải:. The generated parseResultJson includes a warning that complex legacy parser parity is still pending. When docxFastTempDraftToken is provided, the native endpoint accepts only the materialized token shape returned by native temp-draft reads: materialized-{jobId}. That token is treated as an idempotent acknowledgement that the native DOCX Fast parse payload is already materialized in the import job; arbitrary legacy temp-draft tokens and client-side temp assets remain on legacy routes until separate parity work.

GET /v1/import/docx/jobs/{id} returns the legacy-compatible detail reload shape with camelCase fields such as questionFileName, fileName, parseStatus, matchStatus, reviewStatus, packagingJson, and parseResultJson. If a reviewed result has been saved, parseResultJson returns the reviewed payload instead of the original parser output. Completed native DOCX Fast jobs include packagingJson.docxFastTempDraft with status MATERIALIZED, token materialized-{id}, and assets: [] temp-draft reads; this lets the editor skip legacy client temp-draft materialization for native materialized jobs.

GET /v1/import/docx/docx-fast-jobs/{id}/temp-draft/assets/{tempAssetId}/content is the materialized asset content adapter. It does not create a native non-materialized temp-draft store. Instead, it accepts only completed native DOCX Fast jobs, verifies that tempAssetId matches a materialized media reference in the stored parse result (media_asset_id, media id, go_asset_id, or the /api/storage/media-assets/{id}/content URL), then streams the permanent media asset through document-service /v1/storage/media-assets/{id}/content with the legacy private cache header.

POST /v1/import/docx/docx-fast-jobs/{id}/reprocess is the legacy-compatible DOCX Fast reprocess adapter. It re-reads the source DOCX from the job's stored questionStorageKey through document-service, saves a fresh payload under the same native job id, clears previous parse output, requeues the native worker, and returns message DOCX Fast import reprocessed. The foundation rejects jobs that are already active or waiting in the native queue.

ImportJobService publishes these lifecycle events to a service-owned event bus. Without REDIS_URL, the bus is in-process only. When REDIS_URL is configured and reachable, the service wraps the in-process bus with Redis pub/sub on DOCX_IMPORT_EVENTS_CHANNEL (default docx-import:jobs) so subscribers on one service process can receive job.updated events from another process. Redis payloads redact the heavy parsed result body because the SSE summary does not need it. Uploaded DOCX bytes and payload refs are not published to Redis, SSE, or logs.

Legacy-Compatible Status And Events

Phase P3-005 maps legacy import status/history surfaces without cutting over the default public route table:

  • Legacy /api/exam-import/teacher-library maps to native GET /v1/import/docx/teacher-library.
  • Legacy POST /api/exam-import/docx-fast-jobs maps to native POST /v1/import/docx/docx-fast-jobs in the non-default gateway create-job example.
  • Legacy GET /api/exam-import/docx-fast-jobs/{id}/temp-draft maps to native GET /v1/import/docx/docx-fast-jobs/{id}/temp-draft only for the materialized read branch in the same non-default gateway example.
  • make test-import-temp-draft-live is the read-only live smoke for the non-default DOCX Fast materialized temp-draft route. It requires a completed native DOCX Fast job with stored parse output, auth/org context, and a gateway running with the native import-create route table.
  • Legacy /api/exam-import/docx-fast-jobs/{id}/temp-draft/assets/{tempAssetId}/content maps to native GET /v1/import/docx/docx-fast-jobs/{id}/temp-draft/assets/{tempAssetId}/content only for materialized media already referenced by the completed native job.
  • make test-import-temp-asset-live is the read-only live smoke for the non-default DOCX Fast materialized temp asset content route. It requires a completed native DOCX Fast job, a materialized temp asset id present in the stored parse result, auth/org context, and a gateway running with the native import-create route table.
  • Legacy POST /api/exam-import/docx-fast-jobs/{id}/reprocess maps to native POST /v1/import/docx/docx-fast-jobs/{id}/reprocess in the same non-default gateway example.
  • make test-import-reprocess-live is the opt-in write smoke for the non-default DOCX Fast reprocess route. It requires explicit confirmation, a native DOCX Fast job with a stored source DOCX, auth/org context, and a gateway running with the native import-create route table.
  • make test-import-create-live is the opt-in write smoke for the non-default DOCX Fast create route. It requires explicit confirmation, a real uploaded DOCX questionStorageKey, auth/org context, and a gateway running with the native import-create route table.
  • make test-import-create-browser is the opt-in Playwright smoke for the real import surface. It can either upload a caller-supplied DOCX through the UI file input or use an existing questionStorageKey, then verifies the browser-observed gateway route headers and legacy success envelope.
  • Legacy PATCH /api/exam-import/jobs/{id}/review maps to native PATCH /v1/import/docx/jobs/{id}/review only in the non-default review-save gateway example; all sibling job actions remain legacy.
  • Legacy GET /api/exam-import/jobs/{id} maps to native GET /v1/import/docx/jobs/{id} only in the non-default detail reload gateway example; deeper sibling paths such as /status and /review remain legacy unless separately routed.
  • Legacy /api/exam-import/jobs/{id}/status and /api/imports/{id}/status map to native GET /v1/import/docx/jobs/{id}/status.
  • Legacy /api/exam-import/algorithm-jobs/events maps to native GET /v1/import/docx/jobs/events.

The compatibility list and SSE snapshot use the existing frontend summary fields: id, jobKind, source, title, questionFileName, fileName, status, fileStatus, extractionStatus, parseStatus, matchStatus, reviewStatus, progress, optional queue, errorSummary, and timestamps. The queue object is attached only while native jobs are PENDING or PROCESSING, with legacy-shaped fields such as status, waiting, active, totalPending, waitingPosition, jobsAhead, and workerConcurrency. The status endpoint returns the legacy status counters: status, totalQuestions, parsedQuestions, pendingEquations, pendingImages, warnings, and errors.

This is a read/status compatibility layer over native docx_import_jobs and docx_import_events. The SSE route now replays stored jobs, keeps the stream open, sends periodic heartbeat events, and forwards live job.updated events for job start/completion/failure/approval changes from the current process or Redis fanout. Browser cutover remains deferred until the gateway auth/header adapter, worker topology, and browser verification are ready. The payload spool is already wired to a named Compose volume and a docx-import-payloads PersistentVolumeClaim in offline K8s/Helm. A read-only BullMQ snapshot bridge can be enabled with DOCX_IMPORT_BULLMQ_BRIDGE_ENABLED=1 to report legacy exam-import-algorithm queue depth/position when a native job is not present in the in-process queue; it does not enqueue, remove, retry, or claim legacy jobs. Full shared worker execution with the legacy BullMQ queue remains deferred. make test-docx-bullmq-status is an opt-in live smoke for this bridge path; it checks the non-default status route table and the teacher-library row queue shape for a caller-supplied pending/processing job id. make test-docx-import-library-browser adds browser-route evidence for the real /teacher/exams/library page and can optionally search for the same job id to assert the queue snapshot in the browser-observed response.

Native Approval Boundary

POST /v1/import/docx/jobs/{id}/approve approves a completed job result into native services. It is an internal /v1 contract. A non-default API Gateway route-table example can adapt legacy-compatible POST /api/exam-import/jobs/{id}/approve to this endpoint, but the default route table keeps public /api/exam-import/* on legacy until gateway auth/RBAC and browser review parity are verified.

Question-bank approval:

json
{
  "target": "QUESTION_BANK",
  "questionMetadata": {
    "subjectId": "math",
    "gradeLevel": 12,
    "tags": ["approved"]
  }
}

The service calls question-bank-servicePOST /v1/questions/import-docx-output, forwarding X-Organization-Id and X-User-Id, and adds sourceImportJobId, sourceFileName, and approval metadata.

Draft exam approval:

json
{
  "target": "EXAM_DRAFT",
  "questionMetadata": {
    "subjectId": "math",
    "gradeLevel": 12
  },
  "exam": {
    "title": "Approved draft",
    "subjectId": "math",
    "gradeLevel": 12,
    "durationMinutes": 45
  }
}

For EXAM_DRAFT, the request must include either examId or an exam create payload. The service creates the draft when needed and then calls PUT /v1/exams/{examId}/question-snapshots with snapshots built from the QAS output and imported question IDs.

Approval is idempotent for identical requests: the service records an approval.completed event with an approval key and returns the stored result on retry instead of reinserting questions.

Gateway live approval smoke:

bash
cd go-platform
IMPORT_APPROVAL_LIVE_CONFIRM=approve-native \
IMPORT_APPROVAL_JOB_ID=<completed-job-id> \
IMPORT_APPROVAL_AUTHORIZATION='Bearer <token>' \
IMPORT_APPROVAL_ORGANIZATION_ID=<org-id> \
make test-import-approval-live

Use IMPORT_APPROVAL_BODY_JSON or IMPORT_APPROVAL_BODY_FILE to pass the exact approval payload. The script defaults to {"target":"QUESTION_BANK"} and will not run without the explicit confirmation flag because it performs native target writes.

Go-platform documentation is generated from repository Markdown.