fix HWPX char offsets around interleaved controls#213
Closed
yl-star7 wants to merge 15 commits into
Closed
Conversation
Release v0.7.2: VS Code 컨텍스트 메뉴 + 한컴 단축키 + 커맨드 팔레트 + 양식 컨트롤
Release v0.7.2: npm 패키지 배포 준비 — @rhwp/core + @rhwp/editor + VS Code 0.7.2
Windows 환경에서 cfb 크레이트 동작 시 발생하는 역슬래시(\) 경로 구분자 문제를 슬래시(/) 단위로 일일이 변경하여 모든 OS 환경에서 일관된 경로를 보장하도록 함.
Problem ------- HwpxReader::read_file and read_file_bytes called read_to_string / read_to_end directly on zip::ZipFile readers with no size ceiling. ZIP allows extreme compression ratios, so a few-KB .hwpx can claim a single section.xml entry that expands to multi-GB — OOMing the host. Any downstream (CLI, WASM, browser extension, VS Code) inherits the risk since all of them funnel through parse_hwpx → HwpxReader. Fix --- Introduce a small read_limited(reader, max) helper using Read::take so we never allocate past max+1 bytes. Apply: - MAX_XML_SIZE = 32 MB for text entries (section/header/hpf) - MAX_BINDATA_SIZE = 64 MB for binary entries (images, fonts) Real-world Korean government HWPX (보도자료·법령·판례) stays well under these, so legitimate files are unaffected. Over-limit entries surface as HwpxError::ZipError with "decompression bomb" in the message — callers see a clean parse error instead of a crash. Rationale for thresholds ------------------------ MDM (https://github.com/seunghan91/markdown-media) has been running the same 32/64 MB caps in production across ~500 real HWPX fixtures with zero false positives; see MDM commit a94b459 for the original application of this pattern across HWP/HWPX/PDF parser paths. Tests ----- Five unit tests in parser::hwpx::reader: - under / at / over cap boundary for read_limited - a synthetic zip-bomb entry (MAX_XML_SIZE + 1 bytes of 'A', compressed to <1 MB) is rejected with ZipError Full suite: 789 passed, 0 failed (baseline + 5 new). No public API change: HwpxError variants unchanged, method signatures unchanged. MAX_XML_SIZE / MAX_BINDATA_SIZE are exposed as pub const so integrators can read the current caps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Problem
-------
parser/hwpx/header.rs:293 treats any shape != "NONE" | "3D" as a real
strikethrough:
cs.strikethrough = !matches!(val.as_str(), "NONE" | "3D");
This is a blacklist of known placeholders. It is correct for the 3D
case Hancom emits today, but fails open: any future placeholder value
Hancom adds (e.g. "4D", "Ghost", internal experiments) will be
misinterpreted as a real strike line, and entire body-text paragraphs
will render with strikethrough.
Fix
---
Flip the predicate to a whitelist of OWPML LineSym2 values that shape.rs
already recognises (SOLID, DASH, DOT, DASH_DOT, DASH_DOT_DOT, LONG_DASH,
CIRCLE, DOUBLE_SLIM, SLIM_THICK, THICK_SLIM, SLIM_THICK_SLIM, WAVE,
DOUBLE_WAVE — 13 entries, same set as the enum mapping immediately
below). Unknown values — including NONE, 3D, and any future Hancom
placeholder — are fail-closed to no-strike.
Extracted into a `pub(crate) fn is_real_strike_shape(&str) -> bool`
so the predicate is independently testable and can be reused if other
parsers (e.g. underline, which has the same blacklist risk) adopt the
same approach later.
Evidence
--------
MDM (https://github.com/seunghan91/markdown-media) hit the exact same
bug earlier this month. Hancom's HWPX exporter writes <hh:strikeout
shape="3D"/> as a placeholder default on body-text charPr definitions.
Treating "anything but NONE" as strike caused press-release bodies
(e.g. "251113 벤처투자 보도자료") to render wrapped in ~~...~~.
Our fix (MDM commit ae15dd8) moved to the whitelist approach and
verified across 21 MOIS HWPX fixtures: zero false strikes remained.
The same logic fits rhwp one-for-one, so we're upstreaming it.
Tests
-----
Five new unit tests in parser::hwpx::header::tests:
- all 13 valid OWPML LineSym2 shapes → true
- "NONE" → false
- "3D" → false (the original bug)
- fail-closed cases: "4D", "Ghost", "", "solid" (case-sensitive)
Full suite: 789 passed, 0 failed.
No behaviour change for files Hancom currently emits — the set of
shapes that produce strike=true is identical to the old blacklist
path for every value shape.rs knows about. The only difference is
forward-compatibility: unknown future placeholders no longer get
mis-rendered as strike.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eparator fix(parser): Windows 환경의 CFB 경로 구분자 오류 수정
…otection fix(hwpx): cap ZIP entry decompression to defeat zip bombs
…-whitelist refactor(hwpx): whitelist strikeout shapes instead of blacklist
Release v0.7.3 (라이브러리) / v0.2.0 (확장)
Hotfix v0.2.1 (확장): 매뉴얼 §5.1 누락 버전 보강
라이브러리/확장 이원화 정책에 따라 버전 표기 정리: - 라이브러리 v0.7.3 (Cargo, npm/core, npm/editor, vscode, studio) - 확장 v0.2.1 (chrome, safari) — Chrome Web Store / Edge Add-ons 심사 진행 중 문서 갱신: - README.md / README_EN.md / rhwp-chrome/README.md: 변경 이력 + 향후 예정 + 외부 기여자 6명 누적 명시 - mydocs/orders/20260419.md: v0.2.1 표기로 일관성 정리 - mydocs/plans/task_m100_196.md, working/stage3, report: v0.2.1 표기 스토어 등록 정보 (mydocs/feedback/, mydocs/release/): - kor_desc_0.2.1.md, eng_desc_0.2.1.md: 한국어/영어 마켓 설명 (HWPX 베타 안내 + 기여자 + PR 번호) - cert_notes_0.2.1.md: 심사용 Notes for certification (1986 chars) - add_desc_02.md: 변경 사항 추가용 텍스트 - release_notes_v0.7.3.md: GitHub Release 노트 다음 사이클: v0.7.5 통일 (라이브러리 + 확장 단일 버전 정책) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
docs + CodeQL fix: v0.7.3 회고 보고서 + build.mjs alert edwardkim#16 fix
Add a roundtrip regression test that verifies paragraphs containing line breaks keep their text and char offsets after serialize and reparse. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Keep HWPX paragraph offsets and char shape positions aligned with the original text/control order so linebreak and embedded control roundtrips remain stable. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Owner
|
@yl-star7 @jskang 기여 감사합니다. 실제 fix (src/parser/hwpx/section.rs) + 테스트 (task177_linebreak_preserved_on_roundtrip_ref_mixed) 두 커밋을 local/devel 에 cherry-pick 하여 반영했습니다.
본 PR 은 오래된 main 기준으로 분기되어 그 이후 머지된 상당량의 커밋/문서 (v0.7.2, v0.7.3, v0.2.1 릴리즈 + #152/#153/#154/#196/#203) 가 diff 에 포함되어 admin merge 시 릴리즈 문서가 덮어쓰일 위험이 있어 cherry-pick 방식으로 처리했습니다. 본 PR 은 close 처리합니다. 저자(@jskang) 의 커밋 authorship 과 Co-Authored-By: Claude 는 모두 보존됩니다. 앞으로는 PR 생성 전에 `git fetch origin && git rebase origin/devel` 로 최신 devel 기준으로 맞춰주시면 diff 가 명확해집니다. 감사합니다. |
This was referenced Apr 20, 2026
edwardkim
added a commit
that referenced
this pull request
Apr 20, 2026
PR #181 (by @seunghan91) 가 2026-04-17 기준으로 작성된 후 다음 두 PR 이 devel 에 머지되어 SVG 출력이 변경됨: - #213 (cherry-picked by @jskang): HWPX char offsets interleaved 보정 - #221 (by @planet6897): OLE/Chart/EMF 네이티브 렌더링 + HWPX 차트 파서 두 fix 가 form-002 렌더링에 반영되어 원본 golden 과 diff 발생. `UPDATE_GOLDEN=1 cargo test --test svg_snapshot` 로 재생성 후 결정성 확인 완료. Verification: - cargo test --test svg_snapshot → 2 passed (form_002_page_0 + render_is_deterministic_within_process) - 재실행 back-to-back 결정성 유지 Refs #181 · #173 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
edwardkim
added a commit
that referenced
this pull request
Apr 20, 2026
2026-04-20~21 사이클 정리: PR 처리 9건: - admin merge 5: #209 #214 #215 #221 #224 - cherry-pick + close 2: #213 (+중복 #210 close), #181 (+golden 재생성) - dependabot close 2: #211 #212 (devel 수동 bump + target-branch=devel 설정) - 보류 1: #165 skia (별도 사이클) 이슈 close 7: #173 #195 #202 #205 #207 #210 #222 신규 이슈 등록 1: #204 (표 Undo/Redo) Chrome Web Store / Edge Add-ons v0.2.1 심사 통과 (2026-04-21). local/task205 폐기: PR #209+#214 가 중복 처리하여 잔여 고유 기여 분리 곤란. 기여자 7명 감사 기록. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
edwardkim
added a commit
that referenced
this pull request
Apr 21, 2026
한글판 README.md 와 영문판 README_EN.md 의 섹션 구조·내용을 1:1 일치시킴. 주요 변경: - Roadmap + Milestones 섹션을 상단(로드맵/이정표 위치) 으로 이동 - v0.5.0 ~ v0.7.x 이정표에 v0.2.1 사이클 전체 반영 (Chrome/Edge 심사 통과, PR #213/#215/#221/#169/#209/#214/#224/#181) - rhwp-firefox (v0.1.1 AMO 준비) + rhwp-safari (v0.2.1) 섹션 추가 - 기여자 감사 목록 확장: @seo-rii, @planet6897, @yl-star7 추가 (9명) - Features: OLE/Chart/EMF native rendering 항목 추가 (#221) - Project Structure: src/emf/, src/ooxml_chart/, rhwp-firefox/, rhwp-shared/ 반영 - Trademark 섹션 신규 (한글판 동일 위치) - AI 페어 프로그래밍 섹션: Git Workflow / Task Workflow / Debugging Protocol / Documentation Rules 완전 반영 - 로드맵 링크를 mydocs/eng/report/rhwp-milestone.md 로 보정 - 테스트 수치 783 → 935 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
edwardkim
added a commit
that referenced
this pull request
Apr 22, 2026
한글판 README.md 와 영문판 README_EN.md 의 섹션 구조·내용을 1:1 일치시킴. 주요 변경: - Roadmap + Milestones 섹션을 상단(로드맵/이정표 위치) 으로 이동 - v0.5.0 ~ v0.7.x 이정표에 v0.2.1 사이클 전체 반영 (Chrome/Edge 심사 통과, PR #213/#215/#221/#169/#209/#214/#224/#181) - rhwp-firefox (v0.1.1 AMO 준비) + rhwp-safari (v0.2.1) 섹션 추가 - 기여자 감사 목록 확장: @seo-rii, @planet6897, @yl-star7 추가 (9명) - Features: OLE/Chart/EMF native rendering 항목 추가 (#221) - Project Structure: src/emf/, src/ooxml_chart/, rhwp-firefox/, rhwp-shared/ 반영 - Trademark 섹션 신규 (한글판 동일 위치) - AI 페어 프로그래밍 섹션: Git Workflow / Task Workflow / Debugging Protocol / Documentation Rules 완전 반영 - 로드맵 링크를 mydocs/eng/report/rhwp-milestone.md 로 보정 - 테스트 수치 783 → 935 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
char_offsetsin original text/control order instead of shifting all controls to the frontchar_shapes.start_posaligned with the same interleaved positions used by roundtrip serialization<hp:lineBreak/>and interleaved table control parsingTest plan
cargo test --test hwpx_roundtrip_integrationNotes
secPr/colPrshould also consume offset space in the same path🤖 Generated with Claude Code