fix: 수식 토크나이저 keyword prefix-split 구현 (#576) by oksure · Pull Request #579 · edwardkim/rhwp

oksure · 2026-05-04T07:38:05Z

문제

{a timesm} 스크립트에서 토크나이저가 timesm을 하나의 Command 토큰으로 반환.
파서에서 알려진 키워드로 인식되지 않아 이탤릭 텍스트 "timesm"으로 렌더링됨.

동일 패턴: simZ → "simZ" (기대값: ∼ + Z)

원인

read_command()가 연속 alphanumeric 시퀀스를 무조건 하나의 토큰으로 반환.
HWP 수식 스크립트에서 키워드와 식별자가 공백 없이 결합되는 경우를 처리하지 못함.

해결

Longest keyword prefix-split 방식 적용:

symbols.rs에 is_known_command(), longest_keyword_prefix_len() 추가
- 모든 알려진 키워드 테이블(OPERATORS, GREEK, FUNCTIONS, FONT_STYLES, DECORATIONS, 구조명령어)을 통합 조회
- 대소문자 무시 매칭 지원
tokenizer.rs의 read_command()에서:
- 전체 문자열이 알려진 키워드 → 그대로 반환
- 가장 긴 키워드 접두사가 있고 나머지가 남음 → 접두사만 반환, pos 되감기
- 매칭 없음 → 기존 동작 (전체를 하나의 토큰으로)

검증

timesm → times + m  (× 기호 + 변수 m)
simZ   → sim + Z    (∼ 기호 + 변수 Z)
rmX    → rm + X     (폰트 스타일 + 변수 X)
alphaX → alpha + X  (그리스 문자 + 변수 X)
alpha  → alpha      (완전 일치 → 분할 안 함)
xyz    → xyz        (미인식 → 분할 안 함)

cargo test 전체 통과 (기존 12 + 신규 6 = 18 tokenizer 테스트, symbols 2 추가)
cargo clippy -- -D warnings 경고 0

Closes #576

`timesm` → `times` + `m`, `simZ` → `sim` + `Z` 등 알려진 키워드 접두사를 인식하여 분할한다. 기존 read_command는 연속 alphanumeric 시퀀스를 하나의 토큰으로 반환했으나, HWP 수식 스크립트에서 키워드와 식별자가 공백 없이 붙어 있는 경우 올바르게 분리되지 않았다. 변경: - symbols.rs: is_known_command(), longest_keyword_prefix_len() 추가 - tokenizer.rs: read_command()에서 longest-prefix 매칭 후 분할 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR updates the equation tokenizer to split a known command keyword from the front of a contiguous alphanumeric token (for example timesm → times + m) so the equation parser can render operator/style/symbol commands correctly when they are attached to following identifiers.

Changes:

Added is_known_command() and longest_keyword_prefix_len() in symbols.rs to centralize keyword detection across symbols, functions, decorations, styles, and structure commands.
Updated Tokenizer::read_command() to keep exact known commands intact, otherwise split off the longest known keyword prefix and rewind the remaining characters.
Added tokenizer and symbol tests covering prefix-splitting, exact-match preservation, and unknown-token behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`src/renderer/equation/tokenizer.rs`	Applies longest-known-keyword prefix splitting during command tokenization and adds regression tests for the new behavior.
`src/renderer/equation/symbols.rs`	Adds shared helpers for known-command detection and longest-prefix lookup, plus unit tests for the new lookup behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        // 알려진 키워드 접두사가 있으면 분할: 키워드만 반환하고 나머지는 되감기
+        if let Some(prefix_len) = symbols::longest_keyword_prefix_len(&value) {
+            self.pos = start + prefix_len;
+            let prefix = value[..prefix_len].to_string();
+            return Token::new(TokenType::Command, prefix, start);


+    let upper = cmd.to_ascii_uppercase();
+    is_structure_command(&upper) || FUNCTIONS.contains_key(upper.as_str())


edwardkim · 2026-05-04T09:11:43Z

@oksure 님, @planet6897 님 — 두 PR (#578 + #579) 이 동일 이슈 #576 의 다른 접근으로 동시 등록되어 있습니다. 메인테이너 검토 결과 양쪽 통합 이 가장 정합한 방향으로 판단됩니다. 두 분의 협업 PR 을 부탁드리는 첫 시도로 본 안내를 드립니다.

두 PR 의 정합한 영역 비교

PR	본질	정합 메모리 룰
#578 (@planet6897)	광범위 sweep (158 fixture / 563 unique 수식 scripts) + 명시적 4 키워드 (`times/sim/TIMES/SIM`) list 추가	`feedback_essential_fix_regression_risk` / `feedback_hancom_compat_specific_over_general`
#579 (@oksure)	일반적 인프라 (`is_known_command()`, `longest_keyword_prefix_len()` 헬퍼 + 모든 알려진 키워드 통합 조회)	`feedback_rule_not_heuristic`

통합 시 본질

두 영역 모두 가치 있고 본질적으로 보완:

PR fix: 수식 토크나이저 keyword prefix-split 구현 (#576) #579 의 인프라 — symbols.rs::is_known_command() + longest_keyword_prefix_len() 헬퍼는 향후 키워드 추가 시 자동 인식되는 확장성 있는 본질. 모든 알려진 키워드 (OPERATORS / GREEK / FUNCTIONS / FONT_STYLES / DECORATIONS / 구조명령어) 를 통합 조회하여 단일 룰 적용
PR Task #576: 수식 토크나이저 times/sim 키워드 prefix-split 정정 (closes #576) #578 의 sweep + case-specific — 158 fixture / 563 scripts 광범위 검증으로 발견한 4 영향 키워드 (times/sim/TIMES/SIM) — 회귀 위험 정밀 식별

일반화 알고리즘 (PR #579) 의 잠재 회귀:

PR #579 의 longest_keyword_prefix_len() 이 모든 알려진 키워드를 prefix 로 분리할 수 있어 다음과 같은 case 에 회귀 위험:

alphabet → alpha + bet (그리스 문자 prefix 충돌)
sqrtest → sqrt + est

PR #578 의 광범위 sweep 결과로 이런 회귀가 실제 fixture 에서 발현되는지 검증되어 4 키워드만 영향이 있다는 것이 확정됨.

통합 PR 제안

다음 통합 영역 권장:

PR fix: 수식 토크나이저 keyword prefix-split 구현 (#576) #579 의 인프라 (symbols.rs 헬퍼 + tokenizer.rs::read_command 정정) 채택
PR Task #576: 수식 토크나이저 times/sim 키워드 prefix-split 정정 (closes #576) #578 의 광범위 sweep 결과로 회귀 검증 — 158 fixture / 563 scripts 에서 일반화 알고리즘 적용 시 byte-identical 또는 의도된 정정만 발현되는지 점검
PR Task #576: 수식 토크나이저 times/sim 키워드 prefix-split 정정 (closes #576) #578 의 6 신규 unit tests 보존 (회귀 차단 가드)
회귀 발견 시: case-specific 분기 (예: alphabet 같은 알려진 충돌은 명시적 차단) 추가 또는 일반화 영역의 키워드 list 좁힘

협업 PR 자원 요청

본 안내는 두 컨트리뷰터의 커뮤니케이션 통한 PR 처리 시도 의 첫 사례입니다. 둘 중 한 분이 통합 PR 을 자원해주시거나, 함께 작업하실 방식 (예: PR #579 base 위에 PR #578 의 sweep 검증 영역 추가) 알려주시면 그에 따라 진행하겠습니다.

PR #578 / #579 의 OPEN 상태는 통합 PR 결정까지 유지합니다 (close 안 함).

본 사이클의 매우 활발한 컨트리뷰션 (3 컨트리뷰터, 10+ PR 처리 중) 의 진보 사례로 두 분의 협업을 부탁드립니다.

oksure · 2026-05-04T10:04:53Z

감사합니다. 통합 PR을 자원합니다.

Copilot 리뷰에서 지적된 alphabet → alpha + bet 회귀 문제를 인지하고 있습니다. @planet6897 님의 sweep 결과(4개 키워드만 실제 영향)를 기반으로 다음과 같이 통합하겠습니다:

symbols.rs의 인프라 헬퍼 (is_known_command, longest_keyword_prefix_len) 유지하되, prefix-split 대상을 연산자/관계 기호 (OPERATORS) 로 한정하여 그리스 문자(alpha 등)와의 충돌 방지
@planet6897 님의 6개 unit test 포함
158 fixture sweep 결과로 byte-identical 검증

@planet6897 님, PR #578 의 테스트 코드와 sweep 결과 참고하겠습니다.

@planet6897

Copilot 리뷰 + 메인테이너 피드백 반영: - `alphabet` → `alpha`+`bet` 회귀 방지 - `sinx` → `sin`+`x` 분할 방지 is_splittable_keyword()를 도입하여 prefix-split 대상을 OPERATORS, BIG_OPERATORS, SPECIAL_SYMBOLS, FONT_STYLES로 한정. 그리스 문자, 함수, 장식, 구조 명령어는 변수명 충돌 위험으로 제외. PR edwardkim#578 (@planet6897) sweep 결과와 일치: 실제 영향 키워드는 times/sim/TIMES/SIM 4개 (모두 OPERATORS). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

oksure · 2026-05-04T12:55:03Z

Copilot 리뷰 + 메인테이너 피드백 반영하여 2543722 push 했습니다.

변경 내용

is_splittable_keyword() 도입하여 prefix-split 대상을 축소:

포함 (분할 O)	제외 (분할 X)
OPERATORS (`times`, `sim`, `div` 등)	그리스 문자 (`alpha`, `beta` 등)
BIG_OPERATORS (`INT`, `SUM` 등)	함수 (`sin`, `cos`, `log` 등)
SPECIAL_SYMBOLS (`INF`, `DEG` 등)	장식 (`hat`, `bar` 등)
FONT_STYLES (`rm`, `it`, `bold`)	구조 명령어 (`OVER`, `SQRT` 등)

회귀 검증

alphabet → alphabet (분할 안 함 ✓)
sinx → sinx (분할 안 함 ✓)
timesm → times + m (분할 ✓)
simZ → sim + Z (분할 ✓)
rmX → rm + X (분할 ✓)

@planet6897 님의 158 fixture sweep 결과(영향 키워드 4개: times/sim/TIMES/SIM)와 정합합니다.

oksure · 2026-05-05T02:55:36Z

@edwardkim 님, 통합 제안 감사합니다.

검토 결과 제 #579의 일반화 인프라(longest_keyword_prefix_len)에 오분리 edge case가 있어서 (예: alphabet → alpha+bet, 대문자 함수 미인식 등), 가드 보강 없이는 회귀 위험이 있습니다.

planet6897 님의 #578이 158 fixture sweep으로 안전성 검증 완료된 상태이니, #578 먼저 머지 진행해주시고 제 #579는 close 하겠습니다. 일반화 인프라는 오분리 가드 설계 보강 후 별도 PR로 다시 올리겠습니다.

@planet6897

PR #578 (Task #576, @planet6897 / Jaeook Ryu) 1차 검토: - 본질 결함: tokenizer.rs::read_command (L104) prefix-split list 가 bold/it/rm 만 처리 → times/sim/TIMES/SIM 결합 식별자 단일 토큰화 (예: "a timesm" → "timesm" italic) - 본질 정정: list 확장 bold/it/rm → bold/it/rm/times/sim/TIMES/SIM (광범위 sweep 158 fixture / 563 scripts 결과 4 키워드만 발현) - 회귀 위험 0 — alpha/over/sqrt 등 다른 keyword 는 공백 구분 보존 - 단일 파일 본질 (tokenizer.rs +53 LOC + 6 신규 unit tests) PR #579 영역: 5/4 협업 PR 권유 후 @oksure self-close → PR #578 단독 처리 정합, @planet6897 답신 학습 의지 표명. 본 환경 임시 검증: - cherry-pick 충돌 0 - cargo test --lib --release 1140 passed (회귀 0) - task576 unit tests 6/6 passed (alpha 회귀 차단 가드 포함) - svg_snapshot 6/6 / clippy 0건 권장 처리: 옵션 A — 핀셋 cherry-pick + 결정적 검증 + 광범위 sweep + 작업지시자 시각 판정 (exam_science page 3/4).

@planet6897

…키워드 prefix-split — @planet6897 / Jaeook Ryu 1 commit + 시각 판정 ★ 통과) PR #578 (Task #576, @planet6897 PR / Jaeook Ryu commit author): - 본질 결함: tokenizer.rs::read_command (L104) prefix-split list 가 bold/it/rm 만 처리 → times/sim/TIMES/SIM 결합 식별자 단일 토큰화 (예: "a timesm" → "timesm" italic) - 본질 정정: list 확장 bold/it/rm → bold/it/rm/times/sim/TIMES/SIM - 광범위 sweep (158 fixture / 563 scripts) 결과 4 키워드만 발현 → alpha/over/sqrt 등 회귀 위험 0 - 단일 파일 본질 (tokenizer.rs +53 LOC + 6 신규 unit tests) PR #579 영역: 5/4 협업 PR 권유 후 @oksure self-close → PR #578 단독 처리 정합. 검증: - cargo test --lib --release 1140 passed (회귀 0) - task576 unit tests 6/6 passed (alpha 회귀 차단 가드 포함) - svg_snapshot 6/6 / clippy 0 / build --release - WASM 4,583,156 bytes (v0.7.10 baseline +1,691 bytes — tokenizer.rs +53 LOC 정합) - 광범위 페이지네이션 회귀 sweep: 164 fixture (158 hwp + 6 hwpx) / 1,614 페이지 / 페이지 수 회귀 0 - exam_science byte 차이: page 3/4 만 의도된 정정 (page 1/2 byte-identical) 작업지시자 웹 시각 판정 ★ 통과. closes #576.

Copilot AI review requested due to automatic review settings May 4, 2026 07:38

Copilot started reviewing on behalf of oksure May 4, 2026 07:38 View session

Copilot AI reviewed May 4, 2026

View reviewed changes

edwardkim mentioned this pull request May 4, 2026

Task #576: 수식 토크나이저 times/sim 키워드 prefix-split 정정 (closes #576) #578

Closed

7 tasks

oksure closed this May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: 수식 토크나이저 keyword prefix-split 구현 (#576)#579

fix: 수식 토크나이저 keyword prefix-split 구현 (#576)#579
oksure wants to merge 2 commits into
edwardkim:develfrom
oksure:contrib/fix-eq-tokenizer-prefix-split

oksure commented May 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

edwardkim commented May 4, 2026

Uh oh!

oksure commented May 4, 2026

Uh oh!

oksure commented May 4, 2026

Uh oh!

oksure commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		let upper = cmd.to_ascii_uppercase();
		is_structure_command(&upper) \|\| FUNCTIONS.contains_key(upper.as_str())

Uh oh!

Conversation

oksure commented May 4, 2026

문제

원인

해결

검증

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

edwardkim commented May 4, 2026

두 PR 의 정합한 영역 비교

통합 시 본질

통합 PR 제안

협업 PR 자원 요청

Uh oh!

oksure commented May 4, 2026

Uh oh!

oksure commented May 4, 2026

변경 내용

회귀 검증

Uh oh!

oksure commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants