Skip to content

Non-Latin text support via MTTextAtom#200

Merged
kostub merged 12 commits into
masterfrom
worktree-non-latin
May 14, 2026
Merged

Non-Latin text support via MTTextAtom#200
kostub merged 12 commits into
masterfrom
worktree-non-latin

Conversation

@kostub
Copy link
Copy Markdown
Owner

@kostub kostub commented May 10, 2026

Summary

Adds first-class non-Latin text support to iosMath via a new MTTextAtom (Approach E from the LLD): \text, \textrm, \textbf, \textit, \textsf, \texttt now capture their bodies as raw NSString and render through CoreText's system-font cascade. This unlocks CJK, Cyrillic, Devanagari, Hebrew, Arabic, etc. inside math layouts without relying on the math font's BMP coverage. The implementation also retires the legacy Cyrillic-as-Variable carve-out, since \text now handles arbitrary scripts uniformly.

  • New atom type kMTMathAtomText (= 19) with MTTextStyle enum (Roman/Bold/Italic/SansSerif/Typewriter), distinct from Ordinary so Rule 14 fusion is avoided.
  • New MTTextDisplay opaque sub-display embedded into the parent math line, x-height-aligned via the math font's \fontdimen5.
  • New MTFontManager +textCTFontForStyle:size: returning a CTFontRef with system font + symbolic traits.
  • Parser switch: the six \text* commands route to MTTextAtom; raw-body capture supports balanced braces and the standard 8 LaTeX escapes (\\, \{, \}, \_, \^, \%, \&, \#, \$).
  • 34 new tests; full suite 217 → 219 (the +2 are item 6's regression coverage for the dropped carve-out).

Plan & LLD

(These live under untracked `docs/` per the orchestrator setup; they're not in the diff but are the design reference for review.)

Commits

  • [item 1] feat: add MTTextAtom and MTTextStyle data model (`010565a`)
  • [item 2] feat: MTFontManager +textCTFontForStyle:size: (`abd89fe`)
  • [item 3] feat: MTTextDisplay for CoreText system-font runs (`9ceb5b4`)
  • [item 4] feat: typesetter renders MTTextAtom as MTTextDisplay (`436f286`)
  • [item 5] feat: parser routes \text* to MTTextAtom (`01e071b`)
  • [item 6] refactor: drop Cyrillic carve-out in atomForCharacter: (`d32b192`)
  • [item 7] docs(examples): add mixed text + math demonstrators (`23382fc`)

Each commit's tests pass independently (verified per-commit with `swift test`).

Test plan

  • `swift test` — 219/219 pass at HEAD and at every intermediate commit
  • Run iOS sample app (`iosMathExample`) on simulator and visually confirm entries 23–30 (Cyrillic, CJK, Devanagari, Hebrew, Arabic, mixed-style, scripts-on-text)
  • Run macOS sample app (`MacOSMathExample`) and confirm the same
  • Run `SwiftMathExample` (SwiftUI) and confirm the three new `NamedFormula` entries render
  • Spot-check that pre-existing math layouts (fractions, radicals, integrals) are byte-identical — `MTTextAtom` is opaque to the parent line so layout for non-text math should be unchanged
  • One known follow-up: `Example/iosMathExample/example/ViewController.m` and `MacOSMathExample/AppDelegate.m` carry hard-coded `demoHeights[]` arrays that stop at index 22; new entries 23–30 fall through to the 60pt default. If any clip in practice, extend the arrays.

kostub and others added 7 commits May 10, 2026 19:09
Introduces a new atom type kMTMathAtomText distinct from Ordinary so it
won't be fused by Rule 14, an MTTextStyle enum (Roman/Bold/Italic/
SansSerif/Typewriter), an MTTextAtom subclass with raw-Unicode body and
LaTeX round-trip with the eight standard escapes, and the
+textStyleWithName:/+commandNameForTextStyle: factory APIs that drive
parser and serializer lookups.

No parser/typesetter wiring yet; \text* still flows through the legacy
fontStyles path. Existing behavior unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Returns a CT font configured for system text rendering with optional
bold/italic traits, or a system monospace font for typewriter style.
Caller owns the returned CTFontRef (CF_RETAINED).

No callers yet — wired into MTTextDisplay/typesetter in following commits.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A new MTDisplay subclass that owns one CTLineRef built from raw text
and a CTFontRef, with an xHeightShift applied at draw time so the
text-font x-height can be aligned with the math x-height. Models
MTCTLineDisplay's lifecycle for color/draw but takes a CTFontRef
directly rather than an MTFont so it can carry a system text font.

Not yet wired into the typesetter — covered by the next chunk.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds case kMTMathAtomText: to createDisplayAtoms — flushes the current
math line, builds an MTTextDisplay backed by a system text CTFont, and
positions it like any other inline display. Inter-element spacing maps
kMTMathAtomText into the Ord row/column of the existing 8x8 matrix
without changing the matrix itself. Sub/superscripts attach via the
existing makeScripts: path with delta=0 (no italic correction).

Baseline alignment uses mathTable.accentBaseHeight (\fontdimen5 / TeX
x-height) minus CTFontGetXHeight(textFont) so the lowercase x-heights
line up — visually correct for Latin/Cyrillic/Greek; approximate for
CJK/Devanagari per LLD section 5.1.

Tests construct MTTextAtoms programmatically; \text* still flows
through the legacy parser path until the next commit flips the switch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
\text, \textrm, \textbf, \textit, \textsf, \texttt now build a single
MTTextAtom from the raw {…} body. The new readTextArgument honors
balanced nested {…} as TeX-style grouping (the braces are stripped),
processes the eight standard escapes \\, \{, \}, \_, \^, \%, \&, \#,
\$, and rejects $ and any other backslash sequence as
MTParseErrorInvalidCommand. \textbf{你好}, \text{नमस्ते}, \text{مرحبا}
etc. now render correctly through CoreText cascade fallback.

Removes the six \text* keys (text, textrm, textbf, textit, textsf,
texttt) from MTMathAtomFactory.fontStyles so no fall-through path
remains. \math* keys are preserved — \mathbf{x} etc. still produce
Unicode-math-alphanumeric remapping unchanged.

Updates the existing testText to expect the new MTTextAtom shape and
adds parser/integration tests covering the body-capture grammar
(empty, ASCII, all five styles, CJK, Cyrillic, Devanagari, Hebrew,
Arabic, mixed scripts, escapes, NBSP, nested braces), scripts,
round-tripping, and parse-error cases.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The U+0411-U+044E carve-out was inconsistent: only one Cyrillic block,
off-by-one against U+0410-U+044F, and no equivalent for Greek, Hebrew,
Arabic, or CJK. Now that \text* offers a uniform path for non-Latin
text, the special case is more confusing than useful.

After this commit, raw math input is ASCII U+0021-U+007E only. Cyrillic
must be wrapped in \text*, \textbf{...} etc. (already exercised by
testTextCyrillic*).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds demo entries covering Latin/Cyrillic/Chinese/Devanagari/Hebrew/
Arabic mixed with math, the five \text* styles side-by-side, and a
text block carrying scripts. Visible in iosMathExample,
MacOSMathExample, and SwiftMathExample.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

self.width = (CGFloat)CTLineGetTypographicBounds(_line, NULL, NULL, NULL);
CGRect bounds = CTLineGetBoundsWithOptions(_line, kCTLineBoundsUseGlyphPathBounds);
self.ascent = MAX(0, CGRectGetMaxY(bounds));
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because draw: later sets the CT text position to self.position.y - _xHeightShift, the ink no longer lives in the coordinate space described by these unshifted metrics. For positive shifts, the real descent is -CGRectGetMinY(bounds) + _xHeightShift; displayBounds, label sizing/background fills, and script placement can all undercount the lower ink. Please fold _xHeightShift into the reported ascent/descent here, or expose shifted accessors like the existing shifted glyph displays do.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed in 3ff02bf_xHeightShift was removed from MTTextDisplay entirely. draw: now uses self.position.y directly, so the unshifted metrics here are consistent with the draw position and no longer undercount lower ink.

Comment thread iosMath/lib/MTMathListBuilder.m Outdated
case '^': case '%': case '&': case '#': case '$':
[body appendFormat:@"%C", esc];
break;
default:
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now rejects \ inside text bodies. The old text/font-style path accepted that single-character space command, and it is common LaTeX, so existing inputs such as \text{hello\ world} or \textbf{hello\ world} now fail with MTParseErrorInvalidCommand. Please preserve that compatibility, for example by adding case ' ': to append a literal space and covering it with a regression test.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f3d4e12. readTextArgument now treats \<space> as a forced literal space (matching the legacy single-char command behavior), and testTextBackslashSpace covers both \text{hello\ world} and \textbf{hello\ world}.

kostub and others added 4 commits May 14, 2026 01:11
The original LLD called for an x-height shift so text x-height aligned
with the math x-height, implemented using MTFontMathTable.accentBaseHeight.
That property is TeX's math axis height (\fontdimen5), not the x-height,
so the shift was both poorly motivated and incorrectly computed.

Drop the xHeightShift parameter from MTTextDisplay and the helper from
MTTypesetter; \text{...} now shares the surrounding math baseline, which
matches TeX semantics. Pin the contract with an equality assertion in
testTextInMixedLine.

Refresh demos 23-30 to "x + \text{...} + y = ..." compositions that make
the baseline alignment visually verifiable across Latin, Cyrillic, CJK,
Devanagari, Arabic, Hebrew, and the five \text* styles.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Share the \text* body's escapable-character set via
  MTTextAtom.latexEscapableCharacterSet so the LaTeX writer
  (MTTextAtom.appendLaTeXToString:) and the parser
  (MTMathListBuilder.readTextArgument) cannot drift apart.
- Add testTextSubAndSuperscript covering combined ^ and _ on a
  text atom (parse + round-trip).
- Expand MTTextStyle doc comments to match the MTFontStyle style.
- Replace the fixture-shaped "x + \\text{...} + y = ..." demo rows
  with realistic compositions (vector-calculus definition, area-
  of-a-circle in five non-Latin scripts, multi-style textbook
  definition, Cyrillic Pythagoras label) and bump demo 23 from
  60 to 70 to fit its new fraction.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CTFontCreateUIFontForLanguage with kCTFontUIFontSystem /
kCTFontUIFontUserFixedPitch does not return NULL on supported
platforms. The dead fallback also bypassed the bold/italic
trait-application that follows, so a hit would have silently
served unstyled Helvetica.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The legacy text path treated \<space> as a single-character command that
mapped to an ordinary atom with a literal space, so inputs like
\text{hello\ world} and \textbf{hello\ world} worked. The new
\text* parser in readTextArgument restricts backslash escapes to the
\{}_^%&#$ set, so those inputs now fail with MTParseErrorInvalidCommand.

Accept \<space> as a forced literal space in \text* bodies and add a
regression test covering both \text and \textbf.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@kostub
Copy link
Copy Markdown
Owner Author

kostub commented May 14, 2026

@gemini-code-assist review code

@kostub
Copy link
Copy Markdown
Owner Author

kostub commented May 14, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements native support for \text* commands by introducing MTTextAtom and MTTextDisplay, allowing verbatim Unicode text capture and rendering via system fonts. The parser and typesetter were updated to handle these blocks, including support for non-Latin scripts. Feedback highlights a potential crash on iOS 6 from using CTFontCreateUIFontForLanguage and a regression where braces are now strictly required for \text* arguments, breaking standard LaTeX single-character compatibility.

break;
}

CTFontRef baseFont = CTFontCreateUIFontForLanguage(base, size, NULL);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

CTFontCreateUIFontForLanguage was introduced in iOS 7.0. Since the library maintains compatibility with iOS 6.0 (as indicated by isIos6Supported in MTMathListDisplay.m), using this function will cause a crash on older systems. CTFontCreateUIFontType is a safer alternative available since iOS 3.0 and macOS 10.5.

    CTFontRef baseFont = CTFontCreateUIFontType(base, size, NULL);

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gemini-code-assist Current repo support is iOS 18+ and macOS 15+, not iOS 6

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, my apologies for the oversight regarding the minimum supported OS version. Given that the project now targets iOS 18+ and macOS 15+, CTFontCreateUIFontForLanguage is indeed the appropriate and modern API to use. Thank you for the clarification.

Comment thread iosMath/lib/MTMathListBuilder.m Outdated
Comment on lines +303 to +311
if (![self hasCharacters] || [self getNextCharacter] != '{') {
// Roll the position back so the error highlights the right spot.
if ([self hasCharacters] || _currentChar > 0) {
[self unlookCharacter];
}
[self setError:MTParseErrorCharacterNotFound
message:@"Missing { for \\text* argument"];
return nil;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The new \text* commands now strictly require an opening brace {. In standard LaTeX, and in previous versions of this library for font styles, commands like \textbf can be used without braces for a single following character (e.g., \textbf x). This change breaks compatibility for such inputs. Consider allowing a single character or command as an argument if no opening brace is found, while maintaining the raw capture logic for the braced case.

@kostub
Copy link
Copy Markdown
Owner Author

kostub commented May 14, 2026

Review findings

  1. iosMath/lib/MTMathList.m: +[MTMathAtom atomWithType:value:] does not create an MTTextAtom for kMTMathAtomText. The public factory falls through to a plain MTMathAtom, but MTTypesetter later casts every text atom to MTTextAtom and reads textStyle/text. Programmatic callers using [MTMathAtom atomWithType:kMTMathAtomText value:@"..."] will crash at render time. Please add a factory case returning [[MTTextAtom alloc] initWithText:value ?: @"" style:kMTTextStyleRoman] and cover it with a test.

  2. iosMath/lib/MTMathList.h: MTTextAtom.text is mutable, but MTMathList.m only syncs text to nucleus during init/copy/finalize. If a caller mutates text, rendering and LaTeX serialization use the new value while stringValue, description, and consumers reading nucleus still see the old value. Please either make text readonly or override setters to keep text and nucleus consistent.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for text-mode commands such as \text, \textbf, and \textit by implementing a new MTTextAtom and MTTextDisplay rendering path using CoreText. This allows for raw Unicode text runs, including non-Latin scripts like Arabic and Chinese, to be rendered within math equations. The feedback identifies potential crashes on older iOS versions due to the use of APIs like CTLineGetBoundsWithOptions (iOS 12.0+) and CTFontCreateUIFontForLanguage (iOS 7.0), and suggests improving LaTeX compatibility by allowing single-character arguments without braces.

(__bridge CFAttributedStringRef)_attributedString);

self.width = (CGFloat)CTLineGetTypographicBounds(_line, NULL, NULL, NULL);
CGRect bounds = CTLineGetBoundsWithOptions(_line, kCTLineBoundsUseGlyphPathBounds);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

CTLineGetBoundsWithOptions is only available on iOS 12.0+ and macOS 10.14+. Since this library includes compatibility checks for older versions (e.g., isIos6Supported), this call will cause a crash on devices running older OS versions. Consider adding a version guard and providing a fallback mechanism for calculating glyph bounds, similar to the computeDimensions: method used in MTCTLineDisplay.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gemini-code-assist Current repo support is iOS 18+ and macOS 15+, not iOS 6.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarification regarding the supported OS versions. Since the project targets iOS 18+ and macOS 15+, CTLineGetBoundsWithOptions is indeed safe to use. I apologize for the confusion regarding the legacy support constraints.

Comment thread iosMath/lib/MTMathListBuilder.m Outdated
Comment on lines +303 to +311
if (![self hasCharacters] || [self getNextCharacter] != '{') {
// Roll the position back so the error highlights the right spot.
if ([self hasCharacters] || _currentChar > 0) {
[self unlookCharacter];
}
[self setError:MTParseErrorCharacterNotFound
message:@"Missing { for \\text* argument"];
return nil;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The readTextArgument method strictly expects an opening brace {. In standard LaTeX, commands like \textbf or \textit can also accept a single character or token as an argument without braces (e.g., \textbf x). This implementation change may break existing LaTeX strings that do not use braces for these commands. If full LaTeX compatibility is desired, consider allowing a single-token fallback if the next character is not {.

break;
}

CTFontRef baseFont = CTFontCreateUIFontForLanguage(base, size, NULL);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

CTFontCreateUIFontForLanguage was introduced in iOS 7.0. If the library is intended to support iOS 6 (as suggested by isIos6Supported in MTMathListDisplay.m), this call will result in a crash on those devices. Consider using CTFontCreateWithName with a fallback to the system font name if the current OS version is below 7.0.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gemini-code-assist Current repo support is iOS 18+ and macOS 15+, not iOS 6.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarification, kostub. I appreciate the context regarding the project's supported platform versions. Since the minimum deployment target is iOS 18+ and macOS 15+, the use of CTFontCreateUIFontForLanguage is perfectly appropriate. I will keep this in mind for future reviews.

@kostub kostub merged commit 89a1881 into master May 14, 2026
1 check passed
@kostub kostub deleted the worktree-non-latin branch May 14, 2026 07:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant