Skip to content

Consume @technical-1/email-archive-parser v3 (retire duplicated parser/detector code)#1

Open
Technical-1 wants to merge 9 commits into
mainfrom
consume-email-archive-parser
Open

Consume @technical-1/email-archive-parser v3 (retire duplicated parser/detector code)#1
Technical-1 wants to merge 9 commits into
mainfrom
consume-email-archive-parser

Conversation

@Technical-1

@Technical-1 Technical-1 commented Jun 2, 2026

Copy link
Copy Markdown
Owner

Summary

Replaces EmailAnalyzer's duplicated, drift-prone parser/detector logic with the published library @technical-1/email-archive-parser@^3.0.0, ending the fork drift. The Web Worker keeps its streaming/messaging shell and IndexedDB persistence; parsing + detection now come from the library.

What changed:

  • Worker (parserWorker.ts): MBOX, OLM, and Gmail-Takeout paths now delegate to the library's MBOXParser/OLMParser via a new toAppEmail adapter. All inline parsing/MIME helpers removed (~−950 lines net across the migration).
  • Detection (importPipeline.ts): uses the library's four detectors; small createXFromEmail factories inlined (with lastActivityDate parity preserved).
  • Gmail Takeout: the .zip.mbox walk + dedup + folder mapping stays app-side (packaging logic) and delegates per-.mbox to the library.
  • Nullable dates: adopts the library's Date | null semantics app-wide (no more fabricated now() for missing headers); undated emails are guarded everywhere and excluded from time-based aggregations.
  • Deletions: removed 8 duplicated service modules + 13 detector/parser unit tests (the library owns that testing now); pruned dead utils.

Inherited for free from library v3: correct non-US locale money parsing (€1.234,56), honest subscription monthlyAmount/frequency, byte-accurate Email.size, tightened newsletter heuristics, and the nullable-date correctness.

Test Plan

  • npx tsc -b clean
  • npm run build (tsc + vite) succeeds; library bundles into the parserWorker chunk
  • npm run test:run — 257 pass, 0 fail (count down from 329: 72 duplicated detector/parser unit tests removed)
  • Manual E2E before merge (worker path is not unit-tested): in npm run dev, upload (a) a small .mbox, (b) a .olm, (c) a Gmail Takeout .zip; confirm progress advances, emails/contacts/subscriptions/newsletters/purchases populate IndexedDB, an email with a missing Date: header shows "Unknown date" (not 1970/today), and a €1.234,56 subscription stores 1234.56.

Notes / follow-ups

  • Library parseStreaming has no cancellation hook, so "cancel" stops importing but the library finishes parsing the current file in the background (documented in-code; candidate for a future AbortSignal lib enhancement).
  • A sender/thread whose emails ALL lack dates renders "Jan 1, 1970" in a couple of spots (those fields are non-null Date); rare edge, left minimal.

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Added graceful handling for emails with missing dates—now displays "Unknown date" instead of breaking displays.
    • Improved date filtering and sorting to safely handle null or undefined timestamps.
  • Chores

    • Updated email archive parsing dependency.
    • Refactored internal email processing pipeline.

@vercel

vercel Bot commented Jun 2, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
email-analyzer Ready Ready Preview, Comment Jun 2, 2026 6:40pm

Request Review

@coderabbitai

coderabbitai Bot commented Jun 2, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR migrates the email analytics app from in-house email parsing and detection logic to an external @technical-1/email-archive-parser library. The change introduces nullable dates throughout the system, updates the type contract, refactors the worker and import pipeline to delegate to library parsers and detectors, and adds comprehensive null-safe date handling across all UI, service, and database layers.

Changes

Migrate to external library with nullable date support

Layer / File(s) Summary
Type system foundation for nullable dates
web/src/types/index.ts
Email.date, Account.signupDate, Account.lastActivityDate, Contact.lastEmailDate, Subscription.lastRenewalDate, and Newsletter.lastEmailDate are now Date | null instead of non-null Date.
Database persistence for nullable dates
web/src/db/database.ts
DBEmail, DBAccount, DBContact, DBSubscription, and DBNewsletter store date fields as number | null. Insert functions persist null when source dates are absent; mapping functions convert null storage values back to null (not Date) and only construct Date objects for non-null timestamps. updateContactEmailCount signature updated to accept lastDate: Date | null.
External library integration and mapper
web/package.json, web/src/workers/toAppEmail.ts, web/src/services/__tests__/library-smoke.test.ts
Added @technical-1/email-archive-parser dependency. New toAppEmail() maps library Email to app Email row shape, deriving snippet, defaulting attachments to [], setting emailType to 'regular', and preserving date nullability. Smoke test verifies integration.
Worker refactoring to use library parsers
web/src/workers/parserWorker.ts
parseMBOXFile delegates to MBOXParser.parseStreaming, batches via toAppEmail, and tracks progress. parseOLMFile uses OLMParser().parse with compressed-size enforcement, batches parsed emails, and yields between batches. parseGmailTakeoutFile parses Takeout MBOX entries via library, deduplicates using toAppEmail output, maps folder names, and always sends final empty batch with isLast: true.
Import pipeline detection refactoring
web/src/services/importPipeline.ts
Uses instantiated detectors from library (AccountDetector, PurchaseDetector, SubscriptionDetector, NewsletterDetector). Purchase deduping now requires non-null email.date. Subscription renewal checks safely handle missing dates. Newsletter lastEmailDate updates only advance for non-null dates. Batch date coercion preserves null (instead of forcing new Date(...)).
Service layer null-safe date handling
web/src/services/backupService.ts, web/src/services/threadingService.ts, web/src/services/searchParser.ts
exportBackup converts dates to Date | null based on presence. ThreadingService.createThread computes firstMessageDate/lastMessageDate from emails with valid dates only, defaulting to epoch. Date-filter searches in filterEmails exclude undated emails early.
UI layer null-safe date rendering
web/src/components/*, web/src/pages/*
All components and pages conditionally format dates and display 'Unknown date' when absent. Sorting comparators use optional chaining with -Infinity fallback for missing dates. Examples: EmailCard, ThreadView, ContactModal, AnalyticsPage, AttachmentsPage, ContactsPage, EmailDetailPage, EmailsPage, NewslettersPage, SendersPage, SubscriptionsPage, SenderEmailsPage, AccountsPage, HomePage.
Test fixture updates for nullable dates
web/src/__tests__/phase-9/*
Regression test fixtures use storeEmail.date! non-null assertions where dates are required. Deprecated test suites for in-house mboxParser, accountDetector, purchaseDetector, subscriptionDetector, newsletterDetector, and domainMatch were removed.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 The parsers have left for the library shelf,
Dates now dance with null in the code itself,
From database to UI, a safety net spread,
"Unknown date" whispers when timestamps have fled.
One library to parse them, one contract to keep—
The refactor is done; now we reap what we sowed deep.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 29.17% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main objective: consuming the @technical-1/email-archive-parser v3 library and retiring duplicated parser/detector code. It is specific and clearly conveys the primary change.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch consume-email-archive-parser

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
web/src/db/database.ts (1)

218-221: ⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Fix Dexie orderBy('date') silently dropping undated emails (date: null).

web/src/db/database.ts (getEmails and similarly getEmailHeaders/folder paginated queries) relies on orderBy('date') over a secondary index. Dexie treats secondary indices as sparse, so records whose indexed key value is null/undefined (non-indexable) are omitted from orderBy() results. That means imported emails with date: null won’t appear in getEmails()/getEmailHeaders() and will also be missing from getEmailsByFolderPaginated() where [folderId+date] is used, preventing the “Unknown date” UI goal.

Options to unblock:

  • Ensure the indexed date column is always indexable at write time (e.g., store a sentinel like 0/-Infinity for “unknown” and map it back in the app), or
  • Keep a separate non-indexed nullable domain field for “unknown” and base sorting/filtering on an indexable value, or
  • Avoid using the date index for list loading—fetch and sort in-memory.

Add/confirm an import test case for “undated” emails to ensure they remain visible after persistence and sorting.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@web/src/db/database.ts` around lines 218 - 221, getEmails (and related
loaders getEmailHeaders / getEmailsByFolderPaginated) currently use Dexie
secondary index ordering (db.emails.orderBy('date') /
orderBy('[folderId+date]')) which silently omits records with date null; to fix,
persist an indexable sentinel value for missing dates (e.g., add/maintain a
dateIndex numeric field set to epoch 0 or -Infinity when date is null and
index/sort on dateIndex) or alter those loaders to read all relevant records and
perform an in-memory sort by a mapped date (treating null as oldest) instead of
relying on the sparse index; update the read/write code paths that create Email
records (where date is set) to populate the new indexable field and add/adjust
tests to assert imported “undated” emails remain visible and correctly ordered
after persistence and retrieval.
web/src/services/backupService.ts (1)

327-391: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Critical: Import path corrupts nullable dates to epoch.

The export path (lines 100, 116, 136, 167, 179) correctly preserves null dates, but the import path calls toTimestamp without null guards. Since toTimestamp (line 445) executes new Date(null).getTime(), it converts null to 0 (epoch), breaking round-trip integrity.

Impact: A backup exported with email.date = null will be imported as date = 0 (Jan 1, 1970), corrupting the nullable-date contract and causing "Unknown date" emails to display as 1970.

🛡️ Proposed fix: guard toTimestamp calls with null checks
 const dbEmails: DBEmail[] | null = emails && emails.length > 0
-  ? (emails.map((e) => ({ ...e, date: this.toTimestamp(e.date, 'email.date') })) as DBEmail[])
+  ? (emails.map((e) => ({ ...e, date: e.date == null ? null : this.toTimestamp(e.date, 'email.date') })) as DBEmail[])
   : null;
 const dbAccounts: DBAccount[] | null = accounts && accounts.length > 0
   ? (accounts.map((a) => ({
       ...a,
-      signupDate: this.toTimestamp(a.signupDate, 'account.signupDate'),
+      signupDate: a.signupDate == null ? null : this.toTimestamp(a.signupDate, 'account.signupDate'),
       lastActivityDate: a.lastActivityDate
         ? this.toTimestamp(a.lastActivityDate, 'account.lastActivityDate')
         : undefined,
     })) as unknown as DBAccount[])
   : null;
 const dbContacts: DBContact[] | null = contacts && contacts.length > 0
   ? (contacts.map((c) => ({
       ...c,
-      lastEmailDate: this.toTimestamp(c.lastEmailDate, 'contact.lastEmailDate'),
+      lastEmailDate: c.lastEmailDate == null ? null : this.toTimestamp(c.lastEmailDate, 'contact.lastEmailDate'),
     })) as DBContact[])
   : null;
 const dbSubscriptions: DBSubscription[] | null = subscriptions && subscriptions.length > 0
   ? (subscriptions.map((s) => ({
       ...s,
-      lastRenewalDate: this.toTimestamp(s.lastRenewalDate, 'subscription.lastRenewalDate'),
+      lastRenewalDate: s.lastRenewalDate == null ? null : this.toTimestamp(s.lastRenewalDate, 'subscription.lastRenewalDate'),
       nextRenewalDate: s.nextRenewalDate
         ? this.toTimestamp(s.nextRenewalDate, 'subscription.nextRenewalDate')
         : undefined,
       emailIds: JSON.stringify(s.emailIds || []),
     })) as unknown as DBSubscription[])
   : null;
 const dbNewsletters: DBNewsletter[] | null = newsletters && newsletters.length > 0
   ? (newsletters.map((n) => ({
       ...n,
-      lastEmailDate: this.toTimestamp(n.lastEmailDate, 'newsletter.lastEmailDate'),
+      lastEmailDate: n.lastEmailDate == null ? null : this.toTimestamp(n.lastEmailDate, 'newsletter.lastEmailDate'),
     })) as DBNewsletter[])
   : null;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@web/src/services/backupService.ts` around lines 327 - 391, The import mapping
calls this.toTimestamp unguarded and new Date(null) turns null into 0; update
each mapping in the import path (the dbEmails, dbAccounts, dbPurchases,
dbContacts, dbEvents, dbFolders, dbSubscriptions, dbNewsletters mappings shown)
to first check for null/undefined before calling this.toTimestamp and preserve
null/undefined values (e.g., if e.date == null leave as null, if
a.lastActivityDate == null leave undefined/null) so toTimestamp is only invoked
for actual date values; ensure usages reference the toTimestamp method name
exactly and keep JSON.stringify for emailIds unchanged.
🧹 Nitpick comments (1)
web/src/pages/AnalyticsPage.tsx (1)

18-19: 💤 Low value

Consider simplifying redundant Date wrapping.

After filtering email.date != null, TypeScript knows email.date is a Date object, so wrapping it in new Date(...) creates an unnecessary copy:

  • Line 19: new Date(email.date).getFullYear()email.date.getFullYear()
  • Line 27: new Date(e.date).getFullYear()e.date.getFullYear()
  • Line 40: new Date(e.date) >= thirtyDaysAgoe.date >= thirtyDaysAgo
  • Line 48: new Date(e.date as Date).getTime()e.date.getTime() (also removes redundant type assertion)
  • Line 70: new Date(email.date)email.date

The null guards themselves are correct and essential.

♻️ Proposed simplification
   emails.forEach((email) => {
     if (!email.date) return; // undated emails contribute no year
-    years.add(new Date(email.date).getFullYear());
+    years.add(email.date.getFullYear());
   });
   if (selectedYear === 'all') return emails;
-  return emails.filter(e => e.date != null && new Date(e.date).getFullYear() === selectedYear);
+  return emails.filter(e => e.date != null && e.date.getFullYear() === selectedYear);
-  const recentEmails = filteredEmails.filter(e => e.date != null && new Date(e.date) >= thirtyDaysAgo);
+  const recentEmails = filteredEmails.filter(e => e.date != null && e.date >= thirtyDaysAgo);
   const sortedDates = filteredEmails
     .filter(e => e.date != null)
-    .map(e => new Date(e.date as Date).getTime())
+    .map(e => e.date.getTime())
     .sort((a, b) => a - b);
   filteredEmails.forEach((email) => {
     if (!email.date) return; // undated emails excluded from volume aggregation
-    const date = new Date(email.date);
+    const date = email.date;
     const key = `${date.getFullYear()}-${String(date.getMonth() + 1).padStart(2, '0')}`;
   filteredEmails.forEach((email) => {
     if (!email.date) return; // undated emails excluded from activity heatmap
-    const date = new Date(email.date);
+    const date = email.date;
     const day = date.getDay();

Also applies to: 27-27, 40-40, 46-49, 69-71, 135-136

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@web/src/pages/AnalyticsPage.tsx` around lines 18 - 19, Replace redundant new
Date(...) wrappings on email date values with direct Date property access: in
the block that adds to years use years.add(email.date.getFullYear()) instead of
new Date(email.date).getFullYear(); in any map/filter using e.date use
e.date.getFullYear() and comparisons like e.date >= thirtyDaysAgo instead of new
Date(e.date) >= thirtyDaysAgo; replace new Date(e.date as Date).getTime() with
e.date.getTime(); and use email.date directly where a Date is needed (e.g.,
remove new Date(email.date)). Keep the existing null/undefined guards on
email.date but remove the extra Date construction and the redundant type
assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@web/src/db/database.ts`:
- Around line 18-23: DBEmail.date being nullable breaks existing indexes (used
by orderBy('date') in getEmails/getEmailHeaders) because indexed fields cannot
be null; fix by keeping the indexed field numeric and mapping nulls to a
sentinel before storage: change DBEmail.date to number (timestamp) for
storage/indexing or add a separate indexed field (e.g., indexedDate: number) and
ensure the persistence layer (where emails are created/updated) converts null ->
0 (or another sentinel) and reads 0 back to null when returning Email objects;
update any index definitions and references (orderBy('date') or
orderBy('indexedDate')) and the getEmails/getEmailHeaders plumbing to use the
indexed numeric field.

In `@web/src/workers/parserWorker.ts`:
- Around line 229-244: The dedup key currently uses e.threadId which collapses
Gmail conversation-level threads and drops distinct messages; update the key
generation in the parser.parseStreaming callback to prefer the message-id from
the libEmail (e.g., use libEmail.messageId or e.messageId) instead of threadId,
falling back to the existing `${e.subject}|${e.sender}|...` composite when
message-id is missing; modify the line that assigns key (and any related
seenEmailKeys logic) to use messageId-first deduplication so each unique message
is preserved.

In `@web/src/workers/toAppEmail.ts`:
- Line 19: The mapper in toAppEmail.ts currently hard-codes attachments: [],
dropping parsed attachments; update the mapper that converts LibEmail -> Email
(the toAppEmail function) to map LibEmail.attachments into the app Attachment
shape instead of an empty array, preserving fields like filename, contentType,
size and including the attachment base64 payload (e.g., data) when present so
the persistence layer (insertEmail / bulkInsertEmails which split
email.attachments and emailBodies.attachmentData) can store metadata and payload
correctly; ensure the mapped property names match the app Attachment/interface
expected by the database layer.

---

Outside diff comments:
In `@web/src/db/database.ts`:
- Around line 218-221: getEmails (and related loaders getEmailHeaders /
getEmailsByFolderPaginated) currently use Dexie secondary index ordering
(db.emails.orderBy('date') / orderBy('[folderId+date]')) which silently omits
records with date null; to fix, persist an indexable sentinel value for missing
dates (e.g., add/maintain a dateIndex numeric field set to epoch 0 or -Infinity
when date is null and index/sort on dateIndex) or alter those loaders to read
all relevant records and perform an in-memory sort by a mapped date (treating
null as oldest) instead of relying on the sparse index; update the read/write
code paths that create Email records (where date is set) to populate the new
indexable field and add/adjust tests to assert imported “undated” emails remain
visible and correctly ordered after persistence and retrieval.

In `@web/src/services/backupService.ts`:
- Around line 327-391: The import mapping calls this.toTimestamp unguarded and
new Date(null) turns null into 0; update each mapping in the import path (the
dbEmails, dbAccounts, dbPurchases, dbContacts, dbEvents, dbFolders,
dbSubscriptions, dbNewsletters mappings shown) to first check for null/undefined
before calling this.toTimestamp and preserve null/undefined values (e.g., if
e.date == null leave as null, if a.lastActivityDate == null leave
undefined/null) so toTimestamp is only invoked for actual date values; ensure
usages reference the toTimestamp method name exactly and keep JSON.stringify for
emailIds unchanged.

---

Nitpick comments:
In `@web/src/pages/AnalyticsPage.tsx`:
- Around line 18-19: Replace redundant new Date(...) wrappings on email date
values with direct Date property access: in the block that adds to years use
years.add(email.date.getFullYear()) instead of new
Date(email.date).getFullYear(); in any map/filter using e.date use
e.date.getFullYear() and comparisons like e.date >= thirtyDaysAgo instead of new
Date(e.date) >= thirtyDaysAgo; replace new Date(e.date as Date).getTime() with
e.date.getTime(); and use email.date directly where a Date is needed (e.g.,
remove new Date(email.date)). Keep the existing null/undefined guards on
email.date but remove the extra Date construction and the redundant type
assertion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bb48665c-6ad2-4cf0-8325-5b0157007238

📥 Commits

Reviewing files that changed from the base of the PR and between 6f0ecac and 4435893.

⛔ Files ignored due to path filters (1)
  • web/package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (50)
  • web/package.json
  • web/src/__tests__/phase-7/mboxParser.test.ts
  • web/src/__tests__/phase-9/accountDetector.detect.test.ts
  • web/src/__tests__/phase-9/accountDetector.domain.test.ts
  • web/src/__tests__/phase-9/bucket-d-regression.test.tsx
  • web/src/__tests__/phase-9/domainMatch.test.ts
  • web/src/__tests__/phase-9/newsletterDetector.classify.test.ts
  • web/src/__tests__/phase-9/newsletterDetector.domain.test.ts
  • web/src/__tests__/phase-9/purchaseDetector.currency.test.ts
  • web/src/__tests__/phase-9/purchaseDetector.detect.test.ts
  • web/src/__tests__/phase-9/purchaseDetector.domain.test.ts
  • web/src/__tests__/phase-9/purchaseDetector.locale.test.ts
  • web/src/__tests__/phase-9/snippet-render.test.tsx
  • web/src/__tests__/phase-9/subscriptionDetector.billing.test.ts
  • web/src/__tests__/phase-9/subscriptionDetector.detect.test.ts
  • web/src/__tests__/phase-9/subscriptionDetector.domain.test.ts
  • web/src/components/AttachmentGallery.tsx
  • web/src/components/ContactModal.tsx
  • web/src/components/EmailCard.tsx
  • web/src/components/ThreadView.tsx
  • web/src/db/database.ts
  • web/src/pages/AccountsPage.tsx
  • web/src/pages/AnalyticsPage.tsx
  • web/src/pages/AttachmentsPage.tsx
  • web/src/pages/ContactsPage.tsx
  • web/src/pages/EmailDetailPage.tsx
  • web/src/pages/EmailsPage.tsx
  • web/src/pages/HomePage.tsx
  • web/src/pages/NewslettersPage.tsx
  • web/src/pages/SenderEmailsPage.tsx
  • web/src/pages/SendersPage.tsx
  • web/src/pages/SubscriptionsPage.tsx
  • web/src/services/__tests__/library-smoke.test.ts
  • web/src/services/accountDetector.ts
  • web/src/services/backupService.ts
  • web/src/services/domainMatch.ts
  • web/src/services/gmailTakeoutParser.ts
  • web/src/services/importPipeline.ts
  • web/src/services/mboxParser.ts
  • web/src/services/newsletterDetector.ts
  • web/src/services/olmParser.ts
  • web/src/services/purchaseDetector.ts
  • web/src/services/searchParser.ts
  • web/src/services/subscriptionDetector.ts
  • web/src/services/threadingService.ts
  • web/src/types/index.ts
  • web/src/utils/emailUtils.ts
  • web/src/workers/__tests__/toAppEmail.test.ts
  • web/src/workers/parserWorker.ts
  • web/src/workers/toAppEmail.ts
💤 Files with no reviewable changes (22)
  • web/src/tests/phase-9/domainMatch.test.ts
  • web/src/services/subscriptionDetector.ts
  • web/src/tests/phase-9/subscriptionDetector.detect.test.ts
  • web/src/services/accountDetector.ts
  • web/src/tests/phase-9/accountDetector.detect.test.ts
  • web/src/services/domainMatch.ts
  • web/src/tests/phase-9/accountDetector.domain.test.ts
  • web/src/tests/phase-9/purchaseDetector.locale.test.ts
  • web/src/tests/phase-9/purchaseDetector.detect.test.ts
  • web/src/tests/phase-9/purchaseDetector.domain.test.ts
  • web/src/services/newsletterDetector.ts
  • web/src/services/mboxParser.ts
  • web/src/services/purchaseDetector.ts
  • web/src/services/olmParser.ts
  • web/src/tests/phase-9/purchaseDetector.currency.test.ts
  • web/src/services/gmailTakeoutParser.ts
  • web/src/tests/phase-9/newsletterDetector.domain.test.ts
  • web/src/tests/phase-9/subscriptionDetector.billing.test.ts
  • web/src/tests/phase-7/mboxParser.test.ts
  • web/src/tests/phase-9/subscriptionDetector.domain.test.ts
  • web/src/tests/phase-9/newsletterDetector.classify.test.ts
  • web/src/utils/emailUtils.ts

Comment thread web/src/db/database.ts
Comment on lines 18 to 23
export interface DBEmail extends Omit<Email, 'date' | 'body' | 'htmlBody'> {
id: number;
date: number;
date: number | null;
body?: string;
htmlBody?: string;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

date remains indexed while becoming nullable — see the orderBy('date') exclusion flagged above.

The DBEmail.date: number | null widening is correct for storage, but date is declared as an index across schema versions 1–6 (Lines 78, 87, 97, 107, 120, 173) and in three compound indexes. Records with date: null won't be enumerable through these indexes. This is the root of the query-exclusion issue noted on getEmails/getEmailHeaders.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@web/src/db/database.ts` around lines 18 - 23, DBEmail.date being nullable
breaks existing indexes (used by orderBy('date') in getEmails/getEmailHeaders)
because indexed fields cannot be null; fix by keeping the indexed field numeric
and mapping nulls to a sentinel before storage: change DBEmail.date to number
(timestamp) for storage/indexing or add a separate indexed field (e.g.,
indexedDate: number) and ensure the persistence layer (where emails are
created/updated) converts null -> 0 (or another sentinel) and reads 0 back to
null when returning Email objects; update any index definitions and references
(orderBy('date') or orderBy('indexedDate')) and the getEmails/getEmailHeaders
plumbing to use the indexed numeric field.

Comment on lines +229 to +244
await parser.parseStreaming(mboxFile, undefined, async (batch) => {
if (ctx.isCancelled) return;
for (const libEmail of batch) {
const e = toAppEmail(libEmail);
const key = e.threadId || `${e.subject}|${e.sender}|${e.date ? e.date.getTime() : 'nodate'}`;
if (seenEmailKeys.has(key)) continue;
seenEmailKeys.add(key);
currentBatch.push(email);

currentBatch.push({ ...e, folderId });
if (currentBatch.length >= BATCH_SIZE) {
sendEmailBatch(currentBatch, batchNumber, false);
sendEmailBatch(currentBatch, batchNumber++, false);
ctx.totalEmailsParsed += currentBatch.length;
batchNumber++;
currentBatch = [];
await new Promise(resolve => setTimeout(resolve, 0));
await new Promise((r) => setTimeout(r, 0));
}
}
}
});

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Inspect how the library assigns threadId for MBOX parsing (thread-scoped vs message-scoped),
# and whether a message-id field is exposed on the parsed Email.
pkg=$(curl -s https://registry.npmjs.org/@technical-1%2Femail-archive-parser | jq -r '."dist-tags".latest')
url=$(curl -s https://registry.npmjs.org/@technical-1%2Femail-archive-parser | jq -r --arg v "$pkg" '.versions[$v].dist.tarball')
tmp=$(mktemp -d); curl -sL "$url" -o "$tmp/p.tgz"; tar -xzf "$tmp/p.tgz" -C "$tmp"
rg -nP -C3 'threadId|messageId|message-id|X-GM-THRID' "$tmp/package" || echo "no matches"

Repository: Technical-1/EmailAnalyzer

Length of output: 8835


🏁 Script executed:

#!/bin/bash
# Inspect parserWorker.ts around the reported dedup logic to identify which parsing path it belongs to.
file="web/src/workers/parserWorker.ts"
if [ ! -f "$file" ]; then
  echo "Missing $file"
  exit 1
fi
sed -n '200,280p' "$file" | nl -ba | sed -n '1,120p'

Repository: Technical-1/EmailAnalyzer

Length of output: 110


🏁 Script executed:

#!/bin/bash
set -euo pipefail
file="web/src/workers/parserWorker.ts"
sed -n '1,120p' "$file" | nl -ba | sed -n '1,120p'
echo "----"
sed -n '120,220p' "$file" | nl -ba | sed -n '1,120p'
echo "----"
sed -n '220,270p' "$file" | nl -ba | sed -n '1,80p'
echo "----"
sed -n '270,360p' "$file" | nl -ba | sed -n '1,120p'

Repository: Technical-1/EmailAnalyzer

Length of output: 112


🏁 Script executed:

#!/bin/bash
set -euo pipefail
file="web/src/workers/parserWorker.ts"
wc -l "$file"

Repository: Technical-1/EmailAnalyzer

Length of output: 105


🏁 Script executed:

#!/bin/bash
set -euo pipefail
file="web/src/workers/parserWorker.ts"
sed -n '210,260p' "$file" | nl -ba | sed -n '1,120p'

Repository: Technical-1/EmailAnalyzer

Length of output: 112


🏁 Script executed:

#!/bin/bash
set -euo pipefail
file="web/src/workers/parserWorker.ts"
rg -n "seenEmailKeys|threadId|messageId|parseStreaming" "$file"

Repository: Technical-1/EmailAnalyzer

Length of output: 589


🏁 Script executed:

#!/bin/bash
set -euo pipefail
file="web/src/workers/parserWorker.ts"
sed -n '180,260p' "$file" | cat -n

Repository: Technical-1/EmailAnalyzer

Length of output: 4224


🏁 Script executed:

#!/bin/bash
set -euo pipefail
file="web/src/workers/parserWorker.ts"
sed -n '210,260p' "$file" | nl -ba | sed -n '1,120p'

Repository: Technical-1/EmailAnalyzer

Length of output: 112


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "function toAppEmail|toAppEmail\\s*\\(" web/src/workers/parserWorker.ts
rg -n "type Email\\b|interface Email\\b|export type Email\\b|from '.*Email'" web/src/workers/parserWorker.ts web/src -S

Repository: Technical-1/EmailAnalyzer

Length of output: 726


🏁 Script executed:

#!/bin/bash
set -euo pipefail
file="web/src/workers/parserWorker.ts"
# Show the toAppEmail definition (and nearby mappings) without dumping the whole file
start=$(rg -n "function toAppEmail\\b" "$file" | head -n1 | cut -d: -f1 || true)
if [ -n "${start}" ]; then
  from=$((start-30)); to=$((start+80))
  sed -n "${from},${to}p" "$file" | nl -ba | head -n 200
else
  # fallback: show a reasonable region around where toAppEmail is referenced
  ref=$(rg -n "toAppEmail\\(" "$file" | head -n1 | cut -d: -f1)
  from=$((ref-60)); to=$((ref+60))
  sed -n "${from},${to}p" "$file" | nl -ba | head -n 200
fi

Repository: Technical-1/EmailAnalyzer

Length of output: 113


🏁 Script executed:

#!/bin/bash
set -euo pipefail
# Find the app Email model/type and show messageId/originalId fields (if present)
fd -t f "Email.ts|email.*\\.ts|types.*Email.*\\.ts" web/src -a 2>/dev/null | head -n 50

Repository: Technical-1/EmailAnalyzer

Length of output: 459


🏁 Script executed:

#!/bin/bash
set -euo pipefail
# Search for messageId/originalId in the app codebase
rg -n "messageId\\b|originalId\\b" web/src

Repository: Technical-1/EmailAnalyzer

Length of output: 51


🏁 Script executed:

#!/bin/bash
set -euo pipefail
file1="web/src/workers/toAppEmail.ts"
file2="web/src/types/index.ts"

echo "---- $file1 ----"
awk 'NR>=1 && NR<=220 {print NR ":" $0}' "$file1" | sed -n '1,220p'

echo "---- $file2 ----"
awk 'NR>=1 && NR<=220 {print NR ":" $0}' "$file2" | sed -n '1,220p'

Repository: Technical-1/EmailAnalyzer

Length of output: 7029


🏁 Script executed:

#!/bin/bash
set -euo pipefail
# Search specifically within app for messageId/originalId property usage in email mapping
rg -n "messageId|originalId" web/src/workers/toAppEmail.ts web/src/types web/src -S

Repository: Technical-1/EmailAnalyzer

Length of output: 51


🏁 Script executed:

#!/bin/bash
set -euo pipefail
# Re-show just the parseGmailTakeoutFile dedup key area with line numbers via awk (no nl)
file="web/src/workers/parserWorker.ts"
awk 'NR>=1 && NR<=1 {print}' "$file" >/dev/null
awk 'NR>=200 && NR<=260 {print NR ":" $0}' "$file" | sed -n '200,260p'

Repository: Technical-1/EmailAnalyzer

Length of output: 51


Fix Gmail Takeout dedup to avoid conversation-scoped threadId collisions. In parseGmailTakeoutFile, the dedup key is e.threadId || ..., so when threadId comes from Gmail x-gm-thrid (conversation-level), only the first email per thread is kept and subsequent distinct messages are dropped. Use libEmail.messageId (Message-ID) for the dedup key (fallback when missing) instead of relying on threadId.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@web/src/workers/parserWorker.ts` around lines 229 - 244, The dedup key
currently uses e.threadId which collapses Gmail conversation-level threads and
drops distinct messages; update the key generation in the parser.parseStreaming
callback to prefer the message-id from the libEmail (e.g., use
libEmail.messageId or e.messageId) instead of threadId, falling back to the
existing `${e.subject}|${e.sender}|...` composite when message-id is missing;
modify the line that assigns key (and any related seenEmailKeys logic) to use
messageId-first deduplication so each unique message is preserved.

date: e.date, // Date | null
body: e.body,
htmlBody: e.htmlBody,
attachments: [],

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Confirm whether attachments are handled anywhere else in the worker/import path.
rg -nP --type=ts -C3 '\battachments\b' -g 'web/src/workers/**' -g 'web/src/services/**'
# Look for any consumer that maps library attachments into app Attachment.
ast-grep --pattern 'attachments: $_'

Repository: Technical-1/EmailAnalyzer

Length of output: 3992


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate the parser worker implementation and where attachments might be produced.
fd -a "parserWorker.ts" web/src/workers -t f -x cat -n | head -n 260

# Inspect the mapper in toAppEmail.ts around the attachments field.
sed -n '1,120p' web/src/workers/toAppEmail.ts | cat -n

# Find any usage of `e.attachments` or `attachments` produced by the parsing library types.
rg -n --type=ts --glob 'web/src/workers/**' --glob 'web/src/services/**' '\be\.attachments\b|\battachments\s*:\s*e\.attachments\b|\battachment(s)?\b' 

# Check app/domain type definition for Attachment and where Email.attachments is expected.
fd -a "*Attachment*" web/src -t f -x rg -n --type=ts -S 'export (type|interface|class) Attachment|attachments:\s*Attachment\[\]' || true
fd -a "*Email*" web/src -t f -x rg -n --type=ts -S 'attachments:\s*Attachment\[\]|attachments\??:' || true

# Look for any code that populates attachments after mapping (e.g., via attachmentService or later enrichment).
rg -n --type=ts 'attachmentService|downloadMultiple\(|createAttachment|extractAttachment|enrich.*attachment|populate.*attachments' web/src/services web/src/workers

Repository: Technical-1/EmailAnalyzer

Length of output: 16138


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect app Email/Attachment types.
fd -t f "types.ts" web/src -x sed -n '1,260p'
fd -a "*types*" web/src -t f -x rg -n --type=ts 'export (type|interface|class) Email|export (type|interface|class) Attachment|attachments:\s*Attachment\[\]'

# Find any later enrichment/persistence steps that touch `email.attachments`.
rg -n --type=ts '\.attachments\b|attachments:' web/src --glob '!**/node_modules/**' | head -n 200

# Find how attachments are persisted/created from downloaded data.
rg -n --type=ts 'downloadMultiple\(|downloadAttachment\(|createBlob\(|mimeType|filename' web/src/services web/src/workers | head -n 200

Repository: Technical-1/EmailAnalyzer

Length of output: 2790


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# List likely types directory/files.
ls -la web/src || true
ls -la web/src/types || true

# Find and print the app Email/Attachment type definitions.
fd -t f -g 'types*' web/src | head -n 50
fd -t f -g '*attachment*' web/src/types web/src | head -n 50
fd -t f -g '*email*' web/src/types web/src | head -n 50

# Search within types for Attachment/Email definitions.
fd -t f -g '*ts' web/src/types 2>/dev/null | xargs -r rg -n --type=ts --fixed-string 'interface Attachment' --fixed-string 'type Attachment' --fixed-string 'class Attachment' --fixed-string 'attachments:'
fd -t f -g '*ts' web/src/types 2>/dev/null | xargs -r rg -n --type=ts --fixed-string 'interface Email' --fixed-string 'type Email' --fixed-string 'class Email' --fixed-string 'attachments:'

# Also search outside types in case Attachment/Email types live elsewhere.
fd -t f -g '*ts' web/src | xargs -r rg -n --type=ts --fixed-string 'interface Attachment' --fixed-string 'type Attachment' --fixed-string 'attachments: Attachment' --fixed-string 'attachments:'

fd -t f -g '*ts' web/src | xargs -r rg -n --type=ts --fixed-string 'interface Email' --fixed-string 'type Email' --fixed-string 'attachments:'

Repository: Technical-1/EmailAnalyzer

Length of output: 2518


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show what's in web/src/types and list candidate type files.
ls -la web/src/types
fd -t f -g '*.ts' web/src/types

# Search within app types for Attachment and Email attachments shape.
fd -t f -g '*.ts' web/src/types | xargs rg -n --type=ts --fixed-strings 'Attachment' --fixed-strings 'attachments' --fixed-strings 'interface Email' --fixed-strings 'type Email' --fixed-strings 'interface Attachment' --fixed-strings 'type Attachment'

# Also search within web/src for Email/Attachment types if they are re-exported elsewhere.
fd -t f -g '*.ts' web/src | xargs rg -n --type=ts --fixed-strings 'attachments:' --fixed-strings 'attachments.' --fixed-strings 'Attachment' | head -n 200

# Open types/index.ts for re-exports.
sed -n '1,240p' web/src/types/index.ts | cat -n

Repository: Technical-1/EmailAnalyzer

Length of output: 725


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Open the relevant sections of the app type definitions.
sed -n '1,220p' web/src/types/index.ts | cat -n

# Narrow views around Attachment and Email definitions.
rg -n --fixed-strings 'export interface Attachment' web/src/types/index.ts
rg -n --fixed-strings 'attachments:' web/src/types/index.ts
rg -n --fixed-strings 'export interface Email' web/src/types/index.ts || true
rg -n --fixed-strings 'interface Email' web/src/types/index.ts || true

Repository: Technical-1/EmailAnalyzer

Length of output: 7088


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find where attachmentData is read/written.
rg -n --type=ts --fixed-strings 'attachmentData' web/src | head -n 200

# Find where Attachment objects (id/filename/mimeType/size) are created or persisted.
rg -n --type=ts '\bAttachment\b|attachment(s)?\.' web/src | head -n 200

# Find DB layer / store usage for attachments.
fd -t f -g '*.ts' web/src/db web/src/store web/src/services web/src/pages | xargs -r rg -n --type=ts 'attachments|attachmentData' || true

# Look for any import pipeline code that might populate attachments after worker mapping.
rg -n --type=ts 'EMAIL_BATCH|parse.*(worker|import)|toAppEmail\(' web/src | head -n 200

Repository: Technical-1/EmailAnalyzer

Length of output: 14916


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show where EMAIL_BATCH from worker is persisted to DB.
rg -n --type=ts "'EMAIL_BATCH'|EMAIL_BATCH" web/src/pages/HomePage.tsx web/src -S | head -n 200

# Inspect the HomePage handling that writes emails to DB.
sed -n '1,260p' web/src/pages/HomePage.tsx | cat -n

# Inspect database persistence logic for attachments + attachmentData.
sed -n '1,260p' web/src/db/database.ts | cat -n
sed -n '260,520p' web/src/db/database.ts | cat -n

# Inspect the DBEmail -> slimAttachments / attachmentData extraction.
rg -n --type=ts 'slimAttachments|attachmentData|attachments:' web/src/db/database.ts

# Re-open toAppEmail.ts snippet for context.
sed -n '1,120p' web/src/workers/toAppEmail.ts | cat -n

Repository: Technical-1/EmailAnalyzer

Length of output: 37832


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect import pipeline entrypoints used by HomePage when receiving EMAIL_BATCH.
sed -n '1,260p' web/src/services/importPipeline.ts | cat -n

# Search within importPipeline for attachments/attachmentData.
rg -n --type=ts 'attachments|attachmentData|Attachment' web/src/services/importPipeline.ts web/src/services | head -n 200

# Inspect any helper used by importPipeline that might rebuild attachments.
rg -n --type=ts 'processEmailBatch|createImportCounts|insertEmail' web/src/services/importPipeline.ts web/src/services | head -n 200

Repository: Technical-1/EmailAnalyzer

Length of output: 13578


Fix imported email attachments being dropped by the mapper

web/src/workers/toAppEmail.ts hard-codes attachments: [], so any attachments produced by the archive parser are never carried into the app Email row.

The persistence layer (web/src/db/database.ts insertEmail / bulkInsertEmails) splits email.attachments into:

  • attachment metadata stored in emails.attachments, and
  • base64 payload stored in emailBodies.attachmentData.

With attachments: [], both the attachment count and attachment previews/downloads will be empty for newly imported emails. Map LibEmail attachments into the app Attachment shape (and include data when available), or document where attachments are intentionally reconstructed later.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@web/src/workers/toAppEmail.ts` at line 19, The mapper in toAppEmail.ts
currently hard-codes attachments: [], dropping parsed attachments; update the
mapper that converts LibEmail -> Email (the toAppEmail function) to map
LibEmail.attachments into the app Attachment shape instead of an empty array,
preserving fields like filename, contentType, size and including the attachment
base64 payload (e.g., data) when present so the persistence layer (insertEmail /
bulkInsertEmails which split email.attachments and emailBodies.attachmentData)
can store metadata and payload correctly; ensure the mapped property names match
the app Attachment/interface expected by the database layer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant