Short description. EduHelp predicts a 16-way MBTI-style label from free-form text using a CountVectorizer → multinomial Logistic Regression classifier and uses that label to steer GPT-4 summaries, being able to digest multi-modal inputs (Text,Image/PDF,Voice). The serving path is intentionally lightweight for Vercel serverless deployment. A Flair TextClassifier and clustering notebooks exist in ReferenceCode/ and as Jupyter notebooks for R&D; they are not on the default request path.
Project from MadData Hackathon. For a more holistic explanation of project goal checkout : DevPost
-
Capture input (browser).
Users can type text, upload images (OCR via Tesseract.js), or dictate via voice (Web Speech API). Extracted text is:- sent to
/executefor personality prediction, and - forwarded to GPT-4 as the content to summarize.
- sent to
-
Classify personality (Python).
use_models.pyloadscount_vec.joblib+log_model.joblib, converts text to a sparse vector, and returns one of 16 uppercase MBTI-style labels. -
Steer summarization (Node.js + OpenAI).
The UI composes a prompt:“I am of personality type <TYPE>. Strictly summarize the following in a way someone of my personality would best understand: <TEXT>. Extra instructions (if any): <prefs>.”
server.jsmaintains per-socket conversation history and calls the GPT-4 ChatCompletion API via Socket.IO, streaming the reply back. -
Deploy (Vercel).
Static assets are served frompublic/. The Node entry handles chat + classification.vercel.jsonconfigures@vercel/nodewith a catch-all route.
Let the vocabulary size be
With parameters ${(w_k, b_k)}{k=1}^{16}$: $$ p(y=k \mid x) = \frac{\exp(w_k^\top x + b_k)}{\sum{j=1}^{16} \exp(w_j^\top x + b_j)}. $$ We output: $$ \hat{y} = \arg\max_k p(y=k \mid x), \qquad \text{conf} = \max_k p(y=k \mid x). $$
Why Logistic Regression (statistical + operational):
- Calibrated probabilities help steer prompts.
- Strong performance in sparse, high-dimensional settings.
- Minimal latency and memory — ideal for serverless deployment.
ReferenceCode/models/use_models.pyshows an ensemble of Flair + LR, returning whichever is more confident.final-model.pt(~250 MB) is disabled in the deployed path to avoid Vercel cold-start costs.- Training log (
ReferenceCode/resources/loss.tsv) shows dev metrics plateau while train loss continues to drop → classic overfitting signal, supporting LR as default.
- Jupyter notebooks (
clustering.ipynb,Personality_Data.ipynb) explore unsupervised groupings of text embeddings. - Goal: test whether learned clusters offer better stylistic cues than MBTI categories.
- Not integrated into serving (
use_models.py); R&D only.
-
Server API:
-
POST /executeshells out to Python:Safer alternative:exec(`python3 "${pythonScriptPath}" "${text}"`)
spawn('python3', [pythonScriptPath, text]);
- Socket.IO channel persists conversation history per connection.
-
-
Front-end:
- OCR via Tesseract.js.
- Voice dictation via Web Speech API.
- UI merges predicted label, extracted text, and user instructions into a single prompt.
-
Deployment (Vercel):
-
vercel.jsonbuilds with@vercel/nodetargetingserver.js. - Cold-start discipline: Flair disabled; LR executes instantly.
-
-
Artifacts:
-
count_vec.joblibandlog_model.joblib(Git-LFS tracked). -
final-model.pt(Flair) and loss logs included only inReferenceCode/.
-
The MBTI label conditions the LLM output. It shifts phrasing style — analogy density, concreteness, pacing — while keeping factual content unchanged. This nudging helps align summaries with users’ preferred learning styles.
It is context, not a gate: no information is withheld; the label only changes presentation.
server.js— Express + Socket.IO;/execute→ Python; GPT-4 loop.use_models.py— Loads LR artifacts; prints MBTI label.count_vec.joblib,log_model.joblib— serving artifacts.public/index.html— UI: OCR, voice, chat, prompt composition.config/open-ai.js— OpenAI client init.vercel.json— serverless config.ReferenceCode/— Flair ensemble, alternate models, training logs.
- BoW limitation: ignores syntax/semantics, relies only on lexical cues.
- Serverless: large neural models (Flair) strain cold starts; LR balances speed with acceptable accuracy.
- Future extensions:
- Persist an embedding-based clusterer and return both MBTI label + cluster ID.
- Explore distilled neural classifiers to capture semantics without Flair’s overhead.
- Calibrate LR probabilities with temperature scaling.
- Harden
/executeendpoint (preferspawn, length limits, rate limiting).
Text / OCR / voice → /execute (Python LR) → MBTI label → GPT-4 via Socket.IO → personality-aware summary.