Skip to content

DeboJp/EduHelp

Repository files navigation

EduHelp: Personality-Aware Summarization

Short description. EduHelp predicts a 16-way MBTI-style label from free-form text using a CountVectorizer → multinomial Logistic Regression classifier and uses that label to steer GPT-4 summaries, being able to digest multi-modal inputs (Text,Image/PDF,Voice). The serving path is intentionally lightweight for Vercel serverless deployment. A Flair TextClassifier and clustering notebooks exist in ReferenceCode/ and as Jupyter notebooks for R&D; they are not on the default request path.

EduHelp Project from MadData Hackathon. For a more holistic explanation of project goal checkout : DevPost


1) End-to-End Flow

  1. Capture input (browser).
    Users can type text, upload images (OCR via Tesseract.js), or dictate via voice (Web Speech API). Extracted text is:

    • sent to /execute for personality prediction, and
    • forwarded to GPT-4 as the content to summarize.
  2. Classify personality (Python).
    use_models.py loads count_vec.joblib + log_model.joblib, converts text to a sparse vector, and returns one of 16 uppercase MBTI-style labels.

  3. Steer summarization (Node.js + OpenAI).
    The UI composes a prompt:

    “I am of personality type <TYPE>. Strictly summarize the following in a way someone of my personality would best understand: <TEXT>. Extra instructions (if any): <prefs>.”

    server.js maintains per-socket conversation history and calls the GPT-4 ChatCompletion API via Socket.IO, streaming the reply back.

  4. Deploy (Vercel).
    Static assets are served from public/. The Node entry handles chat + classification. vercel.json configures @vercel/node with a catch-all route.


2) Classification Model (What, How, Why)

Bag-of-Words Encoding

Let the vocabulary size be $V$. For input text, the CountVectorizer produces: $$ x \in \mathbb{N}^V, \quad x_j = \text{count of token } j. $$

Multinomial Logistic Regression (16-class softmax)

With parameters ${(w_k, b_k)}{k=1}^{16}$: $$ p(y=k \mid x) = \frac{\exp(w_k^\top x + b_k)}{\sum{j=1}^{16} \exp(w_j^\top x + b_j)}. $$ We output: $$ \hat{y} = \arg\max_k p(y=k \mid x), \qquad \text{conf} = \max_k p(y=k \mid x). $$

Why Logistic Regression (statistical + operational):

  • Calibrated probabilities help steer prompts.
  • Strong performance in sparse, high-dimensional settings.
  • Minimal latency and memory — ideal for serverless deployment.

3) Flair Classifier (Reference Only)

  • ReferenceCode/models/use_models.py shows an ensemble of Flair + LR, returning whichever is more confident.
  • final-model.pt (~250 MB) is disabled in the deployed path to avoid Vercel cold-start costs.
  • Training log (ReferenceCode/resources/loss.tsv) shows dev metrics plateau while train loss continues to drop → classic overfitting signal, supporting LR as default.

4) Clustering (Offline Exploration)

  • Jupyter notebooks (clustering.ipynb, Personality_Data.ipynb) explore unsupervised groupings of text embeddings.
  • Goal: test whether learned clusters offer better stylistic cues than MBTI categories.
  • Not integrated into serving (use_models.py); R&D only.

5) Runtime & Deployment Details

  • Server API:

    • POST /execute shells out to Python:
      exec(`python3 "${pythonScriptPath}" "${text}"`)
      Safer alternative:
      spawn('python3', [pythonScriptPath, text]);
    • Socket.IO channel persists conversation history per connection.
  • Front-end:

    • OCR via Tesseract.js.
    • Voice dictation via Web Speech API.
    • UI merges predicted label, extracted text, and user instructions into a single prompt.
  • Deployment (Vercel):

    • vercel.json builds with @vercel/node targeting server.js.
    • Cold-start discipline: Flair disabled; LR executes instantly.
  • Artifacts:

    • count_vec.joblib and log_model.joblib (Git-LFS tracked).
    • final-model.pt (Flair) and loss logs included only in ReferenceCode/.

6) Why the Steering Matters

The MBTI label conditions the LLM output. It shifts phrasing style — analogy density, concreteness, pacing — while keeping factual content unchanged. This nudging helps align summaries with users’ preferred learning styles.

It is context, not a gate: no information is withheld; the label only changes presentation.


7) File Map

  • server.js — Express + Socket.IO; /execute → Python; GPT-4 loop.
  • use_models.py — Loads LR artifacts; prints MBTI label.
  • count_vec.joblib, log_model.joblib — serving artifacts.
  • public/index.html — UI: OCR, voice, chat, prompt composition.
  • config/open-ai.js — OpenAI client init.
  • vercel.json — serverless config.
  • ReferenceCode/ — Flair ensemble, alternate models, training logs.

8) Math Recap

$$ x \in \mathbb{N}^V ;;\Rightarrow;; p(y=k \mid x) = \frac{e^{w_k^\top x + b_k}}{\sum_{j=1}^{16} e^{w_j^\top x + b_j}}, \quad \hat{y} = \arg\max_k p(y=k \mid x). $$


9) Limits & Future Directions

  • BoW limitation: ignores syntax/semantics, relies only on lexical cues.
  • Serverless: large neural models (Flair) strain cold starts; LR balances speed with acceptable accuracy.
  • Future extensions:
    • Persist an embedding-based clusterer and return both MBTI label + cluster ID.
    • Explore distilled neural classifiers to capture semantics without Flair’s overhead.
    • Calibrate LR probabilities with temperature scaling.
    • Harden /execute endpoint (prefer spawn, length limits, rate limiting).

10) Data flow:

Text / OCR / voice → /execute (Python LR) → MBTI label → GPT-4 via Socket.IO → personality-aware summary.

About

EduHelp personalizes learning by tailoring complex topics into engaging, customized summaries that match each user’s style and interests.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors