EduHelp: Personality-Aware Summarization

Short description. EduHelp predicts a 16-way MBTI-style label from free-form text using a CountVectorizer → multinomial Logistic Regression classifier and uses that label to steer GPT-4 summaries, being able to digest multi-modal inputs (Text,Image/PDF,Voice). The serving path is intentionally lightweight for Vercel serverless deployment. A Flair TextClassifier and clustering notebooks exist in ReferenceCode/ and as Jupyter notebooks for R&D; they are not on the default request path.

Project from MadData Hackathon. For a more holistic explanation of project goal checkout : DevPost

1) End-to-End Flow

Capture input (browser).
Users can type text, upload images (OCR via Tesseract.js), or dictate via voice (Web Speech API). Extracted text is:
- sent to /execute for personality prediction, and
- forwarded to GPT-4 as the content to summarize.
Classify personality (Python).
use_models.py loads count_vec.joblib + log_model.joblib, converts text to a sparse vector, and returns one of 16 uppercase MBTI-style labels.
Steer summarization (Node.js + OpenAI).
The UI composes a prompt:

“I am of personality type <TYPE>. Strictly summarize the following in a way someone of my personality would best understand: <TEXT>. Extra instructions (if any): <prefs>.”

server.js maintains per-socket conversation history and calls the GPT-4 ChatCompletion API via Socket.IO, streaming the reply back.
Deploy (Vercel).
Static assets are served from public/. The Node entry handles chat + classification. vercel.json configures @vercel/node with a catch-all route.

2) Classification Model (What, How, Why)

Bag-of-Words Encoding

Let the vocabulary size be $V$. For input text, the CountVectorizer produces: $$ x \in \mathbb{N}^V, \quad x_j = \text{count of token } j. $$

Multinomial Logistic Regression (16-class softmax)

With parameters ${(w_k, b_k)}{k=1}^{16}$: $$ p(y=k \mid x) = \frac{\exp(w_k^\top x + b_k)}{\sum{j=1}^{16} \exp(w_j^\top x + b_j)}. $$ We output: $$ \hat{y} = \arg\max_k p(y=k \mid x), \qquad \text{conf} = \max_k p(y=k \mid x). $$

Why Logistic Regression (statistical + operational):

Calibrated probabilities help steer prompts.
Strong performance in sparse, high-dimensional settings.
Minimal latency and memory — ideal for serverless deployment.

3) Flair Classifier (Reference Only)

ReferenceCode/models/use_models.py shows an ensemble of Flair + LR, returning whichever is more confident.
final-model.pt (~250 MB) is disabled in the deployed path to avoid Vercel cold-start costs.
Training log (ReferenceCode/resources/loss.tsv) shows dev metrics plateau while train loss continues to drop → classic overfitting signal, supporting LR as default.

4) Clustering (Offline Exploration)

Jupyter notebooks (clustering.ipynb, Personality_Data.ipynb) explore unsupervised groupings of text embeddings.
Goal: test whether learned clusters offer better stylistic cues than MBTI categories.
Not integrated into serving (use_models.py); R&D only.

5) Runtime & Deployment Details

Server API:
- POST /execute shells out to Python:
```
exec(`python3 "${pythonScriptPath}" "${text}"`)
```
  Safer alternative:
```
spawn('python3', [pythonScriptPath, text]);
```
- Socket.IO channel persists conversation history per connection.
Front-end:
- OCR via Tesseract.js.
- Voice dictation via Web Speech API.
- UI merges predicted label, extracted text, and user instructions into a single prompt.
Deployment (Vercel):
- vercel.json builds with @vercel/node targeting server.js.
- Cold-start discipline: Flair disabled; LR executes instantly.
Artifacts:
- count_vec.joblib and log_model.joblib (Git-LFS tracked).
- final-model.pt (Flair) and loss logs included only in ReferenceCode/.

6) Why the Steering Matters

The MBTI label conditions the LLM output. It shifts phrasing style — analogy density, concreteness, pacing — while keeping factual content unchanged. This nudging helps align summaries with users’ preferred learning styles.

It is context, not a gate: no information is withheld; the label only changes presentation.

7) File Map

server.js — Express + Socket.IO; /execute → Python; GPT-4 loop.
use_models.py — Loads LR artifacts; prints MBTI label.
count_vec.joblib, log_model.joblib — serving artifacts.
public/index.html — UI: OCR, voice, chat, prompt composition.
config/open-ai.js — OpenAI client init.
vercel.json — serverless config.
ReferenceCode/ — Flair ensemble, alternate models, training logs.

8) Math Recap

$$ x \in \mathbb{N}^V ;;\Rightarrow;; p(y=k \mid x) = \frac{e^{w_k^\top x + b_k}}{\sum_{j=1}^{16} e^{w_j^\top x + b_j}}, \quad \hat{y} = \arg\max_k p(y=k \mid x). $$

9) Limits & Future Directions

BoW limitation: ignores syntax/semantics, relies only on lexical cues.
Serverless: large neural models (Flair) strain cold starts; LR balances speed with acceptable accuracy.
Future extensions:
- Persist an embedding-based clusterer and return both MBTI label + cluster ID.
- Explore distilled neural classifiers to capture semantics without Flair’s overhead.
- Calibrate LR probabilities with temperature scaling.
- Harden /execute endpoint (prefer spawn, length limits, rate limiting).

10) Data flow:

Text / OCR / voice → /execute (Python LR) → MBTI label → GPT-4 via Socket.IO → personality-aware summary.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
ReferenceCode		ReferenceCode
config		config
node_modules		node_modules
public		public
.DS_Store		.DS_Store
.env		.env
.gitattributes		.gitattributes
.gitignore		.gitignore
EduHelp.png		EduHelp.png
README.md		README.md
count_vec.joblib		count_vec.joblib
example.js		example.js
index.js		index.js
log_model.joblib		log_model.joblib
package-lock.json		package-lock.json
package.json		package.json
server.js		server.js
use_models.py		use_models.py
vercel.json		vercel.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EduHelp: Personality-Aware Summarization

1) End-to-End Flow

2) Classification Model (What, How, Why)

Bag-of-Words Encoding

Multinomial Logistic Regression (16-class softmax)

3) Flair Classifier (Reference Only)

4) Clustering (Offline Exploration)

5) Runtime & Deployment Details

6) Why the Steering Matters

7) File Map

8) Math Recap

9) Limits & Future Directions

10) Data flow:

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EduHelp: Personality-Aware Summarization

1) End-to-End Flow

2) Classification Model (What, How, Why)

Bag-of-Words Encoding

Multinomial Logistic Regression (16-class softmax)

3) Flair Classifier (Reference Only)

4) Clustering (Offline Exploration)

5) Runtime & Deployment Details

6) Why the Steering Matters

7) File Map

8) Math Recap

9) Limits & Future Directions

10) Data flow:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages