Skip to content

Aka-Nine/Reddit-Persona-Generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧠 Reddit User Persona Generator

An AI-powered Python tool that scrapes a Reddit user's public activity and generates a detailed psychological and behavioral persona — complete with Big Five personality scores, citations, and an interactive Streamlit viewer.

Python PRAW Groq Gemini Streamlit NLP License: MIT


📋 Table of Contents


🧠 Overview

The Reddit User Persona Generator is a modular, LLM-powered analysis pipeline that turns a Reddit profile URL into a rich psychological and behavioral profile. It combines traditional NLP techniques (sentiment analysis, entity extraction, keyword scoring) with a large language model to produce structured, cited persona reports.

The system is designed to be provider-agnostic — you can swap between Groq (LLaMA3-70B / Mixtral-8x7B for ultra-fast inference) and Google Gemini simply by changing an environment variable. An optional Streamlit frontend renders the .txt persona output as an interactive, expandable web dashboard.

Use cases include: user research, behavioral analysis, community moderation insights, LLM fine-tuning dataset creation, and UX persona development.


🏛️ Architecture Diagram

System Architecture

The pipeline flows through four layers: Input (CLI + config), Scraping (Reddit API via PRAW), NLP + AI (multi-library NLP enrichment → LLM persona synthesis → citation linking), and Output (structured .txt report + Streamlit viewer).


✨ Features

  • Reddit Scraping — Fetches up to 100 posts and 200 comments from any public Reddit profile via PRAW with built-in rate limiting
  • Multi-library NLP Pipeline — NLTK tokenization, TextBlob sentiment, VADER polarity, spaCy NER, and keyword extraction all run before the LLM call
  • Dual LLM Support — Groq (LLaMA3-70B / Mixtral-8x7B-32768) as default, Google Gemini as alternate — switchable via environment variable
  • Big Five Personality Scoring — Generates Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism scores (0.0–1.0)
  • Cited Persona Traits — Every inferred trait links back to specific posts or comments (up to 3 citations per trait, configurable)
  • Structured Template Output — Consistent .txt persona format via persona_template.txt, saved to output/{username}_persona.txt
  • Interactive Streamlit Viewervisualizer.py parses the output file and renders it as a web dashboard with expandable sections and a summary sidebar
  • Modular Architecture — Clean separation across reddit_scraper, data_processor, persona_analyzer, citation_manager, and output_generator
  • Validated Configconfig.py validates required env vars at startup and throws helpful errors for missing keys

🛠️ Tech Stack

Core

Technology Role
Python 3.9+ Core language
PRAW Reddit API wrapper — fetches posts and comments
python-dotenv Loads environment variables from .env

NLP

Technology Role
NLTK Tokenization, stopword removal, frequency distribution
TextBlob Polarity + subjectivity sentiment scoring
VADER (vaderSentiment) Fine-grained social-media-optimized sentiment analysis
textstat Readability scoring (used in text_utils.py)
text_utils.py Custom keyword extraction and readability scoring

LLM Providers

Provider Model Notes
Groq (default) llama-3.3-70b-versatile, llama-3.1-8b-instant, openai/gpt-oss-120b See Groq models; older IDs (e.g. Mixtral) may be decommissioned
Google Gemini (alternate) gemini-pro Switch via LLM_PROVIDER=google

Frontend & API

Technology Role
Flask Web UI + POST /analyze API (server.py, static/) — use same host/port as the page to avoid CORS issues
Streamlit Optional viewer for persona .txt output (pip install -r requirements-streamlit.txt)

Deploy

Technology Role
Docker / gunicorn Dockerfile, docker-compose.yml, Procfile for production
GitHub Actions CI (pytest + image build) and CD (push to ghcr.io on tags) — see CI/CD

📁 Project Structure

Reddit-Persona-Generator/
│
├── main.py                        # CLI — orchestrates the full pipeline
├── server.py                      # Flask API + static web UI
├── config.py                      # Centralized config & env var validation
├── visualizer.py                  # Optional Streamlit persona viewer
├── requirements.txt               # App dependencies (production)
├── requirements-streamlit.txt    # Optional Streamlit stack
├── Dockerfile                     # Container image
├── docker-compose.yml             # Local/prod compose (env_file: `.env`)
├── railway.json                   # Railway: Dockerfile + /health (clear bogus $PORT start cmds)
├── Procfile                       # Heroku-style process entry
├── runtime.txt                    # Python version hint (e.g. Heroku)
├── docker_entrypoint.py           # Reads PORT from env; starts gunicorn (no shell)
├── .env.example                   # Sample env (copy to `.env`)
├── .github/workflows/             # CI/CD (pytest, Docker build, GHCR push)
├── .gitignore                     # Excludes `.env`, `__pycache__`, etc.
│
├── src/                           # Core pipeline modules
│   ├── reddit_scraper.py          # PRAW-based Reddit data fetcher
│   ├── data_processor.py          # Text cleaning, filtering, normalization
│   ├── persona_analyzer.py        # NLP analysis + LLM persona generation
│   ├── citation_manager.py        # Links persona traits to source content
│   └── output_generator.py        # Formats and writes final persona report
│
├── utils/
│   ├── text_utils.py              # Keyword extraction, readability scoring
│   ├── validation.py              # URL validation, input sanitization
│   └── reddit_url.py              # Shared Reddit username/URL parsing (CLI + server)
│
├── static/
│   └── index.html                 # Web UI (served by Flask)
│
├── templates/
│   └── persona_template.txt       # Output format template
│
├── output/                        # Generated persona `.txt` files
│   └── …                          # e.g. `spez_persona.txt`
│
├── tests/
│   ├── test_scraper.py
│   └── test_analyser.py
│
├── info.txt                       # Project notes
└── Readme.md

🔄 How It Works

Full Pipeline (Step by Step)

1. INPUT
   User runs: python main.py https://www.reddit.com/user/spez
       │
       ├── config.py validates env vars (raises ValueError if missing)
       └── URL parsed → username extracted ("spez")

2. SCRAPING  (reddit_scraper.py)
       │
       ├── Authenticate with Reddit OAuth2 (read-only)
       ├── Fetch up to MAX_POSTS=100 recent submissions
       ├── Fetch up to MAX_COMMENTS=200 recent comments
       ├── Apply SCRAPING_DELAY=1.0s between requests
       └── Return raw post/comment objects

3. CLEANING  (data_processor.py + utils/)
       │
       ├── Remove deleted/removed content
       ├── Filter short content: MIN_TEXT_LENGTH=10
       ├── Truncate long content: MAX_TEXT_LENGTH=4000
       ├── Strip URLs, markdown, special characters
       └── Return clean structured text corpus

4. NLP ENRICHMENT  (persona_analyzer.py)
       │
       ├── NLTK: tokenize, stopword removal, frequency dist
       ├── TextBlob: polarity + subjectivity per post
       ├── VADER: compound sentiment score per comment
       ├── spaCy: extract named entities (ORG, GPE, PERSON)
       ├── text_utils.py: top keywords, readability score
       └── Pre-score Big Five traits from linguistic signals

5. LLM PERSONA GENERATION  (persona_analyzer.py → Groq / Gemini)
       │
       ├── Bundle NLP features + raw text sample → prompt
       ├── Send to LLM (Groq default, Gemini alternate)
       ├── LLM returns structured persona fields:
       │   demographics, personality, interests, style,
       │   motivations, frustrations, goals, quote
       └── Parse LLM response

6. CITATION LINKING  (citation_manager.py)
       │
       ├── For each inferred trait, find supporting posts/comments
       ├── Score relevance, pick top CITATION_LIMIT=3
       └── Attach source links and excerpts to each trait

7. OUTPUT  (output_generator.py)
       │
       ├── Apply persona_template.txt formatting
       ├── Insert all fields + citations
       ├── Write to output/{username}_persona.txt
       └── Print summary to stdout

8. OPTIONAL: STREAMLIT VIEWER  (visualizer.py)
       │
       ├── streamlit run visualizer.py
       ├── Load persona .txt file
       ├── Parse into sections via regex
       ├── Render expandable panels + sidebar stats
       └── Serve at http://localhost:8501

📊 Persona Output Format

Each generated persona report follows this structured template:

USERNAME
========

DEMOGRAPHICS
============
Estimated Age:       25–34
Occupation:          Software Engineer (inferred)
Location:            United States (inferred)
Relationship Status: Not specified

PERSONALITY TRAITS
==================
Big Five Scores:
  Openness:           0.78
  Conscientiousness:  0.62
  Extraversion:       0.41
  Agreeableness:      0.55
  Neuroticism:        0.33

PRIMARY INTERESTS
=================
1. Technology / Programming
2. Gaming
3. Finance / Investing
[Citations: post_id_1, post_id_2]

WRITING STYLE
=============
Tone:           Analytical, occasionally sarcastic
Avg Sentiment:  0.14 (mildly positive)
Complexity:     High readability score
Common Terms:   ["API", "latency", "async", ...]

MOTIVATIONS & GOALS
===================
...

FRUSTRATIONS
============
...

ACTIVITY SUMMARY
================
Primary Interest:  technology
Activity Level:    Very Active (high post frequency)
Top Subreddits:    r/programming, r/investing, r/gaming
Confidence Score:  0.84

CITATIONS
=========
[1] r/programming – "The async/await pattern in Python is..."
[2] r/personalfinance – "I've been DCAing into index funds..."

📸 Sample Output

Username: spez
Primary Interest: technology
Big Five Traits:
  Openness: 0.65
  Conscientiousness: 0.58
  Extraversion: 0.72
  Agreeableness: 0.44
  Neuroticism: 0.29

Top Communities: r/ModSupport, r/modnews, r/announcements
Activity Level: Extremely Active
Motivational Quote: "Empowering others through knowledge and community"
Confidence Score: 0.81

Full persona files are in the output/ directory.


⚙️ Environment Variables

Create a .env file in the project root. All required variables are validated by config.py at startup.

# ── Reddit API (required) ──────────────────────────────────────────────
# Get these from: https://www.reddit.com/prefs/apps → "create another app"
REDDIT_CLIENT_ID=your_reddit_client_id
REDDIT_CLIENT_SECRET=your_reddit_client_secret
REDDIT_USER_AGENT=PersonaGenerator:v1.0.0 (by /u/your_reddit_username)

# ── LLM Provider (choose one) ─────────────────────────────────────────
LLM_PROVIDER=groq                  # 'groq' (default) or 'google'
GROQ_API_KEY=your_groq_api_key     # https://console.groq.com
GOOGLE_API_KEY=your_google_api_key # https://ai.google.dev (if using Gemini)

# ── Model Selection ───────────────────────────────────────────────────
GROQ_MODEL=llama-3.3-70b-versatile # or: llama-3.1-8b-instant, openai/gpt-oss-120b
GOOGLE_MODEL=gemini-pro

# ── Scraping Limits ───────────────────────────────────────────────────
MAX_POSTS=100
MAX_COMMENTS=200
SCRAPING_DELAY=1.0                 # Seconds between API calls

# ── Analysis Settings ─────────────────────────────────────────────────
MIN_TEXT_LENGTH=10                 # Ignore very short posts
MAX_TEXT_LENGTH=4000               # Truncate very long posts
CONFIDENCE_THRESHOLD=0.7           # Minimum confidence for a trait to be included

# ── Output Settings ───────────────────────────────────────────────────
OUTPUT_DIR=output
INCLUDE_CITATIONS=True
CITATION_LIMIT=3                   # Max citations per trait

# ── Logging ───────────────────────────────────────────────────────────
LOG_LEVEL=INFO
LOG_FILE=persona_generator.log

🚀 Installation

Step 1 — Clone the Repository

git clone https://github.com/Aka-Nine/Reddit-Persona-Generator.git
cd Reddit-Persona-Generator

Step 2 — Create a Virtual Environment

python -m venv venv

# Activate:
source venv/bin/activate        # macOS / Linux
venv\Scripts\activate           # Windows

Step 3 — Install Dependencies

pip install -r requirements.txt

Step 4 — Download NLP Models

# NLTK data
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('vader_lexicon')"

# spaCy English model
python -m spacy download en_core_web_sm

Step 5 — Create a Reddit App

  1. Go to reddit.com/prefs/apps
  2. Click "create another app"
  3. Choose "script" type
  4. Set redirect URI to http://localhost:8080
  5. Copy your Client ID (under the app name) and Client Secret

Step 6 — Configure .env

cp .env.example .env
# Edit .env with your API keys

💻 Usage

Generate a Persona (CLI)

# Using a full Reddit profile URL
python main.py https://www.reddit.com/user/spez

# Output saved to: output/spez_persona.txt

Multiple Users

python main.py https://www.reddit.com/user/spez
python main.py https://www.reddit.com/user/GallowBoob
python main.py https://www.reddit.com/user/kojied

Switch LLM Provider

# Use Google Gemini instead of Groq
LLM_PROVIDER=google python main.py https://www.reddit.com/user/spez

# Or set in .env:
LLM_PROVIDER=google
GOOGLE_API_KEY=your_key_here

🌐 Web UI & Docker

Run the Flask app so the browser and POST /analyze share the same origin (avoids CORS / empty JSON responses):

python server.py

Open http://127.0.0.1:5000/ (or the URL shown in the terminal). Use the same .env variables as the CLI.

Health checks: GET /health or GET /healthz.

Docker

docker build -t persona-gen .
docker run --rm --env-file .env -p 8080:8080 persona-gen

Or: docker compose up --build (loads .env, sets OUTPUT_DIR=/tmp/output in the container).

Windows: requirements.txt must be UTF-8 (not UTF-16). If pip or Docker shows \x00 in errors, re-save the file as UTF-8.

PaaS: On Heroku, Railway, or Render, set env vars from .env.example, use the Dockerfile or Procfile, and set OUTPUT_DIR=/tmp/output if only /tmp is writable. Persona files: set PERSONA_WRITE_TO_DISK=false (default) so the API does not write output/*.txt on the server; the web UI keeps the generated text in sessionStorage for that tab only. Set PERSONA_WRITE_TO_DISK=true if you want files on disk (e.g. local CLI-style persistence). CORS: use CORS_ORIGINS=https://your-frontend.com when the UI is on another origin. Timeouts: persona runs can exceed 30s — the image and Procfile use gunicorn --timeout 180.

Railway: Do not add a variable PORT with value $PORT (that passes a literal string). Railway injects PORT automatically. In the service settings, clear any Custom Start Command that references $PORT so the Docker ENTRYPOINT (docker_entrypoint.py) runs. This repo includes railway.json with a /health check.


CI/CD (GitHub Actions)

Workflow When What
CI (.github/workflows/ci.yml) Push / PR to main or master pytest + Docker build (no push)
CD (.github/workflows/cd.yml) Git tag v* (e.g. v1.0.0) or Run workflow in Actions Build and push image to GHCR (ghcr.io/<owner>/<repo>)
Dependabot (.github/dependabot.yml) Weekly / monthly PRs for pip and GitHub Actions updates

One-time: Repo Settings → Actions → General → Workflow permissions — allow read and write (or packages: write) so GITHUB_TOKEN can push to GHCR.

Release: git tag v1.0.0 && git push origin v1.0.0 → image tags include v1.0.0 and latest. Manual run: Actions → CDRun workflow (optional extra tag via input).

Adding new checks: edit .github/workflows/ci.yml (e.g. add ruff, matrix Python versions, or a job that runs docker compose run).


🌐 Streamlit Viewer

Launch the interactive persona viewer to browse generated profiles in a web UI:

streamlit run visualizer.py

The viewer will open at http://localhost:8501. Features:

  • File picker — enter the path to any output/*.txt persona file
  • Expandable sections — each persona category (Demographics, Personality, Interests, etc.) is collapsible
  • Auto-expanded — "Analysis Summary" and "Personality Traits" sections open by default
  • Sidebar — shows Primary Interest, Activity Level, and Confidence Score at a glance

⚙️ Configuration Reference

All settings are managed via config.py with environment variable overrides:

Variable Default Description
LLM_PROVIDER groq LLM backend: groq or google
GROQ_MODEL llama-3.3-70b-versatile Groq model ID (supported models)
GOOGLE_MODEL gemini-pro Gemini model ID
MAX_POSTS 100 Max submissions to fetch
MAX_COMMENTS 200 Max comments to fetch
SCRAPING_DELAY 1.0 Seconds between Reddit API calls
MIN_TEXT_LENGTH 10 Skip posts shorter than this
MAX_TEXT_LENGTH 4000 Truncate posts longer than this
CONFIDENCE_THRESHOLD 0.7 Minimum trait confidence to include
INCLUDE_CITATIONS True Attach source references to traits
CITATION_LIMIT 3 Max citations per trait
OUTPUT_DIR output Directory for persona .txt files
LOG_LEVEL INFO Logging verbosity

🧪 Testing

Run the unit tests with pytest:

pip install pytest
pytest tests/ -v

Test coverage includes:

  • test_scraper.py — PRAW fetching, rate limiting, error handling (404, private profiles)
  • test_processor.py — Text cleaning, filtering, length constraints
  • test_analyzer.py — NLP scoring correctness, LLM prompt construction
  • test_output.py — Template rendering, file writing, citation formatting

Run a specific test file:

pytest tests/test_scraper.py -v

⚖️ Ethical Considerations

This tool only accesses publicly available Reddit content. No private messages, direct messages, or non-public data are ever fetched.

Please keep the following in mind when using this tool:

  • Generated personas are AI inferences, not definitive psychological profiles. They should be interpreted with appropriate skepticism.
  • Do not use generated personas to harass, discriminate against, or make decisions about real individuals.
  • Respect Reddit's API Terms of Service and User Agreement. Do not scrape at rates that violate these terms.
  • If you are analyzing a profile and the user wishes to opt out, respect their request and delete the generated data.
  • The SCRAPING_DELAY default of 1 second is intentionally conservative — do not reduce this significantly.

⚠️ Known Limitations

  • Private profiles return no data — the script exits gracefully with a clear error message
  • Low-activity users (few posts/comments) may produce low-confidence or incomplete personas
  • LLM accuracy varies — inferred demographics (age, location, occupation) are estimates and may be wrong
  • Groq rate limits apply on free-tier accounts — processing very large datasets may require pausing between users
  • spaCy NER is English-only by default; non-English content may produce inaccurate entity extractions
  • The output is plain text — no JSON export yet (on the roadmap)

🔮 Roadmap

  • JSON / structured output — Export personas as JSON for downstream use
  • Batch processing — Analyze multiple users from a list file in one run
  • Trend analysis — Track persona evolution over time with historical snapshots
  • Multi-language support — spaCy model selector for non-English users
  • Web UI upload — Streamlit form to paste URL and generate persona in-browser
  • Async scraping — Parallel fetching for faster data collection
  • Comparative analysis — Diff two personas side by side
  • PDF export — Render the persona report as a formatted PDF
  • LLM agent architecture — Specialized sub-agents per personality dimension (demographics agent, interests agent, etc.)

🤝 Contributing

Contributions are welcome! To get started:

# Fork the repo, then:
git clone https://github.com/<your-username>/Reddit-Persona-Generator.git
cd Reddit-Persona-Generator

# Create a feature branch
git checkout -b feature/your-feature-name

# Install dependencies
pip install -r requirements.txt

# Make your changes and run tests
pytest tests/ -v

# Commit and push
git commit -m "feat: describe your change"
git push origin feature/your-feature-name

# Open a Pull Request on GitHub

Please ensure tests pass before opening a PR. Add new tests for any new functionality.


📜 License

This project is licensed under the MIT License.


🙏 Acknowledgements

  • PRAW — the Python Reddit API Wrapper that makes scraping clean and easy
  • Groq — blazing-fast LLM inference for Mixtral and LLaMA3
  • Google Gemini — powerful alternate LLM provider
  • NLTK · spaCy · TextBlob · VADER — the NLP backbone
  • Streamlit — rapid interactive web UI with zero frontend code

Turning Reddit activity into actionable psychological insights — powered by NLP and LLMs.

⭐ Star this repo · 🐛 Report a bug · 💡 Request a feature