diff --git a/.gitignore b/.gitignore index 731d4ae..f041309 100644 --- a/.gitignore +++ b/.gitignore @@ -2,3 +2,4 @@ _site .tmp/ vendor/bundle .bundle +__pycache__/ diff --git a/scripts/README.md b/scripts/README.md index 871661c..ca6c5e2 100644 --- a/scripts/README.md +++ b/scripts/README.md @@ -1,6 +1,6 @@ -# Transcript Conversion Scripts +# Transcript Management Scripts -This directory contains scripts for managing and converting transcripts for Gary's Economics. +This directory contains scripts for managing, converting, and validating transcripts for Gary's Economics. ## convert_transcripts.py @@ -43,3 +43,98 @@ The script will: - Special characters in titles and descriptions are properly escaped - Multi-line descriptions use YAML literal block scalar format - The script requires Python 3.6+ + +## process_video.py + +This script downloads VTT caption files for a specific YouTube video and creates a corresponding Jekyll post file. + +### What it does: + +1. Takes a YouTube video ID as a command-line argument +2. Downloads video metadata using yt-dlp +3. Downloads all available English VTT subtitle files (manual or auto-generated) +4. Copies VTT files to `_includes/captions/` named by YouTube ID +5. Creates a Jekyll post markdown file in `_posts/` with YAML front matter + +### Usage: + +```bash +python3 scripts/process_video.py VIDEO_ID +``` + +For example, to process a video with ID `Ja9dTjY3uWU`: + +```bash +python3 scripts/process_video.py Ja9dTjY3uWU +``` + +The script will: +- Download video metadata from YouTube +- Download VTT subtitle files (prioritizes manual over auto-generated) +- Copy VTT files to `_includes/captions/VIDEO_ID.vtt` +- Create a post file in `_posts/YYYY-MM-DD-slug.md` +- Report progress and any issues encountered + +### Output Format: + +**Post files** (`_posts/YYYY-MM-DD-slug.md`): +- YAML front matter with complete metadata (title, date, youtube_id, view_count, like_count, tags, categories, description, etc.) +- Properly escaped titles and descriptions for YAML compatibility +- Reference to caption file + +**Caption files** (`_includes/captions/YOUTUBE_ID.vtt`): +- WebVTT format subtitle files +- Named by YouTube video ID +- Additional language variants named as `YOUTUBE_ID.lang.vtt` if available + +### Notes: + +- Requires yt-dlp to be installed (`pip install yt-dlp`) +- Will not overwrite existing post or caption files +- Prioritizes manual subtitles over auto-generated ones +- Downloads English subtitles in order of preference: en-GB, en-orig, en +- The script requires Python 3.6+ + +## linter.py + +This script validates consistency between `_posts` and `_includes/captions` directories. + +### What it does: + +1. Scans all VTT files in `_includes/captions/` +2. Scans all post files in `_posts/` +3. Verifies that each VTT file has a corresponding post with matching YouTube ID +4. Verifies that each post has at least one corresponding VTT file +5. Reports any inconsistencies or missing files + +### Usage: + +```bash +python3 scripts/linter.py +``` + +The script will: +- Check all VTT files have corresponding posts +- Check all posts have corresponding VTT files +- Report posts with multiple VTT language variants +- Provide a summary of findings + +### Output: + +The linter will report: +- Number of unique YouTube IDs with VTT files +- Number of posts with YouTube IDs +- Any VTT files without corresponding posts (errors) +- Any posts without corresponding VTT files (errors) +- Posts with multiple VTT language variants (informational) + +### Exit Codes: + +- `0`: All checks passed +- `1`: Errors found or exception occurred + +### Notes: + +- Should be run after adding new content to verify consistency +- Useful for CI/CD pipelines to ensure data integrity +- The script requires Python 3.6+ diff --git a/scripts/linter.py b/scripts/linter.py new file mode 100755 index 0000000..9abf56c --- /dev/null +++ b/scripts/linter.py @@ -0,0 +1,179 @@ +#!/usr/bin/env python3 +""" +Linter to check consistency between _posts and _includes/captions. + +This script verifies that: +1. For each VTT file in _includes/captions, there is a corresponding _posts file + with the same YouTube ID in its frontmatter +2. For each _posts file, there is at least one VTT file in _includes/captions + with the YouTube ID as the base name +""" + +import os +import re +import sys +import traceback +from pathlib import Path + + +def extract_youtube_id_from_post(post_path): + """Extract YouTube ID from post file's frontmatter.""" + try: + with open(post_path, 'r', encoding='utf-8') as f: + content = f.read() + + # Extract frontmatter (between --- markers) + frontmatter_match = re.search(r'^---\s*\n(.*?)\n---', content, re.MULTILINE | re.DOTALL) + if not frontmatter_match: + return None + + frontmatter = frontmatter_match.group(1) + + # Extract youtube_id field + youtube_id_match = re.search(r'^youtube_id:\s*(.+?)\s*$', frontmatter, re.MULTILINE) + if youtube_id_match: + return youtube_id_match.group(1).strip() + + return None + except Exception as e: + print(f"āš ļø Error reading {post_path}: {e}") + return None + + +def get_all_vtt_files(captions_dir): + """Get all VTT files and their base YouTube IDs.""" + vtt_files = {} + + if not os.path.exists(captions_dir): + return vtt_files + + for filename in os.listdir(captions_dir): + if filename.endswith('.vtt'): + # Extract YouTube ID from VTT filename + # Files can be named like: + # - VIDEO_ID.vtt (primary) + # - VIDEO_ID.lang.vtt (language variants) + # Remove .vtt extension and split on '.' to get base ID + name = filename.replace('.vtt', '') + parts = name.split('.') + youtube_id = parts[0] + + if youtube_id not in vtt_files: + vtt_files[youtube_id] = [] + vtt_files[youtube_id].append(filename) + + return vtt_files + + +def get_all_posts(posts_dir): + """Get all post files and their YouTube IDs.""" + posts = {} + + if not os.path.exists(posts_dir): + return posts + + for filename in os.listdir(posts_dir): + if filename.endswith('.md'): + post_path = os.path.join(posts_dir, filename) + youtube_id = extract_youtube_id_from_post(post_path) + if youtube_id: + posts[youtube_id] = filename + + return posts + + +def lint(): + """Run linting checks.""" + # Change to repository root + script_dir = os.path.dirname(os.path.abspath(__file__)) + repo_root = os.path.dirname(script_dir) + os.chdir(repo_root) + + captions_dir = os.path.join('_includes', 'captions') + posts_dir = '_posts' + + print("Linting transcript consistency...") + print("=" * 60) + + # Get all VTT files and posts + vtt_files = get_all_vtt_files(captions_dir) + posts = get_all_posts(posts_dir) + + print(f"šŸ“Š Found {len(vtt_files)} unique YouTube IDs in VTT files") + print(f"šŸ“Š Found {len(posts)} posts with YouTube IDs") + print("=" * 60) + + errors = [] + warnings = [] + + # Check 1: For each VTT file, ensure there's a corresponding post + print("\nšŸ” Checking VTT files have corresponding posts...") + vtt_without_post = [] + for youtube_id, files in vtt_files.items(): + if youtube_id not in posts: + vtt_without_post.append((youtube_id, files)) + errors.append(f"VTT file(s) {files} for YouTube ID '{youtube_id}' has no corresponding post") + + if vtt_without_post: + print(f"āŒ Found {len(vtt_without_post)} VTT file(s) without corresponding posts:") + for youtube_id, files in vtt_without_post: + print(f" - {youtube_id}: {', '.join(files)}") + else: + print("āœ… All VTT files have corresponding posts") + + # Check 2: For each post, ensure there's at least one VTT file + print("\nšŸ” Checking posts have corresponding VTT files...") + posts_without_vtt = [] + for youtube_id, post_file in posts.items(): + if youtube_id not in vtt_files: + posts_without_vtt.append((youtube_id, post_file)) + errors.append(f"Post '{post_file}' for YouTube ID '{youtube_id}' has no corresponding VTT file") + + if posts_without_vtt: + print(f"āŒ Found {len(posts_without_vtt)} post(s) without corresponding VTT files:") + for youtube_id, post_file in posts_without_vtt: + print(f" - {youtube_id}: {post_file}") + else: + print("āœ… All posts have corresponding VTT files") + + # Check 3: Report on posts with multiple VTT variants + print("\nšŸ“‹ Posts with multiple VTT language variants:") + multi_vtt = [(youtube_id, files) for youtube_id, files in vtt_files.items() if len(files) > 1] + if multi_vtt: + for youtube_id, files in multi_vtt: + print(f" - {youtube_id}: {', '.join(files)}") + else: + print(" (none)") + + # Summary + print("\n" + "=" * 60) + print("šŸ“Š Linting Summary") + print("=" * 60) + + if errors: + print(f"āŒ Found {len(errors)} error(s):") + for error in errors: + print(f" - {error}") + print("\nāŒ Linting failed!") + return False + else: + print("āœ… All checks passed!") + print(f" - {len(vtt_files)} YouTube IDs with VTT files") + print(f" - {len(posts)} posts with YouTube IDs") + print(f" - {len(multi_vtt)} videos with multiple VTT variants") + return True + + +def main(): + """Main entry point.""" + try: + success = lint() + sys.exit(0 if success else 1) + except Exception as e: + print(f"āŒ Error: {e}") + traceback.print_exc() + sys.exit(1) + + +if __name__ == "__main__": + main() diff --git a/scripts/process_video.py b/scripts/process_video.py new file mode 100755 index 0000000..61582ae --- /dev/null +++ b/scripts/process_video.py @@ -0,0 +1,302 @@ +#!/usr/bin/env python3 +""" +Process a YouTube video: download VTT files and create Jekyll post. + +This script downloads VTT caption files for a YouTube video using yt-dlp +and creates a corresponding Jekyll post file in _posts/ with YAML front matter. +""" + +import os +import sys +import json +import re +import argparse +import traceback +from subprocess import run +from datetime import datetime +import shutil + + +def slugify(s): + """Convert string to URL-friendly slug.""" + s = s.lower().strip() + s = re.sub(r"[^\w\s-]", "", s) + s = re.sub(r"[\s_-]+", "-", s) + s = re.sub(r"^-+|-+$", "", s) + return s + + +def download_metadata(video_id): + """Download video metadata using yt-dlp.""" + url = f"https://www.youtube.com/watch?v={video_id}" + dl_path = ".tmp/dl" + + # Create temp directory if it doesn't exist + os.makedirs(".tmp", exist_ok=True) + + # Remove existing metadata file if present + try: + os.remove(f"{dl_path}.info.json") + except FileNotFoundError: + pass + + # Download metadata + run( + [ + "yt-dlp", + "--skip-download", + "--write-info-json", + "-o", + dl_path, + "--", + url, + ], + check=True, + ) + + if not os.path.exists(f"{dl_path}.info.json"): + raise RuntimeError("Failed to download metadata") + + with open(f"{dl_path}.info.json", encoding="utf-8") as f: + return json.load(f) + + +def download_subtitles(video_id, metadata): + """Download VTT subtitle files for all available English variants.""" + url = f"https://www.youtube.com/watch?v={video_id}" + dl_path = ".tmp/dl" + + # Determine which subtitles to download + subtitles = metadata.get("subtitles", {}) + auto_captions = metadata.get("automatic_captions", {}) + + # Priority order for English subtitles + lang_priority = ["en-GB", "en-orig", "en"] + + downloaded_files = [] + + # Try manual subtitles first + for lang in lang_priority: + if lang in subtitles: + # Remove existing file if present + try: + os.remove(f"{dl_path}.{lang}.vtt") + except FileNotFoundError: + pass + + run( + [ + "yt-dlp", + "--write-sub", + "--sub-lang", + lang, + "--skip-download", + "-o", + dl_path, + "--", + url, + ], + check=True, + ) + + if os.path.exists(f"{dl_path}.{lang}.vtt"): + downloaded_files.append((f"{dl_path}.{lang}.vtt", lang, False)) + + # If no manual subtitles, try auto-generated + if not downloaded_files: + for lang in lang_priority: + if lang in auto_captions: + # Remove existing file if present + try: + os.remove(f"{dl_path}.{lang}.vtt") + except FileNotFoundError: + pass + + run( + [ + "yt-dlp", + "--write-auto-sub", + "--sub-lang", + lang, + "--skip-download", + "-o", + dl_path, + "--", + url, + ], + check=True, + ) + + if os.path.exists(f"{dl_path}.{lang}.vtt"): + downloaded_files.append((f"{dl_path}.{lang}.vtt", lang, True)) + break # Only download first auto-caption variant + + return downloaded_files + + +def create_post(metadata, video_id): + """Create Jekyll post file with front matter.""" + title = metadata.get("title", "") + author = metadata.get("uploader", "Garys Economics") + timestamp = metadata.get("timestamp", 0) + date = datetime.fromtimestamp(timestamp).date().isoformat() + view_count = metadata.get("view_count", 0) + like_count = metadata.get("like_count", 0) + duration_seconds = metadata.get("duration", 0) + tags = metadata.get("tags", []) + categories = metadata.get("categories", []) + description = metadata.get("description", "") + thumbnails = metadata.get("thumbnails", []) + thumbnail = thumbnails[-1]["url"] if thumbnails else f"https://i.ytimg.com/vi/{video_id}/maxresdefault.jpg" + + # Generate post filename + slug = slugify(title) + post_filename = f"{date}-{slug}.md" + post_path = os.path.join("_posts", post_filename) + + # Check if post already exists + if os.path.exists(post_path): + print(f"āš ļø Post already exists: {post_path}") + return False, post_path + + # Create _posts directory if it doesn't exist + os.makedirs("_posts", exist_ok=True) + + # Create Jekyll post with front matter + with open(post_path, "w", encoding="utf-8") as f: + f.write("---\n") + f.write("layout: post\n") + + # Quote title if it contains special YAML characters + if ":" in title or "#" in title or "@" in title or "[" in title or "]" in title: + safe_title = title.replace('"', '\\"') + f.write(f'title: "{safe_title}"\n') + else: + f.write(f"title: {title}\n") + + f.write(f"author: {author}\n") + f.write(f"date: {date}\n") + f.write(f"youtube_url: https://www.youtube.com/watch?v={video_id}\n") + f.write(f"youtube_id: {video_id}\n") + f.write(f"view_count: {view_count}\n") + f.write(f"like_count: {like_count}\n") + f.write(f"duration_seconds: {duration_seconds}\n") + + if tags: + f.write("tags:\n") + for tag in tags: + f.write(f"- {tag}\n") + + if categories: + f.write("categories:\n") + for category in categories: + f.write(f"- {category}\n") + + # Write description + if description: + # Use YAML literal block scalar for multi-line or special descriptions + if "\n" in description or len(description) > 80 or ":" in description or "#" in description: + f.write("description: |\n") + for line in description.split("\n"): + f.write(f" {line}\n") + else: + f.write(f"description: {description}\n") + + f.write(f"thumbnail: {thumbnail}\n") + f.write("channel_url: https://www.youtube.com/@garyseconomics\n") + f.write(f"caption_file: captions/{video_id}.vtt\n") + f.write("---\n") + + return True, post_path + + +def copy_vtt_files(downloaded_files, video_id): + """Copy VTT files to _includes/captions directory.""" + captions_dir = os.path.join("_includes", "captions") + os.makedirs(captions_dir, exist_ok=True) + + copied_files = [] + + for file_path, lang, is_auto in downloaded_files: + # Primary file uses just the video ID + if not copied_files: + dest_path = os.path.join(captions_dir, f"{video_id}.vtt") + else: + # Additional files include language code + dest_path = os.path.join(captions_dir, f"{video_id}.{lang}.vtt") + + if os.path.exists(dest_path): + print(f"āš ļø Caption file already exists: {dest_path}") + else: + shutil.copy2(file_path, dest_path) + copied_files.append(dest_path) + print(f"āœ… Copied caption: {dest_path} ({'auto' if is_auto else 'manual'})") + + return copied_files + + +def process_video(video_id): + """Process a YouTube video: download VTT and create post.""" + # Change to repository root + script_dir = os.path.dirname(os.path.abspath(__file__)) + repo_root = os.path.dirname(script_dir) + os.chdir(repo_root) + + print(f"Processing video: {video_id}") + print("=" * 60) + + # Download metadata + print("šŸ“„ Downloading metadata...") + metadata = download_metadata(video_id) + print(f"āœ… Video: {metadata.get('title', 'Unknown')}") + + # Download subtitles + print("šŸ“„ Downloading VTT files...") + downloaded_files = download_subtitles(video_id, metadata) + + if not downloaded_files: + print("āŒ No English subtitles found for this video") + return False + + print(f"āœ… Downloaded {len(downloaded_files)} VTT file(s)") + + # Copy VTT files to captions directory + print("šŸ“ Copying VTT files to _includes/captions/...") + copied_files = copy_vtt_files(downloaded_files, video_id) + + # Create post + print("šŸ“ Creating Jekyll post...") + created, post_path = create_post(metadata, video_id) + + if created: + print(f"āœ… Created post: {post_path}") + + print("=" * 60) + print("āœ… Processing complete!") + + return True + + +def main(): + """Main entry point.""" + parser = argparse.ArgumentParser( + description="Download VTT files and create Jekyll post for a YouTube video" + ) + parser.add_argument( + "video_id", + help="YouTube video ID (e.g., Ja9dTjY3uWU)" + ) + + args = parser.parse_args() + + try: + success = process_video(args.video_id) + sys.exit(0 if success else 1) + except Exception as e: + print(f"āŒ Error: {e}") + traceback.print_exc() + sys.exit(1) + + +if __name__ == "__main__": + main()