Content for workshops on computer vision @ HPI's AI Service Center
-
Updated
Nov 9, 2024 - Jupyter Notebook
Content for workshops on computer vision @ HPI's AI Service Center
Harnessing Large Language Models for Curated Code Reviews
Implementation of TSDS: Data Selection for Task-Specific Model Finetuning. An optimal-transport framework for selecting domain-specific and task-specific training data to improve LLM finetuning and instruction tuning.
HyperView curates datasets and provides model introspection in hyperbolic and Euclidean geometries.
[ACL 2024 (Findings)] ICC: Quantifying Image Caption Concreteness for Multimodal Dataset Curation
Image description/tagging tool
NAACL 2025 | How Can We Diagnose and Treat Bias in Large Language Models for Clinical Decision-Making?
An image deduplication GUI, made for image generation models dataset deduplication using CLIP.
Biomedical Image Processing BAP (Scientific Research Project) - Piri Reis University
Manage and process paired RGB and depth images, with options to view, export, and exclude images using various colormaps.
Comprehensive framework for curating and validating biomedical datasets for clinical AI applications
AIWG training-complete framework — corpus-to-dataset pipeline with SKILL.md agentic surface and optional Python runtime backend. Marketplace plugin for AIWG.
A local-first self-improvement runtime for language systems. Record. Learn. Rewrite.
Workflow and validation toolkit for human review of wildlife AI outputs
ML/AI Data Curation Functional Setup
Two-stage video captioning pipeline: a Vision-Language model produces a rich description, then a text-only LM rewrites it through a task-specific prompt (e.g. for LoRA training datasets, retrieval, summarization).
Efficient, reproducible dataset curation for LLM fine-tuning: scripts and best practices for preparing code datasets without repository bloat.
🪘 Tabla Drum Image Generator – AI-powered tabla drum image generation using Stable Diffusion & GANs. Features custom dataset curation, ML training pipeline, and scalable API deployment.
GPT-2 fine-tuned model for myth generation using curated mythological datasets and structured NLP preprocessing.
End-to-end object detection project for telecom infrastructure using real-world field data with challenging conditions such as rust, occlusion, and adverse weather
Add a description, image, and links to the dataset-curation topic page so that developers can more easily learn about it.
To associate your repository with the dataset-curation topic, visit your repo's landing page and select "manage topics."