โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Building Indonesian Language Intelligence โ from scratch.
Linguist ยท AI Engineer ยท Open Source Builder ยท Security-Aware Developer
status: "Preparing graduate research applications"
focus: "Language Technology ร Computational Sociolinguistics"
next: "Graduate research in Language Technology & Computational Sociolinguistics"
open_to: "Collaborators, compute resources, research mentors"
building: "Aibys2 โ next-gen Indonesian LLM (tokenizer ยท training ยท SFT ยท tool calling ยท vision)"
recent: "Aibys AI tools suite (research, medical, legal, invoice) ยท ArLface Recognition"
learning: "Sociolinguistics research methodology, academic writing EN"- ๐ฎ๐ฉ Indonesian NLP โ anything that makes Bahasa Indonesia better represented in AI
- ๐๏ธ AI for underserved communities โ especially rural or low-resource contexts
- ๐ฌ Low-resource language modeling โ training, fine-tuning, evaluation
- ๐ Security-aware AI systems โ threat modeling, robust architecture
- ๐ Language technology research โ if you're a researcher looking for a motivated collaborator
Got compute? I've got the pipeline. ๐
I'm Syahril Haryono โ an Indonesian developer at an unusual intersection: German linguistics ร AI engineering ร grassroots community work.
I started in tech at 13 โ not through courses or bootcamps, but by probing the edges of systems: exploring web vulnerabilities, understanding how things break. That early obsession with how systems fail under the hood became the foundation of how I build today: with a deep instinct for failure modes, security implications, and why robust architecture must be designed in from day one โ not bolted on as an afterthought.
I studied German Language Education at Universitas Negeri Jakarta, which gave me something most engineers lack: a rigorous understanding of how language is structured, how meaning is encoded, and how communication breaks down across cultures and communities.
Along the way, two experiences left me with questions I still can't stop thinking about.
The first: Building Aibys โ an Indonesian LLM from scratch โ made me realize how severely underrepresented Bahasa Indonesia is in the global NLP landscape. 270 million speakers, yet most language models treat it as an afterthought. Why does this gap exist, and what does it actually take to close it properly?
The second: Leading a digital transformation program at a rural village in Karawang โ training 20โ50 locals, watching the platform get abandoned within a year โ made me realize that the problem isn't just technical. Why do communities with genuine needs still reject technology built "for" them? What is the actual barrier โ and is it linguistic, cultural, or something deeper?
These aren't questions I can answer alone, or with just more coding.
They're the questions I intend to bring into graduate research โ and I'm actively looking for the right academic environment to pursue them.
An open-source pipeline for building a Large Language Model for Bahasa Indonesia โ entirely from scratch.
| Repo | What it does | |
|---|---|---|
| ๐๏ธ | Aibys-Data-Collector | Collect, clean, shuffle & prepare Indonesian text datasets. Streaming-mode for 50GB+ corpora. Estimated corpus: ~13B tokens. |
| ๐๏ธ | Indonesian-LLM-Starter | Decoder-only Transformer from PyTorch scratch: RMSNorm ยท RoPE ยท SwiGLU ยท Flash Attention 2 ยท GGUF export. |
| ๐ฏ | Indonesian-LLM-Finetune | LoRA fine-tuning pipeline โ turn a pre-trained checkpoint into a conversational Bahasa Indonesia assistant. |
| ๐ค | aibys-tokenizer | BPE tokenizer ยท 32K vocab ยท trained on 10M sentences ยท weighted sampling optimized for Bahasa Indonesia. |
| โก | Aibys2 | Next-gen runnable LLM starter โ tokenizer ยท training ยท checkpointing ยท SFT scaffolding ยท tool calling ยท vision dataset support. |
Aibys Data Collector โ Indonesian LLM Starter โ Indonesian LLM Finetune
(corpus pipeline) (pre-training) (instruction tuning)
โ โ โ
~13B token corpus โ aibys_final.pt โ model siap chat ๐ฎ๐ฉ
โ
Aibys2 (next iteration โ SFT ยท tool calling ยท vision)
Current status: Full pipeline functional. Proof-of-concept training completed (20K steps โ coherent Indonesian text generation โ). Aibys2 actively in development. Full training pending compute resources.
Local-first, privacy-preserving AI tools โ all powered by Ollama, running fully on your machine.
| Repo | What it does | |
|---|---|---|
| ๐ | aibys-research-summarizer | Turns PDF/TXT research papers into structured plain-language summaries, key results, limitations, follow-up questions, and exportable reports. |
| ๐ฅ | aibys-medical-explainer | Explains medical reports from PDF/TXT/image uploads, highlights notable results, and saves JSON/CSV/Markdown history. |
| โ๏ธ | aibys-legal-analyzer | Summarizes contracts, highlights risky clauses, scores risk, and saves local JSON/CSV/Markdown reports. |
| ๐งพ | aibys-invoice-extractor | Extracts structured data from invoice/receipt PDFs and images. Export to CSV. Vision-powered, runs fully local. |
| Repo | What it does | |
|---|---|---|
| ๐ค | ArLface-Recognition | Open-source face recognition system built with FastAPI and Python. Uses AuraFace (ArcFace) for embeddings โ all application logic built from scratch. Real-time, OpenCV-powered. |
Separate from my AI projects โ but these shaped how I think about who technology is actually built for.
Community service project in collaboration with Universitas Negeri Jakarta.
- Designed and deployed a digital platform for a rural village in Karawang, West Java
- Conducted a one-week on-site digital literacy training for 20โ50 local residents
- Platform was eventually discontinued โ not due to technical failure, but low adoption
This experience raised questions I haven't stopped thinking about: Why does a working platform, with trained users, still get abandoned? Is it the interface? The language? The relevance to their daily lives? Or is it something about how we define "digital readiness" that's fundamentally wrong?
Science exhibition: "UNIVERSUM ยท MENSCH ยท INTELLIGENZ" at Perpustakaan Nasional RI. Assisted visitors exploring interactive installations on AI, the universe, and human intelligence.
AI / ML & NLP
PyTorch HuggingFace Transformers SentencePiece LoRA / PEFT Flash Attention 2
GGUF ยท Ollama ยท llama.cpp OpenCV ArcFace / AuraFace
Claude API MCP (Model Context Protocol)
Microsoft Azure AI Google Cloud Vertex AI Amazon Bedrock
Systems & Backend
Python Go Rust PHP Node.js / Bun
FastAPI Gin Echo Laravel Express Hono
Frontend
React Next.js Vue Nuxt.js TypeScript Tailwind CSS Vanilla JS
Databases
PostgreSQL MySQL MongoDB Redis SQLite
Human Languages
| Language | Level |
|---|---|
| ๐ฎ๐ฉ Bahasa Indonesia | Native |
| ๐ฌ๐ง English | Professional working proficiency |
| ๐ฉ๐ช Deutsch | B2 โ studied 3+ years, volunteered at Goethe-Institut Jakarta |
๐ Anthropic โ 10 certificates
- Claude 101 ยท Building with the Claude API ยท Claude Code in Action
- Introduction to Model Context Protocol ยท MCP: Advanced Topics
- AI Fluency: Framework & Foundations ยท Teaching AI Fluency ยท AI Fluency for Educators
- Claude with Google Cloud's Vertex AI ยท Claude in Amazon Bedrock
๐ต Microsoft โ 5 certificates
- Foundations of AI and Machine Learning
- AI and Machine Learning Algorithms and Techniques
- Microsoft Azure for AI and Machine Learning
- Advanced AI and Machine Learning Techniques and Capstone
- Building Intelligent Troubleshooting Agents ยท Full-Stack Developer Capstone
๐ด IBM โ 3 certificates
- Machine Learning with Python
- Python for Data Science, AI & Development
- Full Stack Software Developer Assessment
๐ก Google Cloud โ 3 certificates
- Google Cloud Fundamentals: Core Infrastructure
- Developing a REST API with Go and Cloud Run
- Process Documents with Python Using the Document AI API
๐ Amazon ยท ๐ฃ Duke ยท ๐ต Meta ยท others
- Amazon: Generative AI in Software Development ยท Full Stack Web Development
- Duke University: Rust Fundamentals
- Meta: Programming with JavaScript ยท Version Control ยท Introduction to Front-End Development
- ๐ Started hacking systems at 13 โ now I build them with security in mind from day one
- ๐ฉ๐ช Studying German language education while building an Indonesian LLM โ yes, both at the same time
- ๐ป Built a 13B-token corpus pipeline on a laptop that couldn't finish the training run
- ๐๏ธ Got a whole village to use a digital platform in one week โ watched it die in one year
- ๐ง Believes the most interesting problems in AI are not technical โ they're linguistic and social
- ๐๏ธ Built a face recognition system from scratch because "just use a library" felt like cheating
- โ Powered by questions that don't have Stack Overflow answers
[2014] โโโโ Age 13. First contact with the internet's underbelly.
โ Explored web vulnerabilities, network weaknesses, defacing.
โ Not malice โ pure curiosity about how systems work.
โ โ Gained something no course teaches:
โ an instinct for where systems fail,
โ and why security must be designed in, not bolted on.
โ
[2018] โโโโ Channeled that energy into building, not breaking.
โ Joined an IT community. Co-founded ByteDevCode.
โ Started developing real products for real users.
โ
[2022] โโโโ Enrolled in German Language Education @ UNJ.
โ Studied linguistics, pedagogy, cross-cultural communication.
โ โ Language became a new lens: how humans and machines
โ communicate โ and why they so often fail to.
โ
[2024] โโโโ Volunteered at Goethe-Institut Jakarta (UNIVERSUMยทMENSCHยทINTELLIGENZ).
โ
โ Led digital transformation at Desa Medalsari, Karawang.
โ Built the platform. Trained 20โ50 locals in one week.
โ Platform abandoned within a year.
โ โ Left with more questions than answers.
โ That discomfort became a research direction.
โ
[2025] โโโโ Started building Aibys โ Indonesian LLM from scratch.
โ Trained BPE tokenizer (32K vocab, 10M sentences).
โ Built ~13B-token corpus pipeline.
โ First training run: 20K steps โ coherent Indonesian text โ
โ โ More questions: why is Bahasa Indonesia so underrepresented
โ in global NLP? What would it take to change that?
โ
[2026] โโโโ Open-sourced the full Aibys ecosystem.
โ Built Aibys2: next-gen LLM starter with tool calling & vision.
โ Shipped Aibys AI tools suite:
โ research summarizer ยท medical explainer ยท
โ legal analyzer ยท invoice extractor
โ Built ArLface Recognition โ open-source face recognition
โ from scratch with ArcFace embeddings.
โ Certifications: Anthropic ยท Microsoft ยท IBM ยท Google ยท Amazon
โ
[NEXT] โโโโ The questions accumulated.
Solo projects and self-study can only go so far.
The next step is finding the right research environment
to investigate them properly โ and the right people
to investigate them with. ๐ฎ๐ฉ

