A Production-Grade AI Music Creation Environment
Bridging the gap between generative AI and professional music production
The Neural Audio Workstation (NAW) is a next-generation music production tool that integrates state-of-the-art AI generation with professional DAW workflows. Unlike consumer "text-to-music" tools that output monolithic stereo files, NAW provides stem-aware generation, surgical editability, and multi-modal conditioning for professional creators.
Current AI music tools (Suno, Udio, Stable Audio) excel at generative consumption - creating complete songs from text prompts. NAW pioneers generative production, offering:
- ποΈ Stem-Level Generation: Isolated tracks (drums, bass, vocals, other) that are phase-aligned and mix-ready
- π¨ Surgical Inpainting: Regenerate specific regions of audio using bidirectional diffusion
- πΉ Multi-Modal Control: Text prompts, MIDI data, audio references, and rhythmic masks
- π Real-Time Workflow: Hybrid AR/Diffusion architecture for fast preview and high-quality rendering
- ποΈ Professional Mixing: Full mixer view with per-stem volume, solo, and mute controls
4-stem arrangement (Drums, Bass, Vocals, Other) with Semantic Planner, Inpainting Tools, Control Adapters, and CLAP Audio Reference in the left sidebar.
Full mixer with per-stem volume faders, solo/mute, and real-time level monitoring.
Piano Roll visualization for melodic and harmonic editing of individual stems.
NAW implements a Hybrid "Compose-then-Render" Architecture inspired by cutting-edge research:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 1: Semantic Planner (Autoregressive Transformer) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β Input: Text Prompt + BPM + Control Signals β
β Output: Coarse Musical Skeleton (Structure, Rhythm, Pitch) β
β Speed: Fast (~2 seconds for 32 bars) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 2: Acoustic Renderer (Flow Matching / Diffusion) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β Input: Semantic Skeleton + Text Prompt β
β Output: High-Fidelity Audio with Realistic Timbre β
β Speed: Slower (~10 seconds for 32 bars) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Implemented:
UI/UX Layer:
- Semantic Planner using Gemini 2.5 Flash (simulates AR Transformer)
- Stem-aware track architecture (4-stem separation: DRUMS, BASS, VOCALS, OTHER)
- Spectrogram Editor with inpainting UI
- Piano Roll view for MIDI-style visualization
- Prompt Timeline for temporal style control
- Professional mixer with per-stem controls
- Project save/load (JSON format)
- Real-time playback simulation
Neural Engine (Architecture Complete β 67 tests):
- Phase 2 Components:
- DAC Codec (audio compression/decompression with RVQ)
- Semantic Planner (autoregressive structure generation)
- Acoustic Renderer (flow matching for high-fidelity audio)
- Vocoder (Vocos/DisCoder/HiFiGAN support)
- Phase 3 Advanced Features:
- ControlNet adapters (melody, rhythm, dynamics, timbre, harmony)
- CLAP audio-text conditioning (reference-based generation)
- Spectrogram inpainting (surgical editing with discrete diffusion)
- Multi-track export (WAV stems, Ableton/Logic/Pro Tools)
- Phase 4 Production Features:
- ASIO audio backend architecture (low-latency I/O)
- TensorRT optimization config (<100ms latency target)
- VST/AU plugin architecture (JUCE framework)
- Commercial licensing structure (dual-license model)
Testing & Documentation:
- Comprehensive test suite (67 tests β 100% passing)
- Interactive demo workflow
- Complete API documentation
- Architecture documentation
- Performance benchmarks
π§ In Development (See ROADMAP.md):
The neural engine architecture is complete and tested, with stub implementations that demonstrate the full pipeline. Next steps involve:
-
Phase 2: Real neural model integration
- Integrate actual DAC/EnCodec audio codec models
- Train/integrate Transformer-XL semantic planner
- Train/integrate Flow Matching acoustic renderer
- Model optimization (ONNX export, INT8 quantization)
-
Phase 3: Advanced feature implementation
- Implement ControlNet with real neural networks
- Integrate LAION CLAP model for audio conditioning
- Implement discrete diffusion for inpainting
- Complete multi-format export functionality
-
Phase 4: Production deployment
- Build VST/AU plugin with JUCE framework
- Implement ASIO backend for real-time audio
- TensorRT optimization for <100ms latency
- Commercial licensing platform launch
Generate music as 4 independent, synchronized stems instead of a single stereo file:
- Drums: Kick, snare, hi-hats, percussion
- Bass: Sub-bass, bass guitar, bass synth
- Vocals: Lead vocals, harmonies, vocal effects
- Other: Melody, chords, atmosphere, FX
Each stem can be:
- Solo'd or muted independently
- Adjusted in volume
- Edited or regenerated separately
- Exported as individual WAV files
Paint directly on the spectrogram to mask regions for regeneration:
- Brush Tool: Mask specific frequency ranges or time regions
- Bidirectional Context: AI sees both past and future context for seamless edits
- Latent Visualization: Toggle view to see the model's internal representation
Musical prompts that change over time:
Bar 1-8: "Lo-fi vinyl crackle, mellow jazz piano"
Bar 9-16: "Build tension, add strings"
Bar 17-24: "Drop, aggressive dubstep bass"
Bar 25-32: "Outro, fade to ambient"
- Spectrogram View: Frequency-over-time representation for audio editing
- Piano Roll View: MIDI-style note grid for melodic/harmonic editing
- Node.js (v20 or later) β Download
- pnpm (v9 or later) β Install via
npm install -g pnpmor see pnpm.io - Gemini API Key (for AI generation)
-
Clone the repository
git clone https://github.com/GizzZmo/NAW.git cd NAW -
Install dependencies
pnpm install
-
Set up environment Create a
.env.localfile in the root directory:VITE_GEMINI_API_KEY=your_gemini_api_key_here
Get your API key from: https://ai.google.dev/
-
Run the development server
npm run dev
-
Open in browser Navigate to
http://localhost:5173
pnpm run build
pnpm run preview# Run neural engine tests (67 tests)
pnpm test
# Run Phase 1β4 integration tests (45 tests)
pnpm run test:phase1-4
# Run integration tests
pnpm run test:integration
# Run interactive demo
pnpm run demoThe neural engine is a complete implementation with stub neural models, demonstrating the full pipeline from text prompt to multi-stem audio generation. See neural-engine/README.md for detailed API documentation.
The Neural Engine is the core AI/ML pipeline that powers NAW's generative capabilities. It implements a two-stage hybrid architecture inspired by state-of-the-art research.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INPUT: Text Prompt + BPM + Control Signals β
ββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 1: Semantic Planner (Autoregressive Transformer) β
β β’ Generates coarse musical structure β
β β’ Multi-stream prediction (4 stems) β
β β’ Fast generation (~2 seconds for 32 bars) β
β Output: Semantic Tokens (structure, rhythm, pitch) β
ββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 2: Acoustic Renderer (Flow Matching / Diffusion) β
β β’ Paints high-fidelity audio onto skeleton β
β β’ Text-conditioned generation (CLAP embeddings) β
β β’ Slower but higher quality (~10 seconds for 32 bars) β
β Output: Acoustic Tokens (timbre, texture) β
ββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 3: Vocoder (Vocos / DisCoder / HiFiGAN) β
β β’ Converts latent tokens to audio waveforms β
β β’ Multiple quality presets (fast/balanced/high) β
β β’ Real-time capable (25x on RTX 3090) β
β Output: High-Fidelity Audio (44.1kHz stereo) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
import { generateMusic } from './neural-engine';
// Simple generation - automatically handles all stages
const stems = await generateMusic({
text: "Uplifting house track with energetic drums",
bpm: 128,
bars: 32,
quality: 'balanced', // 'fast' | 'balanced' | 'high'
});
// Result: 4 stems (DRUMS, BASS, VOCALS, OTHER)
console.log(`Generated ${stems.length} stems`);ControlNet - Fine-grained control over generation:
import { ControlNet, ControlType } from './neural-engine';
const controlNet = new ControlNet();
await controlNet.initialize();
// Extract melody from reference audio
const melodySignal = await controlNet.extractControlSignal(
referenceAudio,
ControlType.MELODY
);
// Generate with melody control
const stems = await generateMusic({
text: "Synthwave track",
controlSignals: [melodySignal],
controlStrength: 0.8
});CLAP - Reference-based generation:
import { CLAP, generateMusic } from './neural-engine';
const clap = new CLAP();
await clap.initialize();
// Encode reference audio
const audioEmbed = await clap.encodeAudio(referenceAudio);
// Generate with audio reference
const stems = await generateMusic({
text: "Similar vibe but faster",
audioReference: audioEmbed,
audioReferenceWeight: 0.6 // 60% audio, 40% text
});Inpainting - Surgical audio editing:
import { SpectrogramInpainter } from './neural-engine';
const inpainter = new SpectrogramInpainter();
await inpainter.initialize();
// Regenerate specific region (e.g., remove snare)
const mask = {
startTime: 2.0,
endTime: 2.5,
freqMin: 200,
freqMax: 8000
};
const result = await inpainter.inpaint(audio, mask);| Component | Status | Description |
|---|---|---|
| DAC Codec | β Architecture | Audio compression with RVQ |
| Semantic Planner | β Architecture | AR Transformer for structure |
| Acoustic Renderer | β Architecture | Flow Matching for quality |
| Vocoder | β Architecture | Latent-to-audio conversion |
| ControlNet | β Architecture | Fine-grained control signals |
| CLAP | β Architecture | Audio-text conditioning |
| Inpainting | β Architecture | Surgical editing |
All components have complete TypeScript interfaces, stub implementations, and working tests (67 tests passing). See neural-engine/README.md for full API documentation.
-
Enter a Prompt
- In the left sidebar, type a musical prompt (e.g., "Uplifting house track, 128 BPM, energetic")
- Click "Generate Skeleton"
-
Review Generated Stems
- The timeline shows 4 tracks with generated patterns
- Each track has clips containing musical events
-
Adjust in Mixer
- Switch to Mixer view (top-right toggle)
- Adjust volume, solo/mute individual stems
- Monitor levels in real-time
-
Edit with Spectrogram
- Switch back to Timeline view
- Change visualization to Spectrogram mode
- Use the Brush Tool to mask regions for inpainting (future feature)
-
Add Prompt Keyframes
- Click anywhere on the timeline to add a style change
- Enter a new prompt for that section
-
Save/Load Projects
- Click Download icon to save project as JSON
- Click Upload icon to load a saved project
- Browse Presets for example projects
- React 19 - UI framework with hooks and modern patterns
- TypeScript - Type-safe development
- Vite - Fast build tool and dev server
- Tailwind CSS - Utility-first styling (via inline classes)
- Lucide React - Icon library
- Gemini 2.5 Flash - Semantic planner (simulates AR Transformer)
- Planned: DAC/EnCodec for neural audio codec
- Planned: Flow Matching for acoustic rendering
- Planned: ControlNet for fine-grained control
- Web Audio API - Browser-native audio synthesis (current simulation)
- Planned: TensorRT/ONNX for optimized inference
- Planned: JUCE framework for VST/AU plugin
NAW's architecture is based on state-of-the-art research:
- MusicGen (Meta): Multi-stream transformer for music generation
- Fish Speech 4: DualAR architecture for better prosody modeling
- Transformer-XL: Recurrent attention for long-context coherence
- Stable Audio Open: Latent diffusion conditioned on text
- Flow Matching: Faster convergence than standard diffusion
- AudioLDM: Diffusion in compressed latent space
- DAC (Descript): 44.1kHz stereo, superior fidelity
- EnCodec (Meta): Residual Vector Quantization (RVQ)
- AudioDec: Ultra-low latency streaming codec
- ControlNet: Zero-initialized adapters for conditioned generation
- MuseControlLite: Parameter-efficient fine-tuning (PEFT)
- CLAP: Contrastive Language-Audio Pretraining for audio reference
- Discrete Diffusion: Bidirectional inpainting for surgical edits
See ARCHITECTURE.md for detailed technical documentation.
- Core UI and timeline
- Stem-aware architecture
- Semantic planner integration
- Basic playback simulation
- Project save/load
- Architecture design and stubs
- DAC codec architecture with working tests
- Semantic planner architecture with working tests
- Acoustic renderer architecture with working tests
- End-to-end pipeline demonstration
- Integrate DAC audio codec (real neural model)
- Implement Flow Matching renderer (real neural model)
- Real audio generation (not simulation)
- Architecture design and stubs
- ControlNet adapters architecture
- CLAP audio conditioning architecture
- Spectrogram inpainting architecture
- Multi-track export architecture
- ControlNet adapters (actual implementation)
- CLAP-based audio conditioning (actual implementation)
- Spectrogram inpainting (actual implementation)
- Outpainting for loop generation (actual implementation)
- Multi-track export (actual implementation)
- Architecture design and documentation
- VST/AU plugin architecture (JUCE)
- ASIO audio backend design
- TensorRT optimization configuration
- Commercial licensing structure
- VST/AU plugin implementation
- ASIO audio backend implementation
- TensorRT optimization
- Real-time inference (<100ms latency)
- Commercial licensing launch
See ROADMAP.md for complete details.
Contributions are welcome! This project is in active development.
Areas of Interest:
- Neural audio codec integration
- Flow matching implementation
- ControlNet adapters for music
- VST/AU plugin development
- Documentation and examples
Please read CONTRIBUTING.md before submitting PRs.
This project is available under a dual-license model:
Free for personal, educational, and non-commercial use. See LICENSE file for details.
Required for commercial deployment, monetization, and production use. See COMMERCIAL_LICENSE.md for details.
Tiers:
- Individual Producer: $99/year (independent artists)
- Professional Studio: $499/year (up to 5 users)
- Enterprise: $2,999/year + custom (unlimited usage)
Note on AI Model Licensing:
- The current Gemini integration is for research/prototyping
- Future neural models will use permissive licenses (Apache 2.0, MIT)
- Models like MusicGen (CC-BY-NC) require commercial licensing from Meta
- NAW is committed to "Fairly Trained" certification for ethical AI
For commercial licensing inquiries: commercial@naw-audio.com
This project is inspired by cutting-edge research from:
- Meta AI (MusicGen, EnCodec)
- Stability AI (Stable Audio)
- Descript (DAC Codec)
- Google Research (Flow Matching, Gemini)
Special thanks to the open-source audio ML community.
- Repository: https://github.com/GizzZmo/NAW
- Issues: https://github.com/GizzZmo/NAW/issues
- AI Studio: https://ai.studio/apps/drive/1-xaWTqoGbGwitJJFN2c2WEeIpX1XDH2x
- Documentation: ARCHITECTURE.md | ROADMAP.md
Built with β€οΈ for the future of music production



