Project Dolly Shield

The future is here, and you're just in time
Project Dolly Shield
This is Dolly, bae's $1,500 sheep furniture fixture

Overview

Dolly is the name of a sheep furniture fixture we have in our home. Mao-Lin bought it for $1,500 dollars before he met me and named her Dolly, after the famous cloned sheep from the 90’s.

When he first explained this to me, I remember thinking to myself “well OK, that’s def weird, but sure.” Dolly has since grown into one of my favorite things in our home and I would do anything to protect her. Additionally, Dolly is famous amongst my fitness friends because Dolly had to be moved from the bedroom to the living room to make space for my Peloton when I moved in.

Project Dolly Shield is named in her honor.

Cool story, bro, but what's going on here?

I'm glad you asked! I set out in March to build something with a large language model but I didn't want to just feed data into a closed-box API and get a response back. I took the longer and harder road that involved learning the underlying math and design principles that make these models work. I'm now in a place where I'm ready to start working with an open source one directly, which will allow me to change some of the underlying parameters and build a neural network of my own on top of it.

I have some broad guiding goals:

Goal Description
Gather structured and unstructured data about electronic music from various sources Use APIs, web scraping, and other methods to collect text, image, and audio data on electronic music.
Preprocess data using modern techniques Handling missing data; normalization; tokenization; augmentation
Fine-tune Fine-tune modality-specific open-source large language models Each model gets additional training on the data its designed to understand
Create a user-facing output Develop a non-commercial product where users can type text inputs describing music and get reccommendations or audio outputs in return

These are some of the models I've been looking at:

Use Case LLM Description
Text and Images Meta's Llama-3.2-11B-Vision Multimodal model that supports text and images.
Audio Magenta's Music Transformer Generates multiple sounds concurrently that sound like a song.
Audio MusicGen from Meta Generates sound based on users' text input.
Audio MUSEgan from Academia Sinica Generates outputs like MIDI clips for music creation.

Something that's changed since I first set out is that, due to my electronic music obsession, I actually have a lot of raw music making files (MIDI clips, samples, WAV) that I purchased for my music making side hobby. This means I can try and take these files and try and turn them into something that can be used to train one of the audio models and try and get it to produce sounds based on what I give.

To do this, I would need to separately train a text model on text data and an audio model on audio data and them stitch them together. Technically, MusicGen from Meta already does a version of this but I'm in it for the learning so I want to try and do it myself. Multimodal models are becoming more and more in vogue but still challenging. For instance, have you ever notice that ChatGPT's text chat and image generation tool are two separate things? The challenges of multimodal models is why. It's quite difficult to train a model that can understand and generate outputs across different types of inputs.

If your reaction to that was this:

Then I made a drawing to following along:

Chapter 1

Part 1: the road to tokenization

Project Dolly Shield, Chapter 1, Part 1: The Road to Tokenization
Technical overview of chapter 1, part 1 of Project Dolly Shield
Embedded in a high dimensional space
Thoughts on chapter 1, part 1 of Project Dolly Shield