Embedded in a high dimensional space

Thoughts on chapter 1, part 1 of Project Dolly Shield
Embedded in a high dimensional space
Project Dolly Shield, Chapter 1, Part 1: The Road to Tokenization
Technical overview of chapter 1, part 1 of Project Dolly Shield
GitHub - uwsthoughts/dolly_shield: Official repository of Project Dolly Shield
Official repository of Project Dolly Shield. Contribute to uwsthoughts/dolly_shield development by creating an account on GitHub.

Math talk

Tokenization. That's a new term for me. It’s the process of taking text data in whatever form you have it in and breaking it into smaller chunks called tokens. The model you use determines what these tokens look like.

Your mileage will vary across models but tokenization can generally be broken into two components:

  1. Take pre-processed data and convert it into text sentences
  2. Break those sentences into smaller chunks of words called tokens

I’m using the Transformers library from Hugging Face to access and build with large language models. After jumping through a few hoops, I managed to get approval from Meta to access all the Llama models. These models are open source, allowing me to access and change some of the underlying parameters (like weights and biases). Closed models, like Google Gemini and ChatGPT, don't allow you to make any changes to the model so you're stuck with what give and it's just a big black box. Additionally, open source is free where as the closed models charge for API access, often by the token. When you're a small hobbyist like me, it's far cheaper to go this route and pay for the occasional on demand GPU.

Each LLM has its strengths and weaknesses. I chose Llama because it's one of the best open-source models out there and has a robust support community. The code below uses AutoTokenizer from Hugging Face to declare which LLM to use for tokenization.

llama_3211b = "meta-llama/Llama-3.2-11B-Vision"
llama_3211b_tokenizer = AutoTokenizer.from_pretrained(llama_3211b)

....

tokenized_data = llama_3211b_tokenizer(batch_texts, return_tensors= "tf", truncation=True, padding=True)
   tokenized_batches.append(tokenized_data)

After you setup the access, you take data from a typical structured table like this:

release_date artist_name title label_name duration genre_name bpm key_description mix is_remixed is_remixer mode valence danceability energy speechiness loudness liveness instrumentalness acousticness isrc artist_id artist_url track_id track_url label_id label_url genre_id genre_url key_id
0 2022-06-24 Preja Movha Supadjs Projects 6:51 Amapiano 112.0 Natural A Minor Main f f 0.0 0.367 0.732 0.492 0.2620 -10.961 0.2970 0.00000 0.247000 GBKQU2257859 1063851 beatport.com/artist/preja/1063851 16636568 beatport.com/track/movha/16636568 40460 beatport.com/label/supadjs-projects/40460 98 /genre/amapiano/98 8.0
1 2022-06-24 Sidney Saige Ausama Movha Supadjs Projects 6:51 Amapiano 112.0 Natural A Minor Main f f 0.0 0.367 0.732 0.492 0.2620 -10.961 0.2970 0.00000 0.247000 GBKQU2257859 1063852 beatport.com/artist/sidney-saige-ausama/1063852 16636568 beatport.com/track/movha/16636568 40460 beatport.com/label/supadjs-projects/40460 98 /genre/amapiano/98 8.0

That data gets changed into a string representation, with each field and value being added from left to right.

text_values = (
    "Track ID: 16636568, Title: Movha, Artist: Preja, Artist ID: 1063851, "
    "Genre: Amapiano, Genre ID: 98, Label: Supadjs Projects, Label ID: 40460, "
    "Release Date: 2022-06-24, Track URL: beatport.com/track/movha/16636568, "
    "Mix: Main, Remix: No, Remixer: No, Duration: 6:51 minutes, BPM: 112.0, "
    "Key ID: 8.0, Mode: 0.0, Valence: 0.367, Danceability: 0.732, Energy: 0.492, "
    "Speechiness: 0.2620, Loudness: -10.961, Liveness: 0.2970, Instrumentalness: 0.00000, "
    "Acousticness: 0.247000, ISRC: GBKQU2257859, Artist URL: beatport.com/artist/preja/1063851, "
    "Label URL: beatport.com/label/supadjs-projects/40460, Genre URL: /genre/amapiano/98."
)

Show me the tokens

Feed that into the Llama tokenizer, and you'll get something like the output below—these are the tokens. A few things to note:

  • The Ġ prefix is often added by LLMs like Llama to indicate that a token follows a space
    • For example, ĠID means the word "ID" is preceded by a space
  • Punctuation and numbers are treated as individual tokens
  • URLs are often treated as single tokens or split into recognizable parts
tokens = [
    'Track', 'ĠID', ':', 'Ġ16636568', ',', 'ĠTitle', ':', 'ĠMovha', ',', 
    'ĠArtist', ':', 'ĠPreja', ',', 'ĠArtist', 'ĠID', ':', 'Ġ1063851', ',', 
    'ĠGenre', ':', 'ĠAmapiano', ',', 'ĠGenre', 'ĠID', ':', 'Ġ98', ',', 
    'ĠLabel', ':', 'ĠSupadjs', 'ĠProjects', ',', 'ĠLabel', 'ĠID', ':', 'Ġ40460', ',', 
    'ĠRelease', 'ĠDate', ':', 'Ġ2022-06-24', ',', 
    'ĠTrack', 'ĠURL', ':', 'Ġbeatport.com/track/movha/16636568', ',', 
    'ĠMix', ':', 'ĠMain', ',', 'ĠRemix', ':', 'ĠNo', ',', 
    'ĠRemixer', ':', 'ĠNo', ',', 'ĠDuration', ':', 'Ġ6:51', 'Ġminutes', ',', 
    'ĠBPM', ':', 'Ġ112.0', ',', 'ĠKey', 'ĠID', ':', 'Ġ8.0', ',', 
    'ĠMode', ':', 'Ġ0.0', ',', 'ĠValence', ':', 'Ġ0.367', ',', 
    'ĠDanceability', ':', 'Ġ0.732', ',', 'ĠEnergy', ':', 'Ġ0.492', ',', 
    'ĠSpeechiness', ':', 'Ġ0.2620', ',', 'ĠLoudness', ':', 'Ġ-10.961', ',', 
    'ĠLiveness', ':', 'Ġ0.2970', ',', 'ĠInstrumentalness', ':', 'Ġ0.00000', ',', 
    'ĠAcousticness', ':', 'Ġ0.247000', ',', 'ĠISRC', ':', 'ĠGBKQU2257859', ',', 
    'ĠArtist', 'ĠURL', ':', 'Ġbeatport.com/artist/preja/1063851', ',', 
    'ĠLabel', 'ĠURL', ':', 'Ġbeatport.com/label/supadjs-projects/40460', ',', 
    'ĠGenre', 'ĠURL', ':', 'Ġ/genre/amapiano/98', '.'
]

The text tokens are transformed into embeddings—numeric representations of the tokens in a high-dimensional space. Each token is represented by a list of numbers called a vector, which can contain hundreds of values, depending on the model and embedding size. These vectors capture semantic information about the tokens, allowing the model to understand relationships between words, such as similarities in meaning and context.

embeddings = [
    16042, 3110, 25, 220, 4645, 5245, 5313, 11, 11106, 25, 31048, 40961, 11, 
    29459, 25, 34248, 44027, 11, 29459, 3110, 25, 220, 6849, 20555,
    # ..... removed a few rows to save space
    11, 41395, 25, 76212, 611, 18682, 17829, 11, 41395, 3110, 25, 220, 975, 
    11, 916, 14, 1530, 38666, 351, 299, 1009, 1474, 11785, 14, 10132, 3174, 
    11, 41395, 5665, 25, 611, 34713, 45273, 2931, 6953, 752, 42357, 14, 975, 13
]

Click here to skip math

When I say "high-dimensional space," I mean that a 500-value vector exists in a space with 500 dimensions, with each value representing a specific position within those dimensions. In other words:

Pretty charts

Here are some charts I made. The first is a word cloud based on the bios of 55,000 artists, which I gathered by scraping Beatport. These artists are the ones who earned Beatport points over the past few years.

This is from some exploratory analysis I did on the data, which is also part of a final project I'm working on for school.

Top 20 most common words: I used Scikit-Learn to remove "stop words" (e.g., "and," "if," etc.) and then plotted the top 20 words by frequency across the bios.

Word length distributions: Exactly as it sounds—showing the distribution of word lengths.

Top 20 most frequent 3-word combos: Counts the most common three-word combinations and displays their frequencies. This is commonly called a "trigram."

Sentiment distribution: A scale from -1 to +1, representing most negative to most positive sentiment.

The future is now

Back in March, after I applied for grad school, I did some initial research and set a goal: By late fall, I wanted to fine-tune an open-source large language model using data that needed significant prep work. At the time, I didn’t know Python, nor did I understand much about how these models actually worked or the levers you could pull to adjust their behavior. Honestly, this website didn’t even exist back then.

Several months later, that goal is now within reach. Part one of Project Dolly Shield took me back to the start of my work with the Beatport and Spotify data I’ve been using. Preparing data for modeling is still the most time-consuming and crucial part of the process. You always end up revisiting and making changes, but at some point, the data is ready enough for the first big training push. I’m just about at that point with the data I have. I've also started on part two, which fixes some mistakes I made earlier and lays the groundwork for the training attempt.

The future is here. What are you waiting for?