Project Dolly Shield, Chapter 1, Part 1: The Road to Tokenization
Overview
Part one of project Dolly Shield starts close to where I started Project Ventoux, with a more sophisticated took kit and a continued quest to fine tune train an open source large language model. Along the way, I had to confront a lot of thorny decisions around what data should mean and why.
My setup for accessing and securing data is more robust this time around, with a continued shift to cloud computing infrastructure. Hilariously, Google Cloud Platform denied my request to access an instance with NVIDIA GPUs and wanted me to to hop on a call with a sales person instead. While their GPUs are more expensive than other providers, it had the benefit of being home to most of my other cloud-based activities. After I cheerfully declined the invite from Google, I spun up a RunPod A100 instance in about 20 minutes.
Getting setup
I authenticate into my GCP service account using impersonation by running this command from my Macbook terminal:
gcloud auth application-default login
credentials, project_id = default()
service_account_email =
target_scopes = ['https://www.googleapis.com/auth/cloud-platform']
impersonated_credentials = impersonated_credentials.Credentials(
source_credentials=credentials,
target_principal=service_account_email,
target_scopes=target_scopes
)
After impersonation authentication, I can then connect to GCP secrets manager and bring in my Hugging Face API token:
secret_client = secretmanager.SecretManagerServiceClient()
project_id =
secret_id =
version_id = "latest"
secrets_name = f"projects/{project_id}/secrets/{secret_id}/versions/{version_id}"
response = secret_client.access_secret_version(name=secrets_name)
HUGGING_FACE_READ_TOKEN_LLAMA = response.payload.data.decode("UTF-8")
I can also access the GCP bucket I use for this project:
storage_client = storage.Client(credentials=credentials)
love_uwsthoughts = "love-uwsthoughts"
bp_csvs_folder = 'bp_csvs'
bp_artist_bios_folder = 'bp_artist_bios'
uwsthoughts_bucket = storage_client.bucket(love_uwsthoughts)
def gcp_download(bucket_name, gcp_folder, file_name):
bucket = storage_client.bucket(bucket_name)
blob_name = f"{gcp_folder}/{file_name}".strip("/")
file_blob = bucket.blob(blob_name)
file_data = file_blob.download_as_text()
gcp_df = tian.read_csv(StringIO(file_data))
return gcp_df
Data pre-processing
Much of my previous work has relied on using various IDs to establish relationships between data, with text values only coming in at the very end to help identify outputs. Now that I'm using a large language model (LLM) to generate responses, I prefer text values instead of IDs so the model can tokenize and understand their meaning. The goal is to fine-tune the model with previously unseen music data, allowing users to discover music recommendations they haven't heard before.
To achieve this, I identified a set of replacements based on the most common occurrences of different types of values. The original dataset contained over 400,000 distinct values, but the approach below allows me to consolidate 90% of them into just 20 terms.
mix_replacements = {
"continuous": "Set Mixed", "live": "Set Mixed", "remastered": "Remastered Mix",
"orginal": "Original Mix", "ambient": "Ambient Mix", "chill": "Ambient Mix",
"lounge": "Ambient Mix", "rework": "Remastered Mix", "remix": "Remix",
"original": "Original Mix", "club": "Club Mix", "dub": "Dub Mix",
"extended": "Extended Mix", "instrumental": "Instrumental Mix",
"radio": "Radio Mix", "vip": "Remix", "album": "Album Mix",
"Continuous DJ Mix": "Mixed", "Mix Cut": "Set Mixed", "Mixed": "Set Mixed",
"Intro Mix": "Set Mixed", "Edit": "Radio Edit", "Main Mix": "Original Mix",
"Album Version": "Album Mix", "Deep Mix": "Remix", "House Mix": "Remix",
"Tribal Mix": "Remix", "Intro": "Set Mixed", "Edit Mix": "Radio Mix",
"Bonus Track": "Album Mix"
}
This section replaces values where applicable, while preserving the original values if no replacement is found.
def clean_mix_values(mix):
if isinstance(mix, str):
for key, value in mix_replacements.items():
if key.lower() in mix.lower():
return value
return mix
bpmeta_audio_df['mix'] = bpmeta_audio_df['mix'].apply(clean_mix_values)
Combining different track metadata
I have separate files containing the artist, label, and key information for the tracks, and I want to create a single file that presents all of this data in a human-readable format.
Track and label metadata
The first step is straightforward: I have separate CSV files for artist and label metadata, and I want to see a label along with its associated artists. For example: "Anjunadeep includes Marsh and Eli & Fur as artists." I kept the IDs and URLs but moved them to the end, as I have a feeling those URLs might be useful later on.
label_artist_temp = tian.merge(bp_label_artist_df, bp_label_df, on='label_id', suffixes=('', '_label'))
bp_artist_label_names_df = tian.merge(label_artist_temp, bp_artist_df, on='artist_id', suffixes=('', '_artist'))
bp_artist_label_names_df = bp_artist_label_names_df[[ 'label_name', 'artist_name', 'label_id', 'label_url', 'artist_id', 'artist_url']]
Track and key metadata
bp_key_df
is a table where fields like key_id
, key_name
, chord
, and whether it's sharp or flat are in separate columns. That's great for a typical database but I need something a bit more human
so I'm going to make a new dataframe that has key_id
and a new key_description
field that brings together the key name, chord, and sharps/flats into a single value.
bp_key_text_df = bp_key_df.copy()
def key_mapper(row):
if row['is_sharp'] == 't':
return f"{row['key_letter']}-sharp {row['chord_name']}"
elif row['is_flat'] == 't':
return f"{row['key_letter']}-flat {row['chord_name']}"
else:
return f"Natural {row['key_letter']} {row['chord_name']}"
bp_key_text_df['key_description'] = bp_key_text_df.apply(key_mapper, axis=1)
bp_key_text_df = bp_key_text_df[['key_id', 'key_description']]
bp_key_text_df
Four to the floor
With the data I have, I want the model to generate something like:
On August 18th, 2023, Mira released 'Celo' on Kiosk ID. It was her first release of 2023 and also her melodic house & techno debut, showcasing a new, edgier side to her growing repertoire.
That led me to a schema that look like this:
release_date | artist_name | title | label_name | genre_name | bpm | key_description | mix | is_remixed | is_remixer | mode | valence | danceability | energy | speechiness | loudness | liveness | instrumentalness | acousticness | isrc | artist_id | artist_url | track_id | track_url | label_id | label_url | genre_id | genre_url |
---|
All text values are listed first, with IDs and URLs placed at the end. I also included descriptive text for Spotify audio metrics, making it easier to see how they translate from something a computer understands to something a human does. Spotify's metrics are generally unimpressive, and I want to move away from relying on them eventually.
After the joins, I ended up with a table where each track ID has a distinct row for each artist featured on the track. I'll need to find a way to collapse these rows into a single row per track.
I'm sure everything will be fine until then
Narrator's voice: everything was not fine until then
bp_track_artist_merge = tian.merge(bpmeta_audio_shield_df, bp_artist_track_df, on='track_id', suffixes=('', '_artist'))
bp_track_artist_label_merge = tian.merge(bp_track_artist_merge, bp_artist_df, on='artist_id', suffixes=('', '_artist_info'))
bp_track_artist_label_merge = tian.merge(bp_track_artist_label_merge, bp_label_df, on='label_id', suffixes=('', '_label'))
bp_track_artist_label_merge = tian.merge(bp_track_artist_label_merge, bp_genre_df, on='genre_id', suffixes=('', '_genre'))
bp_track_artist_label_merge = tian.merge(bp_track_artist_label_merge, bp_key_text_df, on='key_id', suffixes=('', '_key'))
bp_text_values_df = bp_track_artist_label_merge[[
'release_date', 'artist_name', 'title', 'label_name', 'duration', 'genre_name', 'bpm', 'key_description', 'mix', 'is_remixed', 'is_remixer',
'mode', 'valence', 'danceability', 'energy', 'speechiness', 'loudness', 'liveness', 'instrumentalness', 'acousticness', 'isrc', 'artist_id',
'artist_url', 'track_id', 'track_url', 'label_id', 'label_url', 'genre_id', 'genre_url', 'key_id'
]]
This resulted in a beautifully massive table. At first glance, it looks just as I designed it. Sooner rather than later, I'll need to address similar fields like is_remixed
and is_remixer
, since they're conveying the same information from different perspectives.
bp_text_values_df
release_date | artist_name | title | label_name | duration | genre_name | bpm | key_description | mix | is_remixed | is_remixer | mode | valence | danceability | energy | speechiness | loudness | liveness | instrumentalness | acousticness | isrc | artist_id | artist_url | track_id | track_url | label_id | label_url | genre_id | genre_url | key_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022-06-24 | Preja | Movha | Supadjs Projects | 6:51 | Amapiano | 112.0 | Natural A Minor | Main | f | f | 0.0 | 0.367 | 0.732 | 0.492 | 0.2620 | -10.961 | 0.2970 | 0.00000 | 0.247000 | GBKQU2257859 | 1063851 | beatport.com/artist/preja/1063851 | 16636568 | beatport.com/track/movha/16636568 | 40460 | beatport.com/label/supadjs-projects/40460 | 98 | /genre/amapiano/98 | 8.0 |
1 | 2022-06-24 | Sidney Saige Ausama | Movha | Supadjs Projects | 6:51 | Amapiano | 112.0 | Natural A Minor | Main | f | f | 0.0 | 0.367 | 0.732 | 0.492 | 0.2620 | -10.961 | 0.2970 | 0.00000 | 0.247000 | GBKQU2257859 | 1063852 | beatport.com/artist/sidney-saige-ausama/1063852 | 16636568 | beatport.com/track/movha/16636568 | 40460 | beatport.com/label/supadjs-projects/40460 | 98 | /genre/amapiano/98 | 8.0 |
2 | 2022-01-28 | LSDee | Ayeye | One Night Stand Distribution | 6:20 | Amapiano | 112.0 | Natural F Minor | Original Mix | f | f | 1.0 | 0.504 | 0.875 | 0.556 | 0.0700 | -7.754 | 0.0819 | 0.55300 | 0.090300 | ZARQO2200001 | 213465 | beatport.com/artist/lsdee/213465 | 16115373 | beatport.com/track/ayeye/16115373 | 99208 | beatport.com/label/one-night-stand-distributio... | 98 | /genre/amapiano/98 | 4.0 |
3 | 2021-10-01 | Babalwa M | LalaBy | Mavuso Business Solutions | 9:01 | Amapiano | 114.0 | G-flat Minor | Outro | f | f | 0.0 | 0.261 | 0.852 | 0.445 | 0.0514 | -16.397 | 0.0366 | 0.14800 | 0.007190 | ZAC012100398 | 1039645 | beatport.com/artist/babalwa-m/1039645 | 16223521 | beatport.com/track/lalaby/16223521 | 100342 | beatport.com/label/mavuso-business-solutions/1... | 98 | /genre/amapiano/98 | 34.0 |
4 | 2021-10-01 | Babalwa M | Jaiva | Mavuso Business Solutions | 6:15 | Amapiano | 113.0 | G-flat Minor | Original Mix | f | f | 0.0 | 0.343 | 0.842 | 0.702 | 0.0667 | -13.397 | 0.0253 | 0.03780 | 0.005390 | ZAC012100395 | 1039645 | beatport.com/artist/babalwa-m/1039645 | 16223515 | beatport.com/track/jaiva/16223515 | 100342 | beatport.com/label/mavuso-business-solutions/1... | 98 | /genre/amapiano/98 | 34.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9522072 | 2009-05-29 | Steve O Steen | Breakfast In Illinois | Robsoul Recordings | 4:37 | Jackin House | 127.0 | Natural A Minor | Original Mix | f | f | 1.0 | 0.677 | 0.806 | 0.773 | 0.0726 | -11.024 | 0.1110 | 0.88000 | 0.000038 | FR48Z0900019 | 107441 | beatport.com/artist/steve-o-steen/107441 | 859924 | beatport.com/track/breakfast-in-illinois/859924 | 2204 | beatport.com/label/robsoul-recordings/2204 | 97 | /genre/jackin-house/97 | 8.0 |
9522073 | 2009-05-29 | Sam Karlson | Robsoul Boy | Robsoul Recordings | 5:30 | Jackin House | 127.0 | Natural B Major | Original Mix | f | f | 1.0 | 0.912 | 0.890 | 0.772 | 0.1660 | -5.031 | 0.0685 | 0.00071 | 0.180000 | FR48Z0900018 | 9863 | beatport.com/artist/sam-karlson/9863 | 859923 | beatport.com/track/robsoul-boy/859923 | 2204 | beatport.com/label/robsoul-recordings/2204 | 97 | /genre/jackin-house/97 | 13.0 |
9522074 | 2009-05-29 | Steve O Steen | Robsoul Boy | Robsoul Recordings | 5:30 | Jackin House | 127.0 | Natural B Major | Original Mix | f | f | 1.0 | 0.912 | 0.890 | 0.772 | 0.1660 | -5.031 | 0.0685 | 0.00071 | 0.180000 | FR48Z0900018 | 107441 | beatport.com/artist/steve-o-steen/107441 | 859923 | beatport.com/track/robsoul-boy/859923 | 2204 | beatport.com/label/robsoul-recordings/2204 | 97 | /genre/jackin-house/97 | 13.0 |
9522075 | 2007-08-07 | Colette | Hypnotized | OM Records | 6:40 | Jackin House | 126.0 | Natural F Major | Jason Hodges Mix | t | f | 0.0 | 0.879 | 0.804 | 0.795 | 0.0715 | -10.096 | 0.0494 | 0.82400 | 0.000029 | USOM80795203 | 3201 | beatport.com/artist/colette/3201 | 349708 | beatport.com/track/hypnotized/349708 | 351 | beatport.com/label/om-records/351 | 97 | /genre/jackin-house/97 | 19.0 |
9522076 | 2007-08-07 | Jason Hodges | Hypnotized | OM Records | 6:40 | Jackin House | 126.0 | Natural F Major | Jason Hodges Mix | t | t | 0.0 | 0.879 | 0.804 | 0.795 | 0.0715 | -10.096 | 0.0494 | 0.82400 | 0.000029 | USOM80795203 | 431 | beatport.com/artist/jason-hodges/431 | 349708 | beatport.com/track/hypnotized/349708 | 351 | beatport.com/label/om-records/351 | 97 | /genre/jackin-house/97 | 19.0 |
9522077 rows × 30 columns
All of that work makes it much easier to find the artists I'm looking for. Text values as IDs aren't ideal due to the risk of overlaps and duplicates. For example, "Mira (Berlin)" has "(Berlin)" added to her name because there's another "Mira" on Beatport, and someone needed to make a small distinction.
bp_text_values_df[bp_text_values_df['artist_name'] == 'Mira (Berlin)'].sort_values(by='release_date', ascending=False).head(10)
release_date | artist_name | title | label_name | duration | genre_name | bpm | key_description | mix | is_remixed | is_remixer | mode | valence | danceability | energy | speechiness | loudness | liveness | instrumentalness | acousticness | isrc | artist_id | artist_url | track_id | track_url | label_id | label_url | genre_id | genre_url | key_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7620604 | 2023-08-18 | Mira (Berlin) | Jero | Kiosk ID | 10:05 | Melodic House & Techno | 116.0 | Natural D Major | Original Mix | f | f | 0.0 | 0.0378 | 0.760 | 0.728 | 0.0653 | -11.191 | 0.1030 | 0.899 | 0.028600 | DECY52301345 | 385397 | beatport.com/artist/mira-berlin/385397 | 17993565 | beatport.com/track/jero/17993565 | 80379 | beatport.com/label/kiosk-id/80379 | 90 | /genre/melodic-house-techno/90 | 22.0 |
7698958 | 2023-08-18 | Mira (Berlin) | Cleo | Kiosk ID | 6:43 | Melodic House & Techno | 92.0 | Natural G Major | Remix | t | f | 1.0 | 0.1200 | 0.806 | 0.783 | 0.0461 | -10.205 | 0.0660 | 0.946 | 0.077000 | DECY52301349 | 385397 | beatport.com/artist/mira-berlin/385397 | 17993576 | beatport.com/track/cleo/17993576 | 80379 | beatport.com/label/kiosk-id/80379 | 90 | /genre/melodic-house-techno/90 | 21.0 |
7698955 | 2023-08-18 | Mira (Berlin) | Jero | Kiosk ID | 8:38 | Melodic House & Techno | 119.0 | Natural D Major | Remix | t | f | 1.0 | 0.2670 | 0.830 | 0.350 | 0.0784 | -9.705 | 0.0999 | 0.903 | 0.007660 | DECY52301346 | 385397 | beatport.com/artist/mira-berlin/385397 | 17993567 | beatport.com/track/jero/17993567 | 80379 | beatport.com/label/kiosk-id/80379 | 90 | /genre/melodic-house-techno/90 | 22.0 |
7698952 | 2023-08-18 | Mira (Berlin) | Jero | Kiosk ID | 5:56 | Melodic House & Techno | 120.0 | Natural D Minor | Argia Powerbeats Version | t | f | 1.0 | 0.0876 | 0.795 | 0.837 | 0.0449 | -10.095 | 0.1050 | 0.833 | 0.000121 | DECY52301347 | 385397 | beatport.com/artist/mira-berlin/385397 | 17993570 | beatport.com/track/jero/17993570 | 80379 | beatport.com/label/kiosk-id/80379 | 90 | /genre/melodic-house-techno/90 | 7.0 |
7660383 | 2023-08-18 | Mira (Berlin) | Cleo | Kiosk ID | 6:18 | Melodic House & Techno | 114.0 | Natural G Minor | Original Mix | f | f | 0.0 | 0.1760 | 0.806 | 0.699 | 0.0451 | -11.390 | 0.1200 | 0.939 | 0.014300 | DECY52301348 | 385397 | beatport.com/artist/mira-berlin/385397 | 17993573 | beatport.com/track/cleo/17993573 | 80379 | beatport.com/label/kiosk-id/80379 | 90 | /genre/melodic-house-techno/90 | 6.0 |
5034902 | 2022-11-18 | Mira (Berlin) | Murmeli | Kiosk ID | 6:32 | Organic House / Downtempo | 110.0 | Natural C Major | Remix | t | t | 1.0 | 0.1020 | 0.797 | 0.871 | 0.0544 | -10.162 | 0.1500 | 0.896 | 0.056400 | DECY52201887 | 385397 | beatport.com/artist/mira-berlin/385397 | 17067088 | beatport.com/track/murmeli/17067088 | 80379 | beatport.com/label/kiosk-id/80379 | 93 | /genre/organic-house-downtempo/93 | 20.0 |
5034956 | 2022-11-04 | Mira (Berlin) | Siriema | The Magic Movement | 7:19 | Organic House / Downtempo | 108.0 | Natural G Minor | Remix | t | t | 0.0 | 0.7120 | 0.800 | 0.492 | 0.0649 | -10.809 | 0.0840 | 0.894 | 0.081500 | QM6P42229087 | 385397 | beatport.com/artist/mira-berlin/385397 | 16935012 | beatport.com/track/siriema/16935012 | 39677 | beatport.com/label/the-magic-movement/39677 | 93 | /genre/organic-house-downtempo/93 | 6.0 |
5034953 | 2022-11-04 | Mira (Berlin) | Siriema | The Magic Movement | 7:19 | Organic House / Downtempo | 108.0 | Natural G Minor | Remix | t | f | 0.0 | 0.7120 | 0.800 | 0.492 | 0.0649 | -10.809 | 0.0840 | 0.894 | 0.081500 | QM6P42229087 | 385397 | beatport.com/artist/mira-berlin/385397 | 16935012 | beatport.com/track/siriema/16935012 | 39677 | beatport.com/label/the-magic-movement/39677 | 93 | /genre/organic-house-downtempo/93 | 6.0 |
5034958 | 2022-10-14 | Mira (Berlin) | Siriema | The Magic Movement | 7:19 | Organic House / Downtempo | 108.0 | Natural G Minor | Remix | f | f | 0.0 | 0.7120 | 0.800 | 0.492 | 0.0649 | -10.809 | 0.0840 | 0.894 | 0.081500 | QM6P42229087 | 385397 | beatport.com/artist/mira-berlin/385397 | 16947157 | beatport.com/track/siriema/16947157 | 39677 | beatport.com/label/the-magic-movement/39677 | 93 | /genre/organic-house-downtempo/93 | 6.0 |
5060891 | 2022-09-09 | Mira (Berlin) | Higher Than Me | Frau Blau | 7:21 | Organic House / Downtempo | 114.0 | Natural D Major | Remix | t | t | 1.0 | 0.2090 | 0.809 | 0.798 | 0.0613 | -10.140 | 0.0666 | 0.840 | 0.077800 | IL4612200064 | 385397 | beatport.com/artist/mira-berlin/385397 | 16780144 | beatport.com/track/higher-than-me/16780144 | 74969 | beatport.com/label/frau-blau/74969 | 93 | /genre/organic-house-downtempo/93 | 22.0 |
Organizing art biographies
In parallel, I scraped artist bios from Beatport in batches of 1,000, initially storing them in separate files. At this stage, I combined all of them and removed duplicates. I'll use this data later on—part one is just about getting organized.
Here's a snippet of how the scraping was handled:
def bioscrape(artist_url_extension):
base_url = "https://www.beatport.com"
url = base_url + artist_url_extension
payload = {
"api_key": SCRAPING_FISH_API_KEY,
"url": url,
}
try:
response = requests.get("https://scraping.narf.ai/api/v1/", params=payload, timeout=30)
if response.status_code == 200:
soup = bs(response.content, 'html.parser')
artist_name_tag = soup.find('h1', class_="HeadingWithBreadcrumb-style__Title-sc-8549a8e9-2 hGvWKT")
artist_name = artist_name_tag.text.strip() if artist_name_tag else "N/A"
bio_tag = soup.find('meta', property="og:description")
bio = bio_tag['content'].strip() if bio_tag and bio_tag.get('content') else "N/A"
return artist_url_extension, artist_name, bio
else:
return artist_url_extension, "N/A", "N/A"
except requests.Timeout:
print(f"Timeout occurred while scraping {url}. Skipping to the next URL.")
return artist_url_extension, "N/A", "N/A"
except Exception as e:
print(f"Error scraping {url}: {e}. Skipping to next bio.")
return artist_url_extension, "N/A", "N/A"
Next, I set up parallelism using ThreadPoolExecutor and deployed the script to my GCP instance, where it ran in the background and took about 10 hours to scrape 55,000 pages.
Combining the files
In the end, I had 55 CSV files that needed to be combined. Below, I downloaded the files from their GCP bucket and performed a "glorified copy and paste" to merge them.
#sort folder by created date and then grab
def gcp_folder_sort(bucket, folder):
blobs = list(bucket.list_blobs(prefix=folder))
sorted_blobs = sorted(blobs, key=lambda x: x.time_created)
return sorted_blobs
#combine csv files into one df
def artists_united(bucket, folder):
sorted_blobs = gcp_folder_sort(bucket, folder)
artists_united_df = tian.DataFrame()
for blob in sorted_blobs:
if blob.name.endswith('.csv'):
the_drop = gcp_download(bucket.name, folder, blob.name.split('/')[-1])
the_drop = the_drop[['beatport_artist_id', 'artist_name', 'beatport_bio']]
artists_united_df = tian.concat([artists_united_df, the_drop], ignore_index=True)
artists_united_df = artists_united_df.drop_duplicates(ignore_index=True)
return artists_united_df
#save df back to gcp
def paranoid_guard(df, bucket_name, gcp_folder, file_name):
bucket = storage_client.bucket(bucket_name)
csv_data = df.to_csv(index=False)
blob_name = f"{gcp_folder}/{file_name}".strip("/")
blob = bucket.blob(blob_name)
blob.upload_from_string(csv_data, content_type='text/csv')
bppoints_artist_bios_csv = "bppoints_artist_bios.csv"
bppoints_artist_bios_df = artists_united(uwsthoughts_bucket, bp_artist_bios_folder)
paranoid_guard(bppoints_artist_bios_df, love_uwsthoughts, bp_artist_bios_folder, bppoints_artist_bios_csv)
I kept only the artists for whom I had bios, resulting in 48,000 out of the original 55,000.
Here's a pretty word cloud for a random 20% sample of those bios. The larger the word, the more frequently it appears in the sample.
file_path = '/Users/uwsthoughts/Desktop/bp_spotify_raw_data/csv_data/artist_bios_df.csv'
sample_size = 10000
# total length minue the header
with open(file_path, 'r', encoding='utf-8') as file:
total_lines = sum(1 for line in file) - 1
# random sample generator
sample_lines = set(random.sample(range(1, total_lines + 1), min(sample_size, total_lines)))
# use sample for word cloud
sampled_text = ""
with open(file_path, 'r', encoding='utf-8') as file:
reader = csv.DictReader(file)
for i, row in enumerate(reader):
if i in sample_lines:
sampled_text += row['beatport_bio'] + " "
wordcloud = wc(width=800, height=400, background_color='white').generate(sampled_text)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word cloud from sample of artist bio data', fontsize=16)
plt.show()
A slice of reality carved into language
Tokenization is the process of losing your mind in new and unexpected ways—ways you never knew existed (doctor scribbles furiously into a notebook in the corner, sirens wailing in the background). Not quite about tokenization but this overview of sentence embeddings is a good primer on the related topic of embeddings.
To actually access and use Meta's Llama models, I have an API key for Hugging Face, allowing me to use their Transformers tools. By passing my Hugging Face API token to Transformers, I can call Meta's Llama-3.2-11B-Vision Model.
This establishes the connection to Meta's new 11-billion-parameter model capable of handling both text and images. AutoTokenizer
from the Hugging Face library is used to convert text into tokens for the specified model.
llama_3211b = "meta-llama/Llama-3.2-11B-Vision"
llama_3211b_tokenizer = AutoTokenizer.from_pretrained(llama_3211b)
melodies()
takes a row of table data as input and returns a structured text output. This part will need to change in the future, as processing text and numeric data together isn't the right approach. Instead, I should separate these into two distinct tokenization processes, which can then be combined after tokenization is complete.
def melodies(row):
text = (
f"Track ID: {row['track_id']}, Title: {row['title']}, "
f"Artist: {row['artist_name']}, Artist ID: {row['artist_id']}, "
f"Genre: {row['genre_name']}, Genre ID: {row['genre_id']}, "
f"Label: {row['label_name']}, Label ID: {row['label_id']}, "
f"Release Date: {row['release_date']}, Track URL: {row['track_url']}, "
f"Mix: {row['mix']}, Remix: {'Yes' if row['is_remixed'] else 'No'}, "
f"Remixer: {'Yes' if row['is_remixer'] else 'No'}, Duration: {row['duration']} minutes, "
f"BPM: {row['bpm']}, Key ID: {row['key_id']}, "
f"Mode: {row['mode']}, Valence: {row['valence']}, Danceability: {row['danceability']}, "
f"Energy: {row['energy']}, Speechiness: {row['speechiness']}, "
f"Loudness: {row['loudness']}, Liveness: {row['liveness']}, "
f"Instrumentalness: {row['instrumentalness']}, Acousticness: {row['acousticness']}, "
f"ISRC: {row['isrc']}, Artist URL: {row['artist_url']}, Label URL: {row['label_url']}, "
f"Genre URL: {row['genre_url']}."
)
return text
rows_to_text()
takes a row of table data and converts it to text using melodies()
. The output from this function is then used in tokenize_texts()
, which utilizes the llama_3211b_tokenizer()
defined earlier to convert the rows of text into numeric tokens in equal-sized batches. I also included a console printout of a random sample, which is shown further down, along with logging the entire output to a file for later use. When I paused the processing to consider a different approach, I was about 40% complete, having generated an 85GB file.
def rows_to_text(data_set, batch_size):
num_batches = (len(data_set) // batch_size) + 1
processed_texts = []
for i in range(0, num_batches, 20):
final_batch = min(i + 20, num_batches)
for batch in tqdm(range(i, final_batch), desc=f"Processing batches {i+1}-{final_batch}"):
first_index = batch * batch_size
last_index = min((batch + 1) * batch_size, len(data_set))
batch_df = data_set.iloc[first_index:last_index]
# Assuming `melodies(row)` processes each row in some way
batch_results = [melodies(row) for _, row in tqdm(batch_df.iterrows(), total=len(batch_df), desc=f"Processing batch {batch+1}/{num_batches}")]
processed_texts.extend(batch_results)
return processed_texts
def tokenize_texts(texts, batch_size, sample_indices=None):
num_batches = (len(texts) // batch_size) + 1
tokenized_batches = []
all_tokens = []
token_lengths = []
for i in range(0, num_batches, 20):
last_batch = min(i + 20, num_batches)
for batch_index in tqdm(range(i, last_batch), desc=f"Tokenizing batches {i+1}-{last_batch}"):
start_index = batch_index * batch_size
end_index = min((batch_index + 1) * batch_size, len(texts))
batch_texts = texts[start_index:end_index]
tokenized_data = llama_3211b_tokenizer(batch_texts, return_tensors="tf", truncation=True, padding=True)
tokenized_batches.append(tokenized_data)
for sample_i, text in enumerate(batch_texts):
tokens = llama_3211b_tokenizer.tokenize(text)
token_ids = llama_3211b_tokenizer.convert_tokens_to_ids(tokens)
all_tokens.extend(tokens)
token_lengths.extend([len(token) for token in tokens]
logging.info(f"Text: {text}\nTokens: {tokens}\nEmbeddings: {token_ids}")
global_i = start_index + sample_i
if sample_i and global_i in sample_i:
print("Tokens:", tokens)
print("Token IDs:", token_ids)
return tokenized_batches
I made my computer crash
Before I got the above to run successfully (with the output shown below), I decided to f'around and find out what happens when you try to tokenize a nine-million-row dataframe all at once without any preparation. Surprise! It caused the kernel to crash—another setback for overconfidence. After 46 minutes, I ended up here in my tracking:
Generating texts: 100%|██████████| 9522077/9522077 [04:59<00:00, 31748.96it/s]
Tokenizing texts... This may take a while.
When I got this error message:
The Kernel crashed while executing code in the current cell or a previous cell.
Please review the code in the cell(s) to identify a possible cause of the failure.
Click [here](https://github.com/microsoft/vscode-jupyter/wiki/Kernel-crashes) for more info.
View Jupyter log for further details.
So what's a gurl to do other than twirl down to the Jupyter logs and find out what's going on? Come with me:
}
13:48:29.846 [error] Disposing session as kernel process died ExitCode: undefined, Reason:
13:49:13.917 [error] Failed to write data to the kernel channel shell [
<Buffer 3c 49 44 53 7c 4d 53 47 3e>,
<Buffer 38 64 39 65 37 36 61 37 64 33 61 34 64 62 66 38 38 32 33 32 34 64 63 34 62 34 66 35 65 34 34 36 30 31 37 36 36 36 33 38 34 62 39 66 64 34 61 34 30 30 ... 14 more bytes>,
<Buffer 7b 22 64 61 74 65 22 3a 22 32 30 32 34 2d 30 39 2d 32 39 54 31 37 3a 34 39 3a 31 33 2e 39 31 37 5a 22 2c 22 6d 73 67 5f 69 64 22 3a 22 38 31 65 63 36 ... 177 more bytes>,
<Buffer 7b 7d>,
<Buffer 7b 7d>,
<Buffer 7b 22 73 69 6c 65 6e 74 22 3a 66 61 6c 73 65 2c 22 73 74 6f 72 65 5f 68 69 73 74 6f 72 79 22 3a 66 61 6c 73 65 2c 22 75 73 65 72 5f 65 78 70 72 65 73 ... 1058 more bytes>
] [Error: Socket is closed
at a.postToSocket (/Users/~/.cursor/extensions/ms-toolsai.jupyter-2024.6.0-darwin-arm64/dist/extension.node.js:304:8043)
at /Users/~/.cursor/extensions/ms-toolsai.jupyter-2024.6.0-darwin-arm64/dist/extension.node.js:304:7787] {
errno: 9,
code: 'EBADF'
}
My kernel imploded from being overloaded, more or less. The technical term for this is, "lol fml."
A word cloud as a consolation prize
I also created a word chart for a sample of the logs I generated. Since I had an 85GB file (lol), I took another small sample. The issue was that I included a lot of repeated text on each line as part of melodies()
. Below the image are samples from the logs that generated the word cloud. My cheerful website designer is helping me reconfigure this page so the text wraps within code boxes rather than just extending straight from left to right. For now, you’ll need to scroll quite a bit to the right.
Next steps
I learned a tremendous amount very quickly about how data needs to be prepared for fine tune training with an LLM. For one, just giving an LLM a continuous metric (like 'valence' or 'energy' with values between 0 and 1) without any context to guide it will result in the song metadata being completely severed from its numeric qualities. In part two, I'll need to normalize the numeric values and encode them into embeddings (large numeric vectors representing words) using projections, which is a linear algebra term for mapping data into a space that’s fits what a model can understand, also known as a vibe check
. This will allow the numeric data to be integrated with text embeddings in a, like, super embedding or something. I'm thinking calculus, but bigger? If I do this right, the songs will be reunited with their metrics and everyone will vibe some more. E
Essentially, numbers will be everything, everywhere, all at once (IYKYK) in a multidimensional space.
Addendum: result logs from partial tokenization
This is what a TQDM
tracking log looks like. I had 191 batches so I got a progress bar for each one. In work I started after wrapping up part one, I switched to tracking batches of 20, similar to what I did for tokenization below.
farm_trips = 50000
bp_text_values = row_to_text(bp_text_values_df, farm_trips)
Processing batch 1/191: 100%|██████████| 50000/50000 [00:01<00:00, 32801.32it/s]
Processing batch 2/191: 100%|██████████| 50000/50000 [00:01<00:00, 32199.59it/s]
Processing batch 3/191: 100%|██████████| 50000/50000 [00:01<00:00, 32956.93it/s]
Processing batch 4/191: 100%|██████████| 50000/50000 [00:01<00:00, 31053.80it/s]
Processing batch 5/191: 100%|██████████| 50000/50000 [00:01<00:00, 33235.40it/s]
.....
Processing batch 189/191: 100%|██████████| 50000/50000 [00:01<00:00, 32466.98it/s]
Processing batch 190/191: 100%|██████████| 50000/50000 [00:01<00:00, 32640.58it/s]
Processing batch 191/191: 100%|██████████| 22077/22077 [00:00<00:00, 32620.70it/s]
Processing batches 181-191: 100%|██████████| 11/11 [00:16<00:00, 1.46s/it]
bp_text_tokens = tokenize_text(bp_text_values, farm_trips)
Tokenizing batches 1-20: 15%|█▌ | 3/20 [01:17<07:18, 25.81s/it]
Tokens: ['Track', 'ĠID', ':', 'Ġ', '178', '056', '49', ',', 'ĠTitle', ':', 'ĠBang', 's', 'ĠIn', 'ĠThe', 'ĠHead', ',', 'ĠArtist', ':', 'ĠVal', 'eri', 'ø', 'ĠInn', 'ør', 'ta', ',', 'ĠArtist', 'ĠID', ':', 'Ġ', '647', '817', ',', 'ĠGenre', ':', 'ĠHard', 'ĠTechn', 'o', ',', 'ĠGenre', 'ĠID', ':', 'Ġ', '2', ',', 'ĠLabel', ':', 'ĠCar', 'bone', 'ĠRecords', ',', 'ĠLabel', 'ĠID', ':', 'Ġ', '591', '59', ',', 'ĠRelease', 'ĠDate', ':', 'Ġ', '202', '3', '-', '06', '-', '30', ',', 'ĠTrack', 'ĠURL', ':', 'Ġbeat', 'port', '.com', '/', 'track', '/b', 'angs', '-in', '-the', '-head', '/', '178', '056', '49', ',', 'ĠMix', ':', 'ĠOriginal', 'ĠMix', ',', 'ĠRemix', ':', 'ĠYes', ',', 'ĠRem', 'ixer', ':', 'ĠYes', ',', 'ĠDuration', ':', 'Ġ', '5', ':', '20', 'Ġminutes', ',', 'ĠBPM', ':', 'Ġ', '80', '.', '0', ',', 'ĠKey', 'ĠID', ':', 'Ġ', '3', '.', '0', ',', 'ĠMode', ':', 'Ġ', '0', '.', '0', ',', 'ĠVal', 'ence', ':', 'Ġ', '0', '.', '527', ',', 'ĠDance', 'ability', ':', 'Ġ', '0', '.', '686', ',', 'ĠEnergy', ':', 'Ġ', '0', '.', '928', ',', 'ĠSpeech', 'iness', ':', 'Ġ', '0', '.', '173', ',', 'ĠLoud', 'ness', ':', 'Ġ-', '5', '.', '653', ',', 'ĠL', 'iveness', ':', 'Ġ', '0', '.', '153', ',', 'ĠInstrument', 'al', 'ness', ':', 'Ġ', '0', '.', '175', ',', 'ĠAc', 'oustic', 'ness', ':', 'Ġ', '0', '.', '000', '329', ',', 'ĠIS', 'RC', ':', 'ĠNL', 'CK', '422', '320', '42', ',', 'ĠArtist', 'ĠURL', ':', 'Ġbeat', 'port', '.com', '/', 'artist', '/', 'val', 'eri', '-in', 'nr', 'ta', '/', '647', '817', ',', 'ĠLabel', 'ĠURL', ':', 'Ġbeat', 'port', '.com', '/', 'label', '/car', 'bone', '-', 'records', '/', '591', '59', ',', 'ĠGenre', 'ĠURL', ':', 'Ġ/', 'genre', '/h', 'ard', '-', 'techn', 'o', '/', '2', '.']
Embeddings IDs: [16042, 3110, 25, 220, 11256, 25921, 2491, 11, 11106, 25, 17343, 82, 763, 578, 11452, 11, 29459, 25, 4196, 31803, 6282, 17382, 17545, 2629, 11, 29459, 3110, 25, 220, 22644, 25528, 11, 41395, 25, 11481, 7146, 78, 11, 41395, 3110, 25, 220, 17, 11, 9587, 25, 3341, 20337, 22293, 11, 9587, 3110, 25, 220, 24380, 2946, 11, 17836, 2696, 25, 220, 2366, 18, 12, 2705, 12, 966, 11, 20371, 5665, 25, 9567, 403, 916, 14, 13432, 3554, 65587, 3502, 10826, 27488, 14, 11256, 25921, 2491, 11, 19771, 25, 17674, 19771, 11, 51127, 25, 7566, 11, 5031, 40114, 25, 7566, 11, 21722, 25, 220, 20, 25, 508, 4520, 11, 89319, 25, 220, 1490, 13, 15, 11, 5422, 3110, 25, 220, 18, 13, 15, 11, 14904, 25, 220, 15, 13, 15, 11, 4196, 768, 25, 220, 15, 13, 22369, 11, 30704, 2968, 25, 220, 15, 13, 22347, 11, 12634, 25, 220, 15, 13, 25001, 11, 39841, 1918, 25, 220, 15, 13, 11908, 11, 80648, 2136, 25, 482, 20, 13, 21598, 11, 445, 13071, 25, 220, 15, 13, 9800, 11, 43405, 278, 2136, 25, 220, 15, 13, 10005, 11, 6515, 35415, 2136, 25, 220, 15, 13, 931, 18196, 11, 3507, 7532, 25, 33260, 3096, 16460, 9588, 2983, 11, 29459, 5665, 25, 9567, 403, 916, 14, 19135, 14, 838, 31803, 3502, 20191, 2629, 14, 22644, 25528, 11, 9587, 5665, 25, 9567, 403, 916, 14, 1530, 66759, 20337, 12, 27295, 14, 24380, 2946, 11, 41395, 5665, 25, 611, 34713, 7682, 569, 12, 26522, 78, 14, 17, 13]
.....
Tokenizing batches 41-60: 80%|████████ | 16/20 [09:47<02:34, 38.56s/it]
Tokens: ['Track', 'ĠID', ':', 'Ġ', '101', '180', '85', ',', 'ĠTitle', ':', 'ĠSeven', 'ĠSteps', ',', 'ĠArtist', ':', 'ĠRico', 'ĠMartinez', ',', 'ĠArtist', 'ĠID', ':', 'Ġ', '104', '454', ',', 'ĠGenre', ':', 'ĠMinimal', 'Ġ/', 'ĠDeep', 'ĠTech', ',', 'ĠGenre', 'ĠID', ':', 'Ġ', '14', ',', 'ĠLabel', ':', 'ĠDat', 'ag', 'ro', 'ove', 'ĠMusic', ',', 'ĠLabel', 'ĠID', ':', 'Ġ', '156', '41', ',', 'ĠRelease', 'ĠDate', ':', 'Ġ', '201', '8', '-', '01', '-', '18', ',', 'ĠTrack', 'ĠURL', ':', 'Ġbeat', 'port', '.com', '/', 'track', '/se', 'ven', '-st', 'eps', '/', '101', '180', '85', ',', 'ĠMix', ':', 'ĠOriginal', 'ĠMix', ',', 'ĠRemix', ':', 'ĠYes', ',', 'ĠRem', 'ixer', ':', 'ĠYes', ',', 'ĠDuration', ':', 'Ġ', '5', ':', '48', 'Ġminutes', ',', 'ĠBPM', ':', 'Ġ', '122', '.', '0', ',', 'ĠKey', 'ĠID', ':', 'Ġ', '12', '.', '0', ',', 'ĠMode', ':', 'Ġ', '0', '.', '0', ',', 'ĠVal', 'ence', ':', 'Ġ', '0', '.', '48', ',', 'ĠDance', 'ability', ':', 'Ġ', '0', '.', '814', ',', 'ĠEnergy', ':', 'Ġ', '0', '.', '484', ',', 'ĠSpeech', 'iness', ':', 'Ġ', '0', '.', '061', '8', ',', 'ĠLoud', 'ness', ':', 'Ġ-', '9', '.', '41', ',', 'ĠL', 'iveness', ':', 'Ġ', '0', '.', '070', '7', ',', 'ĠInstrument', 'al', 'ness', ':', 'Ġ', '0', '.', '811', ',', 'ĠAc', 'oustic', 'ness', ':', 'Ġ', '0', '.', '000', '776', ',', 'ĠIS', 'RC', ':', 'ĠCA', '5', 'KR', '170', '754', '6', ',', 'ĠArtist', 'ĠURL', ':', 'Ġbeat', 'port', '.com', '/', 'artist', '/', 'rico', '-m', 'art', 'inez', '/', '104', '454', ',', 'ĠLabel', 'ĠURL', ':', 'Ġbeat', 'port', '.com', '/', 'label', '/dat', 'ag', 'ro', 'ove', '-m', 'usic', '/', '156', '41', ',', 'ĠGenre', 'ĠURL', ':', 'Ġ/', 'genre', '/min', 'imal', '-de', 'ep', '-tech', '/', '14', '.']
Embeddings: [16042, 3110, 25, 220, 4645, 5245, 5313, 11, 11106, 25, 31048, 40961, 11, 29459, 25, 34248, 44027, 11, 29459, 3110, 25, 220, 6849, 20555, 11, 41395, 25, 76212, 611, 18682, 17829, 11, 41395, 3110, 25, 220, 975, 11, 9587, 25, 22362, 351, 299, 1009, 10948, 11, 9587, 3110, 25, 220, 10132, 3174, 11, 17836, 2696, 25, 220, 679, 23, 12, 1721, 12, 972, 11, 20371, 5665, 25, 9567, 403, 916, 14, 13432, 60687, 1055, 5594, 7270, 14, 4645, 5245, 5313, 11, 19771, 25, 17674, 19771, 11, 51127, 25, 7566, 11, 5031, 40114, 25, 7566, 11, 21722, 25, 220, 20, 25, 2166, 4520, 11, 89319, 25, 220, 8259, 13, 15, 11, 5422, 3110, 25, 220, 717, 13, 15, 11, 14904, 25, 220, 15, 13, 15, 11, 4196, 768, 25, 220, 15, 13, 2166, 11, 30704, 2968, 25, 220, 15, 13, 25498, 11, 12634, 25, 220, 15, 13, 20339, 11, 39841, 1918, 25, 220, 15, 13, 23324, 23, 11, 80648, 2136, 25, 482, 24, 13, 3174, 11, 445, 13071, 25, 220, 15, 13, 17819, 22, 11, 43405, 278, 2136, 25, 220, 15, 13, 22588, 11, 6515, 35415, 2136, 25, 220, 15, 13, 931, 23823, 11, 3507, 7532, 25, 9362, 20, 62984, 8258, 23952, 21, 11, 29459, 5665, 25, 9567, 403, 916, 14, 19135, 14, 56347, 1474, 472, 39395, 14, 6849, 20555, 11, 9587, 5665, 25, 9567, 403, 916, 14, 1530, 38666, 351, 299, 1009, 1474, 11785, 14, 10132, 3174, 11, 41395, 5665, 25, 611, 34713, 45273, 2931, 6953, 752, 42357, 14, 975, 13]
Tokenizing batches 81-100: 35%|███▌ | 7/20 [10:51<21:21, 98.56s/it]
Tokens: ['Track', 'ĠID', ':', 'Ġ', '592', '498', '1', ',', 'ĠTitle', ':', 'ĠWhat', 'ĠU', 'ĠSay', ',', 'ĠArtist', ':', 'ĠCarlo', 'ĠCal', 'dar', 'eri', ',', 'ĠArtist', 'ĠID', ':', 'Ġ', '147', '415', ',', 'ĠGenre', ':', 'ĠHouse', ',', 'ĠGenre', 'ĠID', ':', 'Ġ', '5', ',', 'ĠLabel', ':', 'ĠSim', 'ma', 'ĠBlack', ',', 'ĠLabel', 'ĠID', ':', 'Ġ', '318', '07', ',', 'ĠRelease', 'ĠDate', ':', 'Ġ', '201', '4', '-', '11', '-', '03', ',', 'ĠTrack', 'ĠURL', ':', 'Ġbeat', 'port', '.com', '/', 'track', '/', 'what', '-u', '-s', 'ay', '/', '592', '498', '1', ',', 'ĠMix', ':', 'ĠOriginal', 'ĠMix', ',', 'ĠRemix', ':', 'ĠYes', ',', 'ĠRem', 'ixer', ':', 'ĠYes', ',', 'ĠDuration', ':', 'Ġ', '6', ':', '26', 'Ġminutes', ',', 'ĠBPM', ':', 'Ġ', '125', '.', '0', ',', 'ĠKey', 'ĠID', ':', 'Ġ', '6', '.', '0', ',', 'ĠMode', ':', 'Ġ', '1', '.', '0', ',', 'ĠVal', 'ence', ':', 'Ġ', '0', '.', '804', ',', 'ĠDance', 'ability', ':', 'Ġ', '0', '.', '803', ',', 'ĠEnergy', ':', 'Ġ', '0', '.', '974', ',', 'ĠSpeech', 'iness', ':', 'Ġ', '0', '.', '082', '9', ',', 'ĠLoud', 'ness', ':', 'Ġ-', '6', '.', '009', ',', 'ĠL', 'iveness', ':', 'Ġ', '0', '.', '104', ',', 'ĠInstrument', 'al', 'ness', ':', 'Ġ', '0', '.', '736', ',', 'ĠAc', 'oustic', 'ness', ':', 'Ġ', '0', '.', '001', '17', ',', 'ĠIS', 'RC', ':', 'ĠQ', 'MS', 'NZ', '146', '036', '6', ',', 'ĠArtist', 'ĠURL', ':', 'Ġbeat', 'port', '.com', '/', 'artist', '/c', 'arlo', '-c', 'ald', 'ar', 'eri', '/', '147', '415', ',', 'ĠLabel', 'ĠURL', ':', 'Ġbeat', 'port', '.com', '/', 'label', '/s', 'im', 'ma', '-black', '/', '318', '07', ',', 'ĠGenre', 'ĠURL', ':', 'Ġ/', 'genre', '/h', 'ouse', '/', '5', '.']
Embeddings: [16042, 3110, 25, 220, 20128, 21962, 16, 11, 11106, 25, 3639, 549, 25961, 11, 29459, 25, 58870, 3400, 35223, 31803, 11, 29459, 3110, 25, 220, 10288, 18136, 11, 41395, 25, 4783, 11, 41395, 3110, 25, 220, 20, 11, 9587, 25, 4567, 1764, 5348, 11, 9587, 3110, 25, 220, 17592, 2589, 11, 17836, 2696, 25, 220, 679, 19, 12, 806, 12, 2839, 11, 20371, 5665, 25, 9567, 403, 916, 14, 13432, 14, 12840, 46481, 1355, 352, 14, 20128, 21962, 16, 11, 19771, 25, 17674, 19771, 11, 51127, 25, 7566, 11, 5031, 40114, 25, 7566, 11, 21722, 25, 220, 21, 25, 1627, 4520, 11, 89319, 25, 220, 6549, 13, 15, 11, 5422, 3110, 25, 220, 21, 13, 15, 11, 14904, 25, 220, 16, 13, 15, 11, 4196, 768, 25, 220, 15, 13, 20417, 11, 30704, 2968, 25, 220, 15, 13, 20899, 11, 12634, 25, 220, 15, 13, 26007, 11, 39841, 1918, 25, 220, 15, 13, 24996, 24, 11, 80648, 2136, 25, 482, 21, 13, 13858, 11, 445, 13071, 25, 220, 15, 13, 6849, 11, 43405, 278, 2136, 25, 220, 15, 13, 23969, 11, 6515, 35415, 2136, 25, 220, 15, 13, 4119, 1114, 11, 3507, 7532, 25, 1229, 4931, 71030, 10465, 23110, 21, 11, 29459, 5665, 25, 9567, 403, 916, 14, 19135, 2971, 62028, 1824, 4852, 277, 31803, 14, 10288, 18136, 11, 9587, 5665, 25, 9567, 403, 916, 14, 1530, 2754, 318, 1764, 38046, 14, 17592, 2589, 11, 41395, 5665, 25, 611, 34713, 7682, 1559, 14, 20, 13]
Tokens: ['Track', 'ĠID', ':', 'Ġ', '505', '081', '5', ',', 'ĠTitle', ':', 'ĠJust', 'ĠA', 'ĠGirl', ',', 'ĠArtist', ':', 'ĠErin', 'ĠLeah', ',', 'ĠArtist', 'ĠID', ':', 'Ġ', '616', '75', ',', 'ĠGenre', ':', 'ĠHouse', ',', 'ĠGenre', 'ĠID', ':', 'Ġ', '5', ',', 'ĠLabel', ':', 'ĠQuant', 'ize', 'ĠRecord', 'ings', ',', 'ĠLabel', 'ĠID', ':', 'Ġ', '246', '24', ',', 'ĠRelease', 'ĠDate', ':', 'Ġ', '201', '4', '-', '01', '-', '20', ',', 'ĠTrack', 'ĠURL', ':', 'Ġbeat', 'port', '.com', '/', 'track', '/', 'just', '-a', '-girl', '/', '505', '081', '5', ',', 'ĠMix', ':', 'ĠDj', 'ĠS', 'pen', 'ĠVocal', 'ĠMix', ',', 'ĠRemix', ':', 'ĠYes', ',', 'ĠRem', 'ixer', ':', 'ĠYes', ',', 'ĠDuration', ':', 'Ġ', '7', ':', '48', 'Ġminutes', ',', 'ĠBPM', ':', 'Ġ', '123', '.', '0', ',', 'ĠKey', 'ĠID', ':', 'Ġ', '6', '.', '0', ',', 'ĠMode', ':', 'Ġ', '0', '.', '0', ',', 'ĠVal', 'ence', ':', 'Ġ', '0', '.', '473', ',', 'ĠDance', 'ability', ':', 'Ġ', '0', '.', '736', ',', 'ĠEnergy', ':', 'Ġ', '0', '.', '642', ',', 'ĠSpeech', 'iness', ':', 'Ġ', '0', '.', '038', '6', ',', 'ĠLoud', 'ness', ':', 'Ġ-', '7', '.', '595', ',', 'ĠL', 'iveness', ':', 'Ġ', '0', '.', '611', ',', 'ĠInstrument', 'al', 'ness', ':', 'Ġ', '0', '.', '822', ',', 'ĠAc', 'oustic', 'ness', ':', 'Ġ', '0', '.', '000', '161', ',', 'ĠIS', 'RC', ':', 'ĠGB', '3', 'T', 'Q', '120', '037', '4', ',', 'ĠArtist', 'ĠURL', ':', 'Ġbeat', 'port', '.com', '/', 'artist', '/', 'erin', '-le', 'ah', '/', '616', '75', ',', 'ĠLabel', 'ĠURL', ':', 'Ġbeat', 'port', '.com', '/', 'label', '/', 'quant', 'ize', '-record', 'ings', '/', '246', '24', ',', 'ĠGenre', 'ĠURL', ':', 'Ġ/', 'genre', '/h', 'ouse', '/', '5', '.']
Embeddings: [16042, 3110, 25, 220, 17786, 22534, 20, 11, 11106, 25, 4702, 362, 11617, 11, 29459, 25, 56914, 67961, 11, 29459, 3110, 25, 220, 21379, 2075, 11, 41395, 25, 4783, 11, 41395, 3110, 25, 220, 20, 11, 9587, 25, 32541, 553, 13896, 826, 11, 9587, 3110, 25, 220, 14205, 1187, 11, 17836, 2696, 25, 220, 679, 19, 12, 1721, 12, 508, 11, 20371, 5665, 25, 9567, 403, 916, 14, 13432, 14, 4345, 7561, 63970, 14, 17786, 22534, 20, 11, 19771, 25, 52162, 328, 2821, 98403, 19771, 11, 51127, 25, 7566, 11, 5031, 40114, 25, 7566, 11, 21722, 25, 220, 22, 25, 2166, 4520, 11, 89319, 25, 220, 4513, 13, 15, 11, 5422, 3110, 25, 220, 21, 13, 15, 11, 14904, 25, 220, 15, 13, 15, 11, 4196, 768, 25, 220, 15, 13, 21505, 11, 30704, 2968, 25, 220, 15, 13, 23969, 11, 12634, 25, 220, 15, 13, 22266, 11, 39841, 1918, 25, 220, 15, 13, 24462, 21, 11, 80648, 2136, 25, 482, 22, 13, 22754, 11, 445, 13071, 25, 220, 15, 13, 20973, 11, 43405, 278, 2136, 25, 220, 15, 13, 23105, 11, 6515, 35415, 2136, 25, 220, 15, 13, 931, 10718, 11, 3507, 7532, 25, 19397, 18, 51, 48, 4364, 23587, 19, 11, 29459, 5665, 25, 9567, 403, 916, 14, 19135, 14, 85509, 31307, 1494, 14, 21379, 2075, 11, 9587, 5665, 25, 9567, 403, 916, 14, 1530, 14, 31548, 553, 57263, 826, 14, 14205, 1187, 11, 41395, 5665, 25, 611, 34713, 7682, 1559, 14, 20, 13]
Tokenizing batches 101-120: 40%|████ | 8/20 [14:28<20:59, 105.00s/it]