Project Dolly Shield, Chapter 1, Part 1: The Road to Tokenization

Technical overview of chapter 1, part 1 of Project Dolly Shield

Overview

Part one of project Dolly Shield starts close to where I started Project Ventoux, with a more sophisticated took kit and a continued quest to fine tune train an open source large language model. Along the way, I had to confront a lot of thorny decisions around what data should mean and why.

My setup for accessing and securing data is more robust this time around, with a continued shift to cloud computing infrastructure. Hilariously, Google Cloud Platform denied my request to access an instance with NVIDIA GPUs and wanted me to to hop on a call with a sales person instead. While their GPUs are more expensive than other providers, it had the benefit of being home to most of my other cloud-based activities. After I cheerfully declined the invite from Google, I spun up a RunPod A100 instance in about 20 minutes.

Getting setup

I authenticate into my GCP service account using impersonation by running this command from my Macbook terminal:

gcloud auth application-default login

credentials, project_id = default()
service_account_email = 
target_scopes = ['https://www.googleapis.com/auth/cloud-platform']

impersonated_credentials = impersonated_credentials.Credentials(
    source_credentials=credentials,
    target_principal=service_account_email,
    target_scopes=target_scopes
)

After impersonation authentication, I can then connect to GCP secrets manager and bring in my Hugging Face API token:


secret_client = secretmanager.SecretManagerServiceClient()
project_id = 
secret_id = 
version_id = "latest" 
secrets_name = f"projects/{project_id}/secrets/{secret_id}/versions/{version_id}"
response = secret_client.access_secret_version(name=secrets_name)
HUGGING_FACE_READ_TOKEN_LLAMA = response.payload.data.decode("UTF-8")

I can also access the GCP bucket I use for this project:

storage_client = storage.Client(credentials=credentials)
love_uwsthoughts = "love-uwsthoughts"
bp_csvs_folder = 'bp_csvs'
bp_artist_bios_folder = 'bp_artist_bios'
uwsthoughts_bucket = storage_client.bucket(love_uwsthoughts)

def gcp_download(bucket_name, gcp_folder, file_name):
    bucket = storage_client.bucket(bucket_name)
    blob_name = f"{gcp_folder}/{file_name}".strip("/")
    file_blob = bucket.blob(blob_name)
    file_data = file_blob.download_as_text() 
    gcp_df = tian.read_csv(StringIO(file_data))
    
    return gcp_df

Data pre-processing

Much of my previous work has relied on using various IDs to establish relationships between data, with text values only coming in at the very end to help identify outputs. Now that I'm using a large language model (LLM) to generate responses, I prefer text values instead of IDs so the model can tokenize and understand their meaning. The goal is to fine-tune the model with previously unseen music data, allowing users to discover music recommendations they haven't heard before.

To achieve this, I identified a set of replacements based on the most common occurrences of different types of values. The original dataset contained over 400,000 distinct values, but the approach below allows me to consolidate 90% of them into just 20 terms.

mix_replacements = {
    "continuous": "Set Mixed", "live": "Set Mixed", "remastered": "Remastered Mix", 
    "orginal": "Original Mix", "ambient": "Ambient Mix", "chill": "Ambient Mix", 
    "lounge": "Ambient Mix", "rework": "Remastered Mix", "remix": "Remix", 
    "original": "Original Mix", "club": "Club Mix", "dub": "Dub Mix", 
    "extended": "Extended Mix", "instrumental": "Instrumental Mix", 
    "radio": "Radio Mix", "vip": "Remix", "album": "Album Mix", 
    "Continuous DJ Mix": "Mixed", "Mix Cut": "Set Mixed", "Mixed": "Set Mixed", 
    "Intro Mix": "Set Mixed", "Edit": "Radio Edit", "Main Mix": "Original Mix", 
    "Album Version": "Album Mix", "Deep Mix": "Remix", "House Mix": "Remix", 
    "Tribal Mix": "Remix", "Intro": "Set Mixed", "Edit Mix": "Radio Mix", 
    "Bonus Track": "Album Mix"
}

This section replaces values where applicable, while preserving the original values if no replacement is found.


def clean_mix_values(mix):
    if isinstance(mix, str):
        for key, value in mix_replacements.items():
            if key.lower() in mix.lower():
                return value
    return mix

bpmeta_audio_df['mix'] = bpmeta_audio_df['mix'].apply(clean_mix_values)

Combining different track metadata

I have separate files containing the artist, label, and key information for the tracks, and I want to create a single file that presents all of this data in a human-readable format.

Track and label metadata

The first step is straightforward: I have separate CSV files for artist and label metadata, and I want to see a label along with its associated artists. For example: "Anjunadeep includes Marsh and Eli & Fur as artists." I kept the IDs and URLs but moved them to the end, as I have a feeling those URLs might be useful later on.

label_artist_temp = tian.merge(bp_label_artist_df, bp_label_df, on='label_id', suffixes=('', '_label'))
bp_artist_label_names_df = tian.merge(label_artist_temp, bp_artist_df, on='artist_id', suffixes=('', '_artist'))
bp_artist_label_names_df = bp_artist_label_names_df[[ 'label_name', 'artist_name', 'label_id', 'label_url', 'artist_id', 'artist_url']]

Track and key metadata

bp_key_df is a table where fields like key_id, key_name, chord, and whether it's sharp or flat are in separate columns. That's great for a typical database but I need something a bit more human
so I'm going to make a new dataframe that has key_id and a new key_description field that brings together the key name, chord, and sharps/flats into a single value.

bp_key_text_df = bp_key_df.copy()

def key_mapper(row):
    if row['is_sharp'] == 't':
        return f"{row['key_letter']}-sharp {row['chord_name']}"
    elif row['is_flat'] == 't':
        return f"{row['key_letter']}-flat {row['chord_name']}"
    else:
        return f"Natural {row['key_letter']} {row['chord_name']}"

bp_key_text_df['key_description'] = bp_key_text_df.apply(key_mapper, axis=1)

bp_key_text_df = bp_key_text_df[['key_id', 'key_description']]

bp_key_text_df

Four to the floor

With the data I have, I want the model to generate something like:

On August 18th, 2023, Mira released 'Celo' on Kiosk ID. It was her first release of 2023 and also her melodic house & techno debut, showcasing a new, edgier side to her growing repertoire.

That led me to a schema that look like this:

release_date	artist_name	title	label_name	genre_name	bpm	key_description	mix	is_remixed	is_remixer	mode	valence	danceability	energy	speechiness	loudness	liveness	instrumentalness	acousticness	isrc	artist_id	artist_url	track_id	track_url	label_id	label_url	genre_id	genre_url

All text values are listed first, with IDs and URLs placed at the end. I also included descriptive text for Spotify audio metrics, making it easier to see how they translate from something a computer understands to something a human does. Spotify's metrics are generally unimpressive, and I want to move away from relying on them eventually.

After the joins, I ended up with a table where each track ID has a distinct row for each artist featured on the track. I'll need to find a way to collapse these rows into a single row per track.

I'm sure everything will be fine until then

Narrator's voice: everything was not fine until then

bp_track_artist_merge = tian.merge(bpmeta_audio_shield_df, bp_artist_track_df, on='track_id', suffixes=('', '_artist'))
bp_track_artist_label_merge = tian.merge(bp_track_artist_merge, bp_artist_df, on='artist_id', suffixes=('', '_artist_info'))
bp_track_artist_label_merge = tian.merge(bp_track_artist_label_merge, bp_label_df, on='label_id', suffixes=('', '_label'))
bp_track_artist_label_merge = tian.merge(bp_track_artist_label_merge, bp_genre_df, on='genre_id', suffixes=('', '_genre'))
bp_track_artist_label_merge = tian.merge(bp_track_artist_label_merge, bp_key_text_df, on='key_id', suffixes=('', '_key'))

bp_text_values_df = bp_track_artist_label_merge[[
    'release_date', 'artist_name', 'title', 'label_name', 'duration', 'genre_name', 'bpm', 'key_description', 'mix', 'is_remixed', 'is_remixer',
    'mode', 'valence', 'danceability', 'energy', 'speechiness', 'loudness', 'liveness', 'instrumentalness', 'acousticness', 'isrc', 'artist_id',
    'artist_url', 'track_id', 'track_url', 'label_id', 'label_url', 'genre_id', 'genre_url', 'key_id'
]]

This resulted in a beautifully massive table. At first glance, it looks just as I designed it. Sooner rather than later, I'll need to address similar fields like is_remixed and is_remixer, since they're conveying the same information from different perspectives.

bp_text_values_df

	release_date	artist_name	title	label_name	duration	genre_name	bpm	key_description	mix	is_remixed	is_remixer	mode	valence	danceability	energy	speechiness	loudness	liveness	instrumentalness	acousticness	isrc	artist_id	artist_url	track_id	track_url	label_id	label_url	genre_id	genre_url	key_id
0	2022-06-24	Preja	Movha	Supadjs Projects	6:51	Amapiano	112.0	Natural A Minor	Main	f	f	0.0	0.367	0.732	0.492	0.2620	-10.961	0.2970	0.00000	0.247000	GBKQU2257859	1063851	beatport.com/artist/preja/1063851	16636568	beatport.com/track/movha/16636568	40460	beatport.com/label/supadjs-projects/40460	98	/genre/amapiano/98	8.0
1	2022-06-24	Sidney Saige Ausama	Movha	Supadjs Projects	6:51	Amapiano	112.0	Natural A Minor	Main	f	f	0.0	0.367	0.732	0.492	0.2620	-10.961	0.2970	0.00000	0.247000	GBKQU2257859	1063852	beatport.com/artist/sidney-saige-ausama/1063852	16636568	beatport.com/track/movha/16636568	40460	beatport.com/label/supadjs-projects/40460	98	/genre/amapiano/98	8.0
2	2022-01-28	LSDee	Ayeye	One Night Stand Distribution	6:20	Amapiano	112.0	Natural F Minor	Original Mix	f	f	1.0	0.504	0.875	0.556	0.0700	-7.754	0.0819	0.55300	0.090300	ZARQO2200001	213465	beatport.com/artist/lsdee/213465	16115373	beatport.com/track/ayeye/16115373	99208	beatport.com/label/one-night-stand-distributio...	98	/genre/amapiano/98	4.0
3	2021-10-01	Babalwa M	LalaBy	Mavuso Business Solutions	9:01	Amapiano	114.0	G-flat Minor	Outro	f	f	0.0	0.261	0.852	0.445	0.0514	-16.397	0.0366	0.14800	0.007190	ZAC012100398	1039645	beatport.com/artist/babalwa-m/1039645	16223521	beatport.com/track/lalaby/16223521	100342	beatport.com/label/mavuso-business-solutions/1...	98	/genre/amapiano/98	34.0
4	2021-10-01	Babalwa M	Jaiva	Mavuso Business Solutions	6:15	Amapiano	113.0	G-flat Minor	Original Mix	f	f	0.0	0.343	0.842	0.702	0.0667	-13.397	0.0253	0.03780	0.005390	ZAC012100395	1039645	beatport.com/artist/babalwa-m/1039645	16223515	beatport.com/track/jaiva/16223515	100342	beatport.com/label/mavuso-business-solutions/1...	98	/genre/amapiano/98	34.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9522072	2009-05-29	Steve O Steen	Breakfast In Illinois	Robsoul Recordings	4:37	Jackin House	127.0	Natural A Minor	Original Mix	f	f	1.0	0.677	0.806	0.773	0.0726	-11.024	0.1110	0.88000	0.000038	FR48Z0900019	107441	beatport.com/artist/steve-o-steen/107441	859924	beatport.com/track/breakfast-in-illinois/859924	2204	beatport.com/label/robsoul-recordings/2204	97	/genre/jackin-house/97	8.0
9522073	2009-05-29	Sam Karlson	Robsoul Boy	Robsoul Recordings	5:30	Jackin House	127.0	Natural B Major	Original Mix	f	f	1.0	0.912	0.890	0.772	0.1660	-5.031	0.0685	0.00071	0.180000	FR48Z0900018	9863	beatport.com/artist/sam-karlson/9863	859923	beatport.com/track/robsoul-boy/859923	2204	beatport.com/label/robsoul-recordings/2204	97	/genre/jackin-house/97	13.0
9522074	2009-05-29	Steve O Steen	Robsoul Boy	Robsoul Recordings	5:30	Jackin House	127.0	Natural B Major	Original Mix	f	f	1.0	0.912	0.890	0.772	0.1660	-5.031	0.0685	0.00071	0.180000	FR48Z0900018	107441	beatport.com/artist/steve-o-steen/107441	859923	beatport.com/track/robsoul-boy/859923	2204	beatport.com/label/robsoul-recordings/2204	97	/genre/jackin-house/97	13.0
9522075	2007-08-07	Colette	Hypnotized	OM Records	6:40	Jackin House	126.0	Natural F Major	Jason Hodges Mix	t	f	0.0	0.879	0.804	0.795	0.0715	-10.096	0.0494	0.82400	0.000029	USOM80795203	3201	beatport.com/artist/colette/3201	349708	beatport.com/track/hypnotized/349708	351	beatport.com/label/om-records/351	97	/genre/jackin-house/97	19.0
9522076	2007-08-07	Jason Hodges	Hypnotized	OM Records	6:40	Jackin House	126.0	Natural F Major	Jason Hodges Mix	t	t	0.0	0.879	0.804	0.795	0.0715	-10.096	0.0494	0.82400	0.000029	USOM80795203	431	beatport.com/artist/jason-hodges/431	349708	beatport.com/track/hypnotized/349708	351	beatport.com/label/om-records/351	97	/genre/jackin-house/97	19.0

9522077 rows × 30 columns

All of that work makes it much easier to find the artists I'm looking for. Text values as IDs aren't ideal due to the risk of overlaps and duplicates. For example, "Mira (Berlin)" has "(Berlin)" added to her name because there's another "Mira" on Beatport, and someone needed to make a small distinction.


bp_text_values_df[bp_text_values_df['artist_name'] == 'Mira (Berlin)'].sort_values(by='release_date', ascending=False).head(10)

	release_date	artist_name	title	label_name	duration	genre_name	bpm	key_description	mix	is_remixed	is_remixer	mode	valence	danceability	energy	speechiness	loudness	liveness	instrumentalness	acousticness	isrc	artist_id	artist_url	track_id	track_url	label_id	label_url	genre_id	genre_url	key_id
7620604	2023-08-18	Mira (Berlin)	Jero	Kiosk ID	10:05	Melodic House & Techno	116.0	Natural D Major	Original Mix	f	f	0.0	0.0378	0.760	0.728	0.0653	-11.191	0.1030	0.899	0.028600	DECY52301345	385397	beatport.com/artist/mira-berlin/385397	17993565	beatport.com/track/jero/17993565	80379	beatport.com/label/kiosk-id/80379	90	/genre/melodic-house-techno/90	22.0
7698958	2023-08-18	Mira (Berlin)	Cleo	Kiosk ID	6:43	Melodic House & Techno	92.0	Natural G Major	Remix	t	f	1.0	0.1200	0.806	0.783	0.0461	-10.205	0.0660	0.946	0.077000	DECY52301349	385397	beatport.com/artist/mira-berlin/385397	17993576	beatport.com/track/cleo/17993576	80379	beatport.com/label/kiosk-id/80379	90	/genre/melodic-house-techno/90	21.0
7698955	2023-08-18	Mira (Berlin)	Jero	Kiosk ID	8:38	Melodic House & Techno	119.0	Natural D Major	Remix	t	f	1.0	0.2670	0.830	0.350	0.0784	-9.705	0.0999	0.903	0.007660	DECY52301346	385397	beatport.com/artist/mira-berlin/385397	17993567	beatport.com/track/jero/17993567	80379	beatport.com/label/kiosk-id/80379	90	/genre/melodic-house-techno/90	22.0
7698952	2023-08-18	Mira (Berlin)	Jero	Kiosk ID	5:56	Melodic House & Techno	120.0	Natural D Minor	Argia Powerbeats Version	t	f	1.0	0.0876	0.795	0.837	0.0449	-10.095	0.1050	0.833	0.000121	DECY52301347	385397	beatport.com/artist/mira-berlin/385397	17993570	beatport.com/track/jero/17993570	80379	beatport.com/label/kiosk-id/80379	90	/genre/melodic-house-techno/90	7.0
7660383	2023-08-18	Mira (Berlin)	Cleo	Kiosk ID	6:18	Melodic House & Techno	114.0	Natural G Minor	Original Mix	f	f	0.0	0.1760	0.806	0.699	0.0451	-11.390	0.1200	0.939	0.014300	DECY52301348	385397	beatport.com/artist/mira-berlin/385397	17993573	beatport.com/track/cleo/17993573	80379	beatport.com/label/kiosk-id/80379	90	/genre/melodic-house-techno/90	6.0
5034902	2022-11-18	Mira (Berlin)	Murmeli	Kiosk ID	6:32	Organic House / Downtempo	110.0	Natural C Major	Remix	t	t	1.0	0.1020	0.797	0.871	0.0544	-10.162	0.1500	0.896	0.056400	DECY52201887	385397	beatport.com/artist/mira-berlin/385397	17067088	beatport.com/track/murmeli/17067088	80379	beatport.com/label/kiosk-id/80379	93	/genre/organic-house-downtempo/93	20.0
5034956	2022-11-04	Mira (Berlin)	Siriema	The Magic Movement	7:19	Organic House / Downtempo	108.0	Natural G Minor	Remix	t	t	0.0	0.7120	0.800	0.492	0.0649	-10.809	0.0840	0.894	0.081500	QM6P42229087	385397	beatport.com/artist/mira-berlin/385397	16935012	beatport.com/track/siriema/16935012	39677	beatport.com/label/the-magic-movement/39677	93	/genre/organic-house-downtempo/93	6.0
5034953	2022-11-04	Mira (Berlin)	Siriema	The Magic Movement	7:19	Organic House / Downtempo	108.0	Natural G Minor	Remix	t	f	0.0	0.7120	0.800	0.492	0.0649	-10.809	0.0840	0.894	0.081500	QM6P42229087	385397	beatport.com/artist/mira-berlin/385397	16935012	beatport.com/track/siriema/16935012	39677	beatport.com/label/the-magic-movement/39677	93	/genre/organic-house-downtempo/93	6.0
5034958	2022-10-14	Mira (Berlin)	Siriema	The Magic Movement	7:19	Organic House / Downtempo	108.0	Natural G Minor	Remix	f	f	0.0	0.7120	0.800	0.492	0.0649	-10.809	0.0840	0.894	0.081500	QM6P42229087	385397	beatport.com/artist/mira-berlin/385397	16947157	beatport.com/track/siriema/16947157	39677	beatport.com/label/the-magic-movement/39677	93	/genre/organic-house-downtempo/93	6.0
5060891	2022-09-09	Mira (Berlin)	Higher Than Me	Frau Blau	7:21	Organic House / Downtempo	114.0	Natural D Major	Remix	t	t	1.0	0.2090	0.809	0.798	0.0613	-10.140	0.0666	0.840	0.077800	IL4612200064	385397	beatport.com/artist/mira-berlin/385397	16780144	beatport.com/track/higher-than-me/16780144	74969	beatport.com/label/frau-blau/74969	93	/genre/organic-house-downtempo/93	22.0

Organizing art biographies

In parallel, I scraped artist bios from Beatport in batches of 1,000, initially storing them in separate files. At this stage, I combined all of them and removed duplicates. I'll use this data later on—part one is just about getting organized.

Here's a snippet of how the scraping was handled:


def bioscrape(artist_url_extension):
    base_url = "https://www.beatport.com"
    url = base_url + artist_url_extension
    payload = {
        "api_key": SCRAPING_FISH_API_KEY,
        "url": url,
    }

    try:
        response = requests.get("https://scraping.narf.ai/api/v1/", params=payload, timeout=30)
        if response.status_code == 200:
            soup = bs(response.content, 'html.parser')

            artist_name_tag = soup.find('h1', class_="HeadingWithBreadcrumb-style__Title-sc-8549a8e9-2 hGvWKT")
            artist_name = artist_name_tag.text.strip() if artist_name_tag else "N/A"

            bio_tag = soup.find('meta', property="og:description")
            bio = bio_tag['content'].strip() if bio_tag and bio_tag.get('content') else "N/A"
            
            return artist_url_extension, artist_name, bio
        else:
            return artist_url_extension, "N/A", "N/A"
    except requests.Timeout:
        print(f"Timeout occurred while scraping {url}. Skipping to the next URL.")
        return artist_url_extension, "N/A", "N/A"
    except Exception as e:
        print(f"Error scraping {url}: {e}. Skipping to next bio.")
        return artist_url_extension, "N/A", "N/A"

Next, I set up parallelism using ThreadPoolExecutor and deployed the script to my GCP instance, where it ran in the background and took about 10 hours to scrape 55,000 pages.

Combining the files

In the end, I had 55 CSV files that needed to be combined. Below, I downloaded the files from their GCP bucket and performed a "glorified copy and paste" to merge them.

#sort folder by created date and then grab
def gcp_folder_sort(bucket, folder):
    blobs = list(bucket.list_blobs(prefix=folder))
    sorted_blobs = sorted(blobs, key=lambda x: x.time_created)
    return sorted_blobs

#combine csv files into one df
def artists_united(bucket, folder):
    sorted_blobs = gcp_folder_sort(bucket, folder)
    artists_united_df = tian.DataFrame()
    for blob in sorted_blobs:
        if blob.name.endswith('.csv'):
            the_drop = gcp_download(bucket.name, folder, blob.name.split('/')[-1])
            the_drop = the_drop[['beatport_artist_id', 'artist_name', 'beatport_bio']]
            artists_united_df = tian.concat([artists_united_df, the_drop], ignore_index=True)
    artists_united_df = artists_united_df.drop_duplicates(ignore_index=True)        
    return artists_united_df

#save df back to gcp
def paranoid_guard(df, bucket_name, gcp_folder, file_name):
    bucket = storage_client.bucket(bucket_name)
    csv_data = df.to_csv(index=False)
    blob_name = f"{gcp_folder}/{file_name}".strip("/")
    blob = bucket.blob(blob_name)
    blob.upload_from_string(csv_data, content_type='text/csv')

bppoints_artist_bios_csv = "bppoints_artist_bios.csv"

bppoints_artist_bios_df = artists_united(uwsthoughts_bucket, bp_artist_bios_folder)

paranoid_guard(bppoints_artist_bios_df, love_uwsthoughts, bp_artist_bios_folder, bppoints_artist_bios_csv)

I kept only the artists for whom I had bios, resulting in 48,000 out of the original 55,000.

Here's a pretty word cloud for a random 20% sample of those bios. The larger the word, the more frequently it appears in the sample.

file_path = '/Users/uwsthoughts/Desktop/bp_spotify_raw_data/csv_data/artist_bios_df.csv'
sample_size = 10000

# total length minue the header
with open(file_path, 'r', encoding='utf-8') as file:
    total_lines = sum(1 for line in file) - 1 

# random sample generator
sample_lines = set(random.sample(range(1, total_lines + 1), min(sample_size, total_lines)))

# use sample for word cloud
sampled_text = ""
with open(file_path, 'r', encoding='utf-8') as file:
    reader = csv.DictReader(file)
    for i, row in enumerate(reader):
        if i in sample_lines:
            sampled_text += row['beatport_bio'] + " "
wordcloud = wc(width=800, height=400, background_color='white').generate(sampled_text)


plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word cloud from sample of artist bio data', fontsize=16)
plt.show()

png

A slice of reality carved into language

Tokenization is the process of losing your mind in new and unexpected ways—ways you never knew existed (doctor scribbles furiously into a notebook in the corner, sirens wailing in the background). Not quite about tokenization but this overview of sentence embeddings is a good primer on the related topic of embeddings.

To actually access and use Meta's Llama models, I have an API key for Hugging Face, allowing me to use their Transformers tools. By passing my Hugging Face API token to Transformers, I can call Meta's Llama-3.2-11B-Vision Model.

This establishes the connection to Meta's new 11-billion-parameter model capable of handling both text and images. AutoTokenizer from the Hugging Face library is used to convert text into tokens for the specified model.


llama_3211b = "meta-llama/Llama-3.2-11B-Vision"
llama_3211b_tokenizer = AutoTokenizer.from_pretrained(llama_3211b)

melodies() takes a row of table data as input and returns a structured text output. This part will need to change in the future, as processing text and numeric data together isn't the right approach. Instead, I should separate these into two distinct tokenization processes, which can then be combined after tokenization is complete.

def melodies(row):
    text = (
        f"Track ID: {row['track_id']}, Title: {row['title']}, "
        f"Artist: {row['artist_name']}, Artist ID: {row['artist_id']}, "
        f"Genre: {row['genre_name']}, Genre ID: {row['genre_id']}, "
        f"Label: {row['label_name']}, Label ID: {row['label_id']}, "
        f"Release Date: {row['release_date']}, Track URL: {row['track_url']}, "
        f"Mix: {row['mix']}, Remix: {'Yes' if row['is_remixed'] else 'No'}, "
        f"Remixer: {'Yes' if row['is_remixer'] else 'No'}, Duration: {row['duration']} minutes, "
        f"BPM: {row['bpm']}, Key ID: {row['key_id']}, "
        f"Mode: {row['mode']}, Valence: {row['valence']}, Danceability: {row['danceability']}, "
        f"Energy: {row['energy']}, Speechiness: {row['speechiness']}, "
        f"Loudness: {row['loudness']}, Liveness: {row['liveness']}, "
        f"Instrumentalness: {row['instrumentalness']}, Acousticness: {row['acousticness']}, "
        f"ISRC: {row['isrc']}, Artist URL: {row['artist_url']}, Label URL: {row['label_url']}, "
        f"Genre URL: {row['genre_url']}."
    )
    return text

rows_to_text() takes a row of table data and converts it to text using melodies(). The output from this function is then used in tokenize_texts(), which utilizes the llama_3211b_tokenizer() defined earlier to convert the rows of text into numeric tokens in equal-sized batches. I also included a console printout of a random sample, which is shown further down, along with logging the entire output to a file for later use. When I paused the processing to consider a different approach, I was about 40% complete, having generated an 85GB file.


def rows_to_text(data_set, batch_size):
    num_batches = (len(data_set) // batch_size) + 1
    processed_texts = []

    for i in range(0, num_batches, 20):
        final_batch = min(i + 20, num_batches)
        for batch in tqdm(range(i, final_batch), desc=f"Processing batches {i+1}-{final_batch}"):
            first_index = batch * batch_size
            last_index = min((batch + 1) * batch_size, len(data_set))
            batch_df = data_set.iloc[first_index:last_index]
            
            # Assuming `melodies(row)` processes each row in some way
            batch_results = [melodies(row) for _, row in tqdm(batch_df.iterrows(), total=len(batch_df), desc=f"Processing batch {batch+1}/{num_batches}")]
            processed_texts.extend(batch_results)

    return processed_texts


def tokenize_texts(texts, batch_size, sample_indices=None):
    num_batches = (len(texts) // batch_size) + 1
    tokenized_batches = []
    all_tokens = []
    token_lengths = []

    for i in range(0, num_batches, 20):
        last_batch = min(i + 20, num_batches)
        for batch_index in tqdm(range(i, last_batch), desc=f"Tokenizing batches {i+1}-{last_batch}"):
            start_index = batch_index * batch_size
            end_index = min((batch_index + 1) * batch_size, len(texts))
            batch_texts = texts[start_index:end_index]

            tokenized_data = llama_3211b_tokenizer(batch_texts, return_tensors="tf", truncation=True, padding=True)
            tokenized_batches.append(tokenized_data)

            for sample_i, text in enumerate(batch_texts):
                tokens = llama_3211b_tokenizer.tokenize(text)
                token_ids = llama_3211b_tokenizer.convert_tokens_to_ids(tokens)

                all_tokens.extend(tokens)
                token_lengths.extend([len(token) for token in tokens]     
                logging.info(f"Text: {text}\nTokens: {tokens}\nEmbeddings: {token_ids}")

                global_i = start_index + sample_i 
                if sample_i and global_i in sample_i:
                    print("Tokens:", tokens)
                    print("Token IDs:", token_ids)

    return tokenized_batches

I made my computer crash

Before I got the above to run successfully (with the output shown below), I decided to f'around and find out what happens when you try to tokenize a nine-million-row dataframe all at once without any preparation. Surprise! It caused the kernel to crash—another setback for overconfidence. After 46 minutes, I ended up here in my tracking:

Generating texts: 100%|██████████| 9522077/9522077 [04:59<00:00, 31748.96it/s]
Tokenizing texts... This may take a while.

When I got this error message:

The Kernel crashed while executing code in the current cell or a previous cell. 
Please review the code in the cell(s) to identify a possible cause of the failure. 
Click [here](https://github.com/microsoft/vscode-jupyter/wiki/Kernel-crashes) for more info. 
View Jupyter log for further details.

So what's a gurl to do other than twirl down to the Jupyter logs and find out what's going on? Come with me:

}
13:48:29.846 [error] Disposing session as kernel process died ExitCode: undefined, Reason: 
13:49:13.917 [error] Failed to write data to the kernel channel shell [
  <Buffer 3c 49 44 53 7c 4d 53 47 3e>,
  <Buffer 38 64 39 65 37 36 61 37 64 33 61 34 64 62 66 38 38 32 33 32 34 64 63 34 62 34 66 35 65 34 34 36 30 31 37 36 36 36 33 38 34 62 39 66 64 34 61 34 30 30 ... 14 more bytes>,
  <Buffer 7b 22 64 61 74 65 22 3a 22 32 30 32 34 2d 30 39 2d 32 39 54 31 37 3a 34 39 3a 31 33 2e 39 31 37 5a 22 2c 22 6d 73 67 5f 69 64 22 3a 22 38 31 65 63 36 ... 177 more bytes>,
  <Buffer 7b 7d>,
  <Buffer 7b 7d>,
  <Buffer 7b 22 73 69 6c 65 6e 74 22 3a 66 61 6c 73 65 2c 22 73 74 6f 72 65 5f 68 69 73 74 6f 72 79 22 3a 66 61 6c 73 65 2c 22 75 73 65 72 5f 65 78 70 72 65 73 ... 1058 more bytes>
] [Error: Socket is closed
  at a.postToSocket (/Users/~/.cursor/extensions/ms-toolsai.jupyter-2024.6.0-darwin-arm64/dist/extension.node.js:304:8043)
  at /Users/~/.cursor/extensions/ms-toolsai.jupyter-2024.6.0-darwin-arm64/dist/extension.node.js:304:7787] {
  errno: 9,
  code: 'EBADF'
}

My kernel imploded from being overloaded, more or less. The technical term for this is, "lol fml."

A word cloud as a consolation prize

I also created a word chart for a sample of the logs I generated. Since I had an 85GB file (lol), I took another small sample. The issue was that I included a lot of repeated text on each line as part of melodies(). Below the image are samples from the logs that generated the word cloud. My cheerful website designer is helping me reconfigure this page so the text wraps within code boxes rather than just extending straight from left to right. For now, you’ll need to scroll quite a bit to the right.

png

Next steps

I learned a tremendous amount very quickly about how data needs to be prepared for fine tune training with an LLM. For one, just giving an LLM a continuous metric (like 'valence' or 'energy' with values between 0 and 1) without any context to guide it will result in the song metadata being completely severed from its numeric qualities. In part two, I'll need to normalize the numeric values and encode them into embeddings (large numeric vectors representing words) using projections, which is a linear algebra term for mapping data into a space that’s fits what a model can understand, also known as a vibe check. This will allow the numeric data to be integrated with text embeddings in a, like, super embedding or something. I'm thinking calculus, but bigger? If I do this right, the songs will be reunited with their metrics and everyone will vibe some more. E

Essentially, numbers will be everything, everywhere, all at once (IYKYK) in a multidimensional space.

Addendum: result logs from partial tokenization

This is what a TQDM tracking log looks like. I had 191 batches so I got a progress bar for each one. In work I started after wrapping up part one, I switched to tracking batches of 20, similar to what I did for tokenization below.

farm_trips = 50000
bp_text_values = row_to_text(bp_text_values_df, farm_trips)

Processing batch 1/191: 100%|██████████| 50000/50000 [00:01<00:00, 32801.32it/s]
Processing batch 2/191: 100%|██████████| 50000/50000 [00:01<00:00, 32199.59it/s]
Processing batch 3/191: 100%|██████████| 50000/50000 [00:01<00:00, 32956.93it/s]
Processing batch 4/191: 100%|██████████| 50000/50000 [00:01<00:00, 31053.80it/s]
Processing batch 5/191: 100%|██████████| 50000/50000 [00:01<00:00, 33235.40it/s]

.....

Processing batch 189/191: 100%|██████████| 50000/50000 [00:01<00:00, 32466.98it/s]
Processing batch 190/191: 100%|██████████| 50000/50000 [00:01<00:00, 32640.58it/s]
Processing batch 191/191: 100%|██████████| 22077/22077 [00:00<00:00, 32620.70it/s]
Processing batches 181-191: 100%|██████████| 11/11 [00:16<00:00,  1.46s/it]

bp_text_tokens = tokenize_text(bp_text_values, farm_trips)

Tokenizing batches 1-20:  15%|█▌        | 3/20 [01:17<07:18, 25.81s/it]

Tokens: ['Track', 'ĠID', ':', 'Ġ', '178', '056', '49', ',', 'ĠTitle', ':', 'ĠBang', 's', 'ĠIn', 'ĠThe', 'ĠHead', ',', 'ĠArtist', ':', 'ĠVal', 'eri', 'Ã¸', 'ĠInn', 'Ã¸r', 'ta', ',', 'ĠArtist', 'ĠID', ':', 'Ġ', '647', '817', ',', 'ĠGenre', ':', 'ĠHard', 'ĠTechn', 'o', ',', 'ĠGenre', 'ĠID', ':', 'Ġ', '2', ',', 'ĠLabel', ':', 'ĠCar', 'bone', 'ĠRecords', ',', 'ĠLabel', 'ĠID', ':', 'Ġ', '591', '59', ',', 'ĠRelease', 'ĠDate', ':', 'Ġ', '202', '3', '-', '06', '-', '30', ',', 'ĠTrack', 'ĠURL', ':', 'Ġbeat', 'port', '.com', '/', 'track', '/b', 'angs', '-in', '-the', '-head', '/', '178', '056', '49', ',', 'ĠMix', ':', 'ĠOriginal', 'ĠMix', ',', 'ĠRemix', ':', 'ĠYes', ',', 'ĠRem', 'ixer', ':', 'ĠYes', ',', 'ĠDuration', ':', 'Ġ', '5', ':', '20', 'Ġminutes', ',', 'ĠBPM', ':', 'Ġ', '80', '.', '0', ',', 'ĠKey', 'ĠID', ':', 'Ġ', '3', '.', '0', ',', 'ĠMode', ':', 'Ġ', '0', '.', '0', ',', 'ĠVal', 'ence', ':', 'Ġ', '0', '.', '527', ',', 'ĠDance', 'ability', ':', 'Ġ', '0', '.', '686', ',', 'ĠEnergy', ':', 'Ġ', '0', '.', '928', ',', 'ĠSpeech', 'iness', ':', 'Ġ', '0', '.', '173', ',', 'ĠLoud', 'ness', ':', 'Ġ-', '5', '.', '653', ',', 'ĠL', 'iveness', ':', 'Ġ', '0', '.', '153', ',', 'ĠInstrument', 'al', 'ness', ':', 'Ġ', '0', '.', '175', ',', 'ĠAc', 'oustic', 'ness', ':', 'Ġ', '0', '.', '000', '329', ',', 'ĠIS', 'RC', ':', 'ĠNL', 'CK', '422', '320', '42', ',', 'ĠArtist', 'ĠURL', ':', 'Ġbeat', 'port', '.com', '/', 'artist', '/', 'val', 'eri', '-in', 'nr', 'ta', '/', '647', '817', ',', 'ĠLabel', 'ĠURL', ':', 'Ġbeat', 'port', '.com', '/', 'label', '/car', 'bone', '-', 'records', '/', '591', '59', ',', 'ĠGenre', 'ĠURL', ':', 'Ġ/', 'genre', '/h', 'ard', '-', 'techn', 'o', '/', '2', '.']
Embeddings IDs: [16042, 3110, 25, 220, 11256, 25921, 2491, 11, 11106, 25, 17343, 82, 763, 578, 11452, 11, 29459, 25, 4196, 31803, 6282, 17382, 17545, 2629, 11, 29459, 3110, 25, 220, 22644, 25528, 11, 41395, 25, 11481, 7146, 78, 11, 41395, 3110, 25, 220, 17, 11, 9587, 25, 3341, 20337, 22293, 11, 9587, 3110, 25, 220, 24380, 2946, 11, 17836, 2696, 25, 220, 2366, 18, 12, 2705, 12, 966, 11, 20371, 5665, 25, 9567, 403, 916, 14, 13432, 3554, 65587, 3502, 10826, 27488, 14, 11256, 25921, 2491, 11, 19771, 25, 17674, 19771, 11, 51127, 25, 7566, 11, 5031, 40114, 25, 7566, 11, 21722, 25, 220, 20, 25, 508, 4520, 11, 89319, 25, 220, 1490, 13, 15, 11, 5422, 3110, 25, 220, 18, 13, 15, 11, 14904, 25, 220, 15, 13, 15, 11, 4196, 768, 25, 220, 15, 13, 22369, 11, 30704, 2968, 25, 220, 15, 13, 22347, 11, 12634, 25, 220, 15, 13, 25001, 11, 39841, 1918, 25, 220, 15, 13, 11908, 11, 80648, 2136, 25, 482, 20, 13, 21598, 11, 445, 13071, 25, 220, 15, 13, 9800, 11, 43405, 278, 2136, 25, 220, 15, 13, 10005, 11, 6515, 35415, 2136, 25, 220, 15, 13, 931, 18196, 11, 3507, 7532, 25, 33260, 3096, 16460, 9588, 2983, 11, 29459, 5665, 25, 9567, 403, 916, 14, 19135, 14, 838, 31803, 3502, 20191, 2629, 14, 22644, 25528, 11, 9587, 5665, 25, 9567, 403, 916, 14, 1530, 66759, 20337, 12, 27295, 14, 24380, 2946, 11, 41395, 5665, 25, 611, 34713, 7682, 569, 12, 26522, 78, 14, 17, 13]

.....

Tokenizing batches 41-60:  80%|████████  | 16/20 [09:47<02:34, 38.56s/it]

Tokens: ['Track', 'ĠID', ':', 'Ġ', '101', '180', '85', ',', 'ĠTitle', ':', 'ĠSeven', 'ĠSteps', ',', 'ĠArtist', ':', 'ĠRico', 'ĠMartinez', ',', 'ĠArtist', 'ĠID', ':', 'Ġ', '104', '454', ',', 'ĠGenre', ':', 'ĠMinimal', 'Ġ/', 'ĠDeep', 'ĠTech', ',', 'ĠGenre', 'ĠID', ':', 'Ġ', '14', ',', 'ĠLabel', ':', 'ĠDat', 'ag', 'ro', 'ove', 'ĠMusic', ',', 'ĠLabel', 'ĠID', ':', 'Ġ', '156', '41', ',', 'ĠRelease', 'ĠDate', ':', 'Ġ', '201', '8', '-', '01', '-', '18', ',', 'ĠTrack', 'ĠURL', ':', 'Ġbeat', 'port', '.com', '/', 'track', '/se', 'ven', '-st', 'eps', '/', '101', '180', '85', ',', 'ĠMix', ':', 'ĠOriginal', 'ĠMix', ',', 'ĠRemix', ':', 'ĠYes', ',', 'ĠRem', 'ixer', ':', 'ĠYes', ',', 'ĠDuration', ':', 'Ġ', '5', ':', '48', 'Ġminutes', ',', 'ĠBPM', ':', 'Ġ', '122', '.', '0', ',', 'ĠKey', 'ĠID', ':', 'Ġ', '12', '.', '0', ',', 'ĠMode', ':', 'Ġ', '0', '.', '0', ',', 'ĠVal', 'ence', ':', 'Ġ', '0', '.', '48', ',', 'ĠDance', 'ability', ':', 'Ġ', '0', '.', '814', ',', 'ĠEnergy', ':', 'Ġ', '0', '.', '484', ',', 'ĠSpeech', 'iness', ':', 'Ġ', '0', '.', '061', '8', ',', 'ĠLoud', 'ness', ':', 'Ġ-', '9', '.', '41', ',', 'ĠL', 'iveness', ':', 'Ġ', '0', '.', '070', '7', ',', 'ĠInstrument', 'al', 'ness', ':', 'Ġ', '0', '.', '811', ',', 'ĠAc', 'oustic', 'ness', ':', 'Ġ', '0', '.', '000', '776', ',', 'ĠIS', 'RC', ':', 'ĠCA', '5', 'KR', '170', '754', '6', ',', 'ĠArtist', 'ĠURL', ':', 'Ġbeat', 'port', '.com', '/', 'artist', '/', 'rico', '-m', 'art', 'inez', '/', '104', '454', ',', 'ĠLabel', 'ĠURL', ':', 'Ġbeat', 'port', '.com', '/', 'label', '/dat', 'ag', 'ro', 'ove', '-m', 'usic', '/', '156', '41', ',', 'ĠGenre', 'ĠURL', ':', 'Ġ/', 'genre', '/min', 'imal', '-de', 'ep', '-tech', '/', '14', '.']
Embeddings: [16042, 3110, 25, 220, 4645, 5245, 5313, 11, 11106, 25, 31048, 40961, 11, 29459, 25, 34248, 44027, 11, 29459, 3110, 25, 220, 6849, 20555, 11, 41395, 25, 76212, 611, 18682, 17829, 11, 41395, 3110, 25, 220, 975, 11, 9587, 25, 22362, 351, 299, 1009, 10948, 11, 9587, 3110, 25, 220, 10132, 3174, 11, 17836, 2696, 25, 220, 679, 23, 12, 1721, 12, 972, 11, 20371, 5665, 25, 9567, 403, 916, 14, 13432, 60687, 1055, 5594, 7270, 14, 4645, 5245, 5313, 11, 19771, 25, 17674, 19771, 11, 51127, 25, 7566, 11, 5031, 40114, 25, 7566, 11, 21722, 25, 220, 20, 25, 2166, 4520, 11, 89319, 25, 220, 8259, 13, 15, 11, 5422, 3110, 25, 220, 717, 13, 15, 11, 14904, 25, 220, 15, 13, 15, 11, 4196, 768, 25, 220, 15, 13, 2166, 11, 30704, 2968, 25, 220, 15, 13, 25498, 11, 12634, 25, 220, 15, 13, 20339, 11, 39841, 1918, 25, 220, 15, 13, 23324, 23, 11, 80648, 2136, 25, 482, 24, 13, 3174, 11, 445, 13071, 25, 220, 15, 13, 17819, 22, 11, 43405, 278, 2136, 25, 220, 15, 13, 22588, 11, 6515, 35415, 2136, 25, 220, 15, 13, 931, 23823, 11, 3507, 7532, 25, 9362, 20, 62984, 8258, 23952, 21, 11, 29459, 5665, 25, 9567, 403, 916, 14, 19135, 14, 56347, 1474, 472, 39395, 14, 6849, 20555, 11, 9587, 5665, 25, 9567, 403, 916, 14, 1530, 38666, 351, 299, 1009, 1474, 11785, 14, 10132, 3174, 11, 41395, 5665, 25, 611, 34713, 45273, 2931, 6953, 752, 42357, 14, 975, 13]


Tokenizing batches 81-100:  35%|███▌      | 7/20 [10:51<21:21, 98.56s/it]

Tokens: ['Track', 'ĠID', ':', 'Ġ', '592', '498', '1', ',', 'ĠTitle', ':', 'ĠWhat', 'ĠU', 'ĠSay', ',', 'ĠArtist', ':', 'ĠCarlo', 'ĠCal', 'dar', 'eri', ',', 'ĠArtist', 'ĠID', ':', 'Ġ', '147', '415', ',', 'ĠGenre', ':', 'ĠHouse', ',', 'ĠGenre', 'ĠID', ':', 'Ġ', '5', ',', 'ĠLabel', ':', 'ĠSim', 'ma', 'ĠBlack', ',', 'ĠLabel', 'ĠID', ':', 'Ġ', '318', '07', ',', 'ĠRelease', 'ĠDate', ':', 'Ġ', '201', '4', '-', '11', '-', '03', ',', 'ĠTrack', 'ĠURL', ':', 'Ġbeat', 'port', '.com', '/', 'track', '/', 'what', '-u', '-s', 'ay', '/', '592', '498', '1', ',', 'ĠMix', ':', 'ĠOriginal', 'ĠMix', ',', 'ĠRemix', ':', 'ĠYes', ',', 'ĠRem', 'ixer', ':', 'ĠYes', ',', 'ĠDuration', ':', 'Ġ', '6', ':', '26', 'Ġminutes', ',', 'ĠBPM', ':', 'Ġ', '125', '.', '0', ',', 'ĠKey', 'ĠID', ':', 'Ġ', '6', '.', '0', ',', 'ĠMode', ':', 'Ġ', '1', '.', '0', ',', 'ĠVal', 'ence', ':', 'Ġ', '0', '.', '804', ',', 'ĠDance', 'ability', ':', 'Ġ', '0', '.', '803', ',', 'ĠEnergy', ':', 'Ġ', '0', '.', '974', ',', 'ĠSpeech', 'iness', ':', 'Ġ', '0', '.', '082', '9', ',', 'ĠLoud', 'ness', ':', 'Ġ-', '6', '.', '009', ',', 'ĠL', 'iveness', ':', 'Ġ', '0', '.', '104', ',', 'ĠInstrument', 'al', 'ness', ':', 'Ġ', '0', '.', '736', ',', 'ĠAc', 'oustic', 'ness', ':', 'Ġ', '0', '.', '001', '17', ',', 'ĠIS', 'RC', ':', 'ĠQ', 'MS', 'NZ', '146', '036', '6', ',', 'ĠArtist', 'ĠURL', ':', 'Ġbeat', 'port', '.com', '/', 'artist', '/c', 'arlo', '-c', 'ald', 'ar', 'eri', '/', '147', '415', ',', 'ĠLabel', 'ĠURL', ':', 'Ġbeat', 'port', '.com', '/', 'label', '/s', 'im', 'ma', '-black', '/', '318', '07', ',', 'ĠGenre', 'ĠURL', ':', 'Ġ/', 'genre', '/h', 'ouse', '/', '5', '.']
Embeddings: [16042, 3110, 25, 220, 20128, 21962, 16, 11, 11106, 25, 3639, 549, 25961, 11, 29459, 25, 58870, 3400, 35223, 31803, 11, 29459, 3110, 25, 220, 10288, 18136, 11, 41395, 25, 4783, 11, 41395, 3110, 25, 220, 20, 11, 9587, 25, 4567, 1764, 5348, 11, 9587, 3110, 25, 220, 17592, 2589, 11, 17836, 2696, 25, 220, 679, 19, 12, 806, 12, 2839, 11, 20371, 5665, 25, 9567, 403, 916, 14, 13432, 14, 12840, 46481, 1355, 352, 14, 20128, 21962, 16, 11, 19771, 25, 17674, 19771, 11, 51127, 25, 7566, 11, 5031, 40114, 25, 7566, 11, 21722, 25, 220, 21, 25, 1627, 4520, 11, 89319, 25, 220, 6549, 13, 15, 11, 5422, 3110, 25, 220, 21, 13, 15, 11, 14904, 25, 220, 16, 13, 15, 11, 4196, 768, 25, 220, 15, 13, 20417, 11, 30704, 2968, 25, 220, 15, 13, 20899, 11, 12634, 25, 220, 15, 13, 26007, 11, 39841, 1918, 25, 220, 15, 13, 24996, 24, 11, 80648, 2136, 25, 482, 21, 13, 13858, 11, 445, 13071, 25, 220, 15, 13, 6849, 11, 43405, 278, 2136, 25, 220, 15, 13, 23969, 11, 6515, 35415, 2136, 25, 220, 15, 13, 4119, 1114, 11, 3507, 7532, 25, 1229, 4931, 71030, 10465, 23110, 21, 11, 29459, 5665, 25, 9567, 403, 916, 14, 19135, 2971, 62028, 1824, 4852, 277, 31803, 14, 10288, 18136, 11, 9587, 5665, 25, 9567, 403, 916, 14, 1530, 2754, 318, 1764, 38046, 14, 17592, 2589, 11, 41395, 5665, 25, 611, 34713, 7682, 1559, 14, 20, 13]
Tokens: ['Track', 'ĠID', ':', 'Ġ', '505', '081', '5', ',', 'ĠTitle', ':', 'ĠJust', 'ĠA', 'ĠGirl', ',', 'ĠArtist', ':', 'ĠErin', 'ĠLeah', ',', 'ĠArtist', 'ĠID', ':', 'Ġ', '616', '75', ',', 'ĠGenre', ':', 'ĠHouse', ',', 'ĠGenre', 'ĠID', ':', 'Ġ', '5', ',', 'ĠLabel', ':', 'ĠQuant', 'ize', 'ĠRecord', 'ings', ',', 'ĠLabel', 'ĠID', ':', 'Ġ', '246', '24', ',', 'ĠRelease', 'ĠDate', ':', 'Ġ', '201', '4', '-', '01', '-', '20', ',', 'ĠTrack', 'ĠURL', ':', 'Ġbeat', 'port', '.com', '/', 'track', '/', 'just', '-a', '-girl', '/', '505', '081', '5', ',', 'ĠMix', ':', 'ĠDj', 'ĠS', 'pen', 'ĠVocal', 'ĠMix', ',', 'ĠRemix', ':', 'ĠYes', ',', 'ĠRem', 'ixer', ':', 'ĠYes', ',', 'ĠDuration', ':', 'Ġ', '7', ':', '48', 'Ġminutes', ',', 'ĠBPM', ':', 'Ġ', '123', '.', '0', ',', 'ĠKey', 'ĠID', ':', 'Ġ', '6', '.', '0', ',', 'ĠMode', ':', 'Ġ', '0', '.', '0', ',', 'ĠVal', 'ence', ':', 'Ġ', '0', '.', '473', ',', 'ĠDance', 'ability', ':', 'Ġ', '0', '.', '736', ',', 'ĠEnergy', ':', 'Ġ', '0', '.', '642', ',', 'ĠSpeech', 'iness', ':', 'Ġ', '0', '.', '038', '6', ',', 'ĠLoud', 'ness', ':', 'Ġ-', '7', '.', '595', ',', 'ĠL', 'iveness', ':', 'Ġ', '0', '.', '611', ',', 'ĠInstrument', 'al', 'ness', ':', 'Ġ', '0', '.', '822', ',', 'ĠAc', 'oustic', 'ness', ':', 'Ġ', '0', '.', '000', '161', ',', 'ĠIS', 'RC', ':', 'ĠGB', '3', 'T', 'Q', '120', '037', '4', ',', 'ĠArtist', 'ĠURL', ':', 'Ġbeat', 'port', '.com', '/', 'artist', '/', 'erin', '-le', 'ah', '/', '616', '75', ',', 'ĠLabel', 'ĠURL', ':', 'Ġbeat', 'port', '.com', '/', 'label', '/', 'quant', 'ize', '-record', 'ings', '/', '246', '24', ',', 'ĠGenre', 'ĠURL', ':', 'Ġ/', 'genre', '/h', 'ouse', '/', '5', '.']
Embeddings: [16042, 3110, 25, 220, 17786, 22534, 20, 11, 11106, 25, 4702, 362, 11617, 11, 29459, 25, 56914, 67961, 11, 29459, 3110, 25, 220, 21379, 2075, 11, 41395, 25, 4783, 11, 41395, 3110, 25, 220, 20, 11, 9587, 25, 32541, 553, 13896, 826, 11, 9587, 3110, 25, 220, 14205, 1187, 11, 17836, 2696, 25, 220, 679, 19, 12, 1721, 12, 508, 11, 20371, 5665, 25, 9567, 403, 916, 14, 13432, 14, 4345, 7561, 63970, 14, 17786, 22534, 20, 11, 19771, 25, 52162, 328, 2821, 98403, 19771, 11, 51127, 25, 7566, 11, 5031, 40114, 25, 7566, 11, 21722, 25, 220, 22, 25, 2166, 4520, 11, 89319, 25, 220, 4513, 13, 15, 11, 5422, 3110, 25, 220, 21, 13, 15, 11, 14904, 25, 220, 15, 13, 15, 11, 4196, 768, 25, 220, 15, 13, 21505, 11, 30704, 2968, 25, 220, 15, 13, 23969, 11, 12634, 25, 220, 15, 13, 22266, 11, 39841, 1918, 25, 220, 15, 13, 24462, 21, 11, 80648, 2136, 25, 482, 22, 13, 22754, 11, 445, 13071, 25, 220, 15, 13, 20973, 11, 43405, 278, 2136, 25, 220, 15, 13, 23105, 11, 6515, 35415, 2136, 25, 220, 15, 13, 931, 10718, 11, 3507, 7532, 25, 19397, 18, 51, 48, 4364, 23587, 19, 11, 29459, 5665, 25, 9567, 403, 916, 14, 19135, 14, 85509, 31307, 1494, 14, 21379, 2075, 11, 9587, 5665, 25, 9567, 403, 916, 14, 1530, 14, 31548, 553, 57263, 826, 14, 14205, 1187, 11, 41395, 5665, 25, 611, 34713, 7682, 1559, 14, 20, 13]



Tokenizing batches 101-120:  40%|████      | 8/20 [14:28<20:59, 105.00s/it]