Adventures with cloud authentication
I had spent the better part of an evening trying setting up a Google Cloud service account that could be used with impersonation from both my GCP VM instance and my local computer. All along the way, I kept running into errors like this:
google.api_core.exceptions.PermissionDenied: 403 Permission 'secretmanager.versions.access' denied for resource 'projects/thoughts-on-music
This is from my GCP instance, a "c2-standard-30" with 30 vCPUs and 120 GB of memory. My Macbook is plenty strong enough to do a lot of deep learning tasks on its own (Apple M3 Max chip with 16‑core CPU, 40‑core GPU, 16‑core Neural Engine with 2TB SSD storage and 128GB unified memory) but I wanted a cloud compute instance to run some web scraping in the background while doing other work on my computer. The scraping job I setup ran today from 7:30am to 7:30pm and scraped 55,000 distinct web pages.
Pause. Freeze frame. I'm sure some of you thinking, "what the hell is impersonation and also what did you just say?"
Google Cloud Platform (GCP) allows you to set up a service account with impersonation, enabling secure authentication via the command line. Impersonation includes session tokens that automatically refresh and eventually expire, requiring re-authentication. This feature is especially convenient if you're a one-person operation like me and you don't have a ton of free time to sort through the thousands of possible Identity and Access Management (IAM) permissions.
Service accounts are meant to provide automated, controlled access to specific resources within an organization. In larger companies, their use is typically governed by data governance or system security policies. Analytics teams often use service accounts to access certain database tables required for dashboards and reporting. Instead of using personal corporate credentials, a service account offers an alternative authentication method that avoids hardcoding usernames or passwords.
The fun part about being a data product manager who is now doing his own development work is that I’m experiencing…what’s the term I’m looking for here…the joys of setting up automated access to different systems. As a data PM you have to know what a service account is, how they work, the general strengths and weaknesses, and best practices for when and when not to use them. This is 101 stuff if you work in this space.
Going through the steps of actually setting them up in the backend and making sure everything is configured properly, though, is a different beast and is (more often than not) the domain of engineering. Depending on the flavor of engineering team you're working with, you may find that they either:
- Don't want product telling them a service account is needed or how it should be setup, they just want the requirements of what needs to be available to who and when.
- Want a lot of specifics about what you need, including the specifics of the service account, what resources need to be accessed, and how they should be accessed.
As you're working through everything, there will be a series of decisions you have to make tbat (if my recent experience is any indication) you will mess up in one way or another: where do you want us based? What types of things do you want it to be able to do? What types of things do you want others to be able to do with it? Is authentication through impersonation or will you store the access keys somewhere? How will you invoke the access keys in your script? Who do you want to be able to run your code?
You will run into a lot of issues when setting this up but, depending on your need, it could be worth the effort. If you’re only ever running code from your local computer while you’re sitting there with it open then you can just use environmental variables. If you want to run your code in the cloud while your computer is turned off or you’re otherwise not around, you’ll need a process for access the platforms that have the data you need.
For example, I use the GCP Secrets Manager to store and access both my Scraping Fish and Hugging Face API keys. Scraping Fish is an excellent API I use to manage all the intricacies of scraping a website and Hugging Face is how I've setup my connection to a few large language models I'm trying out.
You will spend more time than you like on getting these things properly setup and almost certainly have to go back and redo things a few times as you get the hang of it. In the end, though, you'll have a more secure setup and feel more confident about what you're trying to do.