Beyond the Prompt: A Practical Guide to Fine-Tuning Gemini on Vertex AI

Rohit Das

Stop guessing if your model needs a tune-up. This guide cuts through the hype to show you exactly how to implement supervised fine-tuning using Vertex AI. From HuggingFace data prep to testing your final weights, you’ll master the "how" and "when" of leveling up Gemini’s performance.

Fine-tuning is one of those things that feels like magic until you actually have to debug a failed training job at 2 AM. Honestly, before you even touch a line of code, you’ve got to ask yourself if you actually need it.

Most of the time, a well-crafted prompt or a solid RAG (Retrieval-Augmented Generation) pipeline does the trick. But, if you need the model to mimic a very specific "vibe," follow a niche structural format, or understand deeply specialized terminology that it keeps hallucinating on, that’s when you pull the fine-tuning lever.

Here’s a walkthrough on how to get this done using Vertex AI and the Gemini 2.5 Flash-lite model, based on some recent work I've been doing with Yelp review datasets.

Setting the Stage

First things first, you need to get your environment ready. We’re working in a Google Colab-style environment here. You’ll need to initialize Vertex AI with your specific project details.

import vertexai
from google import genai
from google.genai import types
from google.colab import userdata

PROJECT_ID = userdata.get('PROJECT_ID')
REGION = userdata.get('REGION')

vertexai.init(project=PROJECT_ID, location=REGION)
client = genai.Client(vertexai=True, project=PROJECT_ID, location=REGION)

I usually pull my credentials from userdata to keep things clean. Once the client is authenticated, you’re basically ready to start talking to the Vertex backend.

Grabbing the Data

We aren't building a dataset from scratch, who has time for that? HuggingFace is the go-to here. For this demo, I used the Yelp/yelp_review_full dataset. It’s a classic for sentiment analysis and rating prediction.

from datasets import load_dataset
import pandas as pd

ds = load_dataset("Yelp/yelp_review_full")
shuffled_ds = ds['train'].shuffle(seed=21)

# Grab a small, balanced sample for demo purposes
samples = []
for label in range(5):
    subset = shuffled_ds.filter(lambda x: x['label'] == label).select(range(5))
    samples.append(subset.to_pandas())

samples_df = pd.concat(samples).reset_index(drop=True)
# Adjusting labels because Yelp usually goes 1-5, not 0-4
samples_df['stars'] = samples_df['label'] + 1

A quick tip: always shuffle with a fixed seed. It makes your experiments reproducible, which is a lifesaver when you're trying to figure out why one version of your model is acting weirder than the other.

The Actual Fine-Tuning

Now, the meat of the process. Vertex AI makes this surprisingly streamlined with the tunings.tune method. You define your base model, in this case, gemini-2.5-flash-lite and point it toward your training and validation datasets stored in a Google Cloud Storage (GCS) bucket.

base_model = "gemini-2.5-flash-lite"
training_dataset = {"gcs_uri": f"gs://{BUCKET_NAME}/training_data.jsonl"}
validation_dataset = types.TuningValidationDataset(
    gcs_uri=f"gs://{BUCKET_NAME}/validation_data.jsonl"
)

sft_tuning_job = client.tunings.tune(
    base_model=base_model,
    training_dataset=training_dataset,
    config=types.CreateTuningJobConfig(
        tuned_model_display_name="Yelp_Reviews",
        validation_dataset=validation_dataset,
    ),
)

It’s worth noting that the data needs to be in a .jsonl format. If you’ve got a CSV or a DataFrame, you’ll need to do a little bit of transformation before uploading it to your bucket.

Testing the Results

Once the job finishes (which gives you enough time to grab a coffee or three), you’ll get an endpoint. You can then run your test set through this newly "educated" model.

In my tests, I noticed the accuracy hit 100% on a small sample set. While that sounds great, keep an eye out for overfitting—you want a model that understands the concept of a 4-star review, not one that just memorized your training data.

# Calling the tuned model endpoint
model_path = "projects/your-project/locations/us-west1/endpoints/your-endpoint-id"

yelp_response = []
for row in samples_df.itertuples():
    prompt = f"Rate this review on a scale of 1 to 5 stars: {row.text}"
    response = client.models.generate_content(
        model=model_path, contents=prompt
    ).text
    yelp_response.append(int(response))

The output usually looks like a clean list of integers: [1, 1, 2, 2, 3, 4, 5...]. If the model starts yapping about why it chose the number, you might need to go back and refine your training examples to be more concise.

When Fine-Tuning is Overkill

I’ll be honest, fine-tuning is a heavy lift. Before you commit to the compute costs and the data prep, consider Few-Shot Prompting. Sometimes, just giving the model 5 or 10 solid examples in the prompt itself gets you 90% of the way there.

If that’s not enough, look into RAG. If your model is failing because it doesn't know specific facts, fine-tuning won't necessarily help as much as a search-based retrieval system would. Use fine-tuning for style and procedure; use RAG for knowledge.

Inspire Others – Share Now

Table of Contents