From Ratings to Sentiment: Few-Shot LLM Analysis of Amazon Video Game Reviews

From Ratings to Sentiment: Few-Shot LLM Analysis of Amazon Video Game Reviews

Summary: Using a Hugging Face dataset of Amazon product reviews for video games, this project builds a sentiment classification pipeline. It combines data preparation, few-shot prompting, and evaluation to generate insights on how gamers express satisfaction or dissatisfaction.

👉 Full code & notebook: https://github.com/musicwil/Generative-AI-projects/tree/main/sentiment-analysis-classification

1. Introduction

This project began entirely from my own initiative.
While exploring datasets on Hugging Face, I found one containing 15 years of Amazon video game reviews — covering October 1999 to July 2014. The dataset contained the full review text and star ratings, but no direct sentiment classification (positive/negative).

I wanted to see if a Large Language Model (LLM) could take only the review text and reliably classify each review’s sentiment without relying on the star ratings.

This led to a two-stage approach:

  • First, try a zero-shot prompt.
  • Then, improve it using few-shot prompting with 19 manually created “gold” examples.

We then added an optional enrichment step — mapping Amazon product ASINs to their corresponding video game titles — to present results in a more understandable, reader-friendly way.


2. Project Goals

  • Build a reproducible pipeline for classifying Amazon video game reviews as positive or negative.
  • Compare zero-shot vs few-shot prompting.
  • Optional but valuable: enrich data by finding actual game names from ASINs. This was not a requirement but a choice to make the work more complete.

3. Dataset Overview

  • Source: Hugging Face — LoganKells/amazon_product_reviews_video_games
  • Coverage: October 14, 1999 → July 22, 2014
  • Subset used: data_100min (pre-filtered) → 5,674 reviews
  • Fields used: asin, reviewText, overall (rating), and reviewTime

Example:

asin reviewText overall reviewTime
B00000JRSB "When I first played this game, I was stunned..." 5 07 20, 2003
B00004YRQA "It is important to have such a card to save your games." 4 06 11, 2001

4. Methodology

4.1 Data Loading

We loaded the data_100min subset directly from Hugging Face. This subset was already filtered, so no additional cleaning steps like deduplication, length filtering, or text normalization were applied in this workflow.

4.2 Prompting Strategies

  • Zero-shot: A direct instruction prompt telling the model to classify reviews as positive or negative, without any examples.
  • Few-shot: The same prompt, but now with 19 manually labeled “gold” examples from the dataset, showing exactly what qualifies as positive or negative.

Example few-shot function from the notebook:

# Defining function to create few_shot_prompt

def create_prompt(system_message, examples, user_message_template):

    """
    Return a prompt message in the format expected by the Open AI API.
    Loop through the examples and parse them as user message and assistant
    message.

    Args:
        system_message (str): system message with instructions for sentiment analysis
        examples (str): JSON string with list of examples
        user_message_template (str): string with a placeholder for customer reviews

    Output:
        few_shot_prompt (List): A list of dictionaries in the Open AI prompt format
    """

    few_shot_prompt = [{'role':'system', 'content': system_message}]

    for example in json.loads(examples):
        example_review = example['Review']
        example_sentiment = example['Ground_Truth']

        few_shot_prompt.append(
            {
                'role': 'user',
                'content':  user_message_template.format(
                    reviewText=example_review
                )
            }
        )

        few_shot_prompt.append(
            {'role': 'assistant',
             'content': f"{example_sentiment}"}
        )

    return few_shot_prompt
  • Why it helps: Compared to zero-shot, few-shot reduced classification errors, especially for mixed-tone reviews where the model might otherwise misinterpret sentiment.

4.3 Evaluation

Since the dataset didn’t have gold sentiment labels, we couldn’t compute metrics like accuracy or F1 score. Instead, we manually reviewed randomly selected predictions from both approaches and compared results.

5. Results

  • Observation: Zero-shot produced many incorrect classifications. Few-shot significantly improved accuracy in our manual checks.
  • Method: Random sample reviews were compared between the two methods, and correctness was judged manually.
  • Outcome: Few-shot handled mixed or nuanced reviews far better than zero-shot.

6. Insights & Recommendations

Technical

  • Few-shot prompting can greatly improve LLM performance when no labeled dataset exists.
  • Creating a gold set of examples — even a small one — helps anchor the model’s decision-making.

Example ASIN → Product Title Enrichment function from the notebook:

def get_title_from_asin(asin):
    query = f"What is the title of the Amazon video game with ASIN {asin}?"
    try:
        response = tavily_client.search(query, search_depth="basic")
        return response['results'][0]['title'] if response['results'] else "Not found"
    except Exception as e:
        return f"Error: {str(e)}"

Business

  • Companies can mine customer feedback for actionable insights even without labeled data.
  • Sentiment trends could inform product design, marketing, and customer support focus.
  • Product title enrichment makes results easier for non-technical audiences to understand.

7. How to Reproduce

8. Next Steps (Future Development)

  • Add sentiment distribution charts and trends over time.
  • Explore semi-supervised labeling to enable metric computation.

9. Conclusion

By applying few-shot LLM prompting to Amazon video game reviews, we converted raw customer text into structured sentiment insights without relying on ratings or existing sentiment labels. The process showed clear improvements over zero-shot classification and demonstrated the value of even a small set of gold examples.

Future work will make the analysis richer with visualizations and partial labeling so that quantitative metrics can also be included.