Captioning Pokémon Cards with Image-to-Text Models

In this post, we will look at how to take a pre-trained image-to-text neural model and fine-tune it to caption Pokémon cards.
Pavan Kantharaju
Created on May 12|Last edited on July 20
Comment
﻿
IntroductionImage captioning is the process of generating a textual description of visual content (primarily images), usually employing vision and language reasoning systems. These captions can range from simple captions that are just a handful of words to longer, more descriptive text. 
For example: let's suppose I have the below image of a rooster I took during my trip to Hawaii.
Image of a rooster I took during my trip to Hawaii. Rooster is walking on the dirt ground with its left leg up.
I can caption this as "A rooster standing on a single leg" or "A vibrant, colorful rooster on a dirt patch, staring at something in the horizon, standing on a single leg." Captions are important as they allow us to describe images using text, which can be particularly helpful for those with visual impairments.
But enough about roosters. In this post, we move away from captioning realistic images and focus on captioning Pokémon cards. More specifically, we will fine-tune a publicly-available image-to-text model and use this to caption cards.
Example of Pokémon cards. Right-most card is a real-card that I own. Remaining cards are from the 1K dataset I use in this post. Pokémon from left to right: Larvesta, Eevee VMax, Darkrai, Ash's Pikachu.
So, why Pokémon cards? What makes them interesting? First off, Pokémon cards can sometimes be difficult to read and captioning can help with readability. One example is the above Darkrai card, where the text and background make it difficult for me to read the text. From a more scientific point-of-view, Pokémon cards are cartoonish and diverge substantially from realistic images. Pre-trained vision models—which are primarily trained on realistic images—will likely have a difficult time processing Pokémon cards. Taking the above Eevee and Darkrai as examples, we can see their eyes are drawn in a non-realistic manner (specifically, Eevee's eyes look like "anime eyes"). Cards also have a variety of different art styles. For example, the below two cards seem similar in form, but are illustrated by two different individuals:
Reuniclus and Yanmega Cards. Reuniclus is illustrated by Kagemaru Himeno while Yanmega is illustrated by Uta.
Finally, the content needed for captioning Pokémon cards (or games in general) is domain-specific; the content is not common knowledge. For example, the above Reuniclus' attack "Telekinesis of Nobility" is not a common term and thus not general knowledge (it doesn't even exist in the Pokémon games).
As such, they make for an interesting image captioning challenge: How well can neural models caption non-realistic images?
To answer this question, we will use the PokemonCards dataset found on HuggingFace. The original dataset contains roughly 13K instances, but for this post, we create a 1000 sample subset. We'll also leverage W&B Tables to understand the dataset and Sweeps to find good hyperparameters for fine-tuning the image-to-text model. Finally, we will demonstrate the fine-tuned model's capabilities to caption physical Pokemon cards that I own.
Quick disclaimer: I am not an expert on image captioning. I worked on this project to (1) better understand image-to-text models and (2) learn about ML workflows. If you are looking for expert knowledge about image captioning, I am unfortunately not that person. With that, let's start!
💡
Show me the code!The codebase for this blog post can be found on my Github. Feel free to clone the repo and follow along with this post. If there are any comments/suggestions feel free to file a pull request/comment below and I'll get to it. 
What is Pokémon anyway?Pokémon is a media franchise that is adored by millions of fans over the world (spoiler alert: I'm one of them). Started in 1996, Pokémon has expanded from the original video games Pocket Monsters Red and Blue to a video game series, anime series, and even collectible trading card game. Some people collect the cards while others compete tournaments occurring throughout the year. 
Pokémon is also the name of the imaginary creatures in the series, usually based on real-world creatures (e.g., Pikachu is a mouse) and objects (e.g., Vanillite is a play on the spice or the classic ice cream flavor). Pokémon can be collected or used in battle. When battling in both the card game and video games, Pokémon have hit points that denote how much they can take before being unable to fight and have movesets for fighting against each other. 
Pokemon Cards DatasetLet's begin by diving into the Pokémon Cards dataset that we will use in this post. The original PokemonCards dataset can be found on HuggingFace and consists of 13,139 cards from 147 card sets, with the most recent set being Silver Tempest (See the Bulbapedia article List of Pokémon Trading Card Game expansions for the list of all English Pokémon sets). This means that this dataset does not include Pokémon from the most recent mainline Pokémon games, Scarlet andViolet. This is important to know as the captioning model we will train in this post will not work well on Pokémon from the most recent games.
Training, validating, and evaluating over the full 13K dataset can be done if you have access to computing resources. However, I am going to assume not everyone has access to free computing resources. Thus, from this 13K dataset, we take a random sample of 1K and use this for training, validating, and evaluating the image captioning model. 
You can click through the W&B Table below to check out the subset we're using:
﻿
project("pkthunder", "pokemon-cards").artifact("pokemon_cards").membershipForAlias("v1").artifactVersion.file("pokemon_table_1k_seed_1.table.json")
 - 3 of 1000
id
image
caption
name
set_name
hp
image_url
split
1
2
3
The four main columns we care about are caption, name, set_name, and image. There are a total of 691 unique Pokémon card names (name) in this dataset and 137 unique Pokémon sets (set_name). 
Let's make note of a few characteristics of this dataset. First, there are two Pokémon with blank cards (Froakie (row 59) and Chesnaught (row 487)). It's uncertain as to why these are blank, but we currently do not filter these and consider them noise (they will be contained in the training split of the dataset). Second, each Pokémon can also have variations of themselves (e.g., Cinderace vs. Cinderace V). Below are the variations of Cinderace. Notice that the artwork and captions are different, despite them being from the same Pokémon.
﻿
project("pkthunder", "pokemon-cards").artifact("pokemon_cards").membershipForAlias("v1").artifactVersion.file("pokemon_table_1k_seed_1.table.json")
 - 4 of 1000
id
image_url
caption
name
hp
set_name
image
split
56
195
229
614
Finally, the set_name is contained in the caption the model needs to generate, but it does not exist on the card itself. This is important to note because it is impossible for the model to correctly predict that portion of the text for sets not seen during training as that knowledge is not contained on the card, which is the only input to the model.
Dataset PartitioningThis dataset has two attributes which we could group by: set_name and name. We group the data based on set_name. The motivation for this is that new Pokémon card sets come out every year, so generalizing to new sets is realistic and important. We could also stratify by Pokémon, but in this 1K subset, the number of cards per Pokémon was very sparse (i.e., frequently 1 card per Pokémon) so we won't stratify by Pokémon. The number of cards per Pokémon is less sparse in the full 13K dataset, so we leave it to future work to stratify by Pokémon over the full dataset.
Under this grouping strategy, we split the 1K dataset into 80% training (800 instances), 20% validation (100 instances), and 20% testing (100 instances). 
Here's a table assigning each instance in the 1K dataset to train, valid, or test:
﻿
project("pkthunder", "pokemon-cards").artifact("pokemon_cards_split").membershipForAlias("v5").artifactVersion.file("pokemon_table_1k_data_split_seed_1.table.json")
 - 12 of 1000
id
image_url
caption
name
hp
set_name
image
split
9
10
11
12
Model DescriptionThe model we use for this project is a pre-trained image-to-text model from HuggingFace which combines a VisionTransformer (ViT) encoder with a GPT-2 decoder. While other large multimodal models (e.g., GPT-4) are available, I chose to use this as it was (1) publicly available, (2) I could train it locally, and (3) I wanted to evaluate a smaller model before moving onto a bigger model. 
At a high level, the Vision Transformer is a Transformer Encoder trained on image data while GPT-2 is a 1.5B parameter Transformer Decoder trained on textual data. I'll mainly go into detail about the input and output of the full pipeline (ViT + GPT-2) as that is the most important for this post. Below is a visualization of the entire pipeline:
Image-to-Text Model Architecture. Outputs of ViT are provided as input to the Encoder-Decoder Self-Attention mechanism component within the GPT-2 decoder.
ViT Inputs: The standard transformer encoder model takes a sequence of 1D token embeddings (i.e., RN×D{\cal R}^{N \times D}RN×D﻿), where N is the number of tokens and D is the embedding dimension. However, images are not sequential and have the shape H×W×CH \times W \times CH×W×C﻿, where (H,W) is the resolution of the original image and C is the number of channels. 
To pass images into a transformer encoder, we need to convert the image into a sequential input of 1D embeddings. To do this, we break down the image into patches of size P2×CP^2 \times CP2×C﻿, where (P, P) is the resolution of each image patch. We can then aggregate these patches into a sequence to get a image patch sequence RN×(P2×C){\cal R}^{N \times (P^2 \times C)}RN×(P2×C)﻿, where N=HWP2N = \frac{HW}{P^2}N=P2HW​﻿ is the number of patches extracted from the image (i.e., the input length). 
Now that we have converted the image into a sequential input, we need to transform each patch into a 1D embedding. To do this, each patch is then flattened and passed through a patch embedding layer to transform it from RP2×C→RD{\cal R}^{P^2 \times C} \rightarrow {\cal R}^{D}RP2×C→RD﻿. This effectively turns the patches into the sequence of 1D embeddings that the Transformer takes in as input. A learnable embedding (C\texttt{C}C﻿ in the above figure) is pre-pended to the sequence of patch embeddings that functions in a similar manner to that of BERT's [CLASS]\texttt{[CLASS]}[CLASS]﻿ token. More specifically, the encoder output from this embedding (CLS\texttt{CLS}CLS﻿ in the above figure) contains a representation of the full input image and is primarily used for classification purposes. However, this will not be used by the transformer decoder and thus is not important for this post, but provides some additional information about ViT's inputs and outputs. Finally, positional embeddings are added to the patch embeddings to provide positional information about each patch in the sequence.
ViT Outputs: The encoder outputs (1) contextualized embeddings for each image patch (E1,…,ENE_1, \dots, E_NE1​,…,EN​﻿ in the above figure), and (2) the class embedding containing a representation of the full input image (CLS\texttt{CLS}CLS﻿ in the above figure). The contextual embeddings are then passed to the transformer decoder, where it will be used to condition text generated by the decoder.
GPT-2 Input and Outputs: The transformer decoder takes the embeddings from ViT and a start token [START]\texttt{[START]}[START]﻿as input, and autoregressively outputs a sequence of tokens. This output represents the generated text (in our case, a caption) by the decoder. More specifically, it generates the first token (in the above figure, this is "This") from both[START]\texttt{[START]}[START]﻿ and the ViT embeddings, and then uses the "This" it generated as input to generate the second token ("is" in the figure). This is the repeated until a maximum number of tokens is reached or an [END]\texttt{[END]} [END]﻿token is generated.
MetricsWe use three metrics for evaluating our image captioning model: BLEU, Google BLEU, and BERTScore. BLEU and Google BLEU (called "GLEU Score" in the original paper) are metrics that focus on evaluating the syntactic structure of text while BERTScore focuses more on the semantics of text. 
We'll use BLEU and Google BLEU when training and validating our model, and use BLEU, Google BLEU, and BERTScore when evaluating our model. According to the HuggingFace description, BLEU has some undesirable properties when evaluating single sentences that Google BLEU mitigates (BLEU is a corpus-level metric). This makes Google BLEU useful when evaluating each instance's caption in isolation and has the added benefit of correlating well with the BLEU metric on a corpus level, according to the original paper. Thus, we focus more on the Google BLEU score over the BLEU score, but include both in our evaluations.
Fine-Tuning BaselineLet's start with creating a baseline model. This is to understand how well can we do by simply taking something off-the-shelf and training it. To this end, we download the image-to-text model from HuggingFace and fine-tuning it without any hyperparameter optimizations. I will first go through the important parts of the training + validation code and then provide baseline results. The code I will be going through can be found here: https://github.com/Teravolt/pokemon-cards-image-captioning/blob/main/train.py 
First, we will define some important global variables, such as the model we are downloading, its tokenizer and feature extractor, evaluation metrics, and base configuration.
SEED = 1
﻿
# Define model
MODEL = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
MODEL.to(DEVICE)
﻿
# Define image feature extractor and tokenizer
FEATURE_EXTRACTOR = AutoFeatureExtractor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
TOKENIZER = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
﻿
# Define metrics
GOOGLE_BLEU_METRIC = evaluate.load('google_bleu')
BLEU_METRIC = evaluate.load('sacrebleu')
BERTSCORE_METRIC = evaluate.load('bertscore')
﻿
# Define validation/testing results table 
FULL_RESULTS_TABLE = wandb.Table(columns=['eval_iter', 'image', 'pred_text', 'gt_text', 'google_bleu'])
EVAL_ITER = 0
﻿
VAL_DF = None
﻿
CONFIG = Namespace(
    predict_with_generate=True,
    include_inputs_for_metrics=False,
    report_to='wandb',
    run_name='fine_tuning',
    evaluation_strategy='epoch',
    save_strategy='epoch',
    per_device_train_batch_size=16,
    per_device_eval_batch_size=1,
    num_train_epochs=5,
    learning_rate=1e-3,
    push_to_hub=False,
    load_best_model_at_end=True,
    seed=SEED,
    output_dir='baseline-ft-model-output/',
    optim='adamw_torch',
    generation_max_length=256,
    generation_num_beams=1,
    train_limit=0,
    val_limit=0
)
Many of these global variables are straightforward, but some need a bit more explanation.
EVAL_ITER and VAL_DF are global variables that contain the current epoch iteration and validation data. EVAL_ITER is used to keep track of the epoch iteration outside of HuggingFace's SeqToSeqTrainer, which we use for fine-tuning our model. VAL_DF is used when calculating and logging metrics, and we will see how this is specifically used in a second. 
A majority of the attributes in the base configuration CONFIG are based on the Seq2SeqTrainingArguments class from HuggingFace. There are a few additional attributes to the config that are added via command line arguments ( train_limit, val_limit, log_model, log_full_results ). train_limit and val_limit are used to limit the number of training and validation instances we use (for debugging), log_model allows us to log a learned model to W&B, and log_full_results allows us to log validation results from all epochs instead of just the final epoch.
Next, we define the Pokémon Cards dataset. This dataset will convert our input images and captions into the format required for the model. Specifically, images are converted into image features while captions are tokenized and converted into a sequence of token ids.
class PokemonCardsDataset(Dataset)
﻿
    def __init__(self, images:list, captions: list, config) -> None:
﻿
        self.images = []
        for image in images:
            image_ = image.image
            if image_.mode != "RGB":
                image_ = image_.convert(mode="RGB")
            self.images.append(image_)
﻿
        self.captions = captions
        self.config = config
﻿
    def __len__(self):
        return len(self.captions)
﻿
    def __getitem__(self, index):
        
        image = self.images[index]
        caption = self.captions[index]
﻿
        pixel_values = FEATURE_EXTRACTOR(images=image, return_tensors="pt").pixel_values[0]
        tokenized_caption = TOKENIZER.encode(
            caption, return_tensors='pt', padding='max_length',
            truncation='longest_first',
            max_length=self.config.generation_max_length)[0]
﻿
        output = {
            'pixel_values': pixel_values,
            'labels': tokenized_caption
            }
﻿
        return output
Next, we have the metrics function, which takes in a batch of predictions and ground truths from the validation data and runs the Google BLEU and BLEU scores over the batch. These scores are then logged to W&B.
def compute_metrics(eval_obj: EvalPrediction):
    global EVAL_ITER
﻿
    pred_ids = eval_obj.predictions
    gt_ids = eval_obj.label_ids
﻿
    pred_texts = TOKENIZER.batch_decode(pred_ids, skip_special_tokens=True)
    pred_texts = [text.strip() for text in pred_texts]
﻿
    gt_texts = TOKENIZER.batch_decode(gt_ids, skip_special_tokens=True)
    gt_texts = [[text.strip()] for text in gt_texts]
﻿
    avg_google_bleu = []
    for i, (pred_text, gt_text) in enumerate(zip(pred_texts, gt_texts)):
        # Compute Google BLEU metric
        # print(f"Prediction {i}: {pred_text}")
        # print(f"Ground truth {i}: {gt_text}")
﻿
        google_bleu_metric = \
            GOOGLE_BLEU_METRIC.compute(predictions=[pred_text], references=[gt_text])
﻿
        FULL_RESULTS_TABLE.add_data(EVAL_ITER, VAL_DF['image'].values[i],
                                    pred_text, gt_text[0],
                                    google_bleu_metric['google_bleu'])
﻿
        avg_google_bleu.append(google_bleu_metric['google_bleu'])
﻿
    bleu_metric = \
        BLEU_METRIC.compute(predictions=pred_texts, references=gt_texts)
﻿
    metrics = {
        'avg_google_bleu': sum(avg_google_bleu)/len(avg_google_bleu),
        'bleu_metric': bleu_metric['score']}
﻿
    EVAL_ITER += 1
﻿
    return metrics
In this metrics function, we log the epoch iteration, input images, predicted texts, ground truth texts, and Google BLEU scores to a W&B table. Many of these can be retrieved and computed within the function, except for the epoch iteration and input images. Unfortunately, the input images cannot be passed into the metrics function (Seq2SeqTrainingArguments has a way to pass inputs to the metrics functions, but I ran into errors), so we need to get it from somewhere else. This is where VAL_DF comes in. We'll get our input images from this global variable.
Finally, we have our training function. This function initializes a W&B run, downloads data from W&B, creates training & validation splits,  and passes them to HuggingFace's trainer for training. 
def train(config):
    """
    Training process
    """
    global VAL_DF
﻿
    run = wandb.init(project='pokemon-cards', entity=None, job_type="training", name=config.run_name)
    wandb_table = download_data(run)
    train_val_df = get_df(wandb_table)
﻿
    train_df = train_val_df[train_val_df.split == 'train']
    VAL_DF = train_val_df[train_val_df.split == 'valid']
﻿
    if config.train_limit > 0:
        train_df = train_df.iloc[:config.train_limit, :]
    if config.val_limit > 0:
        VAL_DF = VAL_DF.iloc[:config.val_limit, :]
﻿
    train_dataset = PokemonCardsDataset(
        train_df.image.values,
        train_df.caption.values,
        config)
﻿
    val_dataset = PokemonCardsDataset(
        VAL_DF.image.values,
        VAL_DF.caption.values,
        config)
﻿
    training_args = Seq2SeqTrainingArguments(
        predict_with_generate=config.predict_with_generate,
        include_inputs_for_metrics=config.include_inputs_for_metrics,
        report_to=config.report_to,
        run_name=config.run_name,
        evaluation_strategy=config.evaluation_strategy,
        save_strategy=config.save_strategy,
        per_device_train_batch_size=config.per_device_train_batch_size,
        per_device_eval_batch_size=config.per_device_eval_batch_size,
        num_train_epochs=config.num_train_epochs,
        learning_rate=config.learning_rate,
        push_to_hub=config.push_to_hub,
        metric_for_best_model="avg_google_bleu",
        load_best_model_at_end=config.load_best_model_at_end,
        seed=config.seed,
        output_dir=config.output_dir,
        optim=config.optim,
        generation_max_length=config.generation_max_length,
        generation_num_beams=config.generation_num_beams
        )
﻿
    trainer = Seq2SeqTrainer(
        model=MODEL,
        args=training_args,
        compute_metrics=compute_metrics,
        data_collator=collate_fn,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        tokenizer=FEATURE_EXTRACTOR,
        )
﻿
    train_results = trainer.train()
﻿
    if config.log_full_results:
        # Save full metrics table to wandb
        run.log({'full_results_table': FULL_RESULTS_TABLE})
﻿
    # Save final metrics table to wandb
    final_results_table = get_final_results(EVAL_ITER-1, FULL_RESULTS_TABLE)
    run.log({'final_results_table': final_results_table})
﻿
    if config.log_model:
        model_art = wandb.Artifact("pokemon-image-captioning-model", type="model")
        trainer.save_model(f"{config.output_dir}/best_model")
        model_art.add_dir(f"{config.output_dir}/best_model")
        run.log_artifact(model_art)
﻿
    run.finish()
﻿
    return train_results
Baseline ResultsWe trained our model for five epochs and provide evaluation results over the 100 validation instances after each epoch. 
Below, you'll find a) a few graphs containing the average Google BLEU score over the 100 validation instances and BLEU score for each epoch, and b) a table containing the predicted text and ground truth text for the epoch with the highest avg. Google BLEU (epoch 2). 
These results show that we achieve a decent avg. Google BLEU and BLEU score with the baseline model. However, the model does not output the correct text for the given card, and the model outputs the same exact text for all cards. 
﻿
Run set1
﻿
Running a Model SweepNow that we have baseline results, we can run a hyperparameter sweep to see if we can improve our model performance. For this, we leverage W&B Sweeps. We search for a parameter configuration that maximizes the Google BLEU score (we can also use BLEU score as both score are relatively close to each other) by running our sweep 10 times with different configurations. Here are our results: 
﻿
Sweep: lyr5qvzh 110
﻿
These results indicate that a configuration with learning_rate=0.00004536, num_train_epochs=15, and per_device_train_batch_size=4 achieve the best performance. So, let's use that for training a new model and evaluating this model on the evaluation dataset.
﻿
Run set1
﻿
The above plots and table contain validation results from the best performing model. Here, the BERTScore is really high compared to the Google BLEU or BLEU scores. Recall that BERTScore allows for more semantic comparison vs Google BLEU/BLEU which is syntactic. This implies that the model is generating text that has similar meaning to the ground truth, but it is not generating the appropriate text. Additionally, the model is getting the Pokémon type correct most of the time and even the rarity of the card. However, the model usually does not get the Pokémon correct (to its credit, at least it is not generating a non-existent Pokémon). There is one case where it gets the Pokémon name correctly (row 66):
Ground truth text: A Basic Pokemon Card of type Lightning with the title Pikachu and 60 HP of rarity Uncommon from the set Legendary Treasures and the flavor text: It occasionally uses an electric shock to recharge a fellow Pikachu that is in a weakened state. It has the attack Thundershock with the cost Lightning, the energy cost 1 and the damage of 10 with the description: Flip a coin. If heads, the Defending Pokemon is now Paralyzed. It has the attack Tail Whap with the cost Colorless, Colorless, the energy cost 2 and the damage of 20. It has weakness against Fighting 2.
Predicted text: A Basic Pokemon Card of type Lightning with the title Pikachu and 60 HP of rarity Common from the set Furious Fists and the flavor text: It generates electricity from its tail and uses it to generate shock waves. It has the attack Charge with the cost Colorless, the energy cost 1 and the damage of 10. It has the attack Tail Slap with the cost Lightning, Colorless, the energy cost 2 and the damage of 20 with the description: Flip a coin. If heads, the Defending Pokemon is now Paralyzed. It has weakness against Fighting 2.
This was rather surprising give that it failed to get other Pokémon correct! Looking at the training dataset however, we see that there are 11 Pikachu cards in the dataset. Thus, it's possible that this was enough for the model to learn what a Pikachu is.
Results for Final Model﻿
Run set1
﻿
We see that all metrics over the entire evaluation dataset are close to the validation metrics, implying that the validation data was a good representation of the evaluation data. Additionally, our model did not perform substantially worse on the evaluation data, implying that the model was able to generalize well to new, unseen Pokémon card sets. Similar to what we saw in the validation results, our model was able to almost correctly name the one Pikachu card in the evaluation dataset (row 55):
Ground truth text: A VMAX Pokemon Card of type Lightning with the title Flying Pikachu VMAX and 310 HP of rarity VM evolved from Flying Pikachu V from the set Celebrations.  It has the attack Max Balloon with the cost Lightning, Colorless, Colorless, the energy cost 3 and the damage of 160 with the description: During your opponent's next turn, prevent all damage done to this Pokemon by attacks from Basic Pokemon. It has weakness against Lightning 2. It has resistance against Fighting -30.
Predicted text: A VMAX Pokemon Card of type Lightning with the title Pikachu VMAX and 320 HP of rarity Promo evolved from Pikachu V from the set SM Black Star Promos.  It has the attack Lightning Storm with the cost Lightning, Colorless, the energy cost 2 and the damage of 50+ with the description: This attack does 50 more damage for each Lightning Energy attached to Pikachu VMAX. It has the attack Max Pikachu VMAX with the cost Lightning, Colorless, Colorless, Colorless, the energy cost 4 and the damage of 180 with the description: Discard all Lightning Energy from this Pokemon. It has weakness against Fighting 2.
Bonus: Testing Image Captioner against Physical CardsThe cards that are provided in the dataset are extracted from an API and are not physical cards. So, let's see if our image captioner can handle actual, physical cards. Physical cards are what we will see in application, so making sure we can caption over these cards is important.
To see what captions we generate from physical cards, I bought  two booster packs of Pokémon cards from my local game store and took photos of each card using my phone. Please note: You do not need to go and buy cards as they can be rather expensive. This is something I did for fun. 
Next, I created a Gradio demo of the image captioner (see https://github.com/Teravolt/pokemon-cards-image-captioning/blob/main/app.py for the demo). You can run the demo by running python app.py. I then took the photos and passed them to the captioner.
Below are some results from captioning these physical cards.
﻿
We can see that the model gets the Pokémon type correct, but does not get the rest of the content correct. Interestingly, this is a card of a Pikachu and we know the model can get the name of a Pikachu card mostly correct. However, we see that is not the case here. I suspect the model is having trouble because the card is far from the camera and does not take up the entire image (there is a desk in the background). 
For the next two physical cards, I bought one Silver Tempest and Scarlet and Violet booster pack and selected one card from each pack. Recall that this dataset does not have any cards from Scarlet and Violet and thus the card from this booster pack should be completely new to the model. Below are the results:
Caption of the Pokémon card Braxien from the Silver Tempest Booster Pack
Caption of the Pokémon card Varoom from the Scarlet and Violet Booster Pack
Unfortunately, the model did not get the Pokémon, rarity, or Pokémon type correct for these cards. :(
ConclusionsIn this post, we fine-tuned an image-to-text  model for captioning Pokémon cards, leveraging several W&B technologies for this, including Sweeps and Tables. We evaluated our model on held-out evaluation data and real, physical Pokémon cards. While the model was able to generalize to the evaluation data, it did not generalize well to physical cards. 
Going back to our original question, How well can neural models caption non-realistic images?, our initial set of experiments show that they can caption them decently well, but there is still room for improvement, particularly for captioning physical cards. Below are several ideas that I came up with for next steps:
(Near Term) Train on more data. We only trained on a 1K subset of the full 13K dataset. 
(Near Term) Generalize model to physical Pokémon cards
(Near Term) Evaluate Optical Character Recognition (OCR) to text generation
(Long Term) Study how to use other techniques to fine-tune image to text generation (e.g., Reinforcement Learning).
(Long Term) Train model to generalize across different trading card games (e.g., Magic: The Gathering and Yu-Gi-Oh).
(Long Term) Generalize model to caption Japanese Pokémon cards
I hope you enjoyed this post! If you have any questions or comments feel free to comment below! Thank you for reading!
﻿
Add a comment
Tags: Articles, GenAI, NLP, LLM, Computer Vision, Community Posts, Sweeps, Panels
Iterate on AI agents and models faster. Try Weights & Biases today.