テキスト分類のためにBERTを微調整する方法

テキスト分類、tf.dataおよびtf.HubのためにBERTを微調整するためのコードファーストの読みやすいキックスタート
Created on January 31|Last edited on January 31
Comment
このレポートは、akshay uppalによる「How to Fine-Tune BERT for Text Classification」の翻訳です。
セクション： ﻿﻿﻿﻿﻿
セクション： BERTとは何ですか？テキスト分類のためのBERTの設定データセットを取得しましょう探検しましょうデータを使いこなすLet's BERT：TensorFlowハブから事前トレーニング済みのBERTモデルを入手するデータフローを許可する：tf.dataを使用して最終的な入力パイプラインを作成するBERT分類モデルの作成、トレーニング、追跡： いくつかのトレーニングメトリクスとグラフモデルの保存とモデルのバージョン管理BERTテスト分類の概要とコード
﻿
﻿
BERTとは何ですか？﻿﻿Bidirectional Encoder Representations from Transformers, （BERTとしてよく知られています）は、さまざまなNLPタスクの最先端のパフォーマンスを向上させ、他の多くの革新的なアーキテクチャの足がかりとなった、Googleによる革新的な論文です。
BERTがドメイン全体に新しい方向性を設定したと言っても過言ではありません。 これは、事前にトレーニングされたモデル（巨大なデータセットでトレーニングされた）を使用し、ダウンストリームのタスクとは無関係に学習を転送することの明らかな利点を示しています。
このレポートでは、テキスト分類にBERTを使用する方法を検討し、実行に移すための大量のコードと例を提供します。 一次資料を自分でチェックしたい場合は、注釈付きの論文へのリンクをここに示します。﻿﻿
							BERT分類モデル
テキスト分類のためのBERTの設定まず、TensorFlowとTensorFlow ModelGardenをインストールします：
import tensorflow as tf
print(tf.version.VERSION)
!git clone --depth 1 -b v2.4.0 https://github.com/tensorflow/models.git
また、TensorFlowモデルのGithubリポジトリのクローンを作成します。 注意すべきいくつかのこと：
–depth 1、クローン作成中、Gitは関連ファイルの最新のコピーのみを取得するため、多くのスペースと時間の節約になります。
-bを使用すると、特定のブランチのみのクローンを作成します。
TensorFlow2.xバージョンと一致させてください。
# install requirements to use tensorflow/models repository
!pip install -Uqr models/official/requirements.txt
# you may have to restart the runtime afterwards, also ignore any ERRORS popping up at this step
多くのインポートが可能です。
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import sys
sys.path.append('models')
from official.nlp.data import classifier_data_lib
from official.nlp.bert import tokenization
from official.nlp import optimization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
import wandb
from wandb.keras import WandbCallback
インストールされているさまざまなバージョンと依存関係の迅速な健全性チェック：
print("TF Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.experimental.list_physical_devices("GPU") else "NOT AVAILABLE")
データセットを取得しましょう今日使用するデータセットは、KaggleでのQuora Insincere Questions Classificationコンテストを通じて提供されます。
﻿Kaggleからトレーニングセットをダウンロードするか、以下のリンクを使用して、そのコンテストからtrain.csvをダウンロードしてください。
﻿https://archive.org/download/quora_dataset_train.csv/quora_dataset_train.csv.zip﻿
データを解凍してpandas DataFrameに読み込みます。次に、以下を実行しましょう。
# TO LOAD DATA FROM ARCHIVE LINK
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
﻿
df = pd.read_csv('https://archive.org/download/quora_dataset_train.csv/quora_dataset_train.csv.zip', 
                 compression='zip',
                 low_memory=False)
print(df.shape)
df.head(10)
# label 0 == non toxic
# label 1 == toxic 
その後、そのデータをW＆Bテーブルですばやく視覚化しましょう。
﻿
﻿
探検しましょう
The Label Distribution（ラベルの配布）モデリングを実際に掘り下げる前に、使用しているデータを理解することをお勧めします。 ここでは、ラベルの配布について説明します。 データポイントの長さ、テストセットとトレインセットが適切に分散されていること、およびその他のいくつかの予備タスクを確認してください。 まず最初に、次のコマンドを実行してラベルの配布を見てみましょう。
print(df['target'].value_counts())
df['target'].value_counts().plot.bar()
plt.yscale('log');
plt.title('Distribution of Labels')
﻿
				Label Distribution（ラベルの配布）
									
単語の長さと文字の長さここで、ここで使用しているテキストデータを理解するために、数行のコードを実行してみましょう。
print('Average word length of questions in dataset is {0:.0f}.'.format(np.mean(df['question_text'].apply(lambda x: len(x.split())))))
print('Max word length of questions in dataset is {0:.0f}.'.format(np.max(df['question_text'].apply(lambda x: len(x.split())))))
print('Average character length of questions in dataset is {0:.0f}.'.format(np.mean(df['question_text'].apply(lambda x: len(x)))))
﻿
﻿
BERTテキスト分類タスクのトレーニングおよびテストデータの準備ここでの私たちのアプローチに関するいくつかのメモ：
💡
データセット全体のトレーニングには時間がかかるため、データのごく一部を使用します。 もちろん、train_sizeを変更することで、より多くのデータを自由に含めることができます。
データセットは非常に不均衡であるため、ラベルに基づいて階層化することにより、トレインセットとテストセットの両方で同じ分布を維持します。 このセクションでは、データを分析して、うまくいったかどうかを確認します。 
train_df, remaining = train_test_split(df, random_state=42, train_size=0.1, stratify=df.target.values)
valid_df, _ = train_test_split(remaining, random_state=42, train_size=0.01, stratify=remaining.target.values)
print(train_df.shape)
print(valid_df.shape)  
(130612, 3)
(11755, 3)
サンプリングされたセットの単語と文字の長さの取得print("FOR TRAIN SET\n")
print('Average word length of questions in train set is {0:.0f}.'.format(np.mean(train_df['question_text'].apply(lambda x: len(x.split())))))
print('Max word length of questions in train set is {0:.0f}.'.format(np.max(train_df['question_text'].apply(lambda x: len(x.split())))))
print('Average character length of questions in train set is {0:.0f}.'.format(np.mean(train_df['question_text'].apply(lambda x: len(x)))))
print('Label Distribution in train set is \n{}.'.format(train_df['target'].value_counts()))
print("\n\nFOR VALIDATION SET\n")
print('Average word length of questions in valid set is {0:.0f}.'.format(np.mean(valid_df['question_text'].apply(lambda x: len(x.split())))))
print('Max word length of questions in valid set is {0:.0f}.'.format(np.max(valid_df['question_text'].apply(lambda x: len(x.split())))))
print('Average character length of questions in valid set is {0:.0f}.'.format(np.mean(valid_df['question_text'].apply(lambda x: len(x)))))
print('Label Distribution in validation set is \n{}.'.format(valid_df['target'].value_counts()))
言い換えると、トレインと検証セットは、クラスの不均衡と質問テキストのさまざまな長さの点で類似しているように見えます。
﻿
単語での質問テキストの長さの分布の分析# TRAIN SET 
train_df['question_text'].apply(lambda x: len(x.split())).plot(kind='hist');
plt.yscale('log');
plt.title('Distribution of question text length in words')
﻿
﻿
# VALIDATION SET
valid_df['question_text'].apply(lambda x: len(x.split())).plot(kind='hist');
plt.yscale('log');
plt.title('Distribution of question text length in words')
﻿
﻿
﻿
文字単位の質問テキストの長さの分布の分析トレインセットと検証セットを掘り下げるときに、確認したいもう1つのことは、質問のテキストの長さが2つの間でほぼ同じであるかどうかです。 ここでほぼ同様の分布を持つことは、一般に、モデルのバイアスや過剰適合を防ぐための賢明なアイデアです。
# TRAIN SET
train_df['question_text'].apply(lambda x: len(x)).plot(kind='hist');
plt.yscale('log');
plt.title('Distribution of question text length in characters')
﻿
﻿
# VALIDATION SET
valid_df['question_text'].apply(lambda x: len(x)).plot(kind='hist');
plt.yscale('log');
plt.title('Distribution of question text length in characters')
﻿
実際、単語や文字での質問の長さの分布でさえ非常に似ているところがありますが良いトレーニング/テストの半々であると言えるでしょう。
﻿
データを使いこなす次に、データセットを作成してCPUで前処理します。
with tf.device('/cpu:0'):
    train_data = tf.data.Dataset.from_tensor_slices((train_df['question_text'].values, train_df['target'].values))
    valid_data = tf.data.Dataset.from_tensor_slices((valid_df['question_text'].values, valid_df['target'].values))
    # lets look at 3 samples from train set
    for text,label in train_data.take(3):
        print(text)
        print(label)
﻿
print(len(train_data))
print(len(valid_data))
    130612
    11755それでは、 BERTしていきましょう。
Let's BERT：TensorFlowハブから事前トレーニング済みのBERTモデルを入手する
﻿﻿﻿ソース﻿
tfhub にあるケースなしのBERTを使用します。
BERTレイヤーに渡されるテキストを準備するには、最初に単語をトークン化する必要があります。 ここでのトークナイザーはモデルアセットとして存在し、私たちのためにもアンケーシングを行います。
必要に応じて、ここで変更を加えることができるように、すべてのパラメーターを辞書の形式で設定します：
# Setting some parameters
﻿
config = {'label_list' : [0, 1], # Label categories
          'max_seq_length' : 128, # maximum length of (token) input sequences
          'train_batch_size' : 32,
          'learning_rate': 2e-5,
          'epochs':5,
          'optimizer': 'adam',
          'dropout': 0.5,
          'train_samples': len(train_data),
          'valid_samples': len(valid_data),
          'train_split':0.1,
          'valid_split': 0.01
         }
﻿
BERTレイヤーとトークン化を取得します：# All details here: https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2
﻿
bert_layer = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2',
                            trainable=True)
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy() # checks if the bert layer we are using is uncased or not
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)
﻿
トレーニングサンプルとそのトークン化されたIDのいくつかをチェックするinput_string = "hello world, it is a wonderful day for learning"
print(tokenizer.wordpiece_tokenizer.tokenize(input_string))
print(tokenizer.convert_tokens_to_ids(tokenizer.wordpiece_tokenizer.tokenize(input_string)))
    ['hello', 'world', '##,', 'it', 'is', 'a', 'wonderful', 'day', 'for', 'learning']
    [7592, 2088, 29623, 2009, 2003, 1037, 6919, 2154, 2005, 4083]﻿
そのデータを準備しましょう：BERTのテキストをトークン化して前処理しますデータセットの各行は、レビューテキストとそのラベルで構成されています。 データの前処理は、テキストをBERT入力機能に変換することで構成されます： 
入力ワードID: 各文をトークンIDのセットに変換するトークナイザーの出力。
インプットマスク: すべてのシーケンスを128（最大シーケンス長）までパディングしているため、これらのパディングが実際のテキストトークンに干渉しないように、何らかのマスクを作成することが重要です。 したがって、パディングをブロックする入力マスクを生成する必要があります。 マスクには、実際のトークン用に1、パディングトークン用に0があります。 実際のトークンのみが対象となります。
セグメントID: テキスト分類のアウトタスクの場合、シーケンスは1つしかないため、segment_ids / input_type_idsは基本的に0のベクトルにすぎません
バートは2つのタスクでトレーニングされました：
文からランダムにマスクされた単語を入力します。
2つの文が与えられた場合、どちらの文が最初に来ましたか？
# This provides a function to convert row to input features and label, 
# this uses the classifier_data_lib which is a class defined in the tensorflow model garden we installed earlier
﻿
def create_feature(text, label, label_list=config['label_list'], max_seq_length=config['max_seq_length'], tokenizer=tokenizer):
    """
    converts the datapoint into usable features for BERT using the classifier_data_lib
﻿
    Parameters:
    text: Input text string
    label: label associated with the text
    label_list: (list) all possible labels
    max_seq_length: (int) maximum sequence length set for bert
    tokenizer: the tokenizer object instantiated by the files in model assets
﻿
    Returns:
    feature.input_ids: The token ids for the input text string
    feature.input_masks: The padding mask generated 
    feature.segment_ids: essentially here a vector of 0s since classification
    feature.label_id: the corresponding label id from lable_list [0, 1] here
﻿
    """
﻿
    # since we only have 1 sentence for classification purpose, textr_b is None
    example = classifier_data_lib.InputExample(guid = None,
                                            text_a = text.numpy(), 
                                            text_b = None, 
                                            label = label.numpy())
    # since only 1 example, the index=0
    feature = classifier_data_lib.convert_single_example(0, example, label_list,
                                    max_seq_length, tokenizer)
﻿
    return (feature.input_ids, feature.input_mask, feature.segment_ids, feature.label_id)
﻿
Dataset.mapを使用して、この関数をデータセットの各要素に適用する必要があります。 Dataset.mapはグラフモードで実行され、グラフテンソルには値がありません。
グラフモードでは、TensorFlowOpsと関数のみを使用できます。
したがって、この関数を直接.mapすることはできません。tf.py_functionでラップする必要があります。 tf.py_functionは、通常のテンソル（値とそれにアクセスするための.numpy()メソッドを含む）をラップされたPython関数に渡します。
熱心な実行のためにPython関数をTensorFlow操作にラップするdef create_feature_map(text, label):
    """
    A tensorflow function wrapper to apply the transformation on the dataset.
    Parameters:
    Text: the input text string.
    label: the classification ground truth label associated with the input string
﻿
    Returns:
    A tuple of a dictionary and a corresponding label_id with it. The dictionary 
    contains the input_word_ids, input_mask, input_type_ids  
    """
﻿
    input_ids, input_mask, segment_ids, label_id = tf.py_function(create_feature, inp=[text, label], 
                                Tout=[tf.int32, tf.int32, tf.int32, tf.int32])
    max_seq_length = config['max_seq_length']
﻿
    # py_func doesn't set the shape of the returned tensors.
    input_ids.set_shape([max_seq_length])
    input_mask.set_shape([max_seq_length])
    segment_ids.set_shape([max_seq_length])
    label_id.set_shape([])
﻿
    x = {
        'input_word_ids': input_ids,
        'input_mask': input_mask,
        'input_type_ids': segment_ids
    }
    return (x, label_id)
モデルに渡される最終的なデータポイントは、xおよびラベルとしての辞書の形式です（辞書には明らかに一致する必要のあるキーがあります）。
データフローを許可する：tf.dataを使用して最終的な入力パイプラインを作成する
﻿
トレーニングとテストのデータセットに変換を適用する# Now we will simply apply the transformation to our train and test datasets
with tf.device('/cpu:0'):
  # train
  train_data = (train_data.map(create_feature_map,
                              num_parallel_calls=tf.data.experimental.AUTOTUNE)
﻿
                          .shuffle(1000)
                          .batch(32, drop_remainder=True)
                          .prefetch(tf.data.experimental.AUTOTUNE))
﻿
  # valid
  valid_data = (valid_data.map(create_feature_map, 
                               num_parallel_calls=tf.data.experimental.AUTOTUNE)
                          .batch(32, drop_remainder=True)
                          .prefetch(tf.data.experimental.AUTOTUNE)) 
結果のtf.data.Datasetsは、keras.Model.fitで期待されるように、（機能、ラベル）ペアを返します
# train data spec, we can finally see the input datapoint is now converted to the 
#BERT specific input tensor
train_data.element_spec
﻿
﻿
BERT分類モデルの作成、トレーニング、追跡： モデル化を成功させていきましょう!!!
モデルを作成するBERTレイヤーからの出力は2つあります。
入力シーケンス全体の表現を含む形状[batch_size、768]のpooled_output。
入力シーケンス全体の表現を含む形状[batch_size、768]のpooled_output。
分類タスクでは、pooled_outputのみに焦点を当てていきます。
# Building the model, input ---> BERT Layer ---> Classification Head
def create_model():
    
    input_word_ids = tf.keras.layers.Input(shape=(config['max_seq_length'],), 
					    dtype=tf.int32,
                                           name="input_word_ids")
﻿
    input_mask = tf.keras.layers.Input(shape=(config['max_seq_length'],), 
					dtype=tf.int32,
                                   	name="input_mask")
﻿
    input_type_ids = tf.keras.layers.Input(shape=(config['max_seq_length'],), 
					    dtype=tf.int32,
                                    	    name="input_type_ids")
﻿
﻿
    pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, input_type_ids])
    # for classification we only care about the pooled-output.
    # At this point we can play around with the classification head based on the 
    # downstream tasks and its complexity
﻿
    drop = tf.keras.layers.Dropout(config['dropout'])(pooled_output)
    output = tf.keras.layers.Dense(1, activation='sigmoid', name='output')(drop)
﻿
    # inputs coming from the function
    model = tf.keras.Model(
      inputs={
        'input_word_ids': input_word_ids,
        'input_mask': input_mask,
        'input_type_ids': input_type_ids}, 
      outputs=output)
﻿
    return model
﻿
モデルのトレーニング# Calling the create model function to get the keras based functional model
model = create_model()
﻿
# using adam with a lr of 2*(10^-5), loss as binary cross entropy as only 
# 2 classes and similarly binary accuracy
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=config['learning_rate']),
              loss=tf.keras.losses.BinaryCrossentropy(),
              metrics=[tf.keras.metrics.BinaryAccuracy(),
                       tf.keras.metrics.PrecisionAtRecall(0.5),
                       tf.keras.metrics.Precision(),
                       tf.keras.metrics.Recall()])
#model.summary()
﻿
モデルの概要
								モデルアーキテクチャの概要 
tfハブの欠点の1つは、モジュール全体をkerasのレイヤーとしてインポートするため、モデルの概要にパラメーターとレイヤーが表示されないことです。
tf.keras.utils.plot_model(model=model, show_shapes=True, dpi=76, )
﻿
公式のtfhubページで、「モジュール内のすべてのパラメーターはトレーニング可能であり、すべてのパラメーターを微調整することが推奨される方法です」と記載されている通り、何もフリーズせずにモデル全体をトレーニングします。
実験追跡これをお読みになっているあなたはウェイトとバイアスについてよく知っていると思いますが、そうでない場合は、下記を読んでいきましょう:)
実験の追跡を開始するために、W＆Bで「実行」を作成します。
wandb.init()：基本的なプロジェクト情報パラメーターを使用して実行を初期化します
project：プロジェクト名は、このプロジェクトのすべての実験が追跡される新しいプロジェクトタブを作成します
config：追跡したいすべてのパラメーターとハイパーパラメーターの辞書
group：オプションですが、後でさまざまなパラメーターでグループ化するのに役立ちます
job_type：ジョブタイプを説明するために、後でさまざまな実験をグループ化するのに役立ちます。 例：「電車」、「評価」など
# Update CONFIG dict with the name of the model.
config['model_name'] = 'BERT_EN_UNCASED'
print('Training configuration: ', config)
﻿
# Initialize W&B run
run = wandb.init(project='Finetune-BERT-Text-Classification', 
                 config=config,
                 group='BERT_EN_UNCASED', 
                 job_type='train')
ここで、すべての異なるメトリックをログに記録するために、W＆Bが提供する単純なコールバックを使用します。
WandCallback() : https://docs.wandb.ai/guides/integrations/keras
はい、コールバックを追加するのと同じくらい簡単です：D
# Train model
# setting low epochs as It starts to overfit with this limited data, please feel free to change
epochs = config['epochs']
history = model.fit(train_data,
                    validation_data=valid_data,
                    epochs=epochs,
                    verbose=1,
                    callbacks = [WandbCallback()])
run.finish()
﻿
﻿
いくつかのトレーニングメトリクスとグラフ﻿
﻿
﻿
評価しましょう!検証セットの評価を行い、W＆Bを使用してスコアをログに記録しましょう。
wandb.log(): スカラー（精度や損失などのメトリック）およびその他のタイプのwandbオブジェクトのディクショナリをログに記録します。 ここでは、評価辞書をそのまま渡してログに記録します。
# Initialize a new run for the evaluation-job
run = wandb.init(project='Finetune-BERT-Text-Classification', 
                 config=config,
                 group='BERT_EN_UNCASED', 
                 job_type='evaluate')
﻿
﻿
﻿
# Model Evaluation on validation set
evaluation_results = model.evaluate(valid_data,return_dict=True)
﻿
# Log scores using wandb.log()
wandb.log(evaluation_results)
﻿
# Finish the run
run.finish()
﻿
モデルの保存とモデルのバージョン管理最後に、W＆Bを使用して再現可能なモデルを保存する方法について説明します。 つまり、アーティファクトを使用します。 
W＆Bアーティファクトモデルを保存し、さまざまな実験を追跡しやすくするために、wandb.artifactsを使用します。W＆Bアーティファクトは、データセットとモデルを保存する方法です。モデルを保存し、さまざまな実験を追跡しやすくするために、wandb.artifactsを使用します。W＆Bアーティファクトは、データセットとモデルを保存する方法です。
実行内で、モデルアーティファクトを作成および保存するための3つのステップがあります。
wandb.Artifact（）を使用して空のアーティファクトを作成します。
wandb.add_file（）を使用して、モデルファイルをアーティファクトに追加します。
アーティファクトを保存するには、wandb.log_artifact（）を呼び出します。
# Save model
model.save(f"{config['model_name']}.h5")
﻿
# Initialize a new W&B run for saving the model, changing the job_type
run = wandb.init(project='Finetune-BERT-Text-Classification', 
                 config=config,
                 group='BERT_EN_UNCASED', 
                 job_type='save')
﻿
﻿
# Save model as Model Artifact
artifact = wandb.Artifact(name=f"{config['model_name']}", type='model')
artifact.add_file(f"{config['model_name']}.h5")
run.log_artifact(artifact)
﻿
# Finish W&B run
run.finish()
﻿
W＆Bダッシュボードを流し読みしていく注意事項：
実験と実行のグループ化。
すべてのトレーニングログとメトリックの視覚化。
システムメトリックの視覚化は、クラウドインスタンスまたは物理GPUマシンでトレーニングするときに役立つ可能性があります。
表形式のハイパーパラメータ追跡。
アーティファクト：モデルのバージョン管理とストレージ。
﻿
﻿
BERTテスト分類の概要とコードこのハンズオンチュートリアルがお役に立てば幸いです。また、すでに内容を知っている方でも、新たな学びがありましたら幸いです。
この投稿のコードはこちらにあります。﻿﻿
﻿