BERT-ranker baselines for CRR
The experimental results in this report were generated using transformer-rankers. The task is to do rank responses for a given conversational context. We use three benchmarks for this task: MANTiS, MSDialog and Ubuntu.
Conversation Response Ranking
The task of conversation response ranking concerns retrieving the best response given the dialogue context. Formally, let D={(Ui,Ri,Yi)}i=1ND=\{(U_i, R_i, Y_i)\}_{i=1}^{N} be a data set consisting of NN triplets: dialogue context, response candidates and response relevance labels. The dialogue context UiU_i is composed of the previous utterances {u1,u2,...,uτ}\{u^1, u^2, ... , u^{\tau}\} at the turn τ\tau of the dialogue. The candidate responses Ri={r1,r2,...,rk}R_i = \{r^1, r^2, ..., r^k\} are either ground-truth responses or negative sampled candidates (using BM25 as the negative sampler), indicated by the relevance labels Yi={y1,y2,...,yk}Y_i = \{y^1, y^2, ..., y^k\} The task is then to learn a ranking function f(.)f(.) that is able to generate a ranked list for the set of candidate responses RiR_i based on their predicted relevance scores f(Ui,r)f({U}_i,r).
BERT ranker
BERT will learn the function f(Ui,r)f({U}_i,r), based on the representation of the [CLS] token. The input for BERT is the concatenation of the context Ui{U}_i and the response rr, separated by SEP tokens. This is the equivalent of early adaptations of BERT for ad-hoc retrieval transported to conversation response ranking. Formally the input sentence to BERT is
concat(Ui,r)=u1 ∣ [UTTERANCE_SEP] ∣ u2 ∣ [TURN_SEP] ∣ ... ∣ uτ ∣ [SEP] ∣ rconcat({U}_i,r) = u^1 \; | \; [UTTERANCE\_SEP]\; | \; u^2 \; | \; [TURN\_SEP] \; | \; ... \; | \; u^{\tau} \; | \; [SEP] \; | \; r,
where ∣| indicates the concatenation operation. The utterances from the context Ui{U}_i are concatenated with special separator tokens [UTTERANCE_SEP][UTTERANCE\_SEP] and [TURN_SEP][TURN\_SEP] indicating end of utterances and turns. The response rr is concatenated with the context using BERT's standard sentence separator [SEP][SEP]. We fine-tune BERT on the target conversational corpus and make predictions as follows:
f(Ui,r)=σ(FFN(BERTCLS(concat(Ui,r)))),f({U}_i,r) = \sigma(FFN(BERT_{CLS}(concat({U}_i,r)))),
where BERTCLSBERT_{CLS} is the pooling operation that extracts the representation of the [CLS] token from the last layer and FFNFFN is a feed-forward network that outputs logits for two classes (relevant and non-relevant). We pass the logits through a softmax transformation σ\sigma that gives us a probability of relevance. We use the cross entropy loss for training. The learned function f(Ui,r)f({U}_i,r) outputs a point estimate which is used to rank all the responses on RR.
Reproduce Results
The script we use is the following:
export CUDA_VISIBLE_DEVICES=3,4,5,6,7
source /ssd/gustavo/transformer_rankers/env/bin/activate
REPO_DIR=/ssd/gustavo/transformer_rankers
ANSERINI_FOLDER=/ssd/gustavo/anserini/
VALIDATE_EVERY_X_STEPS=100
TRAIN_INSTANCES=300000
WANDB_PROJECT='library-crr-bert-baseline'
for SEED in 1 2 3 4 5
do 
    for TASK in 'mantis' 'msdialog' 'ubuntu_dstc8'
    do
        python ../examples/pointwise_bert_ranker.py \
            --task $TASK \
            --data_folder $REPO_DIR/data/ \
            --output_dir $REPO_DIR/data/output_data/ \
            --sample_data -1 \
            --max_seq_len 512 \
            --num_validation_batches 500 \
            --validate_every_epochs -1 \
            --validate_every_steps $VALIDATE_EVERY_X_STEPS \
            --train_negative_sampler bm25 \
            --test_negative_sampler bm25 \
            --num_epochs 1 \
            --num_training_instances $TRAIN_INSTANCES \
            --train_batch_size 8 \
            --val_batch_size 8 \
            --num_ns_train 9 \
            --num_ns_eval 9 \
            --seed $SEED \
            --anserini_folder $ANSERINI_FOLDER \
            --wandb_project $WANDB_PROJECT        
    done
done