Hello.
I have a question about the VLBertEmbeddings class.
In its forward function, a global image feature is added into linguistic tokens
The last token in vision sequence is used as the global image feature like bellow:
|
text_visual_embeddings = final_feats[:, -1].repeat(1, seq_length).view(batch_size, seq_length, -1) |
Using the last token seems reasonable for the original VLBert (vl-bert_base.json) because add_global_imgfeat is last,
but I think this should be the first token for the controlled VLBert (ctrl_vl-bert_base.json), whose add_global_imgfeat is first.
Are there any reason that the last token is always used in the class?
I'm sorry if I misunderstand the way the embeddings classes work.
Thanks.
Hello.
I have a question about the VLBertEmbeddings class.
In its forward function, a global image feature is added into linguistic tokens
The last token in vision sequence is used as the global image feature like bellow:
volta/volta/embeddings.py
Line 271 in 9e52021
Using the last token seems reasonable for the original VLBert (vl-bert_base.json) because add_global_imgfeat is last,
but I think this should be the first token for the controlled VLBert (ctrl_vl-bert_base.json), whose add_global_imgfeat is first.
Are there any reason that the last token is always used in the class?
I'm sorry if I misunderstand the way the embeddings classes work.
Thanks.