On the visual token added to linguistic tokens in VLBertEmbeddings class

Hello.

I have a question about the VLBertEmbeddings class.

In its forward function, a global image feature is added into linguistic tokens
The last token in vision sequence is used as the global image feature like bellow:

https://github.com/e-bug/volta/blob/9e5202141920600d58a9c5c17519ca453795d65d/volta/embeddings.py#L271

Using the last token seems reasonable for the original VLBert (vl-bert_base.json) because add_global_imgfeat is last, 
but I think this should be the first token for the controlled VLBert (ctrl_vl-bert_base.json), whose add_global_imgfeat is first.

Are there any reason that the last token is always used in the class?

I'm sorry if I misunderstand the way the embeddings classes work.

Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On the visual token added to linguistic tokens in VLBertEmbeddings class #10

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

On the visual token added to linguistic tokens in VLBertEmbeddings class #10

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions