“A picture is worth a thousand words”, the adage reads. However, pictures cannot replace words in terms of their ability to efficiently convey clear (mostly) unambiguous and concise knowledge. Images and text, indeed reveal different and complementary information that, if combined will result in more information than the sum of that contained in single media. The combination of visual and textual information can be obtained by linking the entities mentioned in the text with those shown in the pictures. To further integrate this with the agent background knowledge, an additional step is necessary. That is, either finding the entities in the agent knowledge base that correspond to those mentioned in the text or shown in the picture or, extending the knowledge base with the newly discovered entities. We call this complex task Visual-Textual-Knowledge Entity Linking (VTKEL). In this thesis, after providing a precise definition of the VTKEL task, we present two datasets called VTKEL1k* and VTKEL30k. These datasets consisting of images and corresponding captions, in which the image and textual mentions are both annotated with the corresponding entities typed according to the YAGO ontology. The datasets can be used for training and evaluating algorithms of the VTKEL task. Successively, we developed an unsupervised baseline algorithm called VT-LinKEr (Visual-Textual-Knowledge-Entity Linker) for the solution of the VTKEL task. We evaluated the performances of VT-LinKEr on both datasets. We also developed a supervised algorithm called ViTKan (Visual-Textual-Knowledge-Alignment Network). During training, the ViTKan takes in the input (1) an image and applying an object detector to predict visual-objects & their typing, (ii) takes text (captions) and applying a knowledge graph extracting tool PIKES to recognized textual entity mentions and linked these entities to the knowledgebase YAGO for background knowledge extraction. We trained the ViTKan model by using the visual, textual, and ontological features data of the VTKEL1k* dataset. During prediction, the ViTKan solves the problem of alignment (mapping) between visual entities in the image with textual entities in the captions with a great accuracy. The evaluation results of ViTKan on VTKEL1k* and VTKEL30k datasets show improved results with respect to the state-of-the-art methods on grounding (localization) of textual entities on images task.

“A picture is worth a thousand words”, the adage reads. However, pictures cannot replace words in terms of their ability to efficiently convey clear (mostly) unambiguous and concise knowledge. Images and text, indeed reveal different and complementary information that, if combined will result in more information than the sum of that contained in single media. The combination of visual and textual information can be obtained by linking the entities mentioned in the text with those shown in the pictures. To further integrate this with the agent background knowledge, an additional step is necessary. That is, either finding the entities in the agent knowledge base that correspond to those mentioned in the text or shown in the picture or, extending the knowledge base with the newly discovered entities. We call this complex task Visual-Textual-Knowledge Entity Linking (VTKEL). In this thesis, after providing a precise definition of the VTKEL task, we present two datasets called VTKEL1k* and VTKEL30k. These datasets consisting of images and corresponding captions, in which the image and textual mentions are both annotated with the corresponding entities typed according to the YAGO ontology. The datasets can be used for training and evaluating algorithms of the VTKEL task. Successively, we developed an unsupervised baseline algorithm called VT-LinKEr (Visual-Textual-Knowledge-Entity Linker) for the solution of the VTKEL task. We evaluated the performances of VT-LinKEr on both datasets. We also developed a supervised algorithm called ViTKan (Visual-Textual-Knowledge-Alignment Network). During training, the ViTKan takes in the input (1) an image and applying an object detector to predict visual-objects & their typing, (ii) takes text (captions) and applying a knowledge graph extracting tool PIKES to recognized textual entity mentions and linked these entities to the knowledgebase YAGO for background knowledge extraction. We trained the ViTKan model by using the visual, textual, and ontological features data of the VTKEL1k* dataset. During prediction, the ViTKan solves the problem of alignment (mapping) between visual entities in the image with textual entities in the captions with a great accuracy. The evaluation results of ViTKan on VTKEL1k* and VTKEL30k datasets show improved results with respect to the state-of-the-art methods on grounding (localization) of textual entities on images task.

Collegamento di menzioni visive e testuali di entità con conoscenze di base / Dost, Shahi. - (2021 May 26).

Collegamento di menzioni visive e testuali di entità con conoscenze di base.

DOST, SHAHI
2021

Abstract

“A picture is worth a thousand words”, the adage reads. However, pictures cannot replace words in terms of their ability to efficiently convey clear (mostly) unambiguous and concise knowledge. Images and text, indeed reveal different and complementary information that, if combined will result in more information than the sum of that contained in single media. The combination of visual and textual information can be obtained by linking the entities mentioned in the text with those shown in the pictures. To further integrate this with the agent background knowledge, an additional step is necessary. That is, either finding the entities in the agent knowledge base that correspond to those mentioned in the text or shown in the picture or, extending the knowledge base with the newly discovered entities. We call this complex task Visual-Textual-Knowledge Entity Linking (VTKEL). In this thesis, after providing a precise definition of the VTKEL task, we present two datasets called VTKEL1k* and VTKEL30k. These datasets consisting of images and corresponding captions, in which the image and textual mentions are both annotated with the corresponding entities typed according to the YAGO ontology. The datasets can be used for training and evaluating algorithms of the VTKEL task. Successively, we developed an unsupervised baseline algorithm called VT-LinKEr (Visual-Textual-Knowledge-Entity Linker) for the solution of the VTKEL task. We evaluated the performances of VT-LinKEr on both datasets. We also developed a supervised algorithm called ViTKan (Visual-Textual-Knowledge-Alignment Network). During training, the ViTKan takes in the input (1) an image and applying an object detector to predict visual-objects & their typing, (ii) takes text (captions) and applying a knowledge graph extracting tool PIKES to recognized textual entity mentions and linked these entities to the knowledgebase YAGO for background knowledge extraction. We trained the ViTKan model by using the visual, textual, and ontological features data of the VTKEL1k* dataset. During prediction, the ViTKan solves the problem of alignment (mapping) between visual entities in the image with textual entities in the captions with a great accuracy. The evaluation results of ViTKan on VTKEL1k* and VTKEL30k datasets show improved results with respect to the state-of-the-art methods on grounding (localization) of textual entities on images task.
Linking Visual and Textual Entity mentions with Background Knowledge.
26-mag-2021
“A picture is worth a thousand words”, the adage reads. However, pictures cannot replace words in terms of their ability to efficiently convey clear (mostly) unambiguous and concise knowledge. Images and text, indeed reveal different and complementary information that, if combined will result in more information than the sum of that contained in single media. The combination of visual and textual information can be obtained by linking the entities mentioned in the text with those shown in the pictures. To further integrate this with the agent background knowledge, an additional step is necessary. That is, either finding the entities in the agent knowledge base that correspond to those mentioned in the text or shown in the picture or, extending the knowledge base with the newly discovered entities. We call this complex task Visual-Textual-Knowledge Entity Linking (VTKEL). In this thesis, after providing a precise definition of the VTKEL task, we present two datasets called VTKEL1k* and VTKEL30k. These datasets consisting of images and corresponding captions, in which the image and textual mentions are both annotated with the corresponding entities typed according to the YAGO ontology. The datasets can be used for training and evaluating algorithms of the VTKEL task. Successively, we developed an unsupervised baseline algorithm called VT-LinKEr (Visual-Textual-Knowledge-Entity Linker) for the solution of the VTKEL task. We evaluated the performances of VT-LinKEr on both datasets. We also developed a supervised algorithm called ViTKan (Visual-Textual-Knowledge-Alignment Network). During training, the ViTKan takes in the input (1) an image and applying an object detector to predict visual-objects & their typing, (ii) takes text (captions) and applying a knowledge graph extracting tool PIKES to recognized textual entity mentions and linked these entities to the knowledgebase YAGO for background knowledge extraction. We trained the ViTKan model by using the visual, textual, and ontological features data of the VTKEL1k* dataset. During prediction, the ViTKan solves the problem of alignment (mapping) between visual entities in the image with textual entities in the captions with a great accuracy. The evaluation results of ViTKan on VTKEL1k* and VTKEL30k datasets show improved results with respect to the state-of-the-art methods on grounding (localization) of textual entities on images task.
Collegamento di menzioni visive e testuali di entità con conoscenze di base / Dost, Shahi. - (2021 May 26).
File in questo prodotto:
File Dimensione Formato  
tesi_definitiva_Shahi_Dost.pdf

accesso aperto

Descrizione: tesi_definitiva_Shahi_Dost
Tipologia: Tesi di dottorato
Dimensione 4.73 MB
Formato Adobe PDF
4.73 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3500980
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact