This thesis studies weakly supervised learning for information extraction methods in two settings: (1) unimodal weakly supervised learning, where annotated texts are augmented with a large corpus of unlabeled texts and (2) multimodal weakly supervised learning, where images or videos are augmented with texts that describe the content of these images or videos.
In the <b>unimodal</b> setting we find that traditional semi-supervised methods based on generative Bayesian models are not suitable for the textual domain because of the violation of the assumptions made by these models. We develop an unsupervised model, the latent words language model (LWLM), that learns accurate word similarities from a large corpus of unlabeled texts. We show that this model is a good model of natural language, offering better predictive quality of unseen texts than previously proposed state-of-the-art language models. In addition, the learned word similarities can be used successfully to automatically expand words in the annotated training with synonyms, where the correct synonyms are chosen depending on the context. We show that this approach improves classifiers for word sense disambiguation and semantic role labeling.
<br>
The second part of this thesis discusses weakly supervised learning in a <b>multimodal</b> setting. We develop information extraction methods to information from texts that describe an image or video, and use this extracted information as a weak annotation of the image/video. A first model for the prediction of entities in an image uses two novel measures: The salience measure captures the importance of an entity, depending on the position of that entity in the discourse and in the sentence. The visualness measure captures the probability that an entity can be perceived visually, extracted from the WordNet database. We show that combining these measures results in an accurate prediction of the entities present in the image. We then discuss how this model can be used to learn a mapping from names in the text to faces in the image, and to retrieve images of a certain entity.
We then turn to the automatic annotation of video. We develop a model that annotates a video with the visual verbs and their visual arguments, i.e. actions and arguments that can be observed in the video. The annotations of this system are successfully used to train a classifier that detects and classifies actions in the video. A second system annotates every scene in the video with the location of that scene. This system comprises a multimodal scene cut classifier that combines information from the text and the video, an IE algorithm that extracts possible locations from the text and a novel way to propagate location labels from one scene to another, depending the similarity of the scenes in the textual and visual domain.
11. Example: WSD
Soft rules :
If “kicked”
If “goal” ball = “round object”
...
If “dance”
If “gown” ball = “formal dance”
...
Machine learning methods can combine many
complimentary and/or contradicting rules
11
12. Supervised machine learning
Current stateoftheart machine learning
methods
Manually annotated corpus
Machine learning method
needed for every new task,
often independent of task
language or domain
Successful for many tasks
Features need to be
Flexible, fast development
manually engineered
for new tasks
High variation of language
Only some expert
limits performance even
knowledge needed
with large training corpora
12
13. Solution: use unlabeled data
Unlabeled data: cheap, available for many
domains and languages
Semisupervised learning
Optimize single function that incorporates labeled
and unlabeled data
Violation of assumptions cause deteriorating results
when adding more unlabeled data
Unsupervised learning
First learn model on unlabeled data, then use model
in supervised machine learning method
13
18. Latent words language model
We hope there is an increasing need for reform
We hope there is an increasing need for reform
I believe this was the enormous chance of restructuring
They think that 's no important demand to change
You feel it are some increased potential that peace
... ... ... ... ... ... ... ... ...
Automatically learned synonyms
18
19. Latent words language model
We hope there is an increasing need for reform
We hope there is an increasing need for reform
I believe this was the enormous chance of restructuring
They think that 's no important demand to change
You feel it are some increased potential that peace
... ... ... ... ... ... ... ... ...
Time to compute all possible combinations:
~ very, very long...
Approximate: consider only most likely
~ pretty fast
19
21. LWLM for information extraction
Word sense disambiguation
standard + cluster features + hidden words
66.32% 66.97% 67.61%
Semantic role labeling
90%
80%
70%
60%
standard
50% + clusters
40% + hidden words
5% 20% 50% 100%
Latent words : help with underspecification and
ambiguity
21
24. Annotation of entities in images
Extract entities from descriptive news text that
are present in the image.
Former President Bill Clinton, left, looks on as an honor guard
folds the U.S. flag during a graveside service for Lloyd Bentsen
in Houston, May 30, 2006. Bentsen, a former senator and
former treasury secretary, died last week at the age of 85.
service
Lloyd Bentsen
Bill Clinton
Houston
guard age
flag ...
24
25. Annotation of entities in images
Assumption:
Entity is present in image if important in
descriptive text and possible to perceive visually.
Salience:
Dependent on text
Combines analysis of discourse and syntax
Visualness:
Independent of text
Extracted from semantic database
25
27. Salience
Is the entity important in descriptive text?
Discourse model
Important entities are referred to by other entities
and terms.
Graph models entities, coreferents and other terms
Eigenvectors find most important entities
Syntactic model
Important entities appear high in parse tree
Important entities have many children in tree
27
28. Visualness
Can the entity be perceived visually?
Similarity measure on entities in WordNet
s(“car”,“truck”) = 0.88 s(“thought”,“house”) = 0.23
s(“car”,“horse”) = 0.38 s(“house”,“building”) = 0.91
s(“horse”, “cow”) = 0.79 s(“car”, “house”) = 0.40
Visual seeds “person”, “vehicle” , “animal”, ...
Nonvisual seeds “thought”, “power”, “air”, …
Visualness:
combine similarity measure and seeds
“entities close to visual seeds will be visual”
28
32. Scene segmentation
Segment transcript and video in scenes
Scene cut classifier in text
Shot cut detector in video
Transcript
Shot of Buffy opening the refrigerator and taking out a carton of milk.
Scene cuts
Buffy sniffs the milk and puts it on the counter. In the background we
see Joyce drinking coffee and Dawn opening a cabinet to get out a box
of cereal. ...
Buffy & Riley move into the living room. They sit on the sofa.
Buffy nods in resignation. Smooch. Riley gets up.
Cut to a shot of a bright red convertible driving down the street. Giles
is at the wheel, Buffy beside him and Dawn in the back. Classical
music plays on the radio.
....
32
34. Scene segmentation
Segment transcript and video in scenes
Scene cut classifier in text
Shot cut detector in video
Shot of Buffy opening the
refrigerator and taking out a
carton of milk.
...
Buffy & Riley move into the
living room. They sit on the
sofa.
…
Cut to a shot of a bright red
convertible driving down the
street.
....
34
36. Location annotation results
Scene cut classifier precision recall f1measure
91.71% 97.48% 85.16%
Location detector precision recall f1measure
68.75% 75.54% 71.98%
Location annotation
episode only text text + LDA text + LDA + vision
2 54.72% 58.89% 57.39%
3 60.11% 65.87% 68.57%
36
37. Contributions 1/2
The latent words language model
Best ngram language model
Unsupervised learning of word similarities
Unsupervised disambiguation of words
Using the latent words for WSD
Best WSD system
Using the latent words for SRL
Improvement of soa classifier
37
38. Contributions 2/2
Image annotation :
First full analysis of entities in descriptive texts
Visualness: capture knowledge from WordNet
Salience: capture knowledge from syntactic
properties
Location annotation :
Automatic annotation of locations from transcripts
Including new locations
Including locations that are not explicitly mentioned
38