0:06

In general, we can use the empirical count

Â of events in the observed data to estimate the probabilities.

Â And a commonly used technique is called a maximum likelihood estimate,

Â where we simply normalize the observe accounts.

Â So if we do that, we can see, we can compute these probabilities as follows.

Â For estimating the probability that we see a water current in a segment,

Â we simply normalize the count of segments that contain this word.

Â So let's first take a look at the data here.

Â On the right side, you see a list of some, hypothesizes the data.

Â These are segments.

Â And in some segments you see both words occur, they are indicated as ones for

Â both columns.

Â In some other cases only one will occur, so only that column has one and

Â the other column has zero.

Â And in all, of course, in some other cases none of the words occur,

Â so they are both zeros.

Â And for estimating these probabilities, we simply need to collect the three counts.

Â 1:20

So the three counts are first, the count of W1.

Â And that's the total number of segments that contain word W1.

Â It's just as the ones in the column of W1.

Â We can count how many ones we have seen there.

Â The segment count is for word 2, and we just count the ones in the second column.

Â And these will give us the total number of segments that contain W2.

Â The third count is when both words occur.

Â So this time, we're going to count the sentence where both columns have ones.

Â 1:56

And then, so this would give us the total number of segments

Â where we have seen both W1 and W2.

Â Once we have these counts, we can just normalize these counts by N,

Â which is the total number of segments, and

Â this will give us the probabilities that we need to compute original information.

Â Now, there is a small problem, when we have zero counts sometimes.

Â And in this case, we don't want a zero probability because our data may be

Â a small sample and in general, we would believe that it's potentially possible for

Â a [INAUDIBLE] to avoid any context.

Â So, to address this problem, we can use a technique called smoothing.

Â And that's basically to add some small constant to these counts,

Â and so that we don't get the zero probability in any case.

Â Now, the best way to understand smoothing is imagine that we actually observed more

Â data than we actually have, because we'll pretend we observed some pseudo-segments.

Â I illustrated on the top, on the right side on the slide.

Â And these pseudo-segments would contribute additional counts

Â of these words so that no event will have zero probability.

Â Now, in particular we introduce the four pseudo-segments.

Â Each is weighted at one quarter.

Â And these represent the four different combinations of occurrences of this word.

Â So now each event, each combination will have

Â at least one count or at least a non-zero count from this pseudo-segment.

Â So, in the actual segments that we'll observe,

Â it's okay if we haven't observed all of the combinations.

Â So more specifically, you can see the 0.5 here after it comes from the two

Â ones in the two pseudo-segments, because each is weighted at one quarter.

Â We add them up, we get 0.5.

Â And similar to this, 0.05 comes from one single

Â pseudo-segment that indicates the two words occur together.

Â 4:09

And of course in the denominator we add the total number of pseudo-segments that

Â we add, in this case, we added a four pseudo-segments.

Â Each is weighed at one quarter so the total of the sum is, after the one.

Â So, that's why in the denominator you'll see a one there.

Â 4:36

Now, so to summarize, syntagmatic relation can generally

Â be discovered by measuring correlations between occurrences of two words.

Â We've introduced the three concepts from information theory.

Â Entropy, which measures the uncertainty of a random variable X.

Â Conditional entropy, which measures the entropy of X given we know Y.

Â And mutual information of X and Y, which matches the entropy reduction of X

Â due to knowing Y, or entropy reduction of Y due to knowing X.

Â They are the same.

Â So these three concepts are actually very useful for other applications as well.

Â That's why we spent some time to explain this in detail.

Â But in particular, they are also very useful for

Â discovering syntagmatic relations.

Â In particular, mutual information is a principal way for

Â discovering such a relation.

Â It allows us to have values computed on different pairs of

Â words that are comparable and so we can rank these pairs and

Â discover the strongest syntagmatic from a collection of documents.

Â Now, note that there is some relation between syntagmatic relation discovery and

Â [INAUDIBLE] relation discovery.

Â So we already discussed the possibility of using BM25 to achieve waiting for

Â terms in the context to potentially also suggest the candidates

Â that have syntagmatic relations with the candidate word.

Â But here, once we use mutual information to discover syntagmatic relations,

Â we can also represent the context with this mutual information as weights.

Â So this would give us another way to represent

Â the context of a word, like a cat.

Â And if we do the same for all the words, then we can cluster these words or

Â compare the similarity between these words based on their context similarity.

Â So this provides yet another way to do term weighting for

Â paradigmatic relation discovery.

Â And so to summarize this whole part about word association mining.

Â We introduce two basic associations, called a paradigmatic and

Â a syntagmatic relations.

Â These are fairly general, they apply to any items in any language, so

Â the units don't have to be words, they can be phrases or entities.

Â 7:11

We introduced multiple statistical approaches for discovering them,

Â mainly showing that pure statistical approaches are visible,

Â are variable for discovering both kind of relations.

Â And they can be combined to perform joint analysis, as well.

Â These approaches can be applied to any text with no human effort,

Â mostly because they are based on counting of words, yet

Â they can actually discover interesting relations of words.

Â 7:44

We can also use different ways with defining context and segment, and

Â this would lead us to some interesting variations of applications.

Â For example, the context can be very narrow like a few words, around a word, or

Â a sentence, or maybe paragraphs, as using differing contexts would

Â allows to discover different flavors of paradigmatical relations.

Â And similarly, counting co-occurrences using let's say,

Â visual information to discover syntagmatical relations.

Â We also have to define the segment, and the segment can be defined as a narrow

Â text window or a longer text article.

Â And this would give us different kinds of associations.

Â These discovery associations can support many other applications,

Â in both information retrieval and text and data mining.

Â So here are some recommended readings, if you want to know more about the topic.

Â The first is a book with a chapter on collocations,

Â which is quite relevant to the topic of these lectures.

Â The second is an article about using various

Â statistical measures to discover lexical atoms.

Â Those are phrases that are non-compositional.

Â For example, hot dog is not really a dog that's hot,

Â