ML2 lecture transcripts

Table of Contents

1 Exponential Families

0 Exponential Families - Basics

hi everyone in this clip we'll talk about the basics of exponential families. So what are exponential families actually most of the probability distributions we have seen so far are exponential families causing distribution. gamma or person distribution. Also discrete distribution. Like multinomial manually or categorical exponential families are sets of probability distribution and they have a certain structure namely they relate the arguments in a certain way to the parameters and they're only interacting in a certain way. But we have also seen other distributions that are not of the exponential family form like student t koshi or uniform distributions. Well, let's have a look at the formal definition of exponential families.

So set of probability distribution given here. parameterized by some sita forms an exponential family if it can be written in a specific form namely needs to have a function only dependent on X, which is called the base measure and then times exponential of which are called the natural parameters. Time function also only dependent on acts on the argument, which calls a sufficient statistic and finally afunction only dependent on time. If you can write it like this, then we say this family is an exponential family. Note that X and peter only interact in this

scallop product way in this billion year way. These functions and this quantity again are called base measure for this. prefactor natural parameters for this eta which is basically a function of the original parameters. Theta. So we can consider as a reaper militarization of our distribution. Then we have the sufficient statistics which are these teas and we have the lock partition function which is a function only dependent on detail. Note that the natural parameters and the sufficient statistics, vectors, vectors with the same number of components so that we can actually compute their scale of product. This lock partition function

as the name like petition function because or abnormal. Isar because it is a lock of the normalize er of the distribution if all the other quantities arguing but if we start with its h xtx eta and we want to have a probability distribution proportional to the specter here then we would have to normalize it. And the exponential of this function here would be the normalizing constant here. So if you take a lock of normalizing constant we call it lock normalize er partition function. So the second thing we need to say is if we have a representation as exponential family then this representation is in general not unique. We can always

add for instance with these natural parameters, you can add artificially zeroes or ones or other functions. So this is in general not unique. So in general we're aiming for a representation as an exponential families with a minimal number of components or the natural parameter and thus the sufficient statistics furthermore confusion that always arises is confusion between exponential families, andw exponential distributions which looks very, very similar but there's some different concepts. Not every exponential family is an exponential distribution but we know that the family of exponential distributions is an exponential family, we will see that on the next slide.

So as an example here we are considering the exponential distributions. So what are exponential distributions? Exponential distributions are probability distributions that are only in the positive region of surreal And supported only they're bigger or equal zero and 0 on the negative side. The parameters here only take positive values. So now we want to write it as an exponential family. The first thing is you want to get rid of um these um these if then segments here. So what we use is we use an indicator function. Use an indicator function there to infinity. There

are two infinity of X times E to E to the C to X. Or you write it more explicitly X minus if that times X. So the question is, is this now of exponential family form and we can check. So we have here this function which is only dependent on X. Okay, we have here some parameters, it's a function of the parameters. So here parameters, we could say this is only true parameter and this could be our sufficient statistic but we have a prefactor here which is dependent on the parameters and this was excluded. So what we do now is we just rewrite this theta here as um

block heater inside the exponential and we raise it here. So what do we have then? So we could say this is our age of X. He we would say this could be our eater. Just one could have several but here's only one. This is our T of X and this should be something like a cough each other but this is a function of sita so we have to do something so I write it down here. The indicator function is our base measure our natural parameters online. theta our sufficient statistics, X and our a of eta we cannot just say because

this function is a function of theta so we have to express this in eta since we have here already a formula, you can invert it. So theta you can also right now as minus eta right so that means like this, a of each of them becomes becomes um Charles minus lock minus detail. Maybe we should have written minus thank you when this year is of course we have this problem between this, we have to express it. So now we have found a representation

of the exponential distribution as an exponential family. Let's write this down so unnatural. parameterisation this looks like this base measure times next times x minus minus lock finest. Okay, so this simple example here guides us now how to do this in general and this is the next topic. How to find an exponential family structure if we have a probability distribution um a set of probability distributions given with some parameterization How do we find based measure natural parameters, statistics

and locked partition function first of all sometimes we have more parameters let's say theta and alpha. And what we need to be certain of is what are the running parameters and what are fixed parameters. So we have to be clear about water parameters and what are hyperparameters that are fixed because this is important for base measure, which is not supposed to be dependent on the parameters but there might be dependent on the hyperparameters So this is important. First check if you have several parameters which are the running parameters which are fixed. We will see later an example. Then the second step is we find the support of this distribution basically saying where is the probability zero and where it's not. So the support

of the distribution all these arguments Acts where the distribution is bigger than zero. So it's not zero, bigger than zero. Mm And you want this you want the arguments that are bigger have positive probability for all parameters. And what we want is that this s is not dependent on theTA. If it were dependent on theta in the end, then we have problems but as we have seen for instance, in the exponential distribution case because always indicator function on the right hand side. So not dependent on the parameters. So this will be important. And then if x is in the support then the probability is bigger than zero. And that means we can take the law, terrorism of it. So and then if we have our distribution explicitly given and you take a point

with positive probability then we can look at the terms taking, looking at the terms when we take the lock and what we do then is we affect everything out factoring everything out. Then we're separating the terms that only include the parameters and we separated the other terms which are only dependent on X. So if you have done this we can all to gather them together if you sell access um can gather them together such that you get a minimal number of functions. So and these functions are then hopefully if it works out to one t one until sts then we hopefully have some function which are only dependent on X and some function which

is only dependent on theta then if this is if we found, if you take a look and it is of this form here then you can just say what the natural parameters are. These features are the natural parameters then what can be chosen as these? We have sufficient statistics already and then we can construct a base measure just by taking the indicator function over the support times the exponents of S B of x which of course here and this is then in front of the distribution and only dependent on X. Then um the next thing is we wanna take this part here to be the log partition function and as we have seen in the previous example is that this express in theta but we want to express it in the natural parameters in the new parameters. So what we have to do is basically

have to find an inverse to the natural parameterization So we have to express Ceta as afunction of eter and then if you have found this, if you have found this each of 10 let me just plug it in just like this in Yeah and then we get a of um eta is just as minus because it's a convention of this minus C to C and then we plug in the Ceta and this is the function of eight of them and then we are done so this is a general recipe and we will see a lot of examples in this lecture but first we want to give a counter example,

we want to talk when does um when is when does recipe basically not hold so in a typical example is the student T distribution. Student t distribution comes with parameter and higher dimensions of mean also co variance and some degree of freedoms these other parameters here and the density is given by some normalizing constants in front, gamma function of the half new half breakfast on one over right, yeah, behalf then the determinant of the sigma half looks very similar to caution and now comes a bit of difference to

the ocean is that instead of having an export financial here we have just uh one plus x minus new new transpose. Take my inverse x minus move. This looks here very similar to the argument in the caution but it's not inside an exponential, it's inside this polynomial function or inverse polynomial function. Officers degree here. So and this is not an exponential family, not an exponential family. And the recipe we have written down on the previous slide, it's not really approved. It's only more like a sufficient criterion. If the recipe works,

we have found an exponential family representation but if it recipe doesn't work, it doesn't necessarily necessarily say that it's not an exponential family, but it gives us a hint why student T distribution. It's not an exponential family namely if you would take the logarithms of this function here, then we would not be able to separate these terms properly because of log and this intern these functional relations here. So you cannot pull this out, you cannot separate it. And that makes it difficult andw This is kind of unfortunate because students decision actually look very similar to boston and we'll see later. The gaussians are exponential family. They have a bit of heavier tails at the student T distribution

of heavier tails and Gaussian very similar form, but because of this polynomial form instead of an exponential form, it is not often exponential family type

1 Exponential Families - Example: Multinomial distributions (fixed K, M)

Hi everyone in this clip we want to show how the recipe we had in the previous clip works for multinomial distribution to represent them as exponential families. So let's talk about the multinomial distribution. What is the multinomial distribution? multinomial distribution comes with the parameter. Okay, where we have A Plus one Classes. So these are the number of classes. It's a discrete distribution and all these classes. Mm we have probability distributions discrete. So we have pi going from zero to pi K. So it's a probability

vector hmm Plus one. Such that all the pipe K are bigger or equal to zero and the sum up to one. And if I sum them up 12. Okay, Hi I'm it was too while this is basically my parameter space I call it delta care. So if you have one probability distribution with these parameters then I draw with this distribution in times. So it's a number of trials I have. I draw end times from these classes And what I record is a number the number

of successes my X. This is a random variable or the value of the random variable or the argument of our distribution will be then recording how many which how many of these classes I drew eight plus one times that fire start adding them up then I get em Small typo start from zero small. Okay, so now what is the probability distribution of a multinomial So the probability distribution of multinomial now comes with argument X. And with parameters. pi which we have just introduced and we can vary very shortly write it as M

over X times pi to the power of X where these are both vectors and what do we mean by this? You mean by this? That um this we mean this is and faculty divided by the product of the X faculties. If we can sum them up two M then this is this here and it's zero otherwise. And what do we mean by this here? We mean this is just product of surprise with the corresponding ex powers. So this is our distribution.

So now and the first thing we wanted to do is the first thing was to look what is the support when is um when is this bigger equal to zero? You see already here all these can be excluded and not on the support. So we only have to look at access which satisfies this, these are all the access and plus one such that Something up from 0 to K&I get em so we will in the next step only look at access that's satisfy these relations. So then we can look at the logarithms we want to go through this recipe and

look at the low charism of the probability distributions So let's take the long charism e of X. pi or access that this is not vanishing and this was lock, andw X. We just write it like this plus block off I X. So and this is now block X plus and write a son From 0 to K. xk times blog. Okay, so we could now say stop we have everything you want

because if you look at it you could say okay, this is my B of X. Only dependent on X. Actually it's dependent on em but we fixed em that's the reason actually why we fixed em if you fix em then m is not a parameter in this setting are not considered a parameter in this setting and is fixed. So my B of X is not dependent on pi that's the main point of X is not dependent on pi next. Then I could say okay I can say this is my sufficient statistic and this is my my natural parameters. Then I can just add a zero here and say this is my And partition function. This was

gold from zero. Okay, we had K plus one statistics, K plus one natural parameters. And actually this would be a valid representation as an exponential family but there's some redundancies. Um we know that the access if you add them up it should be M Right and if you add up his pies up then we get one so we can actually use these constraint, remove one of these parameters and one of these sufficient statistics and that what is what we'll do next. So we erase this here and we're trying to find a minimal representation as a minimum number of classes, minimum number of sufficient statistics and natural parameters.

So we go on and just write block M. X. And now we're taking the zeros term X zero. Logq 50. And we let the some only run from one. Alright, okay. So what do we know now? As I said already we know that X zero m minus the other axis. And we also know that pi zero is one minus the other points. I mean I just plug this in.

Oh yeah. And this function here and then we get andw minus. Okay. 1 2. Okay, Okay. Times lock off 1 - is some okay because 12 K I K. When move this a bit here around and then the rest Okay. Typo xk xk thank you. And here

it's only running for one. Okay, so now you can see we have here some of X case if here also some of xk S and we have eliminated X zero. Now we have to gather the terms and this is basically done by just looking at this here, xk times this and this year so we have this with a minus sign and we have this with a plus sign. So what we're ending up with his first and times you have to write it Lock 1 - A son. Okay equals 12 K I K.

This is what this term is about and then we can summarize the rest equals 12 K. X. K. And now we have just one lock but we have a fraction. Okay. And we change the index to not getting confused because you have the sun. Now we change the index here to jay one minus this year. We changed this song one Day 1. Okay. All right, jay like this.

So now we can look at it. What is what? So we can say this is our case efficient statistic, you can say this is our natural parameter, you can say this is your ex and this is only dependent on the parameters. So this is our sea of high. So we are done right? No. The problem is that this is expressed in terms of pi again and we have to express it in terms of natural parameters. So we have to ask the question what is pie in terms of the natural parameters. Let us go to the next slide and again let us write down what this

natural parameters were. They were lock like a 1 - A Cross 1 2 K. yj And we have K Of them. Okay 12 K K two K. So how do we express them in the other way around so we know that they're satisfying some constraint and we're exploiting them so we can take here the exponential then the lot drops away and we get an X. Pierre then we can add them up. I'll directly write down what it is. If I take one plus the sun. K. one. Capital K X O K.

What is this? This is the Sun one plus some one plus the sun by K one, capital K. And then we have one Hi JJ is running from one capital K. So let's look at look what this is. Then I can just replace one by this quantity divided by itself. So what we get is one minus by J. J was 12 K divided by one minus and some J equals one K by J. And then

we have here plus K. One to K K. And as you can see these terms cancel out. So what do we get? Is that this here? This here it's the same as one over maybe like this one minus Some J one K J, inverse. So and this is exactly here, surprise surprise. This is exactly the quantity which appears here. So we can write that I K.

And I mean this pipe K here is now we take the X or let us ride the other way around getting confused. If you take X. Okay, this here. Okay, divided by one minus the sun equals one. Okay, jay bots, we have this here and this part here on the same so we can replace this here and this here in this equation. So this is like a times one plus X one plus some experts.

Mhm Some K maybe you should take a J here they want to. Okay, thanks jane. So now we have only pi K Here we have here to eat us and here they eat us. But what does it mean? That means you can now write K. Finally S. X. K. Find it, find this quantity here that's some A one per capita. Okay X. Top Jack so what is this? We have seen this, this is just softmax output

and we write it take my intern sigma K. Of this is soft mind. So with this now we can write what the log partition function looks like. So what was the law of partition function and partition function eater was minus C. Of I. Of and this is as you've seen this moment ago just softmax of peter and to see what was the See this is now minus M. Block one minus. Alright it um

I write it um first in form of supplies 1 2 K. J. But we have seen that this is if you plug in they found Then we have seen this is just one plus the sun. They won two K. Thanks j minus one. So that means um can write it like this of the minus and the lock goes in front what we get is a plus visit plus mm times logq

one plus They n equals one okay X. In top jail so this is now our partition function so meaning we have represented the multinomial distribution fully as exponential family. So again distribution or look like an exponential family form we have em over X. We have seen this times X Some K because 1, 2 K. eta K times xk minus M Times Rock

one Plus The Sun offices, X. Financials closing here, closing here, closing here. And what needs to be emphasized is that this now starts from one instead of zero. So this is now um representation as exponential family with minimum number of natural prompters, Utah or sufficient statistics.

So this is the multinomial distribution as an expensive family. So what else can we do? Another example could be the binomial distribution. Can write it down your ex X was an over X. Bye to the power of X one minus pi to the power of n minus X. This is the usual description of an exponential family but this is the same as the multinomial where we say M M is equal and And K is equals one. So since M and K were fixed, we can just use them as hyperparameters and find the same

form as before. So this is also an exponential family, same quantities as before. So another example would be of a newly distribution and again, I don't write out in detail, this is just multinomial Where K equals one. So we have two classes and and equals one. We have just one trial then again, this is also exponential family how many same quantities before? Maybe you have to change

the in excellent bit that instead of getting rid of pi zero, you have to get rid of pi juan. Okay, that's it. For these examples, let's stop here.

2 Exponential Families - Example: Gaussian distributions

hi everyone in this clip we want to show how to use the recipe for exponential families to show that Gaussian distributions form exponential families. So let's start, so we start with D dimensional Gaussian distribution where D. Is considered fixed. hyperparameter So what does the density look like of a D dimensional causation distribution? So we have a prefactor C c r C. Yes, okay, now it works okay, you have a prefactor this is too pi minus behalf and then we have the determinant of our

co variance matrix and then minus and half one over the square root. Well here is the mean and sigma is co variance matrix. Um and then an exponential minus and half X minus mu transpose take my interest. Thanks minus mu this is our cost in So we know that the Gaussian because of this exponential and because of the assumption we make on the co variance matrix is always positive, density is always positive. So that means our base measure, can we just chosen to build this constant? You could also just take one as base measure constant one but

then there's two pi would occur in other functions. So you can already say that age of X is a constant two pi minus D half. We don't have to consider the support. Support is everything. So we can directly look at is um so let's look what the logarithms is logarithms E X giving sigma this is minus behalf block of two pi, that's what you already have covered, we're not so interested in that minus one half block. That sigma and then we take the lock of the exponential block. This exponential meaning, meaning what we get is

just what is written in here and we could copy it here. So the only thing we need to do now is multiply this out. So this is uh if X X, muumuu and we have X. Mu and mu X. So this is a minus a half X. Take my hand was X. Then we have minus and half transport missing, new transport, sigma interest, move And then we have two times two times minus and half -2 times minus and half.

pnew transports take my inverse X. So what is what now? Maybe rearrange it a little bit range it a little bit these terms, we have something which is dependent on X. If something which is not dependent on X. General like this.

Mhm Maybe remove this out here and up here. Okay, this is just so what is what? No, So um first of all we can receive and then this is canceled, we can remove that, then we have space for this term back there and make a place. So we have X here and we have something

vector of parameters here then we have here, term which depends on also on X. But now it's weirdly entangled between these parameter matrix and these access and then we have something which is not dependent on X at all. So what we would like to do is we'd like to call this our b of X. We would like to call this our one, this is not just one component, this is the whole vector T one of X is also a D dimensional vector here we have to think about a bit and you want to call this constant only dependent on the parameters, media and sigma So what do we do with this term here? We want to put it into a form where there's a scalar product just

between the parameters and the sufficient statistic. How can we do this? We can do this um by a trick which is called a trace trick. Let's write it out. So if you take this is X transpose, sigma inverse X then what is this? This is one cross one matrix. We could consider the one cross one matrix is just a scalar. So this is then the trace of one cross one matrix which is basically the same. That's before. Just we write a trace before and now the trace has this nice property that we can cycle matrices around. So left hand matrix can go to the right hand side scalar. He actually doesn't matter. So this X can go there then it's not

a one cross one matrix anymore, but the trace is still the same. So this is trace minus and half take my inverse. X. X transpose. And now we can say that this is our um sufficient statistics and this is our prime. eter one thing we maybe need to mention is and that if you have two matrices and we take the multiply them and take the trace. And this is the same as basically taking the entries of this matrix separately, multiplying them against the entries of this matrix and then adding everything up. In other words, if you vectorized this matrix basically

instead of having a matrix making a huge vector out of it Planet if you want and victor rising this year according then this is a scalar product between this and this so we can actually right that this is our second sufficient statistic And this year is our 2nd natural parameter. Okay, so are we done now again? No the problem is that the log partition function is expressed here still in the original parameters and we have two reexpress them in natural parameters. So what we have to find is we have to find mu and sigma as a function

of the natural parameter. We have to invert our relation but this let's go on the next slide, ignore this year. So what do we have? We have to one, what was it a one Taiwan was nu transport sigma in years Andrea uh huh. Two And what was it? A 2? The two was minus and a half take my interest now we want to invert this first you see that the second equation is not dependent on me so this is easy to invert.

so we get sigma is minus off it's up to interest or you can shuffle the interest around or want to take my inverse is minus two tattoo then we look at this equation here, it just multiplied by sigma and then we get an equation from you only transpose Is it a 1? sigma which is minus and half either one it's a tool, months. Oh so the -1

English that means mu minus and half. ptap two minus that's a plus No one crossed pose. Okay then what was our partition function? partitioned function now was minus the C. sigma and this was given where do we have it? We have it? We have it here is a chronic form mu and minus lock that sigma so one half

nu transplants, sigma universe gnew plus a half block that sigma minuses disappear because of this minus and now we plug it in so first of all and this is each month so we can replace this year directly by this one half Taiwan and then we have this equation here minus and half minus transpose. That's transpose and maybe we don't have enough space here so I go here

and then we have plus half block determinant off mm think this should be a tool right this should be two. No This should be 1/2 so then we get minus and half It's two in verse. Now we can write it like one Over 4 - one.

It's a true inverse. One transport minus block debt minus 22 we take this inverse out the front. Oh um maybe we need to take care of the transports here. Up here it's a transport missing. That means here is a transport missing and here this one too much. Here's a transport missing and here's

transports missing many. Yes well transpose now it should be correct. Okay. Okay so we have found the law petition function. Let's write down. Uh huh. The D dimension Gaussian distribution looks like yes an exponential family. So what we've got is that P. X. Of eta Yes pi to the minus behalf. Now a big X. Financial we have inter one transport

X pass trace it two times X. X. Transpose. Alright like that minus. Maybe we just right a heater with a eat of the quantity as before. No blindness or one transpose. True indoors minus our block. That minus so yeah threatened down. Okay so we have seen now that the D dimension Gaussian distributions form an exponential

family where the parameters me and sigma are running Now we can ask the question what about one dimensional motion and actually I don't really need to write this down this is a Did I mention it cause in 20 equals one and that means the same. I wanted to see the same equation so and then we have to basically say that this is sigma squared. Maybe I lose this excellent exercise.

3 Exponential Families - Conjugate Prior, Bayesian Update

Hello everyone. This clip will talk about conjugal prior or exponential families. So you might ask the question why do we do all these exponential families and the reason is they have good properties. One of them is that they have well behaved contacted prior and Bayesian statistics would be easy with them. So what is the continent prior first let's consider. We have family of probability distributions any statistical model andw Now wanna do beige in statistics. So we want to set up a prior prior distributions of the parameters. So now contacted prior is a

family of priors We sekondi prior or a conjugate prior but it's actually a family of priors andw we say that its conjugate with this family of distribution family of priors to say that its conjugate with respect to em if all possible. posteriors also light in f so they have the same form. So if I have any data then if I take the posterior then it's not the same form as a prior as his family of priors lies in there again this is called a conjugate prior family of distributions. Just to remind you what is the posterior the posterior hmm is first the data likelihood

given our model business. theta businesses often it's I. D. Data it's a product of all these individual data points giving theta then times prior where the hyperparameter here is alpha and then divided by some evidence. We have the evidence just integral over likelihood times prior. So we could roughly say if you don't want to talk about the evidence here, there's evidence, likelihood prior you don't want to talk about the evidence. Um then this is here proportional

um to just data likelihood times prior we can do this because here the argument is the sita and sita only the PSC mtm So let's give them names again. This is a posterior, this is a data likelihood, this is a prior and this is the evidence. Okay, so now that we have defined what the contrary prior to a statistical model is, we can ask ourselves what happens with exponential families. Let's assume we have an exponential family now given in its natural parameterisation

So here are the natural parameters and we are basically not saying we consider the set of all these distributions. wheren runs through some except like some real numbers or might be a vector, I believe, some subset of some electro specs. So this is our exponential family in this natural parameterisation of this form as we have seen before, then my claim is that we can construct a conjugal prior prior by the following formula. Just take all the we take adam you take the distribution of formula. Um The first note parameters are here and prior has as an argument

and tau and nu here are hyperparameter So this claim is now that we can construct conjugal prior if we just take the following formula, take the exponential take each transport tau minus new times a of each other as you can see, we're basically resembling this term here where we replace the data dependency with a hyperparameter and we basically allow here for some multi receptive and parameter in front of and so this is now up to normalizing constant and not normalizing constant would be dependent on tower and gnew this is a claim and we can try to prove this.

Mhm Okay let's write it down again saying U. Turn of you was proportional thanks. Oh now look at eta chow new and let's say some data matrix then this is proportional to the data likelihood. So if you take my I. D. Data then we have P. X. And given times prior so now we know that this comes from an exponential family so

we can say This is n equals 1 to N mm Maybe you only want to write the base measure uh and state a point minus a. Of n. Times and now like prior what you want to show its contribute it's time to minus gnew Okay as you can see now um this product will be inside will turn into a sum summing up over these but we're also summing up over these there's no index here so it will just be multiplied by the number of data points and also we have here very

similar terms as here and here so we can simplify We just have one x. Take U. Turn. Now we have some Andw equals 1 to n. To expand less tau minus andw coming from this term here andw trust me times A of em so this is a posterior now we started posteriors wrote down the terms explicitly and now we see it is proportional to this quantity.

If you would now do simple substitution and call this here tau tilde and we will call this here new tilda then this would be the same as key to child tilde tilde and this shows that it's of the same form as a prior. We started with this family of priors so we see that it's actually contribute so we have shown that we can construct one prior for exponential families if we know this quantity is like rock petition function sufficient statistics so now to sum up what you've just arrived, if we take look at the prior parameters

hyperparameters and then the posterior hyperparameters then here we had tau and we had new and here we had tau plus some of the sufficient statistics also data points and here we have just new plus m that means if we use an exponential family and we use a contractor prior we just constructed then the basin update rule just updates the hyperparameters of the prior by just evaluating the data on its sufficient statistics and adding them. hyperparameters and here we just need a number of data ports so you only need to be better at this and you know what the posterior

is usually if you remember having a full base an update is extremely complicated. but for exponential families with contact prior this reduces to this simple addition and evaluation. So now lets us look as a simple example, let's recall the multinomial distribution with fixed M and K. And uh let us recall that multinomial distribution in the usual parameterisation was Even like this and we had like two different parameterization as exponential family. First was not minimal but it was symmetric and the second was minimal but not very symmetric because we had to choose one class and expresses as the

others. I'll just look at the symmetric one. So this basically meant that he just took the exponential then you took a equals zero to capital K to K times. xk where eter K was just rock, I K K. Going from zero to K. And in this parameterization A of it our zero. So this is very easy. How would then contact prior look like prior would be proportional exponential now okay equals capital K times child king

and then minus GNW Times zero. Right? Also this is zero. We have zero here actually it's not dependent on this A So it would look like this and then it would look like this only and then on tower where tao has the same um number of components as uh accent Teton. So this is very simple, this is contacted prior if you go back back to the original parameterization I don't need to do it but I want to relate it and and you take

alpha K To be okay plus one But you changed by one. Just a notation then we can pull this out here. The exponent of this eter K is pi K again and this towel turns to alpha k minus one so then hi all fine would be proportional to the product of okay With the power of all front -1 Hey one Okay and this is what we know as a you know cleaning distribution, this is here not in the

natural parameterisation anymore, but you could say that the directly distribution is a conjugate prior multinomial in this original version and uh we have derived it through the exponential families was general formula and maybe a last word if we have here you want to do an update the prior the posterior andw tau tunnel zero, okay if you write it explicitly and here we have x zero and and it goes and to end okay

plus Andw equals one and xk andw that's amazing update, wilh or if you want in the natural parameterisation of zero. The non natural the original parameterisation then this was correspond to two, The same rules just shifted by one, keep it okay time and so we basically just adding these terms up for each component separately

and just adding them to this these parameters here, this is natural, this is original are on the transition hyperparameters

4 Exponential Families - MLE and MAP estimation

I won in this video we want to derive maximum likelihood estimator estimator or exponential families. Well let's recap so an exponential family is of this form where we have here the base measure intra parameter sufficient statistics and the partition function and they consider them given and no then if you use a conjugate prior which is also formed that levels here likelihood but now it's a function of peter and where hyperparameters play the role of statistics and where the new here is basically account of how many times I have this locked partition function then we have seen that the posterior is of the same form as a contract prior but where

the new parameters here the posterior parameters were given by this simple update rule just adds a sufficient statistic of I I. D. Data to this town and the number of data points to this new. So that means we have some posterior that means with a simple derivative. We can get the posterior mode or map estimation. Map estimated for an actual parameters. Let's go through it, we take the derivative, put it to zero off the lock of our posterior. This is the condition for the map estimator. Then here there's like normalizing constant in front which is only depending on these two hyperparameters So the lock of this just falls away and then

we have the lock of this year which is written in here and the directive was to eat is just like this hotel down and the derivative of a we just call a prime here and this is a function still of eter So while this year is the scale of function if it has several components this year has the same number of components. So this is really the gradient and you consider it as a function of the festival then this means that we can now rewrite this on the other side or like on the other side it means for a prime of return is tilda which could be a vector divided by new tilde which is a rescaling So this is a condition for the mop estimator. Now we make the assumption that we

can invert this function and this is vector, we don't mean an inverse of matrix, we mean really often nonlinear function in new then we get that eter needs to be a prime in rush all Children, new Children. Now we just pluck these values in there and we get here the map estimator just by the And the maximum likelihood estimator is actually much easier basically putting these 20 or we just look at the likelihood, multiply it for each data point and then what we get is exactly This formula here with this and this zero it is some of sufficient

statistics. We get end times to lock partition function. If you put this to zero then we drive at this equation. Let's look at a simple example. Taco let's say oracle distribution. Categorical distribution can be written, you can write it as an exponential family ofx internal of X and each of X. Was logq one class Some K Course

1 to Capital K. E. Okay statistics. Kay was xk A 12 capital K more. Let us do the multinomial distribution directly. We're K and M. Next and then what we got is same as here. Just I need to plug button. No we can take the derivatives A with respect to J. K. And this is M times 1/1 plus day one too

cap it's okay J times then the inner derivative take a So we now have to look at this whole vector. Maybe you write this as juan into the K. It is a vector and he's a prefect er and you know take oh victor this is how it looks like and what we need to do is now we need to invert it. So meaning we say this is now let's say S one xk And we have to express now Peter 1 to eat okay in S and as you can see this year's m times

the softmax peter andw let's arrange it like this. I want to write this as a function as a function of eater and you want to invert it. So we have here and trying to softmax X. So this means that to it's a softmax universe. S one divided by M until S. K divided by M. Um What this means is

that we have now a prime universe of S andw is the inverse of the softmax where the arguments are divided by them. So that means you can now compute the map estimator as softmax inverse, and now we need um sufficient statistics and these were the access we can write as a doctor, then we had the town

and we divided by new plus N. Now we have also to divide by M. This is a mop estimator of multinomial distribution in natural parameters.

5 Exponential Families - Deriving Moments

hi everyone in this clip. We will talk about how to derive the moments for exponential families. So what I claim is that if you have exponential family and you have a slot partition function and you assume that you have a mini minimal representation, minimum number of natural parameters, then you can derive the expectation value of the sufficient statistic and the co variance matrix of sufficient statistic and all higher centered Moments of the sufficient statistic just by taking derivatives of this log partition function. So my claim is one that

you write it like this, that the gradient, this is like partition function, respect eta This is an expectation value of efficient statistic and second to make it explicit if I take higher derivatives retrospectives natural parameters then I guess variants of the corresponding components of the sufficient statistics. And actually I claim you can do this evening with higher moments with higher derivatives. Okay, let's process. So what we first need is remember that our petition function was

the law normalizing er so we can write H of X next B of X P X. So this is the formula will use or if you have this given we can just take the D E K. eta Um and then we like the derivative of this expression with respect to each here. So first we have an outer derivative, then you have an inner derivative let's write down what the other derivative is a derivative of lock is one over. So we get this expression same as right in

there. The ex went over now times the integral and we can pull the derivative of this, we can take this insight. This integral and then derives exponential and the outer derivative of the exponential is itself. And then the inner derivative. Yeah with respect to eat A K. It's just the coefficient in front of this eater and this is talk so we get okay of X edge of X. H of X. X. Each transport top X E X. So now

we have to look at this, what is this this year is nothing else as this year. So if I put this on the other side was annexed then this year is just X of this year and then we have one over so we have that this whole thing here is X minus of this rock petition function. So now what is this year? This is just a of eter times next A of ETA times

the probability density was parameter eter so basically you get I write it like this. Mhm Wait a moment um Mhm Okay X X. One this is this part here and then

times age of X. Next tau G of X. X. And this part has said before S. P of X. eta so this is just the expectation value. Our top T. K. Mhm I think you cannot see it all switch this off. Okay so

and that's exactly what you wanted to show. So now we can we have the second claim and we also check this so what we now need to do is taking second derivatives into the Taiwan. The J. eter andw this means we can use what we have done before. So this is D I integral already of T J X H of X X. T of X sufficient statistic minus A of eta the X. Now we look at it and we want to take the derivative

inside here and we have to look what is dependent on it and what's not. So we have a dependence on eter here, dependence on eter here and this is an act. So we have an outer derivative which doesn't change. And then we have an inner derivative of this year. So let's write this down. So we have E. J. Of X times. Now we write the inner derivative of this term here. And this is first just I component of this tea E I O X. And then we have the derivative of a eternal with respect to andw peter I. And we have seen that this is the same yes expectation value of the statistics

and now the rest is the same out of derivative didn't change X. And what is this again? And this is again the probability distribution answer two P of X again. So what you get is you get the expectation value of D. J of X times the I of x minus expectation value of T I. R. X. Calls and the expectation value is linear. So we can factor it out

G I of x minus, we have here basically constant. Once you can get out because of the linearity of this expectation value. So we'll end up G I X Mhm J X T I and this is nothing else then variants G I X, Jax and this is what you wanted to show. Now we go through some examples just for sanity check, this is true. Okay let's take the exponential distribution again as exponential distribution we have seen the acts of lambda this was

0 to Infinity X. λ X minus lambda X. And natural parameters. This was given very similarly next E two times X minus and minus rock minus eater where was minus London and uh a of eta was minus lock minus eta So and the sufficient statistic was D of X. What's X. So now what is then

what is then the expectation value of the sufficient statistic that what you just said, this is just the derivative of this eater. Now we have to look at this function here your eyes with respect to what do we get you first have a minus sign then we have the derivative of lock, this is 1/1 over minus eta and then we have the inner derivative which is another minus sign. So this is one over minus eta and it was minus lambda. So this is one over lambda and from simple calculations about exponential distributions we know that this is true, this sanity, Jack holds true. What about variance of G of X?

So you have to look at the second derivative according to what we just arrived. So we have to take the derivative of this year and this is just one over e tail squared, one over number squared. And this is a variance of exponential distribution also holds true as the next example we do a one dimensional caution. Um so we look at an X sigma squared and in natural parameterization we had a of was minus Eter one squared divided by N two for 2

-1 block Of -2 - two. And we also know what the parameters were to juan, we're new divided by sigma squared And it had 2 -1/2 sigma squared. And the sufficient statistics G of X for X. 1st 12 of x X squared. So we're basically looking at 1st and 2nd moment. Now if you wanna compute the expectation value of them. So then what is expectation value of X. Of the normal distribution? This is now the expectation

value of T one of X and this is I write it like this uh derivative respective first argument and we have to look at this quantity here and derive with respect to the first argument. You see this second part here drops out, it's not dependent on you to so you get minus Eter one divided by 2, 2. This comes out now so it doesn't make sense? The question is now and we can just plug it in. You can just say what it is and work in the old brown mentors and see if it makes sense. So we get minus

mu divided by sigma Sigrid and then we have we have that minus 2/8 R squared is sigma squared minus sigma squared so my name's second scratch and this is new and this checks out um maybe the minus two minus sense is irritating

and write it like this over the last great times it must grant equal to me and this checks out. So the first thing is correct. Now we want to say look what is the variance of X, the variance of X. Now the variance of T one of x. So that means it takes a second derivative respect to 1 1 of eater. So now we have this function here and if you drive by to eat just drops This is 1/2 m to and if you plug this in again you get sigma squared,

this is true, right? For one dimensional gossip, you know that the variance is sigma scratch check. Now we can even go look at what if I want to derive what expectation value of X squared is cannot text expectation value of our second sufficient statistic and I can take the derivative first derivative respective second argument andw now I have this heater here and just to hear the second component but what do we get? So we get detail one squared, not touched divided by two Utah

two -1 or two. Return to mhm square mhm therefore here oh andw I can see that this is minus mu squared nu squared, andw this here is again sigma screened, it was crap

andw this is basically so if you look at the variance, the variance is expectation of X squared minus expectation squared. So this also checks out and now we can even compute things which are not so easy to compute for instance the variance of X squared um there's another invariance two of X. This is now second derivative officers, um sufficient statistic oh it should be a plus here. So the derivative

utu yes grant two totally last 1/2, 2 squared and if you plug this in we'll get you, take my squared you scratch stream, take muskrat and if that is true it's not so clear. But how would you else derives these kinds of formulas

6 Exponential Families - Summary

So let's recap what you have done as a summary. What we have done is we have introduced exponential families, they are families of distribution that can be brought onto this form where age of X as a base measure each other as a vector of natural parameters. T of X is a vector of sufficient statistics. A of is a lock petition function. And then we have shown a general recipe. If you come up with a family of distribution, how to bring it onto this exponential family form, if it is possible, you have also seen that their distributions like student T distributions, it cannot be brought into this form at least using the recipe roughly.

Then we have seen that exponential families have very convenient contact prior given by this formula where he has a lot partition function from above and here are the natural parameters and where we have to introduce to hyperparameters a vector of tau and nu number and given an exponential family and the conjugate prior, we can use beige in updating just by updating the hyperparameters and this can be done. Have shown that this can be done by just adding the sufficient statistics of the data points and two real parameter and the second is just a number will be updated just by the number of data points. Furthermore,

we have seen that moments of the sufficient statistics and be computed just by taking partial derivatives of this partition function and also we have seen that the core variance and be computed by taking second derivatives of the log partition function and also higher moments as well.

2 Conditional Independence

0 Conditional Independence

Hi everyone in this clip we'll talk about conditional independence as a short recap let us say what the marginals and the conditionals of joint distribution are. So let's say we have X, Y and Z. The random variables have a joint distribution given a P X Y. Is that? It's a function basically in three arguments. So I want to now say what is the distribution over that? So they come somehow with a strong distribution. But how do I compute it? Just as a recap? What you do is you just take X. Y. That So the thing you've given here and then you just marginalize out the other variables. But just integrating over them, this would be for a probability density distribution. If you have discrete variables, you would just replace

this integral by a sum and these are just symbols. So now we have like the marginal distribution over that. Now what is a basically the conditional distribution of X. Y given that. So for this we have to basically look at two cases. One case is where The offset is bigger than zero and the other case is Where P of Z is equal to zero. So here, as you know, we just computed by X, Y. Z. That and we just divide by P of that. This is well defined only if P of Z that is bigger than zero. And for the second case we have several choices, it's how to define it.

The main reason is if the probability for this specific value is zero, then basically we have any choice we want but a very convenient choice is usually to take this to be zero. When the probability of that is also zero, then one usually doesn't have to take care of special cases but this year is actually just convention. one could also argue for other conventions. So what is an independence? So we had independence already in other courses. Independence means um let's say we have two random variable. We say that they're independent. Independent. Um There's no that here independently for all values. I can factorises it into the product of

the marginals and the marginals are computed as you have seen on the other slide. So and then symbols we write that X. Is independent of Y. So we use this independent symbol here. So now we come to conditional independence, conditional dependence is very very similar. Just that we have a third variable at play. So we have X. Y. That random variables and they're given and they have a joint distribution Also in three icons. So then conditional independence means independent, conditioned on Z means that for all values X. Y and Z. I have that P of X. Why given that equals P of X. Given that times

P of Y. Given that. And actually if you would use a different convention for our conditionals we would only check this for peace that bigger than zero. So we only need to check it for these in our convention. We just check it for all values but this year makes it actually easier, andw in symbols, he would write this the ex independent of Y. Given that we use this independent symbols with this conditioning line. So these are the formal definitions. A few remarks unnecessary. First of all, we consider the unconditional case of independence, we consider this as a special case of conditional independence.

So it doesn't mean that um that it doesn't matter what I'm conditioning on. It does. But we have to say in conditional independence anyways what we're conditioning on. And I could say either I basically condition or nothing, this kind of language or I can say I condition on any constant random variable that would be all equivalent by just saying X is independent of Y. So X independent of why I could consider as X independent of Y. Condition on nothing or X being independent of Y condition on any constant random variable. So what do these notions mean? So they're just tickly heuristic. The heuristic is that if I look at why I don't know anything about X in this independence case

here and the other way around. So any of these values don't give me a clue about the value of the other ones. And conditional independence is a bit more tricky. And conditional independence says that X and Y are independent, given that that means that if I know that already, if I have all the information about that already, then the question is about the additional information that goes beyond that. This is a question. So meaning that information in why that goes beyond the information of that can determine cannot determine the state of acts. So let

us come to two examples very lightning examples. Let's say you play a game or hear Alice and bob plays a game and each of them have a coin with two sides. It's either heads or tails. Classical bernoulli experiment can be fair, it can be unfair. It doesn't matter. They have two coins. They have had entails and Alice Rose a coin and she doesn't show it to bob and bob throws a coin and she doesn't show it to Alice and they wanna try to guess what the others have in the hand. And as this was set out it's clear that if you throw your own coin and you just look at your own coin you have no clue about the coin the other has in their hand. It like came up heads or tails. You can look at your own coin as much you want. You can throw it again it doesn't help. So what does this mean? This means that basically

the coin of Ellis and the coin of both. Their independent. This is fine. It was nothing and out of the ordinary. But now let's say there was a third person Eve and Ive just happens to see both of the coins. And now she would tell bob I don't tell you what Ellis has. But the outcomes are different anyways. So even though the coins are independent. Now someone says gives you some information that two coins they're different. So if you have had you know the other person has tails. So bob knows that Nose in 100% position what the other person has in the hand. So and this means if um I had independence. But if I condition on the information

if just gave then the conditional independence is broken. So we are here have the case that A. And B. Are independent. But as soon as a condition on another random variable here meaning the random variable. If the coins are equal or not, then um independence can turn into dependence conditioned on this new information or condition on this new random variable. This is the first example. So it's clear that independence does not imply conditional independence. We have also an example that goes the other way around. Let's say we have no three persons. It's becoming a bit more complicated if Ellis bob Casper and each throw coin. And Ellis the game is now that Ellis tries to get the number of tails under these three coins. But she can only basically look at her

own coin. So now Ellis says that she has had what can she tell about the number of tails in these three um client throws. She can say actually anything. She can say something that means if she has hats, that means like there can only be two tails maximum. That means she can rule out the event that they're like three tails. That means there is information about the number of the total number of tails in these three andw coins. So that means that Ellis is actually dependent on the number of tails. She can say something so they're not independent. So there is an information flow going from A to T. Or the other way around. Also you could see the other way around, let's say you would know the number of tails would

be zero, then you would know that a also needs to be um hats. So there is actually dependence here and now again let's say if I just saw that Alice and bob have two heads together. And now the question is Can we determine if it's not two or 3 tails? And the answer is again, no, because you don't know what Casper through. We don't know if it was heads or tails. If you ask another question, what is the information that goes beyond what Eve said? Can we say something about um about the number of tails then you would say no. So in this case the coin of Alice is independent of the number of tails

given the information is provided Or saying that some of the 1st 2 Alice and bob we can say if it's basically looking like representing the random variable of the number of tales of Alice and bob then because Casper was thrown independent of ellis and Casper is basically is a remaining information in the some here other tools are determined. Yeah. And this remaining information is independent of a conditioning of E. Doesn't provide us any information about. And the number of tales beyond the number two we already know. We don't know if it's two or three. These are the two examples which show that neither. So we see here this condition independent does not

imply unconditional independence. And you have seen in the other example the other way around is also not true. So these are two typical examples showing the relationship or non relationship of conditional independence and unconditional independence. So what kind of what kind of relationship can we expect for this condition? Independence relations. And there are relations and these are called the separate axons. So if you have time then I advise you to just try to prove these relations. Just by the definitions, you can show that if you condition on X then any variable is independent of acts without any assumptions. Secondly, as you have seen by the definition if you switch X and Y. If X is independent of why given that, then why is also independent

of X. Given that this is just a symmetry in the product then you have C. Then you can show if you have four variables and you consider the joint variable Y. And W. Then also you can just integrate over W. And then you will see that you also get the independence of X. Y. Given that. And then there's this week Union saying if you have this joint variable and access independent of the strong variable given Z. That then you can bring one of the variables, any of the variables back into the conditioning part. And then you have the reverse of these two rules together. If you have these two statements, this is what's written here with an end in the middle, there's an ant that both hold. Then you can show that also this independent hold with this joint bearable. So X is here and

that is always conditioning and then you have this here and this here and this Y here pushes over the boundary and you end up here and you could try proving that it's not so difficult. Um We might encounter these rules later on and then as a summary of this chapter we have defined conditional independence first year as the symbols, it was defined that all values X. Y. That we need that the joint conditional X. Y. Given that factorizes and X given that times the probability of why given that. And example one shows that independence does not imply conditional independence in general and that conditional independence, the other example does not imply

independence

1 Conditional Independence

Hi everyone in this video, we'll talk about entropy and information theoretic quantity. So let's start, let's start with a typical game everyone knows called who am I. The game works as follows. Someone writes a name of a character, a person in real life on a sticky note and put it in front of someone's head. And the other person has to guess what the person is. The other person has written on the sticky note. So who am I to explain the rules a bit. The only thing the person is allowed to do is to ask question only. Yes. No question they can ask. Am I Shakespeare and one has to say yes, No. Then I can ask am I Marie curie? And then someone has to say yes, no. And as you can see, these are not very efficient ways to figure out who you are

because you cannot just go through every person in history or movies or all kind of fictional character. And every time I ask, am I this person about this person or this person, you have to be more strategic, You have to be more systematic. And people who play this game a few more times, they usually figure out some good strategies. They're usually trying to eliminate as many possibilities as possible. For instance, by asking Is a person alive? Yes. No. Then I can already exclude maybe the person's life, exclude most of the people ever lived or you can ask for the continent the person is on. Does this person live in europe? Does this person live in America, something like that. Then you will eliminate a lot of people already from the list. So in most um the most efficient way to do this

is usually trying to think about all characters possible and then divide them into two parts. And then ask a question where you can eliminate one half from the batch and this is um are very close to close to the optimal strategy. Indeed, if one has a uniform distribution of all possible characters one can have then this is close to the best. So what if we're not asking the question for every given person? What is the minimal number of yes, no questions required um required, who required to determine the person. And this is already the main question

of um entropy, what is the minimum number of yes. No questions needed to find a specific person and then averaged over all possible characters. But this is already the notion of entropy. But what you need is if I ask if I tell you, so if I play this game with you and I ask you figure out who you are then you might not think about all possible characters. But what you need to know is basically what is the whole space of all possible characters. And if you want to even better at it. You can even explore the frequencies how often people choose these characters and put it on sticky notes. So we can already say the informal definition of entropy if you start with the random variable X. With some distribution P of

X. Then the entropy of X is just a minimum number of bits or it's basically the same as yes. No questions needed to encode the state of X. So if you have an outcome of X then how many yes no questions which you need to figure this specific value out and then this was just one specific value. But you don't know the value in advance. So you average over all these possible states of X. And baby D weighted by the probability of frequency. So the informal definition is already here or the heuristic definition. You just take the number of bits needed. So it's actually the minimum number of bits needed to encode um the state of X. Or the location of X. And then averaged over all these excess outcomes. Okay now let's make

our life a little bit easier and consider just checkers board. Let's start with a small one here and let's consider Alice selects a field. That's basically the equivalent of who am I? Let's say this field and bob has to ask yes or no questions to find this field Alice had in mind. So here we explicitly see what the whole state spaces. So we have 22 fields left and right and we have to figure out what is it andw Let's say this is the field Alice took then bob can ask is it black or white And then with one question he would already know um which one it is. Okay he has to ask yes no question. So he he would ask is it black and Alice would say yes and then he would have found

the field but even if he's asked is it white? Alice would say no, then bob also would know that this is his field. So what is you write down? So how many bits is needed to encode this field? We call this field number of bits I write it down maybe we number of bits for a have seen it's one bit but if I draw, if Alice would just draw this randomly, let's hear in this case uniformly. Then what is the probability of getting this field? This will be in half because you have to so now um let's call this A one, then we could

ask the same question. If he just took a two, what if Alice took this one, what is the number of bits for this field? The same as before? Certainly also the probability is in half, so what is the entropy then the entropy off this year would be the expectation value of the number of bits maxent So in this case it would be it would be E of a one times number of bits of a one plus E. Of a two times number of bits Oh thank you and this was b and half

then oops times one bitch plus and half times one bit. Even the average number of bits needed is just one. So now let's do the same for this square checkerboards here on the left. Now let's say bob picks the field and Alice has to select, has to figure out which it is. Let's say bob takes this field here. How many bits would you need? Of course, as discussed before. And this could ask is this field? Yes, no, one bit. Is this field? Yes, no one bit this field Yes, no

one bit. Is this field? Yes, no one bit. So you would need let's say 123 nose and then you would know this field, that would be three but you can be more efficient you can ask is it on the left side? Yes. No. And then I would say yes, is that on the bottom side, bob would say yes. And then just with two questions we would have narrowed down where he is and of course the same holes for every other fields because a simple example, simple example. So the number of bits Or be less quality of B three would be too. And now what is the probability to get this? If you would randomly toward probability would be 1/4 because your four fields? 1234.

And of course by no coincidence you can say it's one of the power of two and then we cannot do the same this field this field. And from this field and you better came came to the same conclusion means every eye, you would come to this conclusion or maybe we want to see A X. That we call this B. Oh, that means h B expectation value over the number of bits or the eye

or B for every i it's the same. Then this would be E equals one. Or it would have P E I times the number of bits needed for the eye, and this would be 49 times. Cool. So we get to two bits. Okay, so we can all go over to the next example, which is just a bigger example. It's the same as before. Now, Alice picks again, Alice picks, let's say this field here and how many years? No question would be needed. Um So we got the same strategy as

before we say, is it left? Right? Yes. No. Then we know it's on the left on the right, is it on the bottom? Yes. No. So we have these fields then we can ask is it black or white? Black? And then we can ask the last question. Um is it one of um is it in the left corner in the right corner? And then we have it. So it was one question two question three question for question. So that means like number of bits get this is for question. So now the question is, what is the probability to get this field? And the probability is we have 1234 Times four.

So this is 16. So we have 1/16. This is the same as 1/2, 2 to 4. And as you can see this is no coincidence if you go on and on. And the number of bits basically in this setting is basically nothing else. And the exponents coming up here, if you can always like subdivided into two, then um this is exactly the same. So besides computing the entropy here, which is easily seen again to be the number of bits. So it's four on average, it's four on each. So it's also for an average, but also what we got here is the relationship that the number of bits needed. So we get the relationship that P. Of A. Is

1/2 to the number of bits. This means if I take the lock to hear that the number of bits of A is minus log to the base tour here of P. F. A. And this formula can be shown to generalize beyond this power of two cases and we can look at first as an a little bit unequal uh state space, let's say we have three spaces. We have this field here, then we have this field here

and then we have, we considered the whole part here as one. So now if Alice chooses one of these. Um Let's say randomly, then we draw the size of these fields here. The size reflects the probability that it will be drawn. Um So here that would be 1/4 Here, that would be one force and here that would be 1/2. So now how many questions do I need to get these fields? So the probability of let's start with blue one then what is the number of bits? The Number of Bits for All This? A one.

And so what we can do is ask before is we can make a half half situation, we can say its left or right and then is it up or down? And then we will figure this out. So here we would need to bits then if you take number of bits for a two, what would it be? So we can use the same partitioning we take here left, right then up down and it would also get to bits and if you look at the probability to get this then here again it would be the same as before. One over the to the power of the number of bits. So here it would be 1/4 but we have seen this already,

This is 1/4 and again this is 1/2 to the power, the number of bits. And also if you do it here with PA two, It would be 1/4 and again there's two to the number of bits. So and the question is does it in this in even case still hold for as this um for this last part here and we can check that this is also true. That means the phone was more general than we might think in the beginning. So what is the number of bits needed? So we can just ask this bigger field left, right. And if you then if it's right then you already done or left. They say no, we ought to know where it is. So this field is bigger but that means we also need less questions.

So here The number of bits to figure out where a three is 1 and the probability of a 3.5 and this is also one over. The number of bits which were over. So even in this case um these formulas are hot out. But now if you want to compute the entropy then this looks a bit more ugly because we have now to some of equals 123. We have to take the probability ai into account times the number of bits of ai So we can write this down. So this is maybe this 1/4 times 1/4 times

two Plus 1, 4th Times two. And then plus 1/2 times one. And if we add this up, we get here and half to get here and half and you get here and half. So the entropy in this case would be three half, meaning that the average number of it's needed to encode the state. It's not um it's not too But it's also not one. So some men in between one and 2 and the exact numbers of course given by the formula. So even for this,

this uneven partitioned state space um we could we were able to derive value for the entropy. Now think about something which doesn't follow this. Two to the power something part then what do we do? So what we can do is we can first look at And the probabilities. So first also frequencies let's say uniformly um frequencies how we how often we would get the specific field and uniform distribution. We would have 123-456-789. That means the probability each of them is 1/9 For each other is 1/9. And now the question is how many bits do I need? It's very difficult to say right, we cannot just do

this trick anymore. We have to petition it differently. And of course we could petition it into smaller and smaller um sub fields of some power of two and approximated And in the end, what we would get is exactly this formula we have written above is that the number of bits A is roughly the same as minus logarithms to the power to the basis of two of this year. We would use this formula and this formula is basically named Shannon

and you find it also under Ronald Shannon code. So now we can compute the entropy. Now the expectation number of bits, right? This is what we want to compute this was a formal definition, but now we just plug this in and you can just compute it by taking this formula and if you know the distribution, we know the bits. So in this case it was 1/9 and it was uniformly distributed. So we get minus block two, 1/9, which is just Block two of 9, which is which is roughly

B 0.17 roughly. So then it means we can come to the formal definition, how it's defined nowadays is um of a formal definition of entropy is let's say we have random variable and we have to say it comes with distribution actually we only need the distribution need, then the entropy entropy of X. Formal definition is that it is the expectation value minus block E of X. This is a formal definition of entropy and we have to make a short remark that this key

of ex means something different for discrete and continuous variables but discrete, this is a probability mass function. So in this case, in this case everything is discreet and everything follows exactly as we have described before. The continuous case. However, this is not probability mass function, it's a probability density but still we would use this formula but they behave differently. Why densities can go beyond one while probability mass function cannot give or beyond one. So this is called discrete variables. This is called the Shannon. The Shannon entropy this is like access create and it's called the differential entropy

if access continues, but in both cases the formula holds or discrete. For discrete variables, this is the Sun and for continuous variables this is an integral over a probability distribution. Then another thing is, what kind of lock do I use? We take lock to here or use I use the natural algorithm. Well, this is not so important because the logarithms or can be transformed into each other by just multiplying constant. So if you want to have bits as an output, you take basis to if you want to have nuts as an output, you take the oil or constant. Oh, I don't know what you hear. So in the end, if you choose one, just be consistent.

The probability theory, we usually go with E because then the lock and the exponential coming up, let's say in the caution they kind of vanish. If you're discreet, we often take the basis to if you want to do really count bits because we're doing computer science and then you take the basis. E but just be aware it doesn't really matter. Just be consistent with the law. So then a few remarks question is what does entropy measure entropy measures uncertainty or the spread how the distribution is spread across the whole space and it's very similar to the variants. The variants usually measures how much your variable deviate from the mean. But in contrast to variants, entropy doesn't need

a center of mass variants only measures from the mean how far is from the mean? While entropy and of is also well defined for something multimodal Well you can compute the variance of multimodal but it's less meaningful entropy is multimodal case let's say we have a distribution like this and we have year zero, this is a P of X something like this then um the center of mass would be here but that's sort of like the center of mass. So the expectation value would be here at the center of mass to so in this sense the entropy better measures the spread across the space while the variance here would be super big while the entropy

is rather small because it's peaked around two. So you would basically if this would be very peaked um Maybe we draw it a bit more picky like this And here zero then um this would have high variance but it would have low entropy because what we basically need to encode is just 12, that means like it's very close to just one bit but it has a high variance because it deviates from the mean a lot.

So just for the interpretation as with variance if the entropy is small, that means like X is concentrated in a small area and if it's big then xor scattered around So he has small, so it's concentrated at two points and the entropy for discrete variables, the Shannon entropy is always bigger equal to zero. Well this is not true in general for the differential entropy and the reason I already said is that densities can go beyond one. That means like minus lock and go negative. That means like if an average I can go into the negative probability mass functions are bounded by one minus lock always stays in the positive for discrete variables is always positive or zero. And in general this might not

be true. So then let's compute the entropy for a few distributions that we have seen so far. So if you start with exponential families and all, almost all distributions, we know exponential families. Um so if you're computing for exponential families, we have computed them basically for all. So let's plug in the formula, let's say was minus lock your ex so we can just plug it in and this is the expectation of course with respect to the parameter eta then what we get is we get an expectation value, we get minus lock edge of X and then minus

and now what's written inside this logarithms and if it takes a natural lot, this would vanish and the export vanished. So here we just take the natural look, maybe I'll make this clear by writing a little bit easier like this and what we're left with is here minus the expectation value of logq H of X. Then get um

you get minus the transpose expectation value of X plus each other, there's no dependence on accident and we have seen that this is basically the same as a gradient of a of etA so we can write that the entropy is a of minus transpose gradient of so we can compute basically almost everything just with this like normalize er but we have still minus E of logq E was the basement er

often we could, this is a concern if everything is positive everywhere and the dependence on X is gone then this is often not in the formula anymore and then we are left just with something which is reputable just by this like normalize er and its derivative. So if you have now an exponential family and you have it in this natural commoditization you can just take this, take the derivative, plug it in here, take this back base measure, take the logarithms computer expectation value and you're done but since gaussians are so important we'll compute the gaussians also explicitly. So again we use the formula expectation value

of minus block, B O X. sigma to start with andw that means we take minus long this year and again we take the natural logarithms to get rid of this exponential, so we get in half block that pi sigma and then we take the logarithms of this. So this vanishes and we have this minus sign here so this minus this minus cancels oh we get glass us in half

expectation value uh huh X minus mu transpose. Take my interest, X minus mu Again we use here the trace trick so we can write this as a trace, take my interest X minus me thanks minus mu transpose. And then the trace is linear and the expo the expectation value is also linear so we can exchange them. So this is a half block. Dad brought pride, sigma plus uh trace

take my inverse taking inverse expectation value X minus mu transpose times X transport is on the other side minus nu transpose like this and we know that this year is just covariant matrix, invariance I think that's the definition of covariant and this means that this and this cancels and we just have the identity matrix and the trace over it. It's just the number on the diagonals, just adding up one

on the diagonal. So this is deep. So we get half a block that pi sigma class Yeehaw so this can even be more simplified by noting that this is times one and one is basically basically one is basically the log E of E. So if you write it as log E of E here which is one then you can simplify it more or we what we could actually have done directly is just not taking the natural algorithm. Then every were here. We would have a lock. E which is one if you take the natural

algorithm. But then it would be Yes. Well, and then we can simplify it because we have a half here, we have a half here, we have a lock here, we have a lock here and we have D times lak E. So we can write this even s um s let's try it like this one half. And then in times and uh this d we can put here in the exponents. So we have a product of D E. S and then we can plug it into the determinant.

What you get is 1/2 block that two pi e times matrix invariance metrics. And we're done. So this is a formula for the entropy of a d dimensional Gaussian distribution. Where here you now have the freedom to take the lock. If you take two, you measure the spits and if you take the you measure this in knots

2 Conditional Independence

hi everyone in this short clip, I want to introduce you to the maximum entropy principle. Maximum entropy principle just says whenever you don't know what probability distribution to pick, then take a distribution expressing the highest amount of uncertainty means entropy given all non constraints. This means you would try to solve the following optimization problem. So you would take um entropy as a function of the distribution and then you just look for all probability distributions satisfying the constraints. And this can be used for choosing a prior based in statistics, if you know what kind of constraints you have or it can be used as a basic and classical statistical mechanics and physics. Just to mention as maybe a small exercise, you can

derive exponential families actually from this principle, if you have functions which gives you constraints on your random variable and you want to find a distribution satisfying these constraints that the expectation of your function is some given number, then you can just express this in an optimization problem. You think this following lagrangian want to maximize its entropy and here you have a constraint. You also need to constraint that's actually probability distribution. That means it adds up to one and then you have grant multipliers here and here and I use suggested mutations. So if you solve this optimization problem, you'll find that the distribution and you get out of from this maximum entry principle will be of exponential family form. I'll

switch this officer can read this

3 Conditional Independence

hi everyone in this clip we'll talk about relative entropy. What is the relative entropy will come to this in a moment, but for now we just recap Jensen's inequality. Think about the real value of random variable and some convex function. convexity here is essential for instance absolute value this is a typical example or consider a district a function which is second derivatives bigger equal to zero. Then we always have the inequality that if you apply this function to the expectation value of our random variable and this is always smaller or equal to taking the expectation value of five ofx for instance, as an example,

you can just take the absolute value of your random variable and this will always be smaller or equal to the absolute value expectation value of the absolute value for instance. Furthermore, if we have like a strictly complex function then we can say when the reverse holds as well. Typical example of strictly convex functions, It's the second derivatives are always bigger than zero everywhere. Then if this is the case, then equality holds here. If and only if acts is almost a constant random variable mm constant random variable and then it has value this value here.

So equality, if and only if X is a constant random bearable. So we'll use it in a bit. And now we come to the actual definition of high relative entropy. It's also called covid lightly divergence, relative entropy and corporate leibler divergence is the same thing and we assume here that we have to probability distributions of acts, you have X and they live on the same space. Otherwise we cannot compare them and then we define I'm the co but leibler the virgins between B and Q with the following.

So let's write it out or continuous variables first. So we take the expectation value that's like an integral. Uh you have X times the log tourism of P of X divided by Q of X. The X is just like it. Um they both continues continuous case and we can write it the son of X. X. Block your X divided by Q of X. In the discrete case. Maybe all together we can write it as expectation value overpay block be

divided by Q. Or if you really want to write the arguments and we could write maybe that ex assistant you'd like P and we have to lock U. Of X divided by Q. Of X. And then we take the expectation value this is like in both cases were valid. Mm Another way um decomposes is to take the lock of this division is the difference between locks but we could also write it like this that we take expectation value of P. Walk. You write it like this minus? He was a minus actually

minus. Yeah. All right minus again minus E of P. Block. Yeah well let me rewrite the minus sign inside that's sick you know P minus logp So we could write this as cross entropy between E and Q minus the entropy of P. Let me write this. It's a cross entropy and this is an entropy a short remark since probabilities can

be zero and the lock of zero would explode. We make this convenient conventions here that if you multiply infinity zero, we take zero, then the usual one of infinity 01 of zero's infinity. And we say if you have to divide zero by zero, we define this to be zero only in this context. So now that we have defined it, let's talk about what it can actually do. So let's assume again that we have two probability distributions P and Q on the same space. We are random variables of Excellent. And you always have the inequality that and the kobak live the distance between P and Q is always bigger or equal to zero. This is important inequality

furthermore, we know when equality holds equality holds meaning he has an equality sign if and only if the two distributions are the same. So we can measure how far the distribution Q deviates from P just by measuring the light, the distance which can be calculated in bits. So if there's no difference in bits between P and Q, then they're equal. So we're measuring the distance between distributions and bits or in nuts. So why is this? Let us prove this is actually very, very simple with Jensen's inequality.

The only observation we have to make is that minus lock is like the convex. Let's write this out Black man. the er distance between E and Q was the expectation value block really? Block P over Q. But of course we can write this as the expectation value of api of minus logq you overpay. So now this here is a complex function and by inequalities this is bigger equal going the other way minus walk

E of P you over pete. But what is this here? Is this minus lock and this is an integral of X. Your ex e of X. The X as you can see this canceled out and you have an integral over Q of X but Q of X is a probability distribution. So what we get is minus lock of one which is zero. Furthermore, we know that equality holds here if and only if isn't on this, Q of a P is a constant function, Yes, you a P of X

is constant and the constant is the expectation value and the expectation value was one ofx this is equivalent to X equals P of X for all X. This is already the proof just follows, directive by instance inequality but it has a lot of consequences. Now a few remarks the backlight distance is not symmetric in the argument. P and Q. The reason is I can show you maybe on the last slide is that even though P and Q are kind of symmetric here, up to some minus sign expectation value here is just taken with respect to the first variable and not retrospect to the second. So if you switch P and Q were suddenly changing the expectation value. This is the reason

it's not symmetric and you can construct distributions where you can see the difference in numbers but I think this is not one argument furthermore because we have so many logs in there and some things can Wrong there and it can actually explode so we can actually have value zero infinity so it's bigger equal to zero but it can go to infinity. So what does it mean distance heuristic lee heuristic lee the Coburg like this measures the average additional number of bits needed to encode the location of X. Which was sampled from a true distribution P. X. Let's say you have random variables coming from a true distribution P. X. Here but we encode

it by using the proposal distribution Q. Of X and then you're making errors and these errors are the additional bits that are needed to encode this error. This is the interpretation of the distance furthermore. If you wanna for instance let's say we have a distribution and we want to find something which approximates as well then um the principle of minimal entropy basically just says why don't you minimize as a relative entropy meaning the khobar libra distance and we will see on the next slide um how this can been used to derive maximum likelihoods estimation for this, consider you have I. D. Data

and you have a true distribution. Now it's cure security, true distribution of X. And we want to learn what the true distribution is. If he knew what Q. Of access then we can write down what what the best approximation would be. And we approximate this with a parent tries model. So we have a statistical model meaning we have a set of all the distributions with some parameter and we want to find the parameter that best approximates our true distribution. Mm And then the principle of Minimal relative entropy tells us just to take the one Next one that minimizes the corporate life distance. So we have to go back lateral distance in the first argument. We always write a better distribution.

So here it's Q right ex like the arguments in there. Hmm. And then we approximate them with our model and this runs here over the whole parameter space and we want to not pick gonna pick parameter such as the corresponding model as the least error towards the true distribution in bids or not what you want. So one problem do we have here. The problem is we cannot compute this because usually the true distribution is not known. This is a bit unfortunate. So I write this year, this is kuna block. Q. All right. The arguments

like this kita. This is our distance here and we don't know the true distribution. What we can do is we can do we can use the law of large numbers to approximate the expectation value to what would this be? This would be roughly one over N the sun and equals want to end and now we plug in the sample versions X N divided by X P of X. And given theta so and instead of minimizing this cooper lively distance we could minimize this empirical approximation. So what would that

mean? First? Um we have Q here which is not dependent on theta. So what we get is here minus one over N and one infinity. Want to end block P X N. eta last son sing which is constant in theta and this year it's basically up to the sign we're minimizing the sign the same as doing maximum likelihood estimation which will arrive down here Maximum one over N And 12 n. Block E expand eater

overall. So what we just arrived is if we start from information theory from the principle of minimum relative entropy and we are willing to approximate the true distribution which a large sample size distribution was using a lot of large numbers with justifies this, then arrive at the maximum likelihood estimation before we always started from the maximum likelihood estimation as a learning principle. Now we have derived as principal or that's justified this principle by the principle of minimal relative entropy and the law of large numbers. Okay, let us next compute the cool. But like the distance between two distributions that came from the same exponential family. So we have an exponential family in this natural parameterisation here.

Now we pick two cheetahs to one and two corresponding to two different distributions Here. I don't mean the components 81 82. I mean really two vectors of natural two different vectors of natural parameters. So then I want to compute the kobak, live the distance between them and see what comes out. First we can write this As an expectation value. Overeater one globalism ex Taiwan provided by the localism the X. We touch you so and now we just plug in the formula we see that this is not dependent on it to so this cancels out and both formulas. Then we have just X. X. And another X. Ex peer next here

we have a lock which cancels, we basically just had this here from 2 to 1 minus this year from what we get is yes Patient value respective in Taiwan or the distribution corresponding to Taiwan. Then we have to one minus two transpose your ex minus a one Plus A 32. Um maybe moving somewhere else. Maybe I'll write this again here minus and

minus plus a too. Now you see the expectation that it goes over X excess, saran unbearably you're not eating, this does not depend on it only this T of X depends on it. So what is this Now? This is one minus E to to transpose the expectation value of either one key of x minus a to one plus a tattoo. Now what we know is that this quantity here there's nothing else then the derivative of a Evaluated at turn one. What we can right now you can right now that this is a tattoo

minus right now. The other way around transpose times gradient They were added to one minus a of your tone. So we have derived formula for the distance between between two distributions from the same exponential family really the same thing. Exponential family in terms of the log partition function, evaluated a different data points. So too he has all the derivative one, he has kalman and the next slide we will like

um make a geometric interpretation of it. Just right this year. So what does it mean geometrically? That's draw something. Let's draw access. Let's say have access here we have access here, you withdrawn our petition function here we draw a little no, hmm it looks something like this year. This is a A. And now we have two points. one point lies here, This is the one.

Another point lies here, This is it a two and they have values here, andw values here. So The one here we have a one here we have a So now we can draw, you can draw tangent to this point. Okay another try. Okay this is much better. So this is a tangent um pungent Of a. of Peter through the point. Peter one then

here. This value here. Maybe not here. So here just write this down here. We have the value. You touch you. This value here is a better one. So this is here. And then linearly going up to here plus the gradient A. of 2-1 times the distance between these points. Trying to write this property down. Um tried it like this so this is 2 -1 transpose.

So we have this point here, then we have this point here and then we have this distance here between these points and this distance between these points. This is a cool but like the distance right, just right. Just write 1-2. I mean the distribution. So the distance between the function evaluation at At two and the linear, basically linear prolongation of either one to this point. There's a distance between them and this distance is a covert collaborative. If you would do the other way around they

would make and uh And tangent through 8-8-2. And then you look at the distance between these two here then you would have the reverse covert live in distance and you see that it depends on the curvature of this curve here. If these quantities are equal or not, if you flip eater too and eater one around and in general this is not the case. This could be a much flatter curve. Then this curfew, the distance It would go down like this. Then you would see here would be a huge distance. While this would be still small. This is a geometric interpretation of the distance.

4 Conditional Independence

hi everyone in this clip we'll talk about conditional mutual information and a few more information theoretic quantities. So first of all, let's consider two random variables with some joint distribution over here. So then we can talk about the joint joint entropy by just defining joint entropy basically S the entropy of the variable X. Y. Seen as one variable. So we could write this because it's actually only dependent on the distribution. Would say looks like this. You want to use these notations or if you really want to write it out again, you could say again, this is it dictation value again over the strong distribution

minus or charism of P X. Y basically the same formula as we had before. We just consider these two variables as 11 variable was two entries. Then we can talk about the conditional entropy. This is basically the uncertainty remain, the uncertainty that remains in X after we know the values of Y. So and they're like different ways to define it. We could say, we could say that this is the expectation over why I write it like this as entropy of the distribution of acts, even why? So, meaning every single Y. We look at the conditional distribution, we compute

its entropy and then it takes expectation value. But the different ways to write it. We can also write it as the joint distribution minus and the joint entropy minus the marginal entropy but also write it. There are several ways to write it as why given X. As soon as we had defined it, then we add the marginal of X minus and marginal of Y. So that this um equations hold um it's an exercise maybe I'll write it more explicit the bus, you know what I mean? Just write it for continuous variables. What I mean here is we take the expectation value over why? Then

we take the expectation value over X. Gavin Y. Then we use a formula for the entropy E x, Y. Then we do the X. Then we do A D. Y. And of course here you could also write this here together as a joint distribution. Then it would look like E X, y minus block E X given Y the X T Y and uh this is the equation

then we can talk about conditionals mutual information and for this let us consider that you have three variables and they have a joint distribution. Xy, is that And then you can first talk about the mutual information mutual information badly says how much information does X and Y share. It's again something like your cool variance but without reference points like the mean, the mutual information here is defined as there's also like different ways to define it, we can say it's a the distance between the joint distribution X and Y and the product of its marginals But as soon as you have written this down, we can also show

that it's the same. That's a marginal minus the conditional. Or we can show even that symmetric, it's an entropy of y minus a conditional entropy of Y given X. Then furthermore, these were just two variables. You can also define a conditional mutual information, telling us how much information do X and Y share beyond the information already encoded in that. And again we can write this in several ways. One way is that we take the expectation value or that and then we take the cooper like the distance between

X. Why is that? I'll write it like this and the product and then the average over set. So it takes a cool about live a distance for every single zet individually look at what comes out of this and then the average of all that you could say this is the same as taking is that? Mhm Yeah, I was trying to write it basically as a mutual information where he has X given that why given that? But I think this will just

turn into confusion so I'll leave it with this formula But as soon as you have to find it, you can show that more like more relations hold same as taking taking the conditional entropy minus taking the conditional entropy where you have not here Why and that you can also write it the other around because it's symmetric and X and Y which needs to be proven of course by pushing the equations around and to leave this to an exercise. So what can we do now, what do these quantities in court already told that X and Y. Is like usually mission X Y. Is what they share and information here is what

they share and information beyond the information already in that. Now, Shannon had like a fundamental theory um if we have three random variables with some joint distribution then we always have the inequality that's a conditional mutual information is always bigger or equal to zero. And furthermore, we know equality holds if and only if X. Oops if and only if X is conditionally independent why even that. Um This can easily be proved we do this in a moment um but reversed from want to mention that it also holds the unconditional mutual information and unconditional

independence by just taking a variable constant variable that and the proof is just that time. That's why is that? What's the expectation value over that? We have seen leibler I write it like this P X Y. Z given p X. That the why is that? Mm. And we have seen by Shannon before that Information Theoretic inequality that this is bigger or equal to zero. It's bigger or equal to zero. So that means if I take the expectation value over these quantities which are all bigger equal to zero. It can only be zero, it's all zero.

So we have equality. We have a quality if and only if all that the distance zero. I mean this qold distance here and We have seen this zero if and only if the arguments agree if and only if E. X. Y. Given that equals P. X. That times plea wise that all X, Y and Z. And this is nothing else as saying conditional independence. And of course if conditional independence holds then this holds

for all X. Y. Z. That means that the robot library distance between all these quantities will fix that zero. And then that means that equality holds. Mhm. Quality holds only for all that is uh in this argument as zero even only its arguments equal even only if um condition independencies we know also want to visualize um all this looks like with all these quantities. This is usually easier to understand. Mhm

Let us draw some circles and we consider these circles information, context content of X and Y. Two variables. Mm And now we want to say which part belongs to which so the area of these circles and they correspond to the number of bits in these quantities. So what we have here we have here Angel X. So there's a whole circle this age of wine. Yeah we have mutual information and here H. Of X. Out Y. Here we have H. Of Y.

Without acts I think it's clear what I mean here um maybe erin and I try a trial. So yeah this is this and here this is this party. Thank you

these parts this year. This part Oh just this year it's mutual information. This is how this information diagram looks for. Two variables. Now look at three. Now we have three berry boats. Um We have now you have no age of X. Here have H. Of Y here, that's here. And we can now look at all these quantities especially that's okay. Just quantity here. It's this quantity

is H. Is a mutual information between Y. And that. Given X. Let us uh this quantity here and this is the mutual information between X and that. It went on cooking. Also this here and this is a mutual information of X and Y. Given that then you could also write what is in here this year is why given X. That meaning this part here. Yes. Yes. Yeah

and so on. Then maybe say something which we haven't seen yet. This is like this part. This inner part. Inner part is actually higher term higher mutual information between X. Y and Z. And what has to be careful about this? Apparently this can be negative. A new manager on the middle part. This middle part can be negative. So it doesn't really follow the intuition that we consider this an area that the area corresponds to bits this party in the middle can be negative. So it actually violates this interpretation

here. So it's just a heuristic. Okay. And lastly um wanna write down the chain rule or conditional mutual information and for this we consider all variables. And what we get is we get that if we take X and we take the variable Y. W. In the second argument bracket there, then one can show that this is the same. Why is that in your condition? Mutual information between why and that plus conditionals mutual information between X. W. Why is that? And we know that all these quantities

equal to zero, for instance, non confuses. One could use this um this equality to show the separate axiom for conditional independence.

5 Conditional Independence

hi everyone, we wanna wrap up the information theory part and I'll just give you an overview over the most important concept we have learned so far, what did we learn? So we introduced entropy and the conditional version of it and by this formula basically telling us, I take the number of bits and take an expectation value over them or hear the conditional number of bits um and then the expectation value of them, then we had a relative entropy basically measuring the additional bits needed to encode P when I used Q as a proposal and this was the formula here where it takes the expectation of the first random variable, the first distribution, first distribution which also appears here and the second

variable second distribution only a pc. And we had shown or seen that this follows a fundamental inequality Coburg lateral distance always bigger equal to zero and equality holds only if and only if the two distributions are the same if this is zero, we know these two were great. Then we had the condition of mutual information defined basically at the Colbert leibler distance between a joint distribution and the product of the marginals and he and the conditional setting everything is just conditioned on that. And then in the end we average everything out over that. This is the condition of mutual information saying how much information acts and why share beyond the information in that. Then we have seen that from the fundamental inequality we had before, we could easily see

Channon result that the mutual information of the condition in which information is always bigger or equal to zero and equality holds if and only if X and Y. Are conditional independence. And then finally, we have seen the chain rule of um condition mutual information where these are considered here, one variable with two components. Um And we can split it up in something which is on independent X. Y. And that, and then something which corrects the rest with this W. But then conditioning and Y. And that these are the most important onset of the information theory, part of this course.

6 Conditional Independence

Hi everyone in this video. We'll talk about independent component analysis. We start with objective and the usual assumption made. Okay, let's assume a simple example. Consider you on the party and you see a lot of people talking to each other and you would like to know what they say. The problem is everything they say is mixed up. So you hear several voices at the same time. The question now is if you have several microphones and you put them in certain places, if you're able to d mix these um these sources, these measurements to get the sources back. So reconstruct them. One of these algorithms is called I see a independent component analysis to make this work.

We have to make several assumptions for instance that this mixing here um it's linear. So we have some mixing matrix taking uh input of person a with some with some weight. And this turns out to be one part of the output of the microphone, then you have maybe a second person with some different weight And also ending up in microphone one and here I don't here a one deep. Then if you look at the second microphone we also have weights we have here a 2122 and so on. So in the end we consider sources

and the outputs linearly related by some matrix. You also need to d mix them. So if we have just one microphone, you might not be able to remix them. But if you have several microphones, let's say that the number of microphones is at least the number of persons we have then we have a chance to remix it and we can make it more formal by the mathematical assumptions. Independent component analysis. First we need to consider that components. The sources are independent and we have K of them. So we have K random sources and they're independent and measure them over time. The independence need to hold for every time step separately. So every every source K. Is independent of all others, which I indicated

with the minus K here and that's for all. Okay then we have K sources each giving you one value at a time andw but every time point we get one value. This means we can make all the measurements Into one K across T. Matrix pay the number of sources and T the number of time steps. Then we also assume that is linear. noiseless measurements. So we could consider generally that we add noise in the end something um which we cannot control but you could also consider the noise as being an own source. So to some degree we can assume there is noiseless measurement by assuming external noise is just another source. This doesn't work always, but let's

assume that's true or there is no noise. The second thing I said is the linearity here. So we have measurements first we have em signals and these signals are microphones and also we measure at one for every signal and every timestep we measure one real number and we can put them into an M cross T matrix. And you might have noticed that T is here on the roadside. So if you have like a matrix will know the data points here. Usually we would put them parole but here we do them, you order them by column and then we have like this linear relationship as an assumption that there is a matrix. If I multiply it with the sources, I get the signals

and what's important here is that this mixing matrix is time independent, no matter which timestep the intensities per person going to the microphones is the same. So furthermore this would be already like and a program for itself but we have to make more assumption to get some reliable results. One of them is completeness, meaning that we don't have more microphones then um then speakers also we don't have less. So we have exactly and the same number of sources as signals and then we want to be able to invert it. Otherwise how can we recover the sources if you can't invert it. So it needs to be invertible

and the numbers need to match this is often called completeness. The other cases are called over complete or under complete depending of which of the numbers is bigger. And then another thing is you need to assume non Gaussian itty. So we assume there's a distribution for the sources and we assume that non Gaussian why that? So Gaussian distribution is a distribution of noise, but we want a signal. So we assume that the distributes are non Gaussian. So there's actually signal to recover. For instance, we could assume that they're all super boston that they have tails heavier than causing distributions. And in this non glossy energy distribution I said none of them should be gaussians actually were allowed to have at most one of the sources can be noise. This was a

compliment to the part I said here. So one of the sources is allowed to be noise. So we are only working in this very restrictive setting, but what can we gain from it? The question is now first, what do we want to do? So we want to say what is the problem? So the problem is we want to recover both, want to recover the mixing matrix and the sources from the data. And this is different than for instance um different than for instance supervised learning where you have the data but you also have the response. So you would find the linear relationship by learning s by having S and X and learning A from a supervised setting here. A and as a both not given if only the data and the strong assumption I just said and the strong assumption

they actually so strong that they actually allow us to do exactly that under the mentioned assumption. So from the last slide one can recover the sources up to sign, scale and permutation So of course the sign, we cannot really recover and the scale also not because the sign and the scale can always be um absorbed by the matrix A to some degree. And also the permutation is not clear if I permute a source and I permute is um um the columns then we get the same results so the permutation cannot be recovered and so the ordering of the sources can be recovered um because can be absorbed by a scale computation and the sign are not

recovered. But if we basically fix an arbitrary scale sign and permutation then we can identify sources and the matrix from the signals. So that means like if you cannot recover the scale, we can also just fix it. Can just assume that The scale is in such a way that the variance of the sources is one and um since we know that they're independent, we can write that the co variance matrix between them is identity matrix or component wisely written delta ij Texaco variance between them. The one comes from here and the zeros comes from the independence. We can make further reduction steps, not just by assumptions but by pre

processing the data. So most of the I C algorithms, some pre processing is very helpful. So there are two typical things, we can do. One is we can send the data by just subtracting the mean, if you let's say assume we had solved the I see a problem. And uh we subtracted to me before then we can add it in the end. So we haven't lost any information. If you just store what the mean is. So since everything is linear here, then I can take the expectation value, matrix A goes out of the expectation value. So I have the same relation in their relation between the expectation values. I can subtract them and get a relation between the center data here and the center sources here. So I can just not say, okay, I call this my new X. I call this my new s and I still have the same linear relationship but have not gained

that. They're centered. The second step we can do is we can widen the data. What does widening mean? It means like we do a full pc A meaning we find a principal components of our sources, then we rotate our system such that we are aligned with these principal components and some of them are longer. Some of them are smaller and then we all re scale them to one is called whitening. So we rotate and we re scale so that the co variance matrix of our new representation is the identity matrix, meaning in other words what I wrote here. So there is a matrix, we can compute it by P C. A. Such that if it takes the transports together they are diagonal

and if I multiply it with X, then I get the core variance matrix uh identity as a core variance matrix. So we can also again right X for our newbie X. And then because we have multiplied this equation, you basically was a B from both sides. We have to write A for B A. And then we get like something out of this. So it's not just computationally something we put in, we get something out of it first as we said like what we get is here that the co variance of X is identity, that's what we intended. But then if you plug in what X is in terms of the sources, then I get a times the co variance matrix of S and a transport. And on the previous slide we said can without

loss of generality assume that the co variance matrix is one part of the independence and the scale of the variance, that means like we have a relation that a times a transpose is identity matrix as well, meaning that a is a thought ordinal it's a sergeant. So before we widened we had a matrix is a matrix was K cross K. So they were k square parameters in this matrix. After we whiten the data we have this auto banality relation, meaning we have only K times k minus one half parameters. That means like this full rank p C a whitening is basically halfway through, I see AA

7 Conditional Independence

hi everyone in this video, I want to show you how to derive the I. C. A algorithm from the maximum likelihood objective. This you also use natural gradients. Okay let's start, how do we formulate this as an typical maximum likelihood objective. First of all we had under the previous assumption that acts as a matrix times sources X. Other signals. And we had use W to talk about a inverse and we assume that this exists and it's more convenient to parameter things in terms of W. S. And A. You will see now our sources are given by the data times this um timeseries W multiplied from the left. Secondly we had independence of

the sources. That means like if I look at the distribution the joint distribution of the sources then it factorises according to the marginals So we only need the marginals to have full control over the joint distribution of the sources. The more note we could write here every timestep on its own but I will admit to the index teeth to make it less cluttered. So what are now the parameters of our model. So usually we want to in for the parameters so everything which is not known, we can consider a parameter and this is actually a lot here. The parameters consist of the matrix W. But they also consists of the whole distribution E. S one P. S. K. While the distribution might be prompt, prized or unknown.

Um So it's not just one value, it's a whole distributions, these are unknown and we consider these as parameters in our statistical model, our statistical model now. So we want to have a model for acts depends on these parameters. That means like on these Ws and these distributions on the sources. So how do we set this up? So we use this equation here. So X is a deterministic function here using A. And S. That means like if I have a distribution on S, I can just push it through A. To get the distribution on X. And this is what we will do. So the so the distribution on the space acts is just basically the push forward of the distribution of S using A and

we have a formula for changing densities. So what will it be? It will be that we have to take P S W times X. This will be the density of P of X. But we have to change the volume basically by D S, the X. And I use these symbols here to talk about the determinant of the Jacoby in and what is this? This is E S W. X. What is the jacobian of P S E. X. You can see here the jacobian is W. So we have the determinant of w here. So now we can write this out more clearly.

We have the components here, we use these components here, so we can write out these in components um by using the entries of the matrix. So that means we have ps one W one Transport X. These are these are this is the first row of w. Then we multiply this by the other marginals We have P S K WK transport X. And then times the determinant of W. So we have found um found representation for our statistical model. So then the next step is if you want to do maximum likelihood you can do and we can take the logarithms and

then maximize it or attempt to do this. And for this, I abbreviate empirical distribution here which is E hat with an X. And only means like if you have this data matrix, I just plug in all these values later and average but it's more convenient for now right like this. Otherwise notations will become unreadable. So what do we need to do? We need to take this year, take low voyeurism and average over the data points. So what do we get you get here's a product that means like we get a son so some from K equals one to capital K block. E S K wk trump's post X

pass log of the determinant of W maybe we shouldn't write it, write it too. So this is already our lot likelihood and then our maximum likelihood objective is about maximizing this quantity retrospective. Theta. Theta is not not just w it's also the others but first we will look at W for this, we can just use grading descent or what we wanna do is natural grading descent. So maybe we recap what natural grading descent is, consider I'm any any objective function we want to minimize, Oh we want to maximize, so we have to make a plus everywhere, but

the want to minimize it, then we had like a great in dissent um algorithm which updates a parameter by the gradient office objective and some learning rate. We have also seen newton gradient descent where instead of just taking here the plane gradient, we take the second order derivatives the hessian universe, who let the gradient deviate a little bit into the direction of the curvature. A little bit more general approach is to allow not just as an inverse, basically allow for a symmetric positive definite matrix reflecting some geometry of the space and this is always debatable, what is the geometry of what space? And usually this matrix m is called pre conditioner

and you could say if the hessian is positive definite, then Newton roughness does a special case of this. Maybe we talk about a little bit wise, is justified, think about, you have and the contours of l then the gradient at the point gradient uh tells you what is um wheres going upwards and where it's going downwards. So here the grading points upwards you points downwards and what we can, what we know is basically this is like a whole hyperplane and here it always goes up, if you just go a little bit in the direction and here it always goes down. If you just

go a small step. So we have not just the choice to go in the direction of the gradient, there's this whole hyperplane where we can go, we just restricted to a small small direction. So if we go here, we also go closer to the goal, people here, we also go closer to the goal and this is what is positive definite matrix is doing, it just changes the gradient in some of these directions where we think it's better here, of course the straight line is better but think about you at this place then this gradient is not the smartest one. This grade is this direction would be the better one. And this is what grading Descents? Trying to accomplish like the natural gradient trying to accomplish. Now back to our I C A so I see a we want to know look what is a gradient and then later see what

is the natural gradient. So let's see. So this was our low likelihood, that's what we arrived on the last slide. Now we want to derive the gradient. Okay, let's start. So just as a reminder, the sitter contains all these ws here as well, so it's just one of the components. So if you take now the partial derivatives of D L D W I J what do we get, what is dependent on W I J w k are dependent and this matrix determined is dependent. So this is linear, this is linear. So we can put it derivative in let's write this down. So we have data expectation then we have here. The components 12. Okay then we have D d w I

J lock, E S K UK transport X. And this is closed plus E wi a determinant of W lock, it turned. Now there is this formula which you need to know is if I take a lot of determinant and I take the derivative retrospect to the W I J what do I get? I get the inverse matrix and then J I So this year is nothing else than the inverse matrix variates Did the component

J I from then we make an abbreviation and um your pre vh maybe I write it down and then I write the abbreviation. This is equals okay, equals one day. And what I want to abbreviate this function here, I want to abbreviate with fire. K. So this whole derivative, I wanna buy a k. Mhm. So um what we get here is basically in a derivative times out of derivative. I wanna use this derivative here respect to this

argument. S it's my outer derivative, I give it a name and then I have to see what inner derivatives are. So I'll write down fine, I my eye off wi transport X. I was a k A w K transpose X. This is out of derivative per definition. As I say, Now then this W k might not depend on W I J, then everything is zero. If W IJ is contained in here then we get

get this corresponding owning as a prefactor What do we get to get? delta Okay, I times X J and then plus uh this year. So just Make it clear five K S S E S block E S K s This is an abbreviation

we use here, this was the outer derivative and we will use it later. Um approximates this later. So what do we end up with? Let's have a look. So we have here some Okay, and here we have a delta. So only the one where K equals I and survives and the rest vanishes what we have here expectation then we have here fyi Wi transport X times X J plus W. Universe, which is actually a and then the component J I. So now let's turn to the activation function because

and these are dependent on this P S K and the unknown. So we don't know much about this. But what we did is we made assumption about the distribution on these sources and that means we there are good approximations of the lock derivatives, even though we don't know the exact value or distribution of these. So here's our activation function again and they're good approximation. We assumed that this PS is not Gaussian. So that means that either super gaussians or sub caution and what I how I approximate this is a tax minus tangent hyperbolic als s this is a case if sk

is super caution, this means that it has heavier tails than the gaussians Often out your data as a super Gaussian distribution, so this is recommendable then you think that sometimes you have sub convolution and you could use this activation function, 100 type of follicles minus identity and this is S K sub that means all distributions that have tails less or equal heavy to Gaussian distribution. So these are our activation function approximation and uh this is all we need as input for

our distributions here even though we don't know them and the parameters this will be our approximation. Now we want to do natural gradient natural gradients, we need a precondition er first remember that this year is a gradient of a scalar function with respect to matrix. So this year actually the matrix, K cross K matrix but if you want to multiply it with some other matrix, we have to basically consider it as a k cross one. A cross one vector while this m we consider in a squared across k squared matrix, this is how we would consider them. So we barely have to flatten the matrix and then multiply it by

another matrix which lives in this higher. Super high dimensional space andw but this will boil down in the end so there are a lot of heuristics how to come up with a good pre conditioners, some is using the Haitian inverse and then it's data dependence and you're trying to use some heuristics to kind of factorized them or get the data dependence out but there are a lot of heuristics and they're not very precise in my opinion but you can find good uh data independent approximations and I'll just write one down and then we'll see what this gives. So one good data independent approximation comes from um w itself w times um the component of um

w you basically take W and W transpose some form times delta, I. J. They take these values and put them in these K cross cray, k squared cross k squared matrix with these industries. And here you sum up so basically this is transposed times uh matrix and then you better be plugged them into all I J. S. So what do we get then? Natural gradient then is we can ride it out, so we have now yeah L. M. This is remaining the remaining

Basically two dimensional vector and we sum up over all of this year. So with some over I they these are the indices here with some over but we have here also an internal some so with some over I J and gnew then we have W M wnew J. delta I J and then we have to plug in the gradient and the components yj and made it a typo I am which l

this L. So we have a L M E S M E S L. Some over new and some over jay and his sum over I. So this is equal to two. So here we have the sum over I. L. The sum over I. Then this turns into an L. And we can get rid of this one some so we have one some going over jay and me only W M W U. J times the gradient we fill this in, we had the gradient and this was this expectation value over Hi hi el W L.

Transport X times X. Day and then plus W inverse J. L. So now this year is a scale afunction this year it's a scalar. Now we have this year trends this year um we have a product of this matrix with this vector by summing over this J. So we can write some of the new W. Um then taking expectation value despite L. W.

L. X. And now we can just simplify this scale of product by riding by writing w new transport times X on this part, andw here for this term. We see that we can multiply this with this here by running over J and the interest to each other. So we get here some of what new year W. And then a delta mm hmm. L.

So and this here last part is um just W L. L. But what we get is get now starting from right to left L. M plus some of a new w a new m expectation value over activation L W W L. Transport X times W new transport X close brackets. And we are done arriving the natural gradient Now we can write down the

the update rule in terms of the matrix themselves. So at it's an online update rule. So if I get data point X tau XT to some time t then I can update this year and now here we have this natural gradient from before and I just fill this in now in terms of matrices. So this is now w that was like the first part of before of the slide before and the second part is plus activation function which I consider now the vector of activation functions and I write it now as a matrix times X. tau closes and and then we multiply this

X chow transpose W transpose W and this is an update rule, andw after convergence. So we sample now like these data points, we run them through this update rule after initialisation and then in the next data point and do this again and then this will converge and after convergence we reconstruct our signals our sources from the signals just by this uh w we get out of this algorithm times X. T and S U C. After convergence we have a W an estimator which is our maximum likelihood estimator

after convergence. So we have an estimate of our A. So basically A would then be this inverse. So we have reconstructed up to son eros A matrix, A mixing matrix. But we are mainly we're not really interested in this A. We are more interested in this W because we can directly use this to reconstruct the sources by our access at time team. Um So this is component written components. Um Right like this or generally without writing the

component explicitly. Two andw We are done.

8 Conditional Independence

hi everyone. We shortly want to just summarize what you've done this. I see a lecture. So first we have talked about what kind of assumptions we have to make. So the assumptions were that our sources are independent that we have linear and noise land measurements so that our signals linear and mixing of our sources. Then we have completeness which is just another word for convertibility of our matrix and that it's a square matrix. Then negotiation itty the distribution of our sources are non Gaussian or we allow at most one of them to be Gaussian. Then we had a theory. We didn't prove it but we stated it that under those assumptions both the mixing matrix and all the sources can be identified

up to sign after sign scale and permutation can be identified so. And then we derived algorithm by using the maximum likelihood estimation and a natural grading approach. So our algorithm looks like this. We pick a learning rate. We take an activation function for each source basically if you assume they're all super Gaussian then we can just pick all minus tangent. Hyperbolic als if they're all sub question we can take a tangent hyperbolic als minus identity. And then we just initialize our W. That's the inverse of A. And then we pick data point at every iteration and we do the following

until convergence and this is just like um structured way of doing what we've written before. So we just multiply X with matrix W. We get some tea of tea then we apply component wisely the activation functions which I just right in vector form like this. Then we take this C. And we multiply it by the transpose. And then we multiply these two which are computed by C. T. Together and and make it rank one matrix added to W as an update. And then we multiplied by the learning rate and just added to our W. And then we do it under convergence and then we get our estimate W. Hat and then we can reconstruct

our sources. Um There's our reconstruction here by just using this estimate by our data points and then we're done. So this is our covariant online and I see a diversion.

3 Graphical Models

0 Graphical Models - Bayesian Networks

Hi everyone in this video. We'll talk about basic networks. General basic definitions and how we can represent them. Let's start with an example, let's say we want to represent variables or events and the relations between them. But let's say you're a doctor and you're interested in the health status of your patient. And we would just want to represent this here as uh as a circle with health status written in it and we want to reason about this and what is affecting the health status. Maybe the patient has some medical preconditions, oops, great conditions. Then we could just represent it by another

circle like this. And we would draw you would draw such an error. So maybe there are other reasons why the health status could change or could be affected by for instance, um if the patient takes medicine then we would write it like this. Also draw an error in this direction. Furthermore, health status could be affected for instance by the living conditions and in this case the living conditions change the health studies, but the living conditions could also change

if the person takes medicine. So we could add more and more of these variables. And we would like get a huge diagram what influences what And this could be generally described as a basic network. So one of the ingredients of these baby networks are graphs and we'll be here assume is that we have a graph which is directed, meaning that we have directed edges but also that there are no cycles. So formally um interacted a cycle a graph or called the deck consists of set of notes. So these were like he circled from before called also notices. Then between these notes we had edges, directed edges, andw we have between all these,

all these notes, we could have kind of all kind of directed edges and in a deck we have to basically say which of these edges are connected and which are not. And this is then the set of direct edges. Yeah, and we will make the assumption. That's a no and directed cycles. This is a cyclist, sity part here in this definition that if you start from one note and you follow some errors that you cannot go back to the starting point. So there's no feedback in this kind of sense. Let us draw a little example here. Um let's say we have directed graph like this and you would say there's actually a cycle here and that's true, but these kind of cycles are allowed. What

we don't allow us if you go here here, here and here back. So cycles are allowed but not directed cycle, directed cycles are not allowed. So this would be still a valid deck. Let's draw a few more. So this would be a deck. And now we want to say if I start with, let's say note new here. What are the parents? The parents are just variables like the notes that directly um have an error towards this note. So for this note gnew these two notes would be the parents. If I use this note then this would be a parent and so on. Similarly. But in the other

direction we can define Children. If you look at this note here, then this note would be a child and there are no more. If you look at this note then these two notes would be Children. Also. Similarly we can define ancestors and ancestors are basically the parents or the parents of the parents or the parents and so on and all between them. And we include also the note itself. If you start with this note and these are the parents, the parents of the parents, the parents of the parents of the parents and so on. This set here is a whole ancestral set offering and I just write it here including building new itself and similarly we can do this for descendants.

So the descendants of this new contains Children and yourself here. If you use this note here then descendants would be this this note and this itself, but this would be the, the descendants of this note, The graph is just one ingredients for basic networks. The other one is that you want to represent events or random variables or real life events which we model with random variables. So for this we have to start with some probability distributions or some random variables. So we have some random variables here given and they have a joint distribution. This is what we want to model and the index set either like the index set of these random variables we call

V. We could also consider directly saying V is a set of all these random variables. The main point now is that we wanna add a graph structure which of these random variable influences the others. And so in addition we need a deck andw The notes should correspond to the random variables. But here you can see that this set appearing here of random variables or indices of random variables correspond to the notes of the stack. And if you do it like this then there is no formal relationship between these random variables by just looking at the probability distribution. So the probability distribution gives us relations but this is not related in any way to the graph.

If you just take the index cept so really have to look at the edges and the edges basically they give meaning to which are random variables, influences the others, but we have to make it fall. So what does it mean? It was like we have to give meaning to these engines so, and we have not to say how this joint distribution, how is it compatible with this graph? And we say formally that this joint distribution factorises over G. If the joint distribution has a specific factorisation and for this we make a short abbreviation and if I write X with an index A here

X N X. A what do I mean? I mean the two of all X in a where A is year subset of lee just for abbreviation then. And the factorisation property. He says the definition that this joint distribution, factorises andw according to this graph by taking the product of all these notes and then taking the conditional distribution of xnew and then X the parents of g nu Well here this is a set of set of notes with this convention here. So let us just visualize us a little bit, let's say we have a graph like this

and looks like this and you look at this new here, then we have the parents here afternoon here. So in this factorisation property basically says I the joint distribution is given by going from the last note condition on the parents times this note condition on its parents times this note condition on its parents times this note condition on its parents, so we can take the product of all these conditional distributions and you can compare it to his joint distribution. And

what we require is saying that this factorises over G is that these two joint distributions are the same. And just to remind you again, the only part where the graph goes in is in this parents notion. So this is really a graphical notion. So the dependence here is really just over the parents and this is what gives meaning to these edges. So we're now in the lead to define basic network, what is a basic network. Basic network consists of a deck and the joint distribution such that the joint distribution factorises of the graph that's all. So we have a joint distribution,

we have a deck again again um the variables correspond to the notes in this graph. So these are the same andw it factorises meaning it takes factorises according to the the distribution of a note given its parents. And then the product overall notes. Often we represent basic network by its graph but a graph alone doesn't have any meaning or like statistical probability theory unless we give it meaning. So if you write this graph we often implicitly assume that we have a distribution um that factorises over this graph. So just to make

this clear this is the important property here. Even if you just write down the graph we have to check the factorisation property without it and it's kind of meaningless. So there are no different ways to represent basic networks graphically. So let's start with um two Simple notes and several notes on the bottom like this and you give them names let's say for instance this X let's say Z.

Why Let's call this X one X two until XN. And there are several in victory. So we have now like a bunch of a lot of notes and they all have the same parents maybe they even have the same Children let's say make another one, two. And let's call this w. And you let's also draw some errors. Mhm. Some errors. Then this quickly becomes um comes complicated to see what kind of relations we have. Mhm. Maybe this is a bit

like this. And what we can do is now we can simplify the representations which is called the plate notations. Let me Right. Yes. Again X. On the year. Why? Yes W. U. And just one note in the middle that Y. X. W. U. And no macon error like this. Like this. And then we write the number of access we have here. And then we use plate to indicate that this is

copied a few times O. N. So this end here corresponds to the number and this plate just says that we copy this and you can consider this as kind of sheet of papers. So this says we have end sheets of papers and we have some sex above each other. So you could think of viewing this here. But then changing the view to looking from the side. So they all appear behind each other. One sheet, one sheet, one sheet and the stacked on each other. And then we get this view. So what it means is that these here affect every single sheet here and every single sheet here affects this year. This is what it means that errors go out or in of this plate.

Right? Yeah. So what you can also do is you can nest this, you can have several plates on top of each other. For instance we can have representation like this like this and maybe we have even one outside then we have notes. Maybe we have two knobs here, we have

two notes here, why not here, one out here. And then we can write, maybe we have Okay here, we have in here, have any cheats here and then we have influencers like this. So and this then means that I have an outer index N. And for each so we have end of these sheets, index by N. And each of the sheets S. M smaller sheets which have a second in next M Going from 1 to M maybe give the name X, Y, the

W B. Maybe a B. psi furthermore, people like to augment these kind of representations. Let's take a simple example um what's let's say, we have X here and we have a probabilistic model and access influenced by parameters. What people like to do is they also like to represent these parameters directly into these graphical models so they either make this, then maybe in alpha parameters are here some people prefer making making big knots,

which here means the same and then the right beta and what they mean here is um so this basically is the same. Some people prefer the spokes, some people prefer the start and then we can also have another variable. Let's say why here? And let's say it's also influenced at least two. rameters Maybe we should settle for one of these representations. eta And what another thing what people like is if they have data, let's say from X, but not from why that they like to represent this

in this graphical model saying what is my observed variables. So what they do is they make shading of these notes basically say access, observed why it's not observed. We have some parameters influencing this and this is a training set and this may be our prediction set. So we using X to infer often pita and then we use often be to to make some predictions. Something like this. I'm not a big fan of these shadings of the misleading and very helpful but I like including the parameters into these presentations and you can of course also couple it with light notation. For instance, you could have more training data and then you put an end in here. So you have combined all these kind of representations

like this

1 Graphical Models - Bayesian Networks - Examples

Hi everyone in this video. We want to go through some examples based networks. Let's start with a fairly simple one. This is a Mark of chain. I think everyone knows what a Markov chain is, but you can also press express as a graphical model. So we have we have some notes to just all lined up on some chain where here we have X one X two, X three, X 4 X five and so on. So now we just draw the deck of a Markov chain. What does it mean for the probability distribution? The probability distribution distribution looks like this.

So we said it should you um is a product over the notes given its parents. So and in this example, what does this mean? But where do we start? We start here the X men PX one doesn't have parents. So we're just staying with the marginal then we have P X two and the parents are X one to give an X one times now we have X three and the parents are X two, then X four given X three times X five, given X four

and so on. And this is uh on distribution. Markov chain. So we can see for instance that Just go until here that if I condition on X two that X one and X three are independent given X two. So the state of X three only depends on X two and not on X one anymore. Normally you can see this that if I condition on X two Then we have here a function of X three And he had just a function of X one. So they separate into two products. Another example is a state space model which is also called Hidden Markov Model which we'll also see

later. So this looks also like a Markov chain first wow life but then we have further variables which adjust computed by these years. So Is that 12, X one X two X three X four X five and so on and here. And the probability distribution factorises that's falls first we have

distribution over that one has no parents. Then I have this one here. EX one given that one. So and then I go to the next space um where's that too is giving its parents at one To that one. So the transition happens here and then I just have the emission which is P X two, that one and so on. We have basically these blocks here which come here and then we have a similar block for three. Maybe I'll write this one still P that three given that's true. Hi april

that too. At times the emission X three given even that three and so on. This is space state space model or which is used for the hidden Markov model in macro mode. The next model we can look at this naive based model which is used for naive based classification let's say we have one here. This represents a class and then we have some features mhm

and we have a few more notes here is our class is our features X one X 22 X. M. And maybe we write this also as in state space model, it's the in plane rotation I mean, so we have a class here. You have all features here, then we have wait like this and we have here and features but the more sometimes we even have several summations and we have n observations

and then we would draw here and times this year this would be the plate notation for naive based model. Maybe we look at the probability distributions maybe just for the inner one and what does it say? It says that our joint distribution groups, john distribution over the whole joint distribution Now factorises first I have the probability of the class. Class has no parents here so there's no dependence on any parents. And then I always have this one given the other one Given X. one given the class until P. M.

Or given the class. This is a probability distribution. And for the inner part, if you're now interested in this whole distribution, there is no error going in that means like you have um you have an independent copies. So we could write it like this, maybe in X let's write it like this. E then the product and equals one to n. And then we have to make an index end everywhere here. And this is naive base model

represented as a base in network next example is the basin regression mle let's do um basically your regression or we don't need to do in your aggression just regression for this. What do we usually have? We have um half data um coming from variable X. And you want to predict and they come with labels what you want to predict why with some and model it like this so we want to predict Y from X. Then we have parameters,

promises which influence why. So it's a prediction model and then we maybe even have some hyperparameters which also affect prediction to hyperparameters prior mm So finish his offer here sigma squared then we're not just we're not just having one data point, we have several data points, we have n data points here and as you can see um we draw this with the square box. The reason is that we're usually not trying to model the distribution of X.

Usually just trying to model the prediction, predictive distribution of Y given X. So the X. Are usually treated as parameters in these regression models but the more I made here like around note for W. S. Which are the parameters. The reason is that were in the Beijing setting. So in a frequentist model we would make it take a box here but in the basic model the parameters are random variables but then we have hyperparameters which maybe um so a family of prayers basically hyperparameter influences our prior here and this and influences our model here and then we have some hyperparameter which is kind of the noise level we're assuming and this also influences our wife, let's see if it all makes sense. What is the distribution

here? The distribution here, john distribution is first we have parameter um have parameter given hyperparameters so we have the w given offer times and then we have end times 12 N probability of why given X. Maybe I'm making index here, Y, N X and parameters and the noise level this is how it factorises and this is how we know it from the previous course so in this supposed to represent the joint distribution of P oh Y N and

w even the whole xn andw so far at my squared so this is equal so now let's assume we want to do the same but we want to now do a prediction. How do we set this up first let us model again nowhere input next

X, why and w alpha here of w x, Y like this and then we need so I was sick my squared and then let's make a plate, we have a plate here and all of them, this was a model we had just seen, this is our known basic regression model and now we have to

um introduce um gnew nu data, let's say we have test data, so and then we have why start here, Next up here and now. Yes these relations these relations maybe there's also plate around it and maybe this is in prime maybe just have one or we have a bunch of them. So some people now like to shade everything they observe

and what will be observed, this would be here, change it, so this would be shaded because this is observed. Some people like to do it, then we usually use this training data to infer our latentvariables here. The parameters and from these we then do the prediction of why given its input. And as said before this representation says something about how the probability density is factorized and we would get the same factorisation as before. Just now. In addition we have a product over this distribution, let me ride training data

and here test data.

2 Graphical Models - Bayesian Networks - the 3 building blocks

Hi everyone in this video we'll talk about basic networks and their three building blocks. Were interested in how information can flow between the notes in a basic network is thrilling three buildings blocks gives us some intuition how information can flow. So what are these three building blocks. If you look at some paths in the basic network then we will encounter either chain like this and like this or we have a fork or we have a collider

like this. We're interested about how information can flow from X to Y and there's there was that in the middle then here X and Y as it is here in the middle X. Y and visit us here like this in the middle this is called a chain. This is called a fork. This is called a collider sometimes these two together a call to non collider no choirs if you wanted to

distinguishable but let's start let's start with the fork again. Work looks like this. X. Y. That now we're asking um how does the probability distribution look like probability distribution acts. Why that the joint distribution factorises according to the usual network formula that means we have E X given that we have P Y given that times piece that always note

given the parents. Note given the parents and then finally the note given no parents was he. So generally the question is can we assume that X and Y are independent. This is not the case. Think about you have some variable X here and you just copy it down here and you copy it down here again then X and Y are just copies of each other, identical copies of each other. So there is direct deterministic relation between them? So in general we cannot generally X and Y are dependent, so we cannot expect them to be independent. Now let us look at the conditional independence,

so we claim but always true that X is independent of why even that? Why is this? Well, you can directly see here a product of X. andw given that times the probability of why given that? So by just dividing here on the other side, we get we call this star and the star implies that E X. Y given that equal to P X. Give that times. Dy given that E Y given that this is the defining equation for conditional independence.

So in summary and generally X and Y are dependent. But if I condition then the information between X and Y is separated. So if I condition on that the information flow between X and Y is blocked. So let us come to our next case. The chain. The chain looks like this 12, right, that is X. I. Is that. And again, we're interested in how information can flow from X to Y. If we condition on that or if you're not conditioning on that. First,

let us see how the probability distribution is factorized X, Y, Z. That in this case a definition of based with why given that we have E of I given that times, is that given X times E of X. So first question is, can we expect that X and Y are independent? Just consider the very corner case of the that is just an identical copy of X. So we take X here, X here and X here, then X. Y a Communistic lee related and cannot be independent. So we can again say generally X and Y

I'm not independent. Generally they're dependent. Always. I claim that acts independent of Why if you condition on that, why is that? Again? We have to look at the probability on the probability distribution. So we have Y. Why given that? So this is this and then you have this distribution which is this year. So we can write andw separate. Let's call this equation here, we get that. The joint distribution X. Y. Is that is E. Y. Given that times the joint distribution of that and acts. There's no restriction on this

here. So if you take the marginal and conditional, we get the joint and we can write this as E. Y. Given that times now E X. Given that times, is that So if you now divide by P that here we get conditional distribution of E X. Y. Even that equals E X. That times E Y. Is that? And this is defining equation for conditional independence. So this is true. So again we have we have in general we have dependence between X and Y. In such a basic network in the chain. But if you condition on the middle then there's no information flowing from X to Y.

It's very totally the same as the fork and this relations Let us come to the next one. The collider collider looks like this wow 23 yeah two X Y. Z. That first we write down the probability distribution as before. So in this case we have B X. Y. That is given as he said given the parents he offset given X. Y. Times now P X given the parents they are not parents. So it's just P X.

Now look at py there ought to know parents so we just have to marginal as you can see if I now integrate over that here then I integrate this over that these are not dependent on that. This just drops out. So this means that P of Xy is equal to P of X and P of Y. Just basically integrating over dessert. And this it's a defining equation for independence means like we always have always have the ex independent of Y. In the collider case. So now

the question is now the question is is X. And why are the independent given that the answer is no but generally we have X. It's not independent of why even that? Why is that? For example we had already example with coin flips, consider X and why two independent conflicts? And let's say that is the son of them one flips Why? And that expects possible. Let's say they have values 01. And that

to be the son of them, then this will be an instance of this collider basic network, why your ex is independent of X is independent of Y in this product and that is a function of X and Y. So it follows this rule here. So this example is an instant of this collider case. So now if we say for instance that equals zero, what does it mean? That means that X zero and Y zero. Okay, this was trivial. Now think that Z is equal to one then if I have X equals zero, what can I say about why then I know that why it's

equal one and the other way around. So that means if I condition on that the knowing X gives me all the information of why? So this already tells us just a small instance that this cannot be independent. So the general case is that X and Y are dependent given that of course there can always be special cases where yeah, they are independent but this is more like an exception. Now let us see what a collider looks like it's even more variables for instance, let's say X. Y. That W at this P X S B, Y. Gatsby

that this B. W. So then the joint distribution of X. Y. That W looks like P. W. Given that that's apparent, is that given X. Y then E. X. And P. Y. This is how the joint distribution looks like. And the question is S. X. In general, independent of why even W. And again, um.

3 Graphical Models - Bayesian Network - Global Markov Property

Hi everyone in this video, we'll talk about a global mark of property of basic networks. This is basically a property which tells us when the information flow between variables and these basic networks are blocked. And for this we derive a very simple graphical criterion. So you first have to say when is the path in a basic network or in a deck. So here we only require the graphical structure. When is the path blocked? So for this we start with a dag any diag So at this moment we're not talking about probability distributions and we have a fixed subset of the notes but the more we have two notes and we now look um at a path between them

in this dac. So this is our path. And note that in this definition path can go anyway against or with the direction of the edges. So basically a path is just a collection of nodes that are connected by an edge and it doesn't matter if the edge goes this way or it will go that way. So we're not checking the direction energy, all of these will be called a path. Mm And then we say that this path is blocked by our set C. So everything is relative to our C.

At least one. The following cases hold at least one of the following cases. Hold either one of our end notes is already in C. So if the starting point or the end point or the other way around lies in C, then it's already blocked. This is about the end notes. And if this does not hold, then at least one of the next cases should hold one case is about the non colliders and this is now inspired next to cases colitis and colitis is inspired by the three building blocks. So we either have a fork, we have a chain or we have a chain in the other direction. These are all non colliders and if we have at least one of these non colliders where one of the

notes in the middle is in C, then it's blocked. They're here non collider collider and not in the middle is in C. Or if this doesn't hold then at least we need to collider which is of this form meaning that the arrowheads pointing towards our notes both of them. And then we require that this note in the middle here is not an ancestor of any note in C. And this is first a bit counterintuitive but we have seen in the collider case that if you condition on the note in the middle then information can flow. And you have also seen in this second case we

had like this vice structure that if you condition on a node below below a descendant of this collider then also information can flow. So this is the reason why here we have even ancestors of sea. So we need that collider it's not an ancestor of psi perhaps this collider be blocked. So this is an opposite to this non collider case andw if any of these holes for any of these notes then this is blocked and if none of these holes then we call the past C. Open.

Please note that here everything is dependent on the choice of the sea. C is always the context team. Also the case, if we choose C to be the empty set, then all these rules simplify, then a pass is blocked by basically empty set if and only if contains at least one collider The reason is um if this is an empty set, this cannot hold also this cannot hold so we can only have a collider where this year is no condition, so only if it exists at least one collider then it's blocked automatically. Let's look at some examples, let's draw the deck.

Okay, now let's give these notes names X, Y, z, B W. Now we want to know which of these paths are blocked by what for instance, we can look at the past ex that why then we know this is this is empty set. If you're not conditioning blocked but it would be that open. If you were

if you were conditioning on this set here, then we had information flow and that would be called that open also, even though this is not on the path, this would also be not just that open, it will be also W open. Why is that? The reason is W is here and that is an ancestor of W. And the collider case checks if this lies in the ancestors set or not. So this would be W Open. Now we can even go further because you can look at X. That by E. And as you can see this would be

also blocked by conditioning on nothing. The reason is there's at least one collider here here in this path and also this would be Z. Open. It would be W Open. It would also be zw open, but even condition on both the same holds here. But if we're now conditioning on let's say that W why, then it would be closed. The reason is we have this past even though it's open here. Now by conditioning we have now here non collider and we're conditioning on it and by this rule this will be blocked and therefore this whole pass is blocked and so on.

Also just conditioning on why, but if you just condition on why and it will also be blocked. Now we want to look at how information can flow between arbitrary subsets of this graph for this. We now introduce the notion of the separation index definition. First we start again with a dac with some set of notes and some edges and now we just consider any three subsets of note, then we say that A and B. A. D separated by psi If every path from a note from any note Viene A to any note W and B is blocked by C.

It means all possible pass where information could flow are blocked and its symbols rewrite this. That A is D separated by B. Even C. Alternatively people right A. D over this um symbol or even say which graphs they look in. Sometimes they look at different graph structures provide this and note that we define A. B without any conditioning to be a empty sent. This is our convention. So the question is now, what kind of rules does this the separation have?

And similarly andw So similar to conditional independence. We have also very similar rules which you can prove just by definition. For instance, if I have set A and any other set B. If a condition on a then they're always D separated. The reason is that any part in a um starts in A. And taxes block pay for the endnotes are always in a. The symmetry is clear because um the past relation is symmetric, then there's decomposition. If I have like a union of notes here, then any part from A to B which is blocked by C. It's blocked. So I can just forget and not which end in f. Also if I have

the same statement, then I can show that any path between A and F. It's either blocked by B. Or by sea. And then the contraction. If you have to these properties, then we get this back, which is this one. So in total we get the same equivalent as we have for conditional independence here just was in different rotation with a different um a meaning for these, this has been a graphical version of conditional independence. And this is exactly what the next theorem is about. Theorem is a global mark of property for basic networks. And it says if I have any basic network, any basic network and let me remind

you, a basic network consists of a deck together with the joint distribution over some random variables which are indexed by the notes of this graph. Such that the distribution factorises according to this deck. This is important property. So we know that this joint distribution has a factorisation property with respect to this graph. So this is all we know and now we say, you can say that is a theory for any three subsets of the notes whenever they satisfied the separation in this graph, this is in a graph G. Whenever they satisfy this, then I don't need to check anything because the theory tells us by hand, we don't need to check by hand, the theorem tells us, then the corresponding random variables are conditionally independent.

So we just look at the graph, check how does the graph reply? Do we have this? The separation if yes, then I know that the random variables are conditionally independent. That means there cannot be any information flow between A and B. Condition on scene and here this is only the graph. We'll hear this random variable. So there is no information in X A about X. B beyond information contained in xz In other words, the separation implies the correspondence corresponding conditional independence for the more the reverse is not true, not true.

First implication is not true. Okay, I leave the proof as an exercise for everyone who was interested in having a little challenge. So you can do it by induction by the notes. And what you do is you remove childless note, remove a child this note, this is the main the main hint here. Um and then you can show that the remaining variables also form a factor also form a basic network are showing that also the factorisation property still holds. Then you can check that this conditional independence hold where this here means all other variables beside W.

And then you can also check that these sets can assume to be paris disjoint And then you have four cases. W lies in this union. W lies lies not in this union, W lies in a W lies in B or W. Licensee. And then you have to make a little bit of right different case dependent arguments but because of symmetry, the case two is the same as A And this um last part. If W is in the conditioning set, then you can use this argument to bring it onto case one and two. And so we're left basically with mm this case and for this you need introductions that you need this relation and you can do it with separate axioms.

It has now come to a few examples and similar before. Let's promise some basic network deck at least and be like this. And we give them now names and that gives them numbers 123456. And now let us ask for instance if

so we have that's for the north before is this independent D separated of B five. And we have to check every path. So let's start with this four, then we go to two, then we go to five. So this is open, this is a calculator, so this is not true. So what is the condition on extra. Alright, you too, me too. Then we know that this path here it's blocked but this is not the only path. You have to check all path we have this path as well. And as you can see he has a collider utu

Here's a collider meaning this past 25 is blocked When we condition on b. two because this is not an ancestor of not here, this is true there. So this means okay, we might think about maybe there are different routes. So if I go here and here you're basically going the same route as before. If I go back and forth and here as well also if I go here back and forth, I always end up going the same routes, so all the paths are blocked, so this is really true, But then mark of property tells us that X4, Independent X five, even X two. And I haven't written down any probability distribution, but I

know just by looking at the graph that this will be true and I have many choices to make this a basic network because the only thing I need to do is specifying the conditional distribution given the parents, that could be any distributions, normal, gamma person, discreet and on discrete whatever you can come up with, if it factorises according to this basic network, then you know this holds yes, so you know this holds as well, so this holds any basic network of this form. So then maybe let's look at another example, for instance We three, is this independent of B six? And as you can see as an open path, so this is not the case. So what if

we conditioned on B five oh, then the past here going to there is blocked why? Because he is a non collider and non colliders are blocked by this note but now we have a different note. Different route, we can go here here here and here and you would say, okay, there's a collider, so it's open but no, we're not conditioned on five. So this becomes open suddenly. So this path now is also open, so this is still not true, so what if reconditioned now um any of the non colliders that could help. So if you condition on to we could block it, conditioning on it. Um before

so if he no block only four, then let us check again. This route is blocked, it's non collider blocked 35 and then this is open and this is then blocked here. This is an ah collider so this is true. So it implies the correspondent correspondent responding independence of the variables X six, given X five and X four. And again, I didn't need to specify any distribution. The only thing I need to know is that this year is a basing network and you can go on with all kinds of examples you

can condition on more. Can even ask now a set of distributions may be the one with sets of distributions. For instance, let us take we won B two and check if this is independent of let's say E six, and as you can see, we have a personal from 146, this is open, also 246 is open also to 56 is open, so this is not true. Maybe let us condition on some variables for instance, before Now 146 is closed because this is an on collider but suddenly this is open but it doesn't matter because

it too is anyways in this set, so the influence of one until six is blocked here. Now if you look at two and we go to four and six, it's also blocked here because it's also a non collider but still we have this part here, so this is also not true. So let us condition on b. five and then as you can see this pass is blocked by before was not collide on this pass blocked before now we go here here, it's also known collider block This by conditioning only five. Also here, what we get is X one X two is independent of X six given X. Four and X five.

Furthermore we could argue even further, we could even add more variables To this set. For instance it could include b. three. If you now look at the three then we have this past to be six and to block on M. B five. So this is also the separated from six and then we could get even stronger independence. He could get the independence of X six. X three run um X one X two given X four X five. Oh I checked the other thing.

Okay I checked the other way around so I checked if this is separated from B. six, so this is on this side, not on the other side, but then we get this independent and so on

4 Graphical Models - Bayesian Networks - Construction of BNs

Hi everyone this video, I will talk about how to construct based networks of a given probability distribution. Let's first start with the question, what are basic networks good for? Let's say we have a joint distribution of some random variables. And the old discrete and let's say every random variable can take on maximum k values. Then if I wanted to represent this probability distribution let's say in a naive table every single value here, I have one entry. It means like I have K values for X one K values for X, two K values for X. M and so on. It means like if I naively put them in the table then I would get K to the power of n

entries into this table. This quickly blows up as you can see one variable more. I multiply everything by K. If I have 100 classes just one random variable more and I have to multiply it. Mhm By 100. On the contrary let's assume I have no basic network and we make the sparsity of something that we only have L. Parents per note. And let's say L is much smaller. Let's say we have 1000 variables and every variable have only 10 parents then because of this factorisation property, I only need to represent each of these factors and each of these factors is a function in X. New and in the parents of xnew for X I need K values. Each of

the parents also need values that it means I have L parents plus one, this is plus one. So I have to store K to the power of L plus one entries. And this I have to do now, how often I have em notes, that means I have to store em times these kind of tables end up with M, times K to the L plus one Entries. And if L is let's say 10 and M is 1000, then this is much much smaller. So basing networks allow first as we have seen to model independence relations but also allow us for a sparse representation. If we have only a sparse number of parents but if like L a number of parents is high,

let's say L is equal to M. Then we don't lose anything because then in the end we're also in this range of K two D M. So in anyways we are well off representing representing joint distributions as basic networks. Now the question is, if I start with the distribution and I want to find something, find representation as a basic network, how do I approach this first? Let's start start with uh distribution. Note that no graph is given yet, we want to construct such a graph such that this becomes a basic network and we can exploit this sparser representation. So what we use now is is just like the product rule. Product rule allows us two iteratively

take factors conditioned on the values before so I have here X. M let's say we start here now from M equals one to M given x one chill X and minus one. This is a general product dual which always holds. So that means like if I use these as the factors in the basin network where these are the parents then I have done my job. So if I say this year is are the parents of em or it's a note um M for the M parents then I can draw a basic network and then I'm done. Problem is let me represent it, let me draw it.

Let's say we have one variable variable then Let's start with one two. Then this rule here says I first representing the one as a marginal distribution then I represent The two. Givenby 1 and then if I have a third variable drought brought here then I represent me three as this graph as we want and we too and if I have another variable, let's say explore before or this era, this era, this error. And

as you can see if I go on with this, I will have a fully connected debt and this fully connected deck turns this probability distribution into a basic network. The problem is so that this is not sparse. If you just look at the last factor, just look at just look at E X M capital M. I mean just last X one until X. andw minus one and this already needs K to the power of M. That is to be represented. That means like this

form of basic network doesn't help us one. Short remark it of course you can reorder the XMS gives them different indices then of course the this graph would also change it would still be fully connected but it would be um if this would be X two and this would be X one and the errors would go in different directions but it would still be fully connected. So this doesn't help reordering of these notes. Doesn't help. Okay, so this was a bad way to do it. What is a better way? One better way is to exploit conditional independence relations between the variables and then constructing sets and this goes as follows, we inductive lee take a minimal subsets, subsets off in the M step. So we can step em

then we look at all the indices before and then we take a minimal subset of this set here, such that our variable these are given are conditionally independent of all predecessors. Even this ex PM, I want a conditioning set such as X M becomes independent. Everything before. So the question first question is does such a set exist? Yes, it does exist. If you take P M to be equal to this full set then here you have exactly written what's here and this is conditional independent. So there always exists a set and then if there exist one, we can take a minimum one by checking for conditional independencies in this distribution and then we exploit this

to write this factor here which were in the previous on the previous light pride as E of X M, even X E M. Why is that? The reason is the reason is this conditional independence here, meaning that the probability distribution of P X M. Given this year, it's only dependent on this conditioning set here, which is a part of these random variables. So this gives us now the possibility to write the probability distribution. Mm hmm probability distributions more compact way. But if I have

X capital B and we have seen this is M equals one capital M E X M X one until X and -1. We have now seen that using this conditional independence, this is P X M given X P M. So we have from all these variables, we only keep these ones. And since this is a minimal subset, we have a minimal representation andw Now, if I use this set P M just to define the parents of our um then this becomes a deck. Why is that going inductive lee through the set that means all edges always pointing from predecessors

to new ones. So we'll never get into cycles. Then by definition here, the parents set of new M S S P M. We just constructed by this condition independencies and as you can see by this factorisation here that this is then amazing network as this basically per definition factorises over the graph. We have constructed the graph from this factorisation so we get a data network as a minimal number of parents for each note, andw the same remark as before we started with a total ordering of this. This gives us a total ordering of the random variable. It could of course take a totally different ordering, and this would result in a different product and maybe in a better or worse representation.

But given this total ordering, this kind of optimal representation. So this is how we construct a deck and such such a basic network from a joint distribution by exploiting this conditional independence.

5 Graphical Models - Bayesian Networks - Summary

hi everyone. And shortly summarize what you've done so far with Beijing networks first we had defined basic networks, What are basic networks based networks were directed a cyclic graph together with a joint distribution of a random variable index by the set of notes. Such that this distribution, factorises over the graph given by this formula where the dependence on the graph is through the parents set here side. So every note is only dependent on its parents. Then we introduce the plate rotation also with mixing with parameters and indicating observations. And then we went on to define um what the separation means?

The separation meant if you have three sets of the notes, A B and C, then the D separated if every path from A to B. In this graph either contain contains a non collider which lies in C. Or it contains a collider not in the ancestors of sea, Either non collider in psi or collider Not the antithesis of psi then we have seen the global mark of property of basic networks which basically says if I compute um if I compute that a B and C r D separated in the graph just by looking at the graph, then the corresponding

random variables are conditionally independent but the separation implies conditional independence. Then we were talking about construction of basic networks for a given probability distribution. The graph was not given yet. And there is a way we just inductive lee take a minimum subset of the predecessors such that these conditional independence holds here. Well, I use here now a shortened python rotation and these were then declared as a parents of New Moon. And this is inductive lee. And this we have shown we got a basic network and the construction was done there. And we have seen that basin models in summary model independence relation, or conditional independence relation, by this global mark of property,

and they allow for sparse representations using or exploiting this product representation.

4 Graphical Models

0 Graphical Models - Markov Random Fields - Definition

hi everyone in this video, we'll talk about Markov random fields. Markov random fields are kind of an undirected version of basic networks. It is actually not easy to come up with an example which is not inspired by physics where they actually started, but let me try think about this mannequin, you can shake it and you can put it back on the table and now are random variables are the positions of all these body parts, like the head, the body, the arm, the right arm, right hand, left arm, left hand can take the leg, the right leg and the left leg and the right foot and the left foot. So now if you can you can see here the hand here is the arm.

And if the arm is up here then you cannot expect the hand to be down here, for instance. So we have a strong strict correlation between whereas a hand and where's the arm? Also, if this hand is let's say up here, then the left hand cannot be just down here. So there's also a correlation between this hand and this hand, but now let us fix the body. If you say the body is fixed to some position, then I can move this hand independent of the others conditioned on the position of this body. This means we have local correlations between all these body parts which basically propagate to the end of the

other body parts. So we would represent this as an undirected graphical model where we connect all these variables And then we can find a criterion, basic graphical criterion of when are these dependent on the other parts conditioned on for instance, this Middle one and you find a separation criterion later on. A physical reason why the hand and the arm cannot be moved independently is that there are forces, they are connected and their forces between them which cannot be ignored. You could also say there is some potential energy between the hand and the arm and this gives us a motivation how to define Markov random fields

for this. We need to talk about undirected graphs. What is an undirected graph, undirected graph? It's actually simpler than a directed graph because we just start with a set of notes or dirty sees. No, it's vortices together was a set of edges and these edges are undirected undirected edges for instance, we could just brought this year, this could be an undirected graph, then we need to know what is the neighbors of a note For directed graphs. We had something a notion of parents or Children here since

there's no no arrowheads, Every note in the neighborhood is treated the same. So what are the neighbors? The neighbors abbreviated with this partial sign here is defined to be all notes, suchthat the edge between W and the lies in you, meaning that there is an edge between them. So let's make an example here. If this is my b then the neighbors are just these two notes here because there is an edge between them and that's all. Maybe we need to say that in this lecture. We are not allowing for

edges going from one note to itself, exclude that for the moment. The next important definition of an undirected graph is the notion of the click. What is the click? So we again start with an undirected graph. Then the click is a sub graph of G. Every pair of nodes in Z has an edge. For instance, let's look at the graph we have seen before this one here. So a click would be for instance this sub graph here. Why? Because this note and this note has an edge. Well if you take these two notes, they don't form a clique

because there's no edge between them for the more. If you take these three notes here, it's also not a click because we have, even though we have these two edges. Again, this edge is missing. So this is a click. This is a click, This is a click, This is a click. This is a click for the more we have the three note click. And this reno click. These are all the clicks in this brand. So then we have to say what a maximum clique is. Maximum click is a click that cannot be made bigger by adding notes. For instance we said that this is a click but this is not maximum. Why not You can just

add this note to these two and it's still a click. So this is the reason why this is not a maximum clique Also this is not a maximum clique and this is not a maximum clique but the three together they build a maximum clique. Why is that? The reason is we cannot include this note here because this edge is missing. So since these four don't form a clique, The 31 is maximum and as you can see, maximum clique is not unique in the graph as this is also a maximum clique. Now we have all the ingredients from undirected graph theory to go back to probability theory. So as with basic networks always directed decided grounds where we have the factorisation property. We now also have a factorisation property for undirected

graphs. We start in the same setting namely that our set of notes should correspond to some random variables with some joint distributions as before we use like this index V. To mean the whole tuple of all the variables this year is a capital V. And it means like this tuple of all values corresponding to all these variables. So coming with a joint distribution as before and now we're coming with an undirected graph as well. And again this set here a set of notes what is undirected graph corresponds to the index set of our variables. This is the first correspondence and as first as before there's no relation between the edges and

um and the notes and the variables. So how do the edges come and play? So we now say that this joint distribution, factorises factorises according with undirected graph, if if there is a set of cliques in G, this was the notion of the grass coming and functions for you valued functions and there should be non negative such that each of the cliques in this set. Yeah, okay, let me start over. So we say that this joint distribution factorises over G if we have a set of clicks in this graph and for

each of the cliques we have a non negative real valued function in next by this click and only dependent on the variables that correspond to the notes of this click. Said that the joint distribution can be written as a product over these functions. Yeah, I wrote the proportional sign, so what is missing missing is normalizing constant For some remarks and honor 1st. Um we can always assume that the set of cliques here in this definition, we can always say that these are the maximum cliques Oh,

Our graph, why is that? The reason is that if you have a click and every click lies in a maximum one and every function on two variables can be considered also as a function of three variables where one is just ignored and then you can gather these factors arbitrarily and non in a non unique way into functions of the maximum cliques secondly, in comparison to basic networks based networks of factorisation property here, we have conditional distributions. Now we don't have those anymore. So these factors here, they don't have a direct interpretation as conditional distributions. Nevertheless, if you wanted to compute conditional distributions, you could just marginalize out some of these variables here and then use a normal formula, joint

divided by another marginal, then you get a conditional distribution. So, and these functions, they're sometimes called actors or click factors, or sometimes called potentials. Some people actually take the lock from plus minus a lock of this function to refer to potentials. So now that we have this factorisation property for under record graphs, we can directly say what the Markov random field is. The Markov random fields consists of two things. It consists off undirected graph

where the set of notes corresponds to random variables coming with a joint distribution such that this don't distribution, factorises over this crap. So again, the Markov random field comes with an undirected graph and a joint distribution that factorises according to this graph. And I say, as before, we might represent a microbe random field, graphically, for instance, like this, but the main property is that the probability distribution is attached to it and it has a factorisation property according to the clicks of this undirected graph. Markov random field, we

call it positive or also sometimes give it its own name gives random fields, if the distribution is strictly positive for all values. So we're not allowing any vanishing probability any of sity values.

1 Graphical Models - Markov Random Fields - Global Markov Property for MRFs

hi everyone in this video. I want to talk about the global mark of property, Markov random fields for this. We need to define what separation means and undirected graphs. So let's start with an undirected graph and a subset of the notes. Everything will be dependent on this subset. And then let's say we have a path in this undirected graph, which just means a bunch of notes which are each connected by an edge in the graph. And then we say that this path is blocked by C or c blocked blocked by C. Just if at least one of the notes lies in the set, psi that could be the beginning of the end. It could be any in the middle. Could also

be to at least one of these nodes lies and see then we say it blocked. Otherwise we call it C. Open. You want to make the comparison to basic networks then this is just a special case of non colliders. Since then no error heads here or here. Um this is just like a special case of snow, correct? Furthermore, if you have now like three subsets you want to now say what separation means? We didn't say that A and B are separated by scene separated. I see every path every path from A to B from any notes. It's blocked by some This just means

in correspondence to basic networks that there is no information flow possibilities from A to B. And you also write this in symbols and symbols, You write it exactly as before we write a separated from be given psi sometimes be right G below it to saying which kind of graph this occurs and again, if we say a B with all the conditioning set, then this just means this is a E even empty set, but no one can ask the question again, what kind of rule does the separation

and notion have and and you can prove similar to the networks. Um This rule of redundancy symmetry decomposition with union and contraction can be summarized in this equivalency, but I will leave that to you. So then now that we have the separation criteria which is purely graphical, we can talk about probability distributions and random variables and the connection to the graph again, now assume we have the Markov random field which is remind you an undirected graph plus a distribution has factorized according to the cliques undirected graph. Now we have furthermore we have random variables. It was a joint distribution that factorized to the ground.

And again we abbreviate, we abbreviate X with kind of a set as an index, I mean the tuple or random variables um that are indexed by this F And then if we have some subsets in the notes and we know we we would know that they're separated in this graph in this undirected graph. Then also the corresponding random variables are conditionally independent meaning separation and undirected graphs imply conditional independence. Total analogy to the basic networks and again, the reverse implication is not true in general. Now let us look at a few examples,

consider an undirected graph looking like this. Now lets us check if some upsets are separated or not. For instance, let us look at the one. First of all we see that everyone is connected to anything in this craft um without conditioning. So what we know is that everyone is always connected to any other? Yeah, I all I we cannot expect any independence here, but we can for instance, right, look at conditional independencies or conditional separation for instance,

Everyone is independent for instance, of 3, 6, even two and 85. So we're here one here the six and we now condition on this, these two middle ones, why is that? We can just look at every kind of path we can come up with we have friends in this path, this is blocked, condition on this. And you can look at this past is also blocked. The condition again, on this note. What about this path is blocked because we're conditioned on this year then what about this path? This is blocked because we condition on this one, every path we can come up with is blocked by this note or this note. And this means

that we have this separation in this undirected graph but now by the global market property we even know that if this energy graph corresponds to Markov random field that the corresponding variables are independent as well, conditionally independent given X two X five, maybe short. Another example, consider For instance, B three and we want to check every three independent of a five course. Again, since everything is connected here through some path, we cannot expect this independence to hold. So what separation told? Now let's condition on something. What about

the condition on the 6? And let us check. So if I go this path condition on the six. So this is blocked by condition out on this path. You see it's open because there's nothing nothing is in this set. No, not this notice in the set. So maybe if you want to have some condition condition independence, we need to condition on more for instance, on me too. And now you can see they go this path, then this is blocked here, I go this path blocked here and all other paths. I can come up with either go to this on this note. So this independence is true, the separation is true. This implies us. Now the conditional independence,

two corresponding variables. Furthermore, if you're interested in more variables or sets of variables than we could even take for instance, this set of variables is separated from this set of variables conditioned on this middle one here and we would get the independence, conditional independence of these variables and these variables, conditions on these. We can make more and more example in the end, I just want to talk about the famous theory which gives us an inverse to global market property. It says if I start with an undirected graph, andw strictly positive joint distribution such that I have the pairwise

mark of property, meaning that for every pair of non adjacent notes, is this crucial? I have the conditional independence of the first variable from the second variable, given everything else. And this year is not a separation in the graph. This year is a conditional independence. Coming from this probability distribution, the only thing that relates to the graph is a non adjacency of these notes, let's assume that this property holds a strictly positive density. Then the conclusion is, and this is a claim of the theory, then joint density, factorises over this graph, in other words, and this tuple of GNP is a marco friend of field. So in short, what

you've seen so far is Markov random fields, they imply the global mark of property, they clearly imply the pairwise mark of property as a simple example where we use just uh correspondence separation, which implies this year. And then ure Clifford theory states that if I have this pairwise market property and the positive density, then also the reversals

2 Graphical Models - Markov Random Fields - Examples of MRFs

I am the one in this clip you want to talk about examples of marco Brandon fields the first example that comes to mind is just using a caution micro friend in field so we consider we have an M dimension, caution that might be very high dimensional and we parameterized discussion not by the co variance matrix but by the position matrix which is just the inverse of the co variance matrix. Then we look at the undirected graph which is set of notes and edges and here the set of notes adjust the numbers 12 M corresponding to the dimension of our argument, edges are either given or we can define it to be this set here we take all these indices B and w such that the entry in the position matrix is not

zero, these are the edges or at least a subset of the edges. That's what we require then the claim is that this graph we basically constructed from from the entries of this position matrix and discourse in distribution they form a Markov renfield positive indeed. So let us write this out, how does the distribution look like after a normalizing factor the Goshen is just exponential of minus and half X minus mu transpose. Now we have the position matrix where usually is inverse. Also invariance matrix times acts move this is awesome.

So now you can just factor this out so we can write this as ordinal terms times off our terms. So this is a product of all the notes next minus half. Now we look at x. mu bad times only got new new, so these are the diagonal elements times and now take the product for all edges. So this is these are basically just once whether entry of the position

matrix don't vanish and then we have x minus X b minus times, X w minus view, W times, omega b W We can see the half is gone here because this matrix is symmetric and we have two terms, one B, W, one W V and we just gather them together. And as you can see we consider the single element sets as cliques and also edges are cliques andw all the terms

where this year is equal to zero, where this entry zero, you can just ignore because they are just a factor of one. As you can see these factors clearly according to some cliques So from the previous videos we know then that this is a Markov random field, this is a factorisation property by definition it's positive as all these are exponential of something which is always bigger than zero and we know now that they follow a global and local mark of properties that means like if I have that an entry is zero in the position matrix and let's assume that we're here just take the take the edges to be equal to the scent as we want to construct it here, maybe then if this position matrix entry is zero, then we know that the corresponding

variables conditional independent, given the rest of the variables. This might come surprising as we usually consider the co variance matrix saying something about correlations. There's an entrance of the correlation matrix vanishes, then they are just independent or uncorrelated and here the position matrix tells us something about the conditional independencies and this can of course not be generalized because we can use the graph and the global mark of property to come up with all kinds of other independence relations. Let's come to our next example. This example, it's called the easing model. It comes from physics, the idea was that one has the letters which is an undirected graph here are the notes that each of the notes there's a small magnet and the magnet can point upwards and downwards we have

small magnets let's say in iron and they're on this grid considered to be on the script and or random variables here they can now either take plus or minus one depending if the north pole points up or down then um then these magnets there's forces between these magnets, they can be expressed at least in approximation with some potential energy. And this is then written down um in terms of probabilities. So here we know that the probability is the sun over all edges, then we have some prefactor times, xnew XW as an argument which only plus or -1 each

and then they're often interaction terms, non interaction terms or self interaction terms right here we have another value. H New and xnew this is a using model and he on the side we consider like a very very large letters with very very fine grained um distances and if you would sample from this distribution from the easing model then you would get a picture, something like this, you see a lot of white and a lot of black areas which come from these local correlations so if something is pointing upwards the next might also point upwards and then it flips so you get all these kind of areas of

the same color and these constants here say something about how these notes interact with each other, is it like magnetic or anti magnetic and of course this is a mark of random field as we can take out this product, this some out of as a product and so on and similar to before we will then see that these edges are cliques all the single elements here are considered as clicks and then you get a product over cliques So in application in physics people usually rely on some tricks unfolding the normalizing constant, parameterized by these quantities here to get to expectation values and core bearings metrics.

We have already formalized this under the name of exponential families so it's not surprising that we can write such distributions as exponential families here a little bit more general. Let's say we have an undirected graph again And our random variables are all discrete and we fix for each note, we have to fix the state space with the random variable tax value before we just had plus -1 but you can also have more classes. And then we take the product of these spaces with some index set here and now we consider all positive Markov random fields and we consider the factors as parameters. Why can we do this? This is always a trick when we use discrete variables that we can just say the whole vector of

probabilities for each of the classes. It's a real number and we can put them into a big vector. And then the parameterisation is basically coming from this uh functions So we can consider this function in the discrete case just as the vector of values. So and then consider them as parameters. Now we consider all positive micro friend um fields where the graph is the same and also like the state space is the same. We're always having the same kind of discrete distributions and we put these sectors into a vector which are now considered as parameters. And then we can write this as an exponential family. This set of distribution by just saying what the natural parameters are. And the natural parameters

is our we have one vector of natural parameters for each maximum clique namely the logq of this sector and then evaluated on each possible values in this and state space of the click note that this is a final set, so we have here and just kind of values. We put them into one vector and we have such a Lector for each click, so we have andw for each click, we have one of these vectors, these are the natural parameters and the sufficient statistics are now just indicator functions telling me which of these values I take because I take all values this is not dependent on our argument. Acts so we can use the usual trick we use for discrete distributions like multinomial

and binomial again by just taking a one hot encoding dependent on the bearable in the argument and then just multiplying it against this whole vector to get the one out we need so this is just one hot encoding all the values of these creek state space and then we get a lock, normalizing out of it, just protecting the law of the normalize er and then expressing the functions internals it's lots, this is possible but it's not nice to write down and then our base measure is just one in this form, we are able to write exponential families in positive discrete Markov random field as an exponential families and now we can use all the tricks to derive expectation values and core variances and consequent priors

and so on

3 Graphical Models - Markov Random Fields

everyone in this video. We'll talk about how Markov random fields related to Beijing networks, let's start with a basic network. Basic network consists of a diag here and the probability distribution that factorises over it. That means that the joint distribution can be written as a product, all the notes of the conditional distribution of the single note value given this parents in this graph. Now you might think oh we have already here, factorisation why can't we write this as a micro friend infield? In fact you could if you just make sure that this year is a function of a click but now this is a directed graph and not an undirected

graph. So if you want to construct an undirected graph, we have to make sure that this variable together with the parents, they form a clique. And this process is called moralisation it takes a deck, kind of marries the parents and then we end up with an undirected graph where these sets are cliques So normally what do we need to do? So we connect all parents with undirected edges in the deck And then remove all directed edges with undirected one. And then as you can see these are then the natural factors in this Markov random field because their function of the sets here and they just turned them into cliques it's

as simple as that to turn a basic network into a marco friend field one drawback might occur for instance, if you're trying to check independencies then in a basic network we had d separation the Markov random field we had separation in the process of moralisation going from basic network to Markov random fields actually can lose some conditional independencies If you were just using separation here, we will see this in the moment. But first lets us illustrate moralisation So let us consider the stag on the left side. No look at this note here, this note has two parents, this note and this note. And by the rule of moralisation

we have to marry the parents buy an undirected edge. We'll draw this on the right. And the second rule was that directed edges turned into undirected edges withdraw this as well. Then we look at the next note, this note and we see there are three parents. Now we have to connect all these three parents and next we have to connect all these parents back to the note because we have to replace this directed And just by undirected one.

So if we are now interested in independence relations, then you can see that this note and this note there is separated from each other because the only path goes this way and there's a collider So we have E w here the W here and we have here the independent on the double deal. But here we have that we connected w And we have seen that if this represents a basic network, then this year

represents a valid Markov random field. So that means if you start from the basic network, then we know that this year implies the condition independence, the conditional um of the corresponding random variables, conditional independence, corresponding variables here without condition. Now we have this dependence andw from this dependence. We couldn't say if this holds or not, does this hold or not. This shows that the process of moralisation might might destroy the ability to detect independencies Let us look at a few more simple examples for instance,

look at this deck on the left, we can turn this into a undirected graph by moralisation as before by just marrying the parents. So and this turns out to be just this undirected craft. Let's give them numbers 1234 123, 4 as you can see and not many conditional independencies in these both graphs because most of the nodes are connected to the other nodes So what conditional independence or separation can be fined

for instance, number one. Independent of # four, even two and three. And this holds true in both that directed in the undirected graph here. The path from here to here. If only this one. Non collider this one. Non collider This one also known colliders all blocked by two and three and here the same. We only have non colitis and two and three box one of form four. Now let's look at the collider case like this, Let's give them numbers 123

what we know is here that in this upper graph that the first is independent of the second note. Well if you take the moralisation you see that first we have the same errors as before, but now we have to make underactive edges between the parents and now in this lower breath you don't have this anymore. So we lose the ability to detect one of the independencies so you cannot represent all independence relations in the basic network with a Markov random fields by this procedure, but also the other way around, let's see out of this

um would relate to a basic network this year would be a valid Markov random field where here are those cliques maximum cliques And what we would have is we had an independence of two and three, even one and four and one and four Given two and 3. So let's assume that one and two are really dependent. three and four are dependent through this edge, one and three are dependent and 24 are dependent. And now if you wanted to find a basic network that can represent it and first of all we see this moralisation procedure, this is not possible. The reason is if we if one of the notes at more than one parents, there would be two of them variables

connected, but this is not the case. So one of the variables must be the only parents or the other variable for instance like this, but then we cannot draw an edge here otherwise, during moralisation you would get this year also we would get dependence relation between two between one and four conditioned on two and three which is contrast to independence in this graph. So the only thing we could do is pointing this error in this direction for the same reason as before we have to make the error in this direction and then for the same reason you would have to make the error in this direction. But this is not a cyclic anymore, this is not a deck.

So we see that this procedure going from the Markov random field ure based network isn't generally not possible but we can shortly discuss when this is possible and this is the notion of perfect elimination order. If we start with an undirected graph touch set, undirected graph and the joint distribution. They form um micro friend um field then the probability distribution factorises according to the cliques like this with some functions and now we say um that this undirected graph has a perfect elimination order. If we

can order our notes we want to be. M so we give them new labels, new numbers touch that if I look at the neighbors of a note that are all the predecessors of this note so only the ones which come before. And if these form a click every single note then we'll call this ordering this new ordering. Political termination order. And if we have this then we can just define directed graph deck. I calling these notes the parents, meaning that we draw on directed edge from each of these elements here to svm then one can easily check that this is a deck

that this year Hector's as a basic network over the stag and that if it takes more ization of this newly constructed deck I get my old undirected graph back. But here is a big if not every undirected undirected graph has a perfect elimination orange. As we have seen here for instance this graph doesn't have a perfect elimination ordering because each note um each predecessor like if I look at this note then the predecessors that are also neighbors. Is this set. They don't form a clique and any other note. If I start with this or like end with these have all the same problems. On the contrary, if I look at This graph here,

then this is already in perfect elimination order. 12341. I look at four. Then these neighbors which come before they are connected by an edge. I can declare them parents which is already written here and see the other parents. Then I removed this and I'm left with this graph. Then I look at three and I can look at the predecessors that are also neighbors, one and two. They are connected by an edge. So they form a clique. I can declare them parents. What you can see when these angels and then lastly you look at two and then you look at the neighbors that that's three and four but only 13 and four and one and only one is a predecessor and neighbor.

So look at one. This is considered a click in its own. Then you declared a parent. Make it disturbed. So for this ground it's possible to go back to a basic network and for this year unfortunately not. If you look in the Middle one then you might think okay one and two are the predecessors. They connected by an edge. I can declare them parents. So faster. Good. But as soon as you go to two and then look at one then one. Then you need to declare one the parent of two. So we would need to draw this edge. This would be possible. So we would end up with a base network looking like this

but we would not get this independence back from what we started from. This is not a problem. If you just start from a generic Markov random field then you can go up to this baby network and you have found representation. But this example shows that the representation as a basic network is not unique. And if you started from the spacing network you go down and then you go back that you cannot with less conditional independencies Once this edge would make this independence not holding

4 Graphical Models - Markov Random Fields - Summary

okay let's summarize what we have learned so far about marco friendly fields but we have learned that Markov random fields, they consist of an undirected graph, undirected craft and a probability distribution. A joint distribution over some state space is such that it factorises according to the maximum cliques of this graph written in this formula where we need to assume the existence of such functions where the arguments only depend on the ones index by these cliques only the arguments components that lie in these cliques then we have defined separation what separation means If you have three sets of note then A is separated

from be given, see if every path from A to B contains a note and see every path not just one. So every possible way of information flow was blocked Then we had the global market of property for micro for random fields saying that if I can in the graph find three sets that are separated so like if a is separated from be given C then the conditional independence of the corresponding brand and variable holes as well. We also had kind of a reverse of this claim andw that's a hemisphere Clifford and the reverse is not meant this reverse the reverse means that if this holds plus

we have a positive density then I get the factorisation back and it can even be weaker can only assume the pairwise marker of property, strictly positive density and then the graph and the probability distribution from Markov random fields. All the existence of these functions then we have defined what moralisation means and this is just connecting parents and then replacing directed edges with undirected once and if you do this then every Beijing network will turn into marco friendly field and a possible loss of conditional independence relations. You have also seen the other way around. If you start with a Markov random field and in case we have a perfect elimination ordering, then we can turn it into basic network

recursively defining parents by the cliques given by the neighbors. Then as a general remark, we have seen the basic networks and marcOS random fields in general model, different conditional independencies different conditional independence relations and the two important examples where you could see this with a later collider case, we're going from a basic network, Markov random fields, we lost independencies and then we had Markov random fields like this course cycle, which cannot be turned into a basic network because of the bicyclist, sity assumption

5 Graphical Models - Factor Graphs

hi everyone in this video we'll talk about another graphical model called pictographs. So for this we need to introduce new kinds of graph again and what we introduce an hour. bipartite graphs. bipartite drives come with two kind of notes. bns V is a set of Virtus is or notes and then we have another set of purchases or notes F and they will represent them. These notes was around uh huh this round shape while the nodes will represent as boxes and then these don't build a graph yet. Um we also need edges

and these are under edges edges and the edges only Go from notes in V two notes and f. So we don't have connections between two notes in the and you don't have connections between factors and athletes. We have such connections, this is bob photograph all kinds of notes and edges between them. Then we have to say what the neighbors are. Now now consider this graph up here is by project right up here and let's say for this alpha is beaten B W

until that then um the neighbors of suspect er alpha here and then notes from the other kind of set but here they will all be typical notes and be and if I look here at the note from the then all the neighbors will be factors from the other kind of set so we can write that the neighbors of V. Is all the offers of F. Such as e alpha so there is an edge between them and the same the other way around the neighbors of alpha all the be that's it,

these are the neighbors. Now let us define again how this be pi a tight graph can relate to probability distributions and this is a notion of factor graphs. spect a graph, it's just to be part of graph together with some joint probability distribution, it has a following factorisation it is product or offers an F this is basically the chinese index set of some functions sign alpha that's awesome. Well cy alpha are some functions like in Markov random fields they just have this index coming from um

a set of factors and now we have x. alpha wherex alpha now means is our only these components of X. B. This joint so these components that are related um our phone meaning that's a nail the neighbors of alpha X also means all the components such that component is a neighbor to this offer. With this definition we can now turn Markov random fields and basic networks into factor graphs so consider a deck or an undirected graph such that there's some distribution making it a basic network or Markov random field

respectively. So in the second case we have um factorisation according to the clicks or maximum cliques or in the first case factorisation according to the notes but then uh traditional distribution given the parents these two cases so then we can oh each of these factors you can just declare a note in this new set that we give it, give it an index. Let's call it alpha and this year for instance also alpha or we

call it alpha you want to make a distinction between the function of the index like this site alpha Then we can define a pictograph by just taking the notes as before. Um then the factors correspond to these offers roughly this, this is said functions or we can take this function here or if you have a different factorisation because we can gather them together, then you can also take a different

factorisation So this is not a unique way to do this and then connect connect each of the notes, all the alphas meaning all the factors such as X b occurs as an argument for this action. Maybe we give a little example here consider a collider case like this, then we have a joint distribution xk optimally p

X one times P X two times p X three given X one X two. So now we can write this as a fact a graph by representing each of these functions at its own box one, box two, bucks, their inbox, then variables variables variables here. Again, if we want to three then yeah, yeah The X one meant as afunction

X two And here PX given X one x 2. Then connect this wine because This needs to be connected to this one because X one is argument of this function. You two need to be connected to this function because X two is argument in this function and this conditional distribution has three arguments that we connect all of these notes to this. Second note and recall could call this if you wanted, you could call this alpha you could call this pita and we call all this government this factor or use these as indices for this. Here

is also basic network and this was a photograph. Now let's look at a different example for instance, let's consider the mark of random field, make this, record this even to three before then distribution. factorises according to these. cliques and let's assume factorisation only involves these maximum cliques then themaximum cliques will correspond to new factors. We can make two boxes one box, two blocks

and then we need to connect notes where this is again we won no brain before let's call this alpha and beta and now we are connecting we won. So this one here course, as part of this click we have these connections we want is also connected to alpha like this. So we have turned the Markov random fields into a factor graph. Please note that it's not just the graphic converting,

we actually have to look at all the factors. So the probability distribution and the factorisation

6 Graphical Models - Learning in (discrete) Bayesian Networks

hi everyone in this video. I want to talk about how to fit a basic network to data when the graph is known. So the setting is as follows. We start with a basic network where we assume some underlying true distribution that factorises according to the stack here. And we also assume that all the random variables have a discrete finite state space and two joint one, which is then also still finite. And from this basic network we now assume that we have samples and times independently the value and this is our data set. Each of these access comes with the features that are indexed by the notes in G and now we want to recover or estimate

these transition probabilities. If you have these then we have the joint by the factorisation property. So first of all, what are the parameters here? The parameters are exactly these probabilities meaning every value here and every value here. I have one parameter Of course with the constraint of the sum up to one. But basically everybody here, everybody here gives me one parameter. So our parameters if you want can be written as uh then maybe we changed to Is that new? Yeah, that and the parents are you

And then this runs of all new indeed and it runs all that nu in this X space of new and all the values that new, n X. You knew that this is a product of all? Well, ex news, immigrant parents. So we have so many parameters and we have some constraints here and the constraint is that if a some is uh next up then we get one, this is a constraint and there's some overall that gnew

in this state space this is a constraint that's really then we can use the factorisation property to write down a likelihood. What we wanna do is in the end the maximum likelihood approach. So what is the likelihood the likelihood is now um the probability of the data, gibbon this vector of parameters and now we know that the data's I I. D. But we get a product

Product and equals 1 to end of uh um this e or these parameters then we know that we have this factorisation property here. andw tea time that new these are the parameters but we have to make sure we only take the ones of course wants to own data first, take a product of all possible values that and the parents

have set and now we want this to only be um the probability given by our accents and how can we do this, we can do this by just and using indicator functions, meaning and news here, that's nu xnew times one that parents often times times indicator of that parents checked at the data point and have to make the easier correct, correct

andw this here all is supposed to be eta xnew ex parents officers datapoint n so we have a product over the data points then the factorisation product property says the product of all the features and then we have this transition probability, this is our likelihood and if you want to set up the lagrangian we have to take into account that we have constraints so this will be locked likelihood, lock likelihood and then plus some like ranch multipliers and the constraints so on a branch in

will be some want to end. These are the data points in the sun all our features then the sum of all that nu and also the son of all that parents of all possible states. Then the indicator function goes in front indicator that's new X and gnew times indicator of that parents often er tax and parents and then times lock of theta

that's gnew parents often this is a lot like lihood and now we have to incorporate the constraints with some with some labranche multipliers um with equality constraints so it doesn't matter if you take your plus or minus and then we have the sum of all new and new, the sum over all that was of the parents of new and then we have like ranch multiplier which is dependent on both of these that parents of nude and then we have to make sure that if we add up Lead us that we get one that mirror at this here to some of all that

new you're done, you're here, that's nu that parents afternoon minus one Okay that becomes a little bit messy let me write it here again. So again we have here our likelihood coming from here, our data contributions and then since we have to parameter rise, each single input of our probability conditional probability distribution, we have to evaluate them at all possible points and then we sum over them. Each of them is one parameter. So we have data points of factorisation property or the teachers than all possible states and then we add this up, this is a look likelihoods then we have to

add up the constraints we have to introduce multipliers plier for each feature and for each values of the parents. So we have a lot of these and then here the extra constraint. So now what we need to do is now we need to set the derivative to zero, we want to maximize it, so maximize it maximum lack of estimation means we have to find argmax our Lebron's more lagrangian and how do we do this? We look at all the partial derivatives and we set them to zero.

And if you do this, this is a straightforward calculation then what you get is basically just because we just some of these indicator functions here, we have these indicator functions here Hector derivative of this, this is just one over this and hear only the one survives in only one survives here, what we get is here, this is just right like this, the number of that knew that parents of nu divided by knew that you knew that parents afternoon

but then minus from here the corresponding lagrange multiplier. Just long under wow that parents often and if you now introduce the number that gnew occurs as a sum over um I mean that parents of gnew just as the son of all that new off and new which is just our currencies in the data of uh having these values together and now you just send them up. So it's just a number of

um occurrences of these values in the data then you can now use this equation here which equals to zero. Get I'm not gnew that parents of nu times this year. That's nu that's nu that parents often and as we know we have two constrain that this equals one. This is really just the grant multiplier, we have found formalize multiplier, what does it mean? This means we can just plug it in here

and we get a formula for our parameter damn number of that's nu parents often divided by the number of that parents on andw what does this mean? This means just take the frequencies in the data of the joint distribution divided by the frequency in the data off some marginal distribution so that this represents he is that new given that he knew from a frequentist point of view is very

gas sible temple just yeah this means if you want to fit in as a network we just for every value um that P A and every value of that new. We check how many data points have these values currently and how many have these values divide them. And then we have an estimate for these frequencies.

7 Graphical Models - Learning in (positive discrete) Markov Random Fields

hi everyone in this video. I want to talk about how to fit Markov random field to data but this is it's useful cast. Markov random fields as an exponential family. First consider an undirected graph. Yeah. And we consider that we have final state spaces for each of the arguments in next year by the notes and we consider it fixed. Then we consider all positive distributions over this Mark of random fields. So they come with functions which have arguments corresponding to the clicks and we may just consider maximum cliques here without loss of generality.

And these functions can be considered the parameters of is marco friend in field, you can put them all together here in one big factor. And since we're considering finite state spaces we can consider afunction bc to basically correspond to the vector which evaluated on all possible values. So a function is nothing else in this finite setting than vector of all possible values. Then we can write this also as an exponential family, like this whole set of Markov random fields, positive Markov random fields with his fixed final state space. Just by taking the lock of this distribution

and writing an exp in front of it, then we get the log of these factors and we get the log of this petition function and then get the petition petition version here, get the lock of these factors and as I said before we represent these functions as vectors. So all the parameters a lock of all these values here and then the interaction with the data just comes by taking a one hot encoding of which of these function is used according to this X. So if this X is a specific let's see here let me just take this value out of this vector. So and this is how we cast a positive discrete micro friend infield for the family of positive discrete Markov random fields as

an exponential family. Now we will use this to fit to data. Now consider Markov random field as underlying true distribution and we have some I I. D. Data sampled from this and here we make the assumption of strict positivity so we can use uh the representation from before and we consider this as graph is known, we consider the graph known so we basically know the cliques andw cliques correspond here to these natural parameters so what are parameters are parameters are these? It does. So to see let's see we see is running through all cliques on maximum cliques and gazette see running through all values

basically new time all these fellers then we can talk about the livelihood so we know now um also likely it looks like an exponential family form so we now only need to plug in the data, let me write this down, he owns the data mita now because of ideas the product want to end and then we have next of adata sundt

the sun or cliques you can see transpose then sufficient statistic of X. N minus. And in time then we have a lagrangian we have the lagrangian which is just look likelihood so lagrangian is hello john dust block. Yeah the and this is then just some all data points and then the sun psi and see all cliques and then if you want we can

write it out more explicitly all the values um that Z and then if peter psi that C times indicator of that psi X C N then minus end times minus and times this incomes from the sum here which is some over the same end times So now we have lagrangian and can now do maximum light with estimation, let's go to the

next slide, it passes again there then our maximum likelihood estimation is oh some likelihood estimation, arc marks all the tests of this log likelihood and for this we have a necessary condition that if you take the derivative this l with respect to each of the parameters then there should be zero and you can see if I take the derivative with respect to this eater here then I just get the indicator function and get the sum of all these elements here and then here I get the part of derivative of this year what I get is just the number of occurrences of

the specific value. Oh and you call this andw andw andw you will see number of occurrences of the the in the data and then minus and times D A E N C. That psi now we wrote it as an exponential family. Just to use the relation between the derivatives of the partition function and to expect values. So what we know is

and this year expectation value also sufficient statistic of um X, which is just this year. So let me directly write this down. So this is an indicator function. Let's see X. psi I want to write this as a probability off that C giving each other. wherey marginalize out all other studies. So we get at this year's end times they see minus

end times the frequency in the model, this is the frequency in the data and this would be the idealized frequency in in the model. So what we could do now is if we can compute this, then explicitly compute this, we could just plug it in and you're basically done. If you cannot compute this, then we could try using sampling methods which we haven't introduced yet. But um if you could sample from this when you could just take the relative frequency, so this would be roughly and that c minus and and then times are related to frequency and

let's see divided by an eater, wheren eater is just a number of samples. We randomly sample from this model and then we count how many of these turn out to have zc zc component as we can see, you cannot explicitly solve here for eta maybe in special cases we can use this formula bring them on the other side. So for some of the theses at sea and computed in terms of these quantities but in generally it doesn't seem to be possible. But we can at least evaluate roughly the gradient if we are able to sample from this distribution so then this putting this to zero doesn't help so much. But what we could do is we could um update, we could

have an update rule with a gradient gnew Yes, the old one and then we add up to some learning rate. These quantities that we use this update group coming from, great in descent for each component, you get this update role and of course we can always sample the same amount of samples from the model as we do from the real data. Then we just get the difference between number in the data and the number from the model.

If you do this until convergence, we have fit our model

5 Inference

0 Inference - Motivation

Higher one. This video is about inference. You want to talk about what it is and some examples. So what is the inference and our context inference means that you're trying to derive some property of a probability distribution. Let's say we have a probability distribution given here and it's represented in the computer let's say as a probability table for instance, or it's a normal distribution and we know the mean and the variance. So we have different forms of representing probability distributions by a computer. So inference no. Means that we are trying to arrive certain properties or compute representation of certain properties of this distribution. This could be or typically is marginals

or the conditional distributions. Sometimes the mode of the distribution and so on. The real challenge here is not that we are not knowing the mathematical formulas. The problem is here that we need to compute it that we can access this by the computer and computations. Simple example, can be given negotiation where this becomes very clear. Think about this graphical model where we now have a Gaussian random variables that and God's random variable acts affected by that. So we can come up with distribution, we can represent the distribution for that. For instance, we can say that given new and sigma let's say that is multi dimensional, then we can say that X is conditionally given on that

in some normal distribution. So let me write, we have the distribution of that marginal distribution of that and now we have also the conditional distribution of X given that for instance, bye transforming that linearly and then adding another source of noise to it. So this now gives us clearly the joint distribution. Yeah. Just by multiplying the marginal of that times the conditional distribution of X. Given that

since we have a representation of this for all values and the representation for this year for all values, we have a representation full representation of the joint distribution. So we're done right. No. The problem is now that even though we have a full mathematical description of this, there's no loss of information here. We cannot directly say what certain quantities of these distributions are. For instance, now I'm interested in what is for instance, what does it? The joint distribution in terms of And I want to write it as a normal distribution. Ex that with some mean and some barbarians. Then I can ask a question now, what does this mean? What does this coherence here?

andw I don't know how familiar you are with this, but I cannot directly read from these two these two variables. I cannot directly see what mmt is. I have to calculate something. For instance, if you want this. I mean what do I take? I just compute the expectation value of zet and then the expectation value of X. So if you take the expectation value of that, we know it's me if you take the expectation value of X. You know, it's eight times the expectation value of that. Class B. But at least we know what the expectation value is. Let me write to assume but we know that. M

yes, then million. And we know it's a memory plus B in the X component. So we can also calculate what t is he is now tele components, we know that the marginal component or is that so I was sick man, but we know really he has a sigma then our marginal component. X. The margin component, is that here, is that? S here, is that just the mean? So what we get is having us here and then we know that if you have something that's variants and multiplied by a then

the variance multiplies the covariant So we have now all parameters given. So from this full description we were able to compute this if you want to do it in the computer, we just have to represent the separately as we did before, but now we also have to represent or compute whenever we use, it is eight times sigma here. So now the question is, what is uh so this what is a marginal? So this was a joint, this was a joint representation and the marginal now is given by the joint just by looking at the components. So if I want to now know what is p off X, then I only

need to look at this component and this component. So I know normal distribution X, A plus B, then yes, here I made it mistake or some variants propagate through also two X. So here we have S plus a sigma a transpose and this is then of course what we also get in the marginal so this is a marginal. So what about the conditional? Let's go

to the next slide. So here we have written the joint again and then we're now interested in arriving the conditional, which is now the other way around before we had given X. Given that now we have Z given X. And there's a formula for this which tells us this is again a normal distribution that now with and mean, so let us write this down you media and we get plus, then we get is this term here, then this term here and then the difference

between X and the mean of X. So this is just a mean of this conditional looks already complicated but not dramatic, but now let us write down the core variance. This also follows this rule that we can take first this covariant here and minus and now we multiply this times this inverse times is inverse. Missing. So Sigmund a transpose, a sigma a transports plus s inverse, write it

down here, inverse a sigma So this is a formula for the conditional portion we have used so mathematically we could write everything down, we have to join, we had some marginal. Now we have the conditional. This is a conditional. He is a conditional. And if you want to compute this now we have to compute from the given matrices was represented and given signal was given and as we're given all these three ingredients were given and represented before. But now, if I'm interested in computing this distribution, meaning if I have given X, I want to infer what the distribution of that is, then I

need to invert this matrix, an F X is not super high dimensional space, then this year is totally infeasible even though we can write it up here in mathematical terms computing this is infeasible because of this industry and there you can see that inference is inherently difficult and sometimes infeasible to do, even though all ingredients are given without loss of information, let's have another example this time discrete. Let's take a Markov random field for instance. And Markov chain Like this and we have variables X one X two X three X four dot dot dot xcand maybe we call this already

X small, andw then this distribution comes with a factorisation X I mean here the factor of x is some normalizing constant times the product over some factors, some factors corresponding to the cliques these are the clicks, this is click, this is a click is a click and so on and the auto maximum cliques So Yes, I have plus one And then we have XIX plus one. oneof and minus one. These are the factors and now let's assume we have k classes each

big classes, I mean. And the question now is what is the marginal distribution of X M X M. The one here in the middle. So by this formula again, everything is given that the whole distribution is given for all values by at least afunction So mate, so we shouldn't have any, I don't have any loss of information here. So everything is given to us. The problem is if I ask this and then this is not really clear how to derive. So what we need to do is what is the general formula?

The general formula is that we have to modernize all other variables out. That means like we have to sum all possible values of X one, all possible values of X two and so on. All possible values of X and -1 and then all possible values X plus one until more possible values of our last X. M over our X one X men X N. It looks like this. So it's a huge sum. Maybe we make this clearer by writing more some symbols

that's all. So this means I have two son X one over K state's X two over K states and so on. So what we get is that we have K times K times K times K K. So we something over K to the N -1 states to add. And as this representation suggest is if K is big and M is big, then this is infeasible to do so we cannot just add up all the states and then derive a marginal

generally in this representation it's not possible but we cannot exploit for instance this graphical structure to make this easier because not all of these factors depend on all the variables. So we can be smarter first let's write this again. All his sons expand until X. M. But we leave out this X. And in the middle and then we have one over the product I I I plus one X I X I plus one. Let us have a closer look at the next slide. So here's the formula again, you can now use the distribution

role Some over next one and so on. And then in the very, very end we have I am -1 and and minus one X. M. And we sum over X. M. Then we have sai XM -2 XM -1 here and minus two minus one. As in D. C. S. And then we sum over X M minus one. Where we know records like this.

Also here, 1, 2, Next one Next to like this and here of course uh Sam's missing. Oh and so on. And he is somewhere in the middle. We have to factor sigh And -1, XN minus one and then we have the factor sy XNXN- Plus one. But this not the sun over X. M. But some of our XM- Plus one. Let me get this up a little bit

what we have here now is some or X plus one. Ty expend one andw Next year's and plus one then we have this year sign -1 x 10 Sector and -1 and and so on, you can highlight the X ends here and now we can argue how much this costs. If you look at this one, this is for any value

for every value X and minus one. We have to make this some here and the Sun takes K elements and something together and you have this K times so this year costs a squared costs roughly. Now that we have this done Andw have represented this in one um function vector, then we can evaluate the next song and at this up now we have to repeat this end times and times times which leaves us the cost of roughly end times K squared which is much much smaller

M times which is much much smaller then Okay to the M So here you had the first example of how to do influence in for the marginals of one of the distributions um by using and just a distributed floor to go from this time complexity to this kind of complexity something which looks infeasible before and now look feasible by just doing it a smart way, we will see more of this later now and after this introduction, I wanna talk about what kind of ways to influence our there. So first of all what we just did was exact influence. So we wanna have exact formulas, exact representations of all our marginal expenses or conditionals

this is desirable because it's exact but as you have seen this computationally often infeasible but what we can do is we can exploit for instance sparsity and graphical models as we just did to do exact inference. Um Still the sparsity assumptions of this reputation was crucial but not all the variables were dependent on each other. 2nd way is if we cannot do exact inference. For instance if you have a dense graphical network or no graphical model at all in the other ways deal with this in an approximate way And there are two big topics. One is variational influence, which basically means you want to have an analytical and numerical approximation to our distribution or marginal and the other one is using sampling methods

or monte Carlo methods. So variational influence. Um we will have three topics expectation maximization then some mean field approximation and some variational out encoders and for sampling methods we will see a lot of them like important samplings like sampling, rejection, sampling and so on and the famous metropolis Hastings Markov chain monte Carlo algorithm

1 Graphical Models - Exact Inference - Variable Elimination Algorithm

hi everyone in this video, I will walk you through the variable elimination algorithm for this, let's directly start with an example for instance, let us start with this amazing network. For simplicity, I just mark them with different letters and also for simplicity the states I will also mark with the same letters. So in this phase the network we would have a factorisation as follows. We start with a probability of psi they have no parents, then we can take D and C. And again I'm just writing C and D for the note and also for the States instead of writing X, C and so on just for abbreviation and comes

easy to read. Then we have we have P I. Yeah, then he gave a high and D of course, then E of S given I, then E of L given G E j M s and P off age given G N J. And what I'm not interested in is what is the marginal distribution of J.

So and variable elimination algorithm now goes to a certain order and marginalizes them out by just summing over all their states and all occurring factors are summarized by a new factor 1st. We could also just transform this network into my random field G S M H and connect them with underactive edges. So we just moralize them. He's here is this and then we still have to connect parents, he is here

and so I am H s G n J s parent and in this setting we would have the corresponding factors according to the cliques or we just translate them into factors, which just means he's using different letters. So we have a factor according to C one, E n c i wow, D I e s and I one LNG one on JLS and one on HG and J. And this is supposed to be the joint distribution.

Let's call it X for the whole bunch of all these letters. So let us copy that to the next slide here. And as I said, we're now interested in deriving the marginal of a J, which means that we have to formally some out L S G age I mhm and see this joint distribution with X as he considered the backed up all the letters in all states. All states of all letters. No, Yes,

these variables have several states. This would mean we have to sum of all these states, these states and these states in the States. But this would be, is that K state's K two the 1234567 variables. K. to the power of seven additions. We would have to do which is close in feasibility for our context here. So, and we have now this factorisation which are trying to exploit. We want to put this in here and then do the sums piece by piece. Let us write down what this is andw Let's start by the sun, L and S,

which of the terms, which depends still on L n s We have here. L. We have here L and S. Yeah, we haven't asked first J R O S. Then we had still a G in there but we had to and this one and we had an age sigh H J I so I as I times fine some over D sign E I e I'm always seen

sorry to what just happened first. We have to take an order of how we wanna I want to eliminate and we want to evaluate this some first. So we have to pick a note here. Let's see the good thing is not many factors are dependent on C. Only this factor corresponding to see itself and this one which would correspond to what's written here. So we started with the sum we looked for all the functions which depended on it. We just ride it to the back, that's what happened here. And then we sum up this is what happened here. This is the first step and you cannot consider this a new factor

and the new factor, we can write it here as for instance By one and now it's only dependent on D. Next step is now that so we have eliminated, it's not, maybe we arise it, we eliminate the note and we eliminate the factor. This is what happened and we're now left with a smaller graph and a smaller number of factors then if you look in this presentation where these factors now gone and we have the spy now here, what is our next variable? This is deep. Then we look through all these variables

that are dependent on D. We write them to the end, this is here GID. And our new factor 5, 1. And then we eliminate out these ones. So let us do this. 1st we raised it here and then then we can define this be our new factor by two, let's say. And now it's dependent on G and I. So our next variable in line we wanted to eliminate is this I hear we look at all the variables that are dependent on I this year is here and this corresponds to a new factor. We have this and this new factor,

then we do this some, then we get a new factor is our new factor by three. And this is not only dependent on smg and we have eliminated I. So we have to remove iron from the picture. Now, after I we get h Yeah, let's clean up a little bit. So if you found out this representation and you already have it's already gone till here. Now, next step is age. Don't we

take all the factors that depend on age and put them on the end. So this is this one here from here and all the rest is gathered here. As you can see this is actually not dependent on age. We can actually put this out here. Let us do this. It was here, this is not dependent on edge. Oh, this goes here, we order them and then we can sum this up, This is not all by four and then G. A multiplying now we eliminate

H. Yeah. And connect the remains. So now we have a G left and this all depends on G. So we get here a new factor maybe by five and then on L. S. And J. And we get this year. So and then we have to

we have to summarize over L. N. S. They're both depending on it. We can do it step by step, but it can also directly do it now. So and then we get this factor here Which is 5, 6 only independent on Jay. So andw what you want to compute is was the O. J. Then we some of all these states and we did it in such an interactive way that we could avoid ending of all states spaces at once and to eliminate it G. And finally

snl until we're only left with jay. No, we maybe need to make a remark of the order matters. And maybe I have a closer look how graphically the elimination looks like. So it's not just eliminating the note and then all the corresponding edges. We also have to create new ones. Let's have a closer look at the graph again. And considering eliminating now a different note. So this note here in the middle G is very central and you can see it is involved in all the factors where it is connected to. That means like all the factors here. This vector, this vector have a variable in common

and the currently product. This is a product with this factor and the other factor here. The product. And soon as I sum over them, they're not a product anymore. They become one Fecteau and to represent this. After we eliminate G. We have to connect everyone. G was connected to everyone else. This means like he is connected to G is connected to G. G is connected to a church and to jay and to L. That means after eliminating G. We would have to connect them all. Let's do this First. Let's copy this. Now we have to erase now this Middle one little note all this connections.

But now we have to connect all the notes where G was connected to with each other. So we now have to connect the with age we have to connect the with L. We have to connect the with J. We have to connect J with H. Which is already the case jay was L which is also the case. Now we have to connect L. With H. AM I Missing one. Oh we have to connect I was age. I was L. I was J. As you can see if we do this like this, then we would get a factor which is dependent on all these variables where the orange lines connection.

So this would be this, this this this and this. Therefore it's not a good idea. We start limiting G first because then we lost the sparsity We wanted to explore it in the first place. So it's a good idea to start with C. First and then it's a good idea to start with D. Because of the limiting psi He has only two neighbors and so on. So the general variable elimination algorithm would start as follows with a lot of variables to fill in. So we would start first with a basic network or a Markov random field. We could also just the first form basic network to Markov random fields. And then we have to find somehow

a good elimination ordering of the notes. And it is known that this is an np hard problem. And the variable elimination algorithm in general has exponential time complexity even this is if this is N P heart may be looking for a good um for a good elimination ordering. Even if it's not optimal, it's still a good idea because without it it's just exponential time. So there's some heuristics that one could use, you can always trying to anticipate if I eliminate this and note how many edges are created later. And we have seen one example where this goes horribly wrong and an example where it's very good or one could think about creating the smallest possibility factor at each step um such that it stays small afterwards and then there's a few bunch of heuristics for

basic networks. andw for instance. Yeah, if one doesn't know what to do, one can just do moralisation and use eight years from before or if one has access to the edges, one can come up with a topological ordering, maybe sometimes good to have um ancestral set of the marginals um of interest um given or take, then first all of them and then the rest is often a good idea or one eliminates first notes that have no Children, this can usually just disappear with a factor. So using these kinds of heuristics to eliminate first childless notes um promising so

and then after we have found such an elimination order, you basically just go step by step, we find basically a note, maybe a child's note we eliminated um and then the next one and then we combine all the factors we have created and only those put them in a new factor and go on and how these factors are represented in the discrete cases. Often some probability tables or factor tables where all values are listed basically as a big vector and so No one wants to describe this, this is already bearable elimination algorithm. Not the most efficient one but a very simple one to think about

2 Graphical Models - Exact Inference - Factor Trees

everyone in this short video, we'll talk about factor trees. Pre shaped factor graphs. So what are factor trees? The fact a graph that doesn't have any cycle is called tree shaped or factor tree since um, and convert based networks and macro for random fields into factor graphs always can ask our questions but in the end three shapes or not and both things might be counterintuitive. If you start from a basic network, which is per definition of cyclic, then our resulting photograph might not be tree shaped. The reason is that we uh, simplicity assumption here is concerned with directed was directed cycles

while undirected cycles. If you ignore the arrowheads, we can still have cycles. For the Markov random fields on the other hand, we can have cycles but in the corresponding factor graph, the cycles might disappear. The reason here is that leaks where there might be cycles can just be summarized by one factor. So resolving the whole cycle. So when we talk about basic network micro random fields, we have to always check still if the corresponding factor graph on the one we construct is tree shaped and we're trying to do so. So let us look as an example. Yeah, let's consider this right here. And let's draw some basic network like this. Now let's compute the moralisation

like this. How would now affect the graphs look like we could factors coming from these, these and these edges would be either two or three vectors or even more. Let us move this around and draw some factors like this. And as you can see this is a deck which has no cycles, no directed cycles. And here you can see you have a cycle inside the fact a graph.

Let's look at a different example, let's say we have this space and network then how would the moralisation look like like this? And as you can see, even though we started with something, acyclic even tree shaped, we end up with something which has cycles. Now let's go to the pictograph representation. Okay, for instance a factor here and victor here.

So even though we started, it was a cycle cycle here and disappeared. So in the end we end up with something which is tree shaped in all cases if you want to talk about algorithm, some basic networks or lack of random fields and we want to convert them before into a factor graph and we need that this factor graph is tree shaped. Then we have to check this, this is not tree shaped and this is tree shaped

3 Graphical Models - Exact Inference - Sum-Product Algorithm 


I don't know in this video, we'll talk about some product algorithm also called belief propagation. This is only possible infected graphs that are tree shaped tical factor trees but the goal is we wanna be able to compute all the marginals from this distribution where the factors are given according to this graph and we hear the arguments alpha all the values that correspond to the sector which are adjacent to the vector. And we want to compute these kind of marginals or even effective marginals for instance X. P. Town which means here it would be almost all but let's say the graph continues but here then margins of this

let's focus on the marginals Then the algorithm works as follows. 1st we take these factors, these correspond to these functions here and they usually dependent on all the neighboring, all the neighboring variables, variables. These are the factors, this is afunction and we will send this function as a vector evaluated in all possible values. So we send the vector, send it to the variable. We also take these and send them to the variable note that we start from the leaves here. Then these variables takes all incoming messages, aggregate some and then sends them further. Then this note here will wait

until the messages from beliefs to pass here are aggregated not over. And then it also waits for messages from these leaves the central note and then it will aggregate all the information and then it will send out messages back to the leaves here, it gets here and then it will turn back so we have two kinds of phases. One phase is passing messages from the leaves towards the root note and then as soon as a root note receives all messages will pass messages back back to the leaves after that every single note in this graph, we have

received all information from all the others because this went to the root note and back and the root note compared all information on the whole graph and then send it back and from the point on they have all the information. We are able to compute the marginals just by aggregating the messages. We'll see this more formally on the next slide. Let's just summarize what you just said. So we have a fact a graph given with the joint distribution and our goal is to find normalizing constant here, which is needed also computationally marginals this is maybe but the main important, the main important, I think that's trying to marginals the single marginals and effective marginals

and for this we'll pick a root note somewhere in the middle for instance, it's not clear and it should be kind of equidistant from all the leaves and then we pass message only once back and forth from relief towards the words, the rules and here believes is all the leaves, if the graph is bigger than we have much, much more inside of the graph, so many, many more notes that we start from the notes, we pass towards the root and we have two different kinds of notes.

one is from the factor were variable and the second version is from a bearable defector. Then we have to distinguish if there are leaves or is there no leaves and after like everything went back and forth once we are able to compute the beliefs and the beliefs are here at the margins so now it's time to say what the messages are. Okay, let's start with the leaf notes, let's say, we start with the variable this one here and you want to know what is the first message? This variable consent so what this scent is is dust just the one. So as I didn't receive anything it will just

send the value one for all values and we right here xor V but what what we mean is actually that we are sending the whole vector for all values And this is one in all entries. So if you start with a factor, terrible belief for instance, here or here, what we then do is since this corresponds to a factor we can just send the factor that's afunction back there. Note that this function here is dependent actually on x. alpha meaning all the variables in the neighborhood of the sector but it's a leaf so it has only one neighbor, so it's only dependent on one variable.

That's why you can write in this case, this will alpha this will be no it's called this pita this you hear. So we have now Messages one We have here the messages given by the sector of alpha here we have a message one here have your alpha and now it's time to talk about the messages inside. Let's see about this variable what message be to as soon as it receives also messages from this other noted and what it does is

just multiplies them here, variable to factor. It just takes a product all incoming factors not adjacent and that are adjacent but not equal to Utah. Mhm. And then the messages what you call from grandma V Xvi. Note that all the messages, no matter if they're going from alpha to be or from into Pita. So either from factor variable variable factor always depends on the corresponding variable in this. So you see here this mood depends

on the speed, but also if it goes the other way around this only depends on the variables. So we add up by just multiplying. So we have here alpha here. Another one. We multiply them and this is our message. We will send two beaten similarly. Let's say Peter had received all messages from everywhere. Then we can send messages to W. And in all directions let's say W. First, then this was a similar rule, but now we aggregate also by summing up, we're taking all the values. beta all the values of X. That from beta got No W. It

takes a sector values, times product. Mm hmm. Notes, Peter W. Then we sent the message. We multiply the messages that arrived at beta and then we take all the messages. We multiply them, then we multiply them by the factor and then we sum up of all values and see here this XP to is partly dependent on X. W and also dependent on the others. So this and this together is X. beta Maybe we can write it simpler,

this is right be without W. Yeah, also be without W. What you mean is of course all variables related to Peter that are not W. So these are the messages and I said before so we assume we start with the leaves, propagate all the messages to beta with these update rules. Then we aggregated at this one root note and then we go back total leafs also here and as soon as we have done this, the algorithm basically stops and we can now evaluate the beliefs on marginals we want so and for this we claim that the following

formulas hold and so let's say we're here and this be out here and we now want to compute the marginal of this variable here and then this is normalizing constant times. Just taking all the neighbors of v. And multiplying all messages where the normalizing constant here can be found just by summing over all values of X. P. Exactly this expression and here you see that to evaluate this, you need not just one value of these

messages, you need the whole vector of messages? So all functional variation secondly, see that we here have to normalize the constant. This is usually a huge problem if you want to have to normalize the content of this whole function if you want to compute it, but locally this is just summation over the values of a single variable or the values of this variable with some over but not any other anymore. So we usually do all the message sponsoring before we normalize. So here in the end we normalize So if you want to have effective beliefs effectively this means if this is a factor then the effective belief means computing the margin of all these, it

would be all variables in this small example but think about you have more variables in a huge network and then this would be just as small group of variables. And how do we compute this? It's very similar, we have a normalizing constant, then we have the factor at all from and then we multiply all messages that was sent to this and to take the product of all all variables corresponding to that factor and normalizing concert again, take all possible values. Now for all variables included in this vector, its offer product of the same quantity.

So this is how you compute effective beliefs, variable beliefs just single variable models and these are effective beliefs corresponding to this old factor bearable, marginal and this is already the some product algorithm or belief propagation. Now the question is why does it hold? Why do these equations hold? So we had like uh can compute this here and what we're claiming is actually that this equals this marginal and the same, we have just written before that these equal so and this needs to be shown and relieve this as an exercise. Next question arises is what if I have this factor graph of factor tree given again? But I now want to

compute the conditionals Yeah, I want to compute the conditionals the marginals condition on some other values. So what do we do then? So we can use a very simple trick which is called evidence factors. Where we just introduce indicator functions if these values corresponds to the given values we have fixed and we represent this by introducing inspectors in the spectrograph, let's say our B or W here, it's over here then we can introduce a factor over here connected to W and this now corresponds to this value maybe right in this, is that W. andw of course we could also just merge

this factor with this factor, this is only dependent on W. This is dependent on W and more. So we can even merge these 22 into one factor and then we get joint distribution conditioned on that. See by this formula to normalizing content looks like this. So what we can do now is we just run some product algorithm as before, just received additional factors and then we just get these conditionals the same we can do also for these conditional effectively. What if what if we don't have a tree shaped pictograph for instance we have here cycle or we have more cycles like this

like this, then what people often do is which is called loopy belief propagation. They just use algorithm as is andw without the need of leaf nodes. So if there are cycles, it's not clear that we have a leaf node. So what should people do is initialize all messages to one and then they run into convergence the some product algorithm as before with the same update rules for the messages. So the question is in which order one can do this and then different schemes often what just one does is randomly drawn. Note, updates the messages for the incoming products and here the same product in the Sun with Hector and so on, so on, randomly

samples these notes and then updates according to the usual some product or we had and then after convergence we computer beliefs and this is of course an approximate algorithm. We're not doing any exact influence here. This is now in the room of approximate influence and they're like not many theoretical guarantees but it is reported that in practice is converted quickly and the approximation is often good but sometimes not an alternative to loopy belief propagation is trying to find um forest structure where variables are aggregated into one variable and

factors aggregate into one variables and so on. To avoid avoid cycles. And this is what's called a junction tree algorithm in this kind of combination, but we will not treat this year. So let us summarize some product algorithm for factor trees, we have a factor tree given with these factors and normalizing constant, then we pick a route, that's what we said, and then we might pass messages from the leaves to the root and back and back and then we have here these update rules available to factor messages is just taking the product of all incoming messages and sending this product out. The fact of variable messages is taking more incoming messages,

multiplying by a factor, summing all other variables out and sending this to the next. And for the leaves we had to hear these initial conditions, it's either one, it's available or it's a factor. It's it's a factor. And after everything went back and forth, once you can compute all the marginals of every variable. And by just multiplying multiplying messages and normalizing the same for the factors and multiply all messages multiply but the factor and normalize and that's some product algorithm also called belief probably

4 Graphical Models - Exact Inference - Sum-Product Algorithm - Example

Hi everyone in this video. I want to walk you through a simple example. Let's look at this graph, this is a bipartite graph and as a factor graph it comes with a probability distribution let's write this down comes wow two 34 arguments normalizing constant And then it is a product of factors, alpha which depends on X one and X two. eta which comes with X two and X three and it comes was another factor sign gamma X two X 4. Now let's do message message

passing. So we have a message first going from note one to alpha. So we have a message, Everyone going to alpha dependent on X one and this is just one because it's a leave note and now I've received the message and now it can send it to aggregate it and send it to to how does it do it? So we now get a message Our future too dependent on X two. Now given by taking the product of all incoming messages, there's this move one, Alpha X one, multiplying it by the factor which is dependent on one end to

X one and X two want to climb it with a message and then aggregating out all other variables something Them up all by those X one. So of course um we could plug this in, we know this is one. So this year is just some the X one X one extra if you want. So now we have a message Send two extra from here. Now we can do the same from the other side. Okay, it's dark. So we have a message coming from Maybe that start with four before two. gamma

Of X four which is just one was it's a leaf node. Then we can compute the message we have sent one to gamma Now gamma needs to mhm Get the message for two. Doing I'm not too dependent on X two is now the sum of four. Cy gamma X two X four times the message it receives from four for grandma X four And we know this is one. Well you're also we get justice son, explore

Bye Gano X two X 4. Let's move this up here. Another question is, what is the message for? Three? As you guessed? It's the same as before 3 to 2 to three constant one. Now that beta has received the message from three, we can compute the message, beta sends the two, Maybe 2, 2. And then on the X two. Now the sum all values of X three side B to X two X three times the method from three to beta That's right. And this as we know

is again, this is one. So here we end up with some X three side B town next tour x ray. So we have not all messages computed towards two. And since we consider to the root note we can now do the backwards prices. So if we're only interested in marginal of two then we could already since two already received all the messages were already in the position where we can compute the marginal of P two. Now we know that P two is supposed to be by the algorithm. normalizing constant that too. And then just a product of all incoming messages. Alpha

2 2 x two Beat chapter two x 2 And No Matter two X 2. andw that's the algorithm. What you want to do now is just do sanity check. This was a claim of the algorithm and in this simple example we can just have a small look Oh and this looks like if you plug in here is normally the constant and just what we wrote here, this was X one sai alpha, X one X two. Now here we had um X three side to X two x 3 And then the last message we had some over X four.

Bye. gamma X two X four. As you can see we can now after this out some over X one X three X four by uh X one X two By Vita X two X 3. andw side, next tool next four and this is now the sum of x one x three X 41 x one X two X three X four. So in this year equals the marginal of this year, checked out. Check maybe

let's compute the margin of and X one for this. We still need to ask messages from two to alpha and from alpha to one. So let's do this. Now the message um alpha and then on X two is is just the product of the incoming messages. So this was mhm gamma 22 X two times me. beta 22 X two. Then we have a final message going from uh one next one and this is the

sum over X two cy alpha X one X two times the message to the alpha X two. The one we have just sent. So and then we can compute the marginal where is it the X one And then one over set times the product of all incoming messages. You only have one incoming message. Next one. So let's do a sanity check again. So algorithm tells us that this is correct as that is right That some

of our own X one Yeah one Next 1. Now let's just black animal. So we have alpha one, this was this year, this was the Sun. And these were these. Now let's just unfold. So this is next to so I all phone X one X two times mutual alpha X two. And this is now next tool but this mewtwo is now a product of this year. We have alpha X one X two. Now we have the product of two messages and I will just work better in here. So

this message here was a son of all X four Saiga X two X four. And the other message was the son of X three Sign. Beta X two X 3. Now you can see if I again push out the sun that's rearrange then this is one of those that so X two X three X 4. My alpha X- one x 2 X two x 3 times. Sine gamma X two X four. Again, this is a son X two X three X four. It's a probability

distribution of X one X two X three X four. So again, the question if the marginal reason really corresponds to these quantities then we have seen by just multiplying it out. This is checks and just to be clear this computation we won't do in practice here. This is just for sanity check if everything holds out what the computers, all the quantities according to the algorithm to the song product algorithm, all the other variables go on analogous and they don't share more insight. So we believe this example

5 Graphical Models - Exact Inference - Sum-Product Algorithm - Summary

one, I will just shortly summarize what we did. So we were concerned with some product algorithm but this what we needed to affect a tree, so the main important part was that it was tree shaped yes, such, it comes with the probability distributions with factors, then we pick a route inside the graph route which is kind of a quick distance from all the leaves, then we do message passing starting from the leaves towards the root and then we go back and the message passing goes as follows, if it's a variable to factor message then it just multiplies all incoming messages and then send

this out as a new message. If it's a factor variable message, then it also takes all the incoming messages, multiplied by the factor, aggregate all other variables out and then send this message and for the leaves we have this special and starting conditions, variables always start with one and the factors always start with the corresponding factor. And as soon as we went into the root note and back we are able to compute all the marginals single variable marginals and all factor marginals by the formula that we just take the product of all messages and then normalize normalize afterwards and for the factors very similar, we take all messages which come in, multiplied by a factor and then normalize according to this normalizing um procedure just summing up,

then we have talked about two extensions, one was including evidence factors into um into the procedure to be able to compute conditionals yeah, if you want to compute conditionals then we need evidence, factors, andw in case we have cycles. So it's non tree shaped. What we can use is loopy believe propagation, which is more practical approach to this by just initializing all messages to one and then usually using the usual some product update rules from the algorithm we had and then we do this under convergence and we used the normal formula to compute the margins. This in practice works quite ok. And an alternative. You just mentioned

what we didn't go into was what could use injunction tree algorithm to avoid these cycles.

6 Graphical Models

0 Graphical Models - Exact Inference - Mode Finding -Ideas and Problems

everyone in this video, we'll talk about how to find modes or probability distributions that come from factor graphs will first discuss a few ideas and problems with these ideas, let's start, let's say we are the same setting as before, we are in the setting of a factor tree, which is given by a probability distribution coming here with these factors. Now, what you're trying to find is you want to find a mode and a mode is just like an element that maximizes your probability distributions for instance, if you have a Gaussian then the mean is a mode but in general, um the mean and the mode might be different. So one thing we always do is if you want to find maximising element of probability distributions we usually take just the logarithms so we know that this is the same taking

the arc marks, logq p X as a lot of resume is more atomic. Now our probability distribution consists of several factors that we can plug this in, that we can write either Arc marks of X one or was that the product of the factors X. Alpha of an F. We're here this alpha means the components related to the specter alpha or we use now like the lock arc marks or X than the sum of all alpha F and a lot of these factors alpha and then here that, which we actually don't need as it's not dependent on acts

and as you can see this already can become very complicated. So what we would like to do is now pick the mode in a kind of iterative way that instead of looking at this whole distribution with all its components that we take factor by factor or note by note you pick a maximum that you take the next maximum and so on. Of course this um reminds us of the some product algorithm. So let's compare maybe we find something, let's compare the problem of finding a marginal which there's some product algorithm was in the mode in both factor trees. So the same factory, the same probability distribution let's say this is our then the marginal was defined by summing over all other

components and then taking this factor. And as you can see we have here the sum and here the product and then we kind of pushed this inside smartly and then only made local updates and message passing were able to update it. So the main point was that we had some product which led to the some product algorithm by doing it in a smart way here, we have now a maximum and the product and also the maximum and have this except consists of several components. So here the sum which consists of several components. We were able to push inside and then aggregate locally. You could also do this with a maximum, push it inside and aggregated locally. So that means like the only thing we need to do is basically

replace the Sun by a maximum and just keep the product as is and then the rules and these updates and this message passing should work the same. But first we can put it in the lock space to make things more stable. Then we can replace a product with a more convenient some. So the main idea is now to replace the some by a max and replace a product by a sum and then we get the algorithm. So the main idea is most of the formal relationship between these both problems is that the Lord want to find. So at the mode we have a maximum over some over these factors here. And then the idea is just to run the some product algorithm with

the following small changes. First we take a lot of all quantities, then we replace all sums. A maximum maximums and then replace a product by some. And then we should we should be able to find the maximum or moat. Unfortunately there's a problem with this idea. Let's say let's start with a probability distribution in just two variables. And we want to find like this two component vector, which maximizes this distribution. Then we know that the mode of this function of this distribution function um satisfies that. Is it maximizes the value. So if it takes a maximum mobile X two, first of this function

and then the maximum of x. One of those functions, then this needs to be equal or the point is maximising And we could also do it the other way around every value of X two, we could take the maximum of X one and then of this function here we could take the maximum of X two. So these are all equivalent descriptions of the mode this function but the problem now is if you're trying to do this component by component, let's say this function for fixed X two, it takes a maximum over X one Then this year is a function of X two and we pick a maximising then this should be a component of the mode the same if you do it the other way around if it takes a maximum of the X two while holding X one fixed and this is a function of X

one and then we can take a maximize er of this function just in one variable and you should get a component which is a component of remote. Problem is if you do this independently picking X one and X two with these objectives then we might get a point which is not the global maximizes anymore, we will illustrate this on the next slide. So the solution here now is so like the problem is that they were uncoordinated so we can have a mode here um X one here and X two here and and this leads to well something which is not a motor, so we need to keep track, coordinate these coordinates. So what we do instead is So first for every value of X two,

we look at this function here which is a function of X one when X two is fixed and then you look at all the maximizes of this function for X two fixed And we pick one Every other X two, we pick a maximising element and we have to do it for all X two because you don't know what X two might be in the end or what the maximising component X two. With then after we have this we look at this function which for a fixed X two, maximizes with respect to X one and of this function we take a maximising and this is then a component of the mode because this year if maximizes this function here, so we have here X star and this is maximum them after we have this,

we know a component of X two and because we kept track of every single maximising element, every single value, not just mod components, I now pick X1 just as this element for this component for this maximum component and now take this maximising component which means it picked an element of this. afunction maximising this function here we're here is now X two fixed and then it's clear that if you take these components this is globally maximising let's try to illustrate this, Okay let's look at this graph at this probability distribution, let's consider this as a mixture or dis critized mixture of Cautions were here are the modes

four modes and let's say um on the X axis we take X two and on the y axis we have X one. Now what we're doing is You're doing every single value X two for instance here we look at this, look at for instance we look at this line also can look at this line and this line try yeah, but we could also look at

among those faces like here then for every single value, every single value Yeah X two that's saying andw be extra psi next to see these values and we record what the maximum element is, what's next on them value is andw Yeah this is supposed to go through, maybe we draw another one,

We have five values here and then you want to know what is the maximising value here and let's say let's say We look here 1st so we look at this kind of slice here and we want to know what is the maximum value on this slice, this is let's say it is five and now let's look on these slides now we know we have these modes here, This must be higher for instance the value of 10 And so on, we get here at 10 and we get here 10 and here are five again. So that means for every single value of X two you look what is

the maximising value recorded here, every here on this X axis, record this value and now that you have all these values, we can just look between these values on this X two line, what is the maximising and as you can see we have here three values that are maximising so we have three modes 12, three, So we could do now the same for X1 and what would get is you would get maybe here at 10 Here, 10, maybe here five and also here five and as you can see if I would pick this X two here which is part of the mod as you can see and now I take this 10 here which is also part of the mod then this point here would lead not

to a mode, so not a lot. So if you pick x two Independently of X one, then this coordinate and this coordinate which are both part of a mote together and not a mode so we might end up here so what we need to do is when we go through these Slices here and look at the maximum element, we have to pick what what a maximum component of X one would be. So we have to record something like this year, we have to say for instance this is a maximising element which leads to a five, then we have to take here we have to say okay I pick this element here, then

here I pick this maximising element here, I take this and here I take this for instance and then I can do I can't do this again. I look here at the maximum values here, then I pick one for instance, I pick here the 10 and then I just go up and see aha here's the green point, this is the right X one component. Instead of taking X one and X two independently and ending up in a non load, we pick X two, we have recorded the X one value and then we arrived there you could also pick this one and because of the coordination, you would know this is a remote as well here. If you're here you pick one, there's another one what you picked one and then we go on, this gives us um opportunity to coordinate

between the different coordinates and in this way we can basically Optimize opponent wisely if you basically record for every value of X two And next one billion and then we'll end up maybe here in this moment

1 Graphical Models - Exact Inference - Max-Sum Algorithm 


everyone in this video. We'll talk about the maxim algorithm to find modes in factories. Okay, what is the setting? The setting is as before starting with the fact a torrey. So a tree shaped factor graph which comes here with this uh factorisation of the probability distribution according to the factors which we illustrate here in this graphic. So and we want to find and um element that's that, it maximizes the probability the mode. So then at the mode we would have this um this equation, the maximum of the sum of the logarithms of the factors. Then first as before we need to find a pick

route first it could be this w here in the middle of our graph and then we start again passing messages from the leaves towards the root what they wrote, you only have once, so it could be here and then we pass messages here, then we do something else and then we go back, let me backtrack from the road to the leaves here are route needs to be a variable node. Okay, how does the messages look like? What do we pass as um said before, What we do is we take the same messages

that's in the some product algorithm which takes a lock and replace all sons by Marx and all products by sums. So that means if you start with a leaf note I'm here And it's unbearable. Then our message is a lock of one, this is zero and if you start with effect the leaf notes then in the sun product algorithm we had side of this factor next week and he is the only only terrible notes man in the neighborhood of the sector but as we said we take the lot now it's a local suspect er then andw the variable to factor

messages in the some product algorithm with the product over the messages. And since you take the lock will not take the sun, these are the sun. Overall all factors gamma in the neighborhood of our B but without the starting without the beta So if we're here and we want to send a message to this factor then we take this incoming message, we add them up and then send it out this message gnew I don't want to be thanks just replacing the product by the summer then the normal if uh factor variable nodes down here, they also

follow the same rule. So we take the maximum, we now take the maximum of all values of X. And we have written it like beta without w what we mean here is all the components adjacent to beta that are not w this is a short rotation for this, then we have secularism of the factor of meta max pita before we had not elaborates um but we had this factor and multiplied it by the other messages but now we added so here's some and then we had a sum of all notes, all variable notes that I adjacent to be to but not W and then you take took the messages

from you to Peeta, xnew So instead of having a product here we now have just a son. So this is basically as explained the same algorithm as before with these replacements but as we have seen that we need to coordinate the mod components so that this is why we need not to track for every value of X. W. And maximising not just the maximum value, we need to know what a maximises So we pick architects as the same quantity so we know picking temple components. So for every note and then Jason pita we pick one value um but just not just W all other notes.

So if this is peter here then I pick a value. If this is peter here I pick a value or or this and this but not W. So we're sending the message here is our beta I have to pick a better here and about to hear from for each value of um X. W. And what's written here is the same as the line above. Just maxent laser. So we have defined all messages so here we affect the messages or walk factor messages

here, you have zero messages. Their past aggregated process, fast aggregated cost and so on until everything reaches the root note, what do we do at the root note at the root mm we know aggregate all the messages. So what we do is we define afunction dependent. Um the route not variables and we're just adding up all the all the surrounding factor messages, grandma in the neighborhood of W. And all the messages

from these factors two w x w. This is very related to the algorithm where we multiplied all the messages to compute the marginal because this is basically the same evaluation just with the some of them and now we pick um or first andw we can say that this quantity we justified. So this is a bit mysterious what is it? So similarly to the problem of finding marginals with some product algorithm and can show that this is basically this is maximum value of our probability distribution X,

w and X Y. Right not W of all values not w up to normalizing constant which it's locked set. So if you would hear replace this with a son all values not W you would hear get the marginal up to this normalizing sector which was the same in the marginal problem. This is totally analogy to the some truck algorithm. Now we have this function, we know that it maximizes each of x w. It maximises this afunction we know this and then we just take a maximising component, argmax X W

you X W. So that means like we have already found a W component of the load. Now the question is how do we get all the other components corresponding to these variables and this is easy now because we did the bookkeeping. So we can now go back from this, W go to these variables, look at all the values restored for each values and we're basically done. So we're just declaring this year the ex new star. So we had on the route where we had this maximising component w now we plug this in here or for instance this factor

and then we get the values of these out and this hold here as well. So we get the maximising values here and here out and so on. And if this were going longer then we iteratively plug these values into the next factors and so on. And since this was all three shaped, we're not running into problems. This means with our book keeping techniques, we can coordinate all these components and in the end finally we get a moat. If you run it from the root towards leaves then we get a load and we are done. And as in the song product algorithm in this maxim algorithm, I leave it as an exercise to show that this is actually emote.

And first thing is to check the equation we had here mhm at the root note. So this is an exercise summary also marks some algorithm. What do we do? So we start with a factor tree, tree shaped factor graph, then we pick a route in the middle of or grass, which would be a bearable note. And our goal is to find a mode, What we do is then we do forward message passing towards the route and by these formulas variable to factor message passing, which is just summing up the messages, factor to variable messages where we just sum them up and gibbon

plus this sector and then we look at maximizing value and for backtracking purpose is the same thing, what's written here, we also store a maximize er so value, which maximizes this quantity, beliefs, we have initial messages of zero here and here are the factors and then add the root at the road. We aggregate the messages by summing them up and then we pick a maximising Yeah, they can maximising and then we do backtracking with uh

its form over here, then what you get is if you go back to all these variables, then we get all these components and these components are coordinated and we get a mode. This is a mark, some algorithm or factor trees for finding modes

2 Graphical Models - Exact Inference - Max-Sum Algorithm - Example

hi everyone in this video. We want to go through some very simple example of the mark some algorithm. Let's start, let's consider this example we have already seen before where we have here is the leaf notes which are variables we have here our root note and here we have the factors alpha, gamma, beta, gamma and alpha. And we assume that our probability distribution of factorises according accordingly. So alpha is dependent on one and two. Peter is dependent on two and three, and gamma is dependent on two and four. So now let's start with message passing. We want to find a mode. So our goal is to find maximising element of this. probability distributions So let's start with passing messages from one. So our first message goes from one to alpha it is a function of one,

X one. And since this is a leaf node and a variable note, the message is zero here. No, let's Take the next message going from Alpha 2 to now a function of X two. And this was defined to be the maximum of all variables which are not too, which are the neighborhood of alpha which are not too so X one, This is now a maximum of all values of X one. Then we had the factor sai alpha X one, X two plus the message from one to alpha X one. Now we know that this message here is zero so we can write and right that this is just maximum X

one rock. alpha X one, X two. So now we need to be careful this is a factor note and then more variables involved. Now we need to do backtracking I use now like a slightly different rotation, let's call this X one alpha of X two and let this be a maximize er let's just be a maximize er um the same function as here, so this is a maximize er of lock Sigh, Alpha X one X 2.

Let us now remind what this means. This means whenever I check this, lock my alpha and plug in here this and tell you and X two then this is always The maximum of X one walk. alpha X one X two. This is what it means to be a maximising so no matter what and X one I chose, this is always bigger for every given X two and this holds for all X two, All values X two We have, this holds okay, so we have passed The message from one to Alpha

α 22 did some backtracking and now it's time to look at the other values here and here and you will not be surprised that this follows the same procedure. So what do we have? We have a message from one from three to beta 32 beta of X three, we know this is zero and we know that from three from peter to to from beta 22 X two That this is now the maximum similar to it before of X three block side beta X two X three, same structure and now we use this notation to note our choice of

maximize er This is art marks of All Act three block side return X tool X three, Similarly we cannot do uh passes, we cannot pass message from four to gamma which is a function of X4, this is zero then we have gamma to two as a function of X two and as before this is X four, the maximum of X four, lock side gone X two X four. And now we also use this location here, two take maximising backtracking Walk side, γ Extra X four.

Now we have all these ingredients let us um aggregate, we know at the root note already. So we have this message, this message and this message One before here the other two and now we aggregate. So for this hmm let's aggregate. So we now have to say what this function X Q was and what was is it just adds all the messages? Alpha two, X 2 plus the message from beta to to next two plus the message from Gam 2, 2 Extra. So and now finally we can pick our X two as maximising of this queue X two, maybe

we just write out what this Q, I don't know you french looks like. So the message here was maximum X one block by alpha, X one X two plus the maximum X three side, beta X two, X three Plus and now we get the maximum over X4, this is a bit here, Mark's ex four side gamma X two X four. So this is now what this function is and now we choose a maximize er here this X two this means let's write this out what this means, this means

if I plug it in that means u x to start X value maximum value of all X two values, that's what it means. Now we do backtracking this means we pick X one star, we pick at our X one offer of X two stars, then we pick our X three star as our X three beta X two star And we pick our X four star As our X four gamma next to a star. This is how we set all of this.

Now we want to check, sanity check, this is really a maximising for this, we have to evaluate our probability distribution on this value X. Or Which you know say it's X one that's two three X four. Oh let's write this down. If you take X start here then maybe we directly evaluated at the lower charism let's take the logarithms of P. X star and this it's a long charism, sigh, alpha X one star X two star plus the logarithms

sai meta X two star X three star plus the logarithms of gamma X two star X four star minus, it's a lock, normalizing constant. Now we know we know from the from the definition of Um the choice of X one here, maybe we write it more clearly like this, we now here make clear what we chose here from the news. So this value here, every value of X two, Every single value of X two, This function here of X two was maximising this,

so this year is The maximum all X one sign, alpha next one next to a star. Similarly here this function was chosen and shut the way that for everywhere you have X to it maximizes this quantity here and so it also holds for this X two star, so this is the maximum all X three side beta X two star x three. Similarly here this function was chosen that it maximizes this equation at this term here for every value of X two. So this is The maximum of X four

Block side, γ X two star X four minus log, is that? And I forgot a lot of logs. Okay, so I forgot the locks here. Um but then now here, so this was the definition of this function here, which we use that they're maximising it's for all values. So now this year was the Q function. So this here was the Q function and value added evaluated as X two. This is now by the definition of this. Uh this function of Q is now the maximum of X2, Oh max x one and so on

and then minus here look, so the X to what you've chosen to maximize this, this is a Q function X to worse was chosen such as maximizes quantity, so this is equal to maximum here. So here's a maximum of X two. So now This function here is dependent on X one but all these functions are not dependent on X one, Similarly this is dependent on X three but these two are not dependent on X three and this is not dependent on X four and this is also not dependent on X four. So what we can do is we cannot do right max two marks one marks three box form

of these quantities. Oh here forgot to remove the star sorry, alpha X one x 2, bus log side, beta X two X three past side, gamma X two X four minus or is that? But as you can see this is nothing else here, then just block E of X. So what this year is Yes, this is the main point,

this is a maximum X. Black of X. And this is what you wanted to show. So this value here, this was constructed from the algorithm, what is this calculation here we showed that this is actually maximum, that's really maximum, this dysfunction you. So this is what he claimed. So this is sanity check holds out. So this shows us a special case of the algorithm. So this is an example of maximized where we have seen, where we meet here, like our photograph and where we did the backtracking and so on.

3 Graphical Models - Exact Inference - Max-Sum Algorithm - Summary

hi everyone in this short video we want to summarize what we have done within like some algorithm. So we started always with a factor tree, tree shape factor graph and what we want is we want to compute a mode of our probability distribution to attach to this factor graph for this. We pick a root note in the middle of the graph shown here and then we pass messages from the leaves towards the role and these messages they either involve summing up messages previous messages or summing up previous messages plus a factor and then maximising over those values And police there zero or the log factor for factors for

backtracking we kept maximizes so we needed not just a maximum value we needed also at least one of the maximising values and then at the root they pick maximising element of the aggregation of some messages of all the messages which come into this root note, we maximize this, take this value and then we use these functions which we introduce the bookkeeping to pick all other values. So we did some backtracking with these functions and then we have seen claimed that if we take these values constructed by these backtracking room then this actually maximizes our probability distribution.

So two remarks first Mac some algorithm, it's very very similar to some product algorithm With 2, 3, 4 changes one is that take a log of all quantities. Then we replace all sums and some product algorithm by max and all products by sums and then we need to keep track of maximizes the bookkeeping and then this is basically But some algorithm was very similar to the some product algorithm. Another remark is in case our tech to graph has cycles, so it's not a tree. Then one could use this junction tree algorithm. Mm Find something like a spanning tree, aggregating some of these variables

and then running on this tree maxim algorithm. And this avoids cyclic updates which are not allowed in our So this is a.

7 Information Theory

0 Information Theory - Relative Entropy - revisited

everyone in this short clip. You wanna recap a few things about relative entropy. Okay, what was relative entropy, relative entropy or Kobach? Like the divergence between two distributions that live on the same observation of space. We have Q. X. And P. X. And we want to know what is the difference and we could encode it by counting the number of bits needed to encode the difference between these distributions. So this measures the covert latin divergence measures the additional bids needed to encode the distribution Q. When a proposal distribution P is used or if you

take some natural algorithm. These were nuts. And we have seen the fundamental inequality that the call but lack of divergence always Bigger or equal to zero And it is equal to zero if and only if these two distributions are the same but this really measures the distance in the information theoretic sense between the distribution Q. X and P. X. Now we wanna talk about um the chain rule and data processing inequality, Coban like versions. So here consider we have two joint distributions and two variables. So they live on the same product space. We have two joint distributions

and then we have the chain rule which says that the light, the divergence between these two Distributions in two variables can be split into two parts. one part is just marginal distribution of Q. X and P. X. Plus. Yeah contribution. Which comes from the conditional. Is that given X. andw E Z. Given X. andw since um this divergence measures like the distance between the distribution on that and pc but we have for every single act we have a

new distribution, we need to also ex expectation of all access and before this we take the Q. Distribution first one expectation value is always with respect to the first one. So from this we know psi direct. They're a consequence is a data processing inequality. Main point here is if you look at these two parts, both contributions are bigger or equal to zero, so this is bigger or equal to zero and this is bigger or equal to zero and then zero exactly when these two arguments are the same. So now this is bigger equal to zero. That means like this

part is bigger equal to this part and this is already the data processing inequality for the versions and quality homes quality holds if and only if the two conditionals are the same And this can be seen just by this equation. We have a quality here between these two, that means like this is zero and this means this is zero for every X. This means these two distributions are the same for every X. And how do we prove the chain rule is actually very simple. This just follows from the rule that lock that X. U. X divided by

X. Yeah. X. It's just the sun blocks plus doc U of X. And then just taking the expectation value expectation value of the joint distribution. Q. Except these quantities. Another remark is that we have the principle of minimal relative entropy, saying

that if I start with the true distribution which may be not known and I have a model trying to approximate our true distribution, then principle of relative entropy says you can do this just by using the coyote. The virgins try to minimize the elder virgins between the true distribution and the model distribution, and then you get a good approximation in the information theoretic sense of your true distribution. And of course this true distribution is not known, but we have seen if we replace it's true distribution, the expectation value in this true distribution is empirical one by using samples, then we arrive. That's a maximum likelihood estimation,

meaning that the data points which we at here we just pluck into this distribution and the minimization here becomes a maximization because of the minus sign in sight. minimising enlightenment, distance and choosing empirical distribution for the expectation value gives us the maximum likelihood estimation. This was an information theoretical approach to justify the maximum likelihoods

1 Variational Inference - Ideas and Principles

hi everyone in this video. I want to talk about ideas and principles behind variational influence first let's make two remarks. Consider a probability distribution which is given or a marginal bit and we wanna approximate it. We want to basically kind of find a representation of computational visible representation of this distribution. Then a good idea is to start with a model model distribution and then measure our father's model distribution deviates from our distribution of interest. Do this one can use the dead virgins, it could be go back like the distance for instance, there are many, many other divergences from and this divergence and quantifies as the difference between these two distributions

like the cobra climate does in bits or in maps. So then generally it's a good idea just to minima sister virgins give it to sensible divergence um with respect to the parameter of this model. So we're basically approximating Q X with a model distribution in the submergence, right? And this could be a general approach, approximate inference. But if I want to infer distribution Q X, I take it to a statistical model, take divergence, I can optimize and then I do this optimization program here and then um this is minimal, that means like this distribution is very close to my target distribution and so I have a good approximation, small values, Fatih

means good approximation, this was about approximate influence. A second remark, andw it's about latentvariables model and Beijing setting, we have a joint distribution in two variables X and that, I'm not saying what is that is here, I could maybe write it factorises like this, I have a marginal dissolution of a set and then a conditional of X given that and maybe I have data given for X. So I could now consider this distribution as from a frequentist point of view and saying so my access to data so actually distribution here is given by X and the set is just latent variable. Then I have my latent variable model because it contains the latents

and are considered latentvariables and the theta here is like parameters is once I want to optimize for and I would call this year marginals latent distribution and then if I plug in my data then I would call this here likelihood or marginal likelihood and this is what in the maximum likelihood estimation I would optimize for, I would try to find the parameter here that maximizes this likelihood this conditional X. Given that I could call it late and conditional distribution or just conditional distribution and the other way around is also just a conditional distribution, this play of course different roles but this is just like late and given to given observations and hear the observations given the latentvariables and I would call fitting my parameter

with suspect of the data set, I would call this maximum likelihood estimation now I could also just reinterpret this whole model and say zed here is actually a parameter and this p of that is a prior distribution of this parameter and the Ceta are just hyperparameters so and this would be the Beijing setting and in this setting I would call that the parameters not latentvariables they're just parameters and I would not parameters I would call them hyperparameters and I would not call this a marginal latent distribution, I would say this is my prior distribution or a family of prior distributions. parameterized by theta one here then this quantity would not be called likelihoods would called evidence, this would be the likelihood here we call this a likelihood

here we call this a likelihood and then we have a posterior distribution maybe if you can hear the data and if we fit data to our data this is usually not part of a typical basing framework. These are hyperparameters the framework here is like computing the posterior and modernizing it out get here the evidence but if you have still hyperparameters and we don't know how to how to choose between all these parameter hyperparameters one could do something which is called empirical base or a maximum likelihood estimation of higher order where we barely choose between these theaters by looking which one has a higher evidence and as you can see the mathematical

formula here didn't change, we just changed interpretation and what's about to come, this will blur all these um all these interpretations will blur but this doesn't matter too much, it's just some names and interpretations below together some people think about it, it's amazing setting. Some people consider it as a latent variable model but the main point is the mass is the same, the principal, the mass and the computations are all the same. So if people call this maximum likelihood or maximum likelihood approach or latent variable models or they call it empirical base or basing setting and so on, we don't care so much since all the formulas the same. You could also have a mixed approach that that could be part

partially re latentvariables and partially partially parameters. So we treat them all on equal footing. We don't make any distinction here. The main point is that is the same. Let us quickly remind us what a latent variable model might look like. Consider here we have features, we have features here, next one and X two and we also have color that would he be the color? What do we have then? We would have like a data table if you have observations at X one X two and this is what our data table looks like. And then we had like observation 1234 and so on And every role here, every role

would correspond to one point data point. So he would be blue and then we had the coordinates here written with these two and so on. Now consider that we lost the information about the data and our data said over here would just look like X one X two And here would be number 2, 3 data points. This is how the world looks here. And this is where um we actually consider that to be somewhere given but we cannot observe it. We don't know which color all these points, that all they're all the same. And what we do then is remodel more of this data set as this data set. And what you're trying to do is you're trying to infer

the color by the data points. And as you can see, one could do this with these um with the features and the features still have some information about how these clusters might look like. So this is a latent variable model. So that would be here but we cannot observe it. So we might need to inter for it. So then let us formalize this a little bit. Let's say we have again a true distribution. We call it Q. That's the one of interest, that might be the data points on the right hand side. Before. And now we model this with a latent variable model and remodel then maybe as a mixture of caution and the mixture

are given here by the color. So we had three different this could have been three classes and each of the classes a Gaussian which were before, something like this. Right? What? Something like this. So we model it was a latent variable model. No, let us compare this distribution here true distribution and our model. Distribution now a true distribution. We might assume that we have that but we only write Q. X. Because this is the only distribution we can draw samples from the only thing we can observe and this year is just a model. Now there are two approaches generally two approaches how we could compare distribution

just given an X and the distribution which has X. And the latentvariables that first, what we could do is we could marginalize this model distribution to PX by integrating over X. And then just compare the marginal distribution. So this model marginal we're kind of trying to fit to this Q X. This would be one approach and the other approach would be he introduced influence distribution and variational family and variational he means just it's varying with respect to some parameter, so we're varying it influence distribution and this inference distribution is there to complement our Q. X. To make the joint distribution of Z and X. And as soon as we have this, we have a

joint distribution for Q. Then we can compare this joint distribution with the joint distribution of the latent variable model and then we minimize the joint Kobach like distribution instead of March, like minimizing the marginal, we now minimize the joint and now the question is what kind of distribution do we take here? And we take the one which minimizes this whole distribution. So we are minimizing here over a theta and w theta is from this model here latentvariables model and this w is from the inference network inference distribution. Then we have seen that these two approaches were like the first approach um leads to maximum likelihood

estimation. As soon as I replace this unknown true distribution with empirical one, if I replace the expectation values by empirical mean then I end up in maximum light with estimation, we have seen that now I can do the same for this distribution, we don't know this, we can replace it by data points or trying at least and what we end up is variational influence. variational influence starts not from here. variational influence starts from here with this family of distributions here, this leads to some light estimation, this leads to variational influence after we places with empirical means. So now let's talk about how movies approaches differ

then quite related to the lateral distance here in the distance here they're related and how they're they're related, they're related by the chain rule. So the joint distribution here can be written as a son of the marginal and the difference the covert like the distance between the conditional distribution. So this year is what we minimize variational influence. This is what we minimize, implicitly minimize the maximum likelihood approach and here this is this is really like the influence part, entrance park

and as you can see this is the only part where the influence distribution goes in this was a true distribution, there's no parameter this year is the inference distribution and here's theta and here's theta but only this part is dependent on the inferences spirit. Now assume our inference network is so good that it can equal this conditional distribution here then this cool, but like this is simply vanishes, this means and these two quantities are the same meaning if you start minimising this year, we're actually minimising that year. That means even if you do variational influence if this year goes to zero by having good influence distributions

then you're basically generating to the maximum likelihood estimation or if we write it in an inequality we know that all these quantities here are You go equal to zero fundamental inequality then you have the inequality this this is bigger or equal to zero. That means like this part is bigger or equal to this part and this part is bigger equal to zero. So we sandwich The cobalt like between the marginals between zero and the callback library of the joint. So if you minimize this year with respect to both W and then make it smaller and smaller and smaller. Close to zero and this must also be close to zero minimising This quantity

implicitly also minimizes the maximum like um the life table. Okay, was minimizes this quantity implicitly maximizing the likelihood and we know that equality here holds if and only if andw these conditional distributions are the same meaning if my influence distribution is good enough to make this equal, then we're back into the maximum likelihood sity but this was all in the idealized setting. So our Q. Here Q. Of X. Is an underlying true distribution. And we don't have access to this distribution besides

samples. So we need to approximate this with samples. And then because you want related to the maximum likelihood estimation we take minus sign and then we get rid of the terms which are only depend on cue and on the parameters. Let us write this down let us right down. He has a lighter divergence take a minus sign and this obstruct the entropy of this. So what do we get? We're here about like that divergence expectation value black, U. X. The X. And then minus expectation

value U. X minus walk you wax and as you can see this quantity here and this quantity here is the same once was minus sign. It once with a plus sign. So we are left with the left wrist U. X. Block your ex. And now we got rid of this time. Was not dependent on the parameters of what the parameters. Bana now we can approximate this with empirical distribution

X. And this means that X. Is not just sampled here from here from the true distribution. Not integrated over this. It is integrated over the empirical distribution. And this just means we pluck in the true the empirical distribution into this argument. So this is a bad thing lock. You don't our data. Yes this year's empirical approximation which is justified as a law of large numbers as you can see this compact lightweight the burdens is related to the lock. likelihoods compactly Now we want

to do the same for this joint distribution and we do the same to compare them on equal footing, we take a minus sign and we substack picked the entropy of this key of ex which we don't have access to. So let's write this down maybe already here. So this is set distribution that given X. Max then lock you that given X, relax divided by easy to that X. Let us just keep it like this

minus the expectation building minus block relax and now we would use here two X. But what we can also do is yeah no let's leave it for this for the moment like this. Now we use that this lock is additive so we can take the look of this year plus the lock of this year. What do we get? We get minus E. I don't believe that ax U. X. dirac

No one right at the other round that max blinded by Q ax but then I make plus here then minus the expectation value U W X. X. logq your ex bus eq X. You ask so this here is not dependent on that. So this danishes here. So the expectation value over

that vanishes. This one vanishes. There's no dependence on that here. Um and then we have this one and this one and this is gone and here you can see no you write it like this now Write it as two expectation values, one expectation value of X. Than one expectation value that X even walk X. You that given acts like this. And now and now we replace this year with an empirical distribution. This is a lot of large

numbers miracle distribution. This quantity here, walk next you said give an X. This just means oops next year and this means again we can just plug in the values for that. This is the same. This expectation U. W. That except

well mm that except by you that except I'll write it again up here. So this was approximated by w that that walk that X down here you that even access and this quantity is called evidence lower bound

or in charge elbow. Some people also call this variational lower bound and we have seen at this quantity, it's always bigger than this quantity equal now we have minus signs. What we get now is inequality in the other direction that look likelihood which is also called evidence. We talked about some people call it evidence, some people call it look like hood. Um this is an after bound or better. Let's talk about this year which is called evidence. Lower bound is a lower bound to the evidence

but of course we had like here two approximations and the question is if this inequality is still fine with this approximations here and this is a case, we can formulate this as a lemma, this is a fundamental and variational influence here again, we look at the light and bearable model and we have some data here and we have some inference model and for the moment you can forget about the parameters here and they don't matter. So these parameters that you can just ignore and this inference model is meant to approximate um this if you want the posterior distribution of data and given the data over that either later variable

or the parameter depending on how you interpret it and the dependency on X is actually not really needed. If this is a static dataset then we only want to approximate this distribution is not a function of X anymore because this is fixed. So actually we could even consider this X E n not to be given just not every input. We have a new distribution if you just fix this two X and the data set except then this is just one distribution of a set. So and then we have the evidence lower bound defined and this evidence lower bound is actually a lower bound to the evidence also known as la likelihoods it means like this quantity which is evidence lower bound

the elbow, this is smaller or equal to uh huh What black and equality holds if and only if posterior the posterior is equal to this inference network inference distribution and um how to approve this, I can prove this with Jenson inequality or I use similar arguments as before. Maybe let's quickly prove it. So we start where's our maybe we just forget about the parameters E of X.

Then this is logarithms that's an integral of P that X is that? But we now can just add Q of that except times you of that except I say cancel out so we can write this the lock over the expectation of that. Is that X. Que that Except

now Jenson tells us if you have a convex function then there's a smaller equal to the expectation of this convex function. The block is concave So the inequality goes in the other direction if you use bouquet functions. So now we have this is bigger or equal to expectation Q. That accent block of this quantity of you And this was the album but we're done. So let us summarize a framework of variational inference. The framework of

variational influence. We assume we have an underlying true distribution of interest and we have X data sample from it, then we have a latent variable model or Q. X. Or which maybe forgot to say again that it could also be a basin setting where there's that here are parameters and we have a in our joint distribution because we also incorporated a prior or that but also be a base in setting, then instead of maximizing the evidence look likelihood, which is a maximum likelihood approach or the maximum evidence approach and empirical based approach basing setting instead of doing this here, variational inference, we know introduced a family of influence distribution distribution we can handle

well computationally and then we maximize the evidence low amount instead of maximising instead of maximising the evidence look like you would instead of doing this and now maximising the album which is this quantity here and with respect to both parameters, this is important, all those parameters and this is already variational influence the framework and just as we have derived it to be justified, it either using the minimal relative entropy principle applied to the joint distribution plus an empirical approximation

of expectation value. This was one justification, we used the other one was directly using the maximum likelihood estimation basically this one and showing that it's bigger or equal to the elbow and that it equals the elbow. It's a mysterious in the sense, agree either we maximize elbow or minimize the joint Kobach like Burton an asset before we can also consider variational influence as an approximate beige in scheme must be approximating this posterior in the basin setting, this would be the posterior and we approximating this with this distribution instead of using the real posteriors so this is a variational influence framework

2 Variational Inference - Expectation-Maximization Algorithm

everyone in this video, we'll talk about expectation maximization algorithm, the first algorithm for variational influence in this course. So let's start consider again that we have underlying true distribution where we can sample data from and then we model our distribution with a latent variable model so there are some observations which we assume are there, but we cannot see in the data for instance clustering where we don't know which cluster point belongs to. And then as typical in the variational influence framework, we also set up an inference distribution and a family of inference distribution family of inference distribution, supposed to

complement our true distribution and to match the model and then we have seen that in the variational inference framework and instead of maximizing the likelihood, which often is infeasible we instead maximize the elbow. So here you see we have here our latent variable model where you plug in the data, then you have inference model basically if you would interpret it in a basin sense, this would be an approximate austere e er and then it takes the expectation value over this so and now in the E. M. Algorithm. So the question is how do we optimize it? And in the em algorithm we basically alter it, optimizing this quantity with respect to these parameters and then these parameters

and that's already the em algorithm. So we initialize the parameters and then we just optimize them one by one and then repeat. So in the step we just take the arc marks all W. And then we just take this elbow here and here we take the value from before and then in the m step, the m step, you take the arc marks with respect to the other argument take here it was 11 we had just before and then tita

first you can see that if you do this one x 1, then in each step we're getting better. Consider for instance you just assigned this value here, then we have, we can plug it in here, so we have here T plus one and this has a value, this two plus one, here is the same as the T plus one and the next step and this is a fixed value and you know, take here the value which maximizes This where this here is two plus one, so we pick something which is better than before, This was just one value and this was a maximum value where these are T-us one fixed. So in this step you already improve over this step and then we repeat and the same holes here for

the east step, if I have here two plus one and here in the next step I have T plus one and this is just one value but then we pick here the w which maximises So again we improve secondly, sometimes we have constraints on the parameters, for instance, a parameter space could be um would be probability distributions and they need to sum up to one, so there could be probabilities classes, then what you often need to add here is a branch multiplier and then and some constraints here. This optimization problems here. So this is already the very

general expectation maximization algorithm. So this generality what could also think about other ways or other versions of the em algorithm instead here the elbow which was basically minus the distance of joint distribution. We could take any divergence between this joint distributions and then alternate between the optimization of the inference network and the latent variable model. But now in this setting here let's um simplify a little bit since we're optimizing a step by step, some of the terms are not needed in any every of the steps. So we had here arc marks, w

or elbow, W C to team which was argmax W expectation value you w That given the data, knock on distribution of theta mhm. That data and then Q. W. Is that data? As you can see we're optimizing here over W. So this means that the value of this does not matter. So I can also just take here the conditional instead of the joint because this

plays no role in this optimization that also not dependent on this, I can just pull it out. So what I get here is then as you can see if I take here the traditional then I have a cuba libra distance again between this distribution of a set this distribution over that and this distribution of set up to a minus, sign I can just write this year again as argmin on w Hey el you w set xxt and then we see the T. That makes sense. So we see the E step here just about the inference step,

I was trying to adjust this to this posterior, let us think about the end step. M step was maximising the elbow was rectal theta yeah, we had that's one already and cedar. So this was here again. Q it cost one that that and then there's a lot of tourism again that X. That divided by W two that X has and as you can see here now we're optimizing

that means that this part here plays no role in this documentation so if you take the lock of this and take this out this um this will be not part of the optimization, so it doesn't matter what value this year is. Um in this term we can write this is argmax eta Now we take the expectation value of U w Botswana that's that's of barbarism. The joint set except I forgot the parameters iota that accent

and in this optimization also before this is the data we plug in and this set is running and integrated over with this distribution this year is a random variable and here computing the expectation penalty of this random variable dysfunction of this random variable given this distribution. Now that we have simplified and timestep we can think about um refers are simplified if we have um if you can do the E. Step in a perfect way. So this slide again we have to step and we have the M. Step and we have seen that this is argmin

W versions between W that except and p theater P that accent while the M. Step was argmax expectation value you W two Plus 1. That given that logq be that xtax Okay so now let's look at the step E step here

is minimising school but like the distance between this consider tractable um inference distribution and this rather intractable conditionals of our distribution of our late bearable model. And let's assume for a moment that we can our we set up our inference distribution such that we can do this exactly here, meaning that we take minimises distance, this is zero, then this becomes equal to this in the best the best option here would be to minimize the listens to zero, meaning that this equals this. In this setting we would have this would be the conditional distribution

of the previous step of this later ceremony and that means if we take this year access into the instep then what we get is that's an expectation over this Q two plus one that accept block O. P. He done that accept becomes expectation value or the previous value of Ceta that conditioned on X. And then the logq of peace. Either that X had

and here you see that we here have the previous value while here we have the current value and we call this function here this Q. theta eta team, meaning that in the Amsterdam now if you can do the step perfectly meaning that we have related variable model where we can evaluate this conditional. So this is really the inference step from the joint here you see the joint to the conditional, if you can do this, it was all Lincoln bedroom model for instance, consider you have Gaussian distributions then we know the conditionals also goes in I mean we only need to mean and covariant matrix, if this is the case then we can do this or this is something like this, create multinomial and then the andw additionally, he has also discrete multinomial

if you can do these kind of things um such that we can evaluate the corresponding expectation value of some functions in Z if this is the case then we can directly go here for the best E step basically just hoping andw the steps, so it's just like an inference step just in this model so we don't need approximated class we just take andw this year and then in the m step we are maximising dysfunction you you done, you have to timestep just

maximising this and in the E step we basically have to do this inference or more or less um evaluating this function here, this means um this case R E M algorithm can be written as for wars in the e step we evaluate, you see them, you know, team expectation value over in a team that x head then we take the lock easy done that at the joint. Thanks. That's meaning we have to do the influence of the conditional here and we have to basically evaluate the expectation value of all kind of

random variables occurring in this expression which we considered being some formula and then we do the expectation value of all these quantities. Then we have this function as a function of Ceta was fixed better here and then we compute the next value of Ceta argmax Q got you. And now you see why the E M algorithm is called the em algorithm. The e step expectation step is evaluating this expectation with respect to the previous rameters and the M step is called the maximization step is we're taking here the maximum of this function expectation evaluation,

maximization of that function. And furthermore, just don't forget if you do this maximization sometimes you have to add constraints parameters live in a constraint space and this optimization can be resolved in all kinds of manners sometimes in either case you can just take the gradient, put it to zero and then you find the formula for the next step or sometimes it's not feasible. Then you can try using all kinds of other optimization techniques like gradient methods take the gradient and using gradient descent and ascent, and this is how the e m algorithm works.

3 Variational Inference - Expectation-Maximization Algorithm - Example: Mixtures of Exponential Families

Hi everyone in this video. I want to go through some example for the expectation maximization algorithm or mixtures of exponential families. Okay, let's start setting this up again. We assume we have a true distribution Q of X. A true distribution distribution. And we take next one until X N I I D samples. Now we want to set up latentvariables model does a latent variable of our X says which cluster belongs to. So we have first p of that given parameters, pi saying that it belongs to cluster

K if and most probability pi K. So we can write this as I K. That's okay. And then we take the product over K. Maybe we say 1 to capital K classes and small that K is an indicator if that is class K. Oh, that's why. Okay, one of these classes then, if that equals K, then only this term remains. So this is a probability and we now assume that we have

any copies of this distribution. Oh I am the copies. Oh, set up that ko and indicator of that N. And that N is considered to come from X n response to xN But of course these are observed and we need to infer them. Then we now assume that we have a distribution X given that let's say equals K and then we have a big vector of parameters, natural parameters. And here we assume that we have an exponential family. Okay, component

which might it's on vector then efficient statistic. andw formalize er and maybe some function B. K. So now we wrote this for one K. If you want to write this mm Where we don't need to specify this. K. Here we do what we always do. We take the product and then that K. To the power of that K. Each time then we take the product meaning only one of these. One of these we're separating

the data part of the argument part and the distribution part Okay equals one. Okay so now we can write down the joint distribution let's call our X. Hat this is in our X. M mm Transpose and that is now all is that case that case. And andw Okay 12 Okay yeah let's write it like this capitals that then our joint distribution looks as follows our joint distribution.

That X Given pi and eter it's a product and equals one to N. This is I I D. Then we have a product a equals one to pay and then we have pi K times maybe directly right into the exp xz pi que yes he took credit U K X. That's E K X minus A A. Okay then this to the power that came

which we could also write inside the X. A. Yeah yeah and then forgot to make all these dependent on mhm andw it's this comes from Multiplying this with this and then taking an I. D. And then we have to say what the law

is. Talk he that X. Had. Alright eta and now we're just replacing product of sums and then we have that K. andw time. Okay X Yeah us UK Thanks Ryan, that's a K It's okay. That's yeah, I can't. So again, this is the lock of this joint distribution

of our model now we wanna compute this Q function so we can do the E enter timestep let's go next slide. Yes, that's that's right here. No, we're basically and the step mm and what we need to do is in the east that we have to take, I have to look at the expectation over and I just write it here in build up these are the old values Yes, dirac is that?

What was that maybe? And make it clear anyways, so this is P I he said given an X tilde and now these are the old values so as you can see the distribution, the expectation values respect to that here. And in this expression, the only part that is dependent on that is this year. So what we need to do is we need to find an expression for this expectation value for this. We define gamma okay

of N of XN if you find this to be expectation value a pie, is that X teller I of that Okay, yeah, maybe you're right again. What the joint distribution were of XN X N. That alright tilda tilda, these were product K then I K.

That okay then times. xxt okay. T K X M okay X and minus a K. It's okay and then this the power of detail and care that was a joint distribution and we now have to compute for this here the conditional distribution so we have to condition X. andw mhm X N. So for this we need to divide this year

by the marginal I forgot here and kill us so and this is just summing up of all values of that K. That means instead of a sum here instead of a product here, he is an indicator leaving just one of these here now I some out of all values that means like I send them over want okay, okay Children next fromthe Okay, thanks man,

I'll speak a X n minus K. You take care this is really Now the mixture distribution which represents basically our Q X offer but this Q we don't know and this is basically our model. marginals logn so we can know compute compute the conditional distribution by just taking this and dividing it by this. But then first, maybe we look at the expectation value, is that K is an indicator variable if XN is of class K or the set is of class K, what's this? Mhm for this? We can also write this is a probability

that that and okay, gibbon expand five this means then that here in this distribution we only need to look at the case component that means like this part and divided by this part. Maybe we there's a little bit here and then continue writing gamma K X N. It's now equal to okay tilda X.

Until then France post T k x n wants to be K x n minus a. Hey you talk okay yes the kids component and now we have to divide by the sun. Thank you. He was now A. J. So there's no confusion until the chain times X. You tell J T J spanned E J and A.

So this is basically the inference that this Oh I forgot to tell us here and here. So if you have the old parameters I. K and eter Okay. And we have our value of our value X. M. We just plug it into this function which are given by the exponential families here. The sufficient statistics until this X lock based measure based measure, exports of based measure. We just plug it in. We compute this because we have the old parameters and we have our values then this can be computed just by these functions. We plug it in, we calculated the sum it up, we divide it

and then we have our gamma A X. M. And this basically concludes the easter because this is the only quantity here that is needed. So meaning that we can compute now the step, you can compute you and then on pine peter given the old eye and old 10 given by we have to take this quantity here

copy taste. And this quantity and we can write this as yes, this quantity here and maybe we directly write it in the way intend to the main point is this expectation value here? It's now gamma Okay. And this just X. M.

And where Kane X. M. Equals two. The quantity on the previous life gamma it's defined to be this. So this quantity here is determined by this. So this is the step. Now the M step we have to maximize this function with respect to the new parameters so let's go to

the end stop timestep so we take bye nita maybe call them stars argmax the skew function by lieutenant until eta tilde which were the previous parameters. This means argmax of this function here. How do we do it? We can take the derivatives respect to the parameters and put them to zero and see if we get something out so we won't 0 to be Q

derivative spectrum it Okay for instance, this is it came first year. Thank you. So what we get is yeah I'm okay X n times. Mhm Okay. Mhm minus that's called radiant. Okay, okay. And then we sum up all data points

and we know that this here. It's an expectation T k X gibbon this Okay. The problem is now that we don't have like we cannot invert this. Mhm So if we can invert this year then we get the following let's just let's right it just as a prime then we get, if you can invert it then we get on the one hand yet here this is depends on and it's not dependent on em so what we get is first

a prime. Okay, okay, This one over N Some and it was 1 - 10. I'm okay. X n times T K Excellent. So it is a mean of sufficient statistics weighted by is quantity or if you can invert this function then we would get came equal to a k prime inverse one over N

And equals 1 to n. I'm okay X N k k extent and this would be first part of the m step then we need to derive um spy here. Mm maybe before we start with supply. If this here is not invertible what we can do is or we do the update rule king is Utah convergence plus we're maximizing

our phone and then we take the gradient and of course and andw I'm okay. X 10 okay, X and minus this quantity here. So either we can do this, analytically by taking the derivative or if you can't you can try to sample from this distribution of it Okay. Of this component and then take the mean. Either you compute this and take an update role or or you sample samples to approximate its quantity.

We're here but maybe for simplicity let's assume we can do this then Okay, this is our objective here. This here is our Q. Which we are only writing as a function of the new parameters. All parameters are gathered in here. Mm now we have to be careful we have constraints and our constraints are that if we sum up pi case that you get one, that means our lagrangian you eat off pi lasso lambda one minus IK. K 12 K. This is lagrangian we have to take

we have to take into account so now let's look if you can Put our Lagrangian

  1. So what do we get? Where is pi here is only here and pies over here so what

do we get here? We get one over pi K. And times have this quantity God okay xn andw let's not forget the sign It goes 1 2.

So this is our first contribution and the second contribution comes from here was minus mhm Like this. So that means if you solve for pi K and observe that the some years of N. So this pie is not bound to the sun then um just bring this on the other side, you multiply by pi K and you might buy lambda so what we get is one over lambda um

um Okay next 10 so and now we need to find out what this lambda is and for this we can use a constraint. We some maybe it's easier. Right like this then lambda This equals λ Times one about times a sum Okay 12 K. Okay, now we plug this in, lock this in so we get some overcame some over in and then

okay, thanks Annie. And we know that give me some or N and then over K. I'm okay. XM you know that these are probabilities, they sum up to one And here you go from 1 to N. So what we end up with it's ended and this means we have found um okay, and this is just one over N and one andw

Okay, X. M So we can saying what we do in the M step, we have like a I don't know but andw soliton and of course 1, 2 and okay, data points and we know that. Okay. Yes, A prime K inverse in verse one over N. Um and it was 12 n. I'm okay. X N

awaited some of the case provision statistic one X. M. This is an timestep Oh, so maybe we write Uh huh. And this is here, it was T plus one And these garments are also 2-plus 1. Yeah, Plus one and maybe, yeah, I think you step here. Step

and step basically says that we take that long. It was fine. Mm hmm being let's copy paste this. Oh, this is my a top team X K T Transpose T K X N. Yes. Okay X N minus A K I'm T K I was

and then divided by the sun, J k. yj 19. Thanks. I'm t J transpose J X and plus B J. andw minus. Okay, okay. So and this is now the E M algorithm for mixtures of exponential families. In the East step we update this year which has an index K and in X M. We some of these quantities here which are given by the exponential families and the previous

the previous parameters. And then in the end step we compute the mean of all these governments from, from this step and we take the new parameters. We uh this quantity here so we need to be able to take the derivative and an inverse of this function. And and then we have our new parameters. If you can do this then we do this under convergence and then we have a representation our clusters and we have parameters for each of the components of our exponential fans. May we make it clear that this is all Okay. 1, 2. Okay. Uh huh. And with this and

here also for Okay, occam

8 Variational Inference

0 Variational Inference - Variational AutoEncoders (VAE)

everyone in this video. I want to talk about variational out encoders. The typical setting is a high dimensional setting. So we're assuming that we have distribution which lives in a high dimensional space where Q of X. Here is a true distribution and then we have samples and a typical example or application area images that could be for instance am nist each pixel is one dimension, so we have a lot of pixels meaning high dimensions, then we consider that the support of this distribution lies on a lower dimensional called manifold, meaning in the pixel space if you randomly sample pixels then this won't be an image, the image. Is there kind of structured. So if you wanted to parameter rise the space of all

images, this would be much lower dimensional than the pixel space. So we kind of model now this lower dimensional space which basically parameter rises or data manifold um isolating variables. We consider them even by some latent lower dimensional latentvariables So in this we call the space that in and now since this space of images might be very nonlinear, you want to use some non linear models. And for this we usually take this framework deep neural networks and if you make them deep or white enough just big enough, then um we hope that they will approximated the data manifold very well. And this is based on

the universal approximate er property that you can basically approximate any probability distribution with a neural network if it's just big enough. So this is the intuition here and then we um need to optimize and the maximum likelihood approach doesn't really work as we cannot solve for these. integrals So what we do is we use variational influence meaning maximized sible And for this we also need influence networks and of course um we also use new networks results. So the setting is as follows, we model our nowhere our world here as a graphical model like very

simple graphic model X is considered here the observed images if you want end of them and each of these image is constructed or thought of constructed from a low dimensional zet and then we have a probability distribution on that. And since neural networks are considered to very be very flexible. We consider just a normal distribution, a standard normal distribution. We believe that the normal neural network can transform the normal distribution in any other distribution. So we have a distribution on that, then we need a conditional distribution of that given X. And for this case we just consider a normal distribution impairment. Tryst by a mean and covariances which we consider here.

This distribution often is called either generator generator or called the decoder and it transforms this latent boat into images. This is like how we think about it. And then for variational influence, we need an inference network. So we have a network back going from X to that which is here represented by these thoughts mm And this is called encoder also here we consider the normal distribution with a mean and the forbearance which were here also assumed to be. So these are Now you might think that's just normal distribution. What about them? The main point is that here these parameters then not just parameters, their functions

and these functions are neural networks. So these functions the parameters are of a distribution we use here deep neural networks then um now that we have like set up this latent variable model here and this inference model we can al optimize the album and this can be done by a backpropagation So we need to maximize it. We have if data we go through these networks here we compute the loss and then do backpropagation and update the parameters with stuff as degrading the scent. The choice of this distribution here could be any. Usually this is just used for simplicity. One could come up with discrete distribution here for instance. High dimensional manually if it's

black and white pictures for instance or one could even consider other parameterized distribution like exponential families. So then one point for the optimization is one difficulty and the difficulty lies in the elbow where we have to evaluate this expectation value here. This expectation value in general is an integral which we cannot directly solve because of this higher in charity and so on. But what we can do is we can try to approximate it with samples. Yeah expectation. Here's a problem and we want to use an empirical version based on samples. And the problem now is if you use samples, how do we deal with the hostage grading descent? If you have a sample, this is just a value. How can we get a great into out of it? And

solution is called the representation trick. If you want a sample from this encoding distribution which has here these parameters and if the samples just wanna tell you, I can get gradients by writing by samples as functions of this mean and the variance or standard deviation here and then something which is sampled from a uniform caution if I so I start the other way around, I sample from the uniform motion and then I define my zet as this function of the sample. The standard deviation and the mean. The reason is that now separating the parameters from the randomness, there's no parameter in here. So we know now how to do gradients, like just

like with functions so we can compute this gradient, we can compute this gradient and this is just something which has no parameters and just noise, there's no gradient with respect to this. All the gradients are outside of this noise. And um I can check of course if I take something uh from a normal distribution and multiply them with scale and add some bias term, then these are sampled from this distribution. They follow this distribution. So the main point is that you're separating parameters from noise and this allows us for backpropagation So that means like if you want to approximate this quantity here, then we can approximate this by with these samples just by sampling set like this, plugging this in here and

taking a sun and M needs to be big enough that it is approximation is good enough and explicitly it looks like this, this is not a function of new sigma and all these parameters which are occurring in this neural network and these are just um values which just maybe pop up in derivatives, derivatives. So now we can say how this training works as described, what we wanna do is again we want to maximize the elbow and for this we do until convergence after initialisation The first sample, a mini batch from our data distribution um these are the actual images phantom am n'est. And then for each single value of the mini batch, we sample another mini batch of em samples from

this encoding distribution. This is the encoding distribution and the important part was that we do this with the red parameterisation trick. So we can compute radiance and then we approximate our elbow with a sample version where we have here. Now the end samples and for every end sample we have em of these latent samples and then we plug in the barriers here here here in the values and this distribution. We parameterized them with normal distribution plus on neural networks here and here means we have an analytic form so we can really compute gradients with out of differentiation tools and then we have this, we compute the gradient, yeah, you compute the gradient and then

uh huh is a gradient and computed and then we have a learning rate and then we update the parameters. The new parameters are the old parameters plus on learning rate plus the gradient. Or you can also use other optimization tools um which have higher momentum or or something like that. Okay, and this is how your trends. Now let us maybe talk about how to interpret the elbow in this setting. So the elbow can be written as but we have an expectation over X and then we have an expectation our encoding distribution, let's give an X.

And then we can write lock, xk that usually we have here the joint distribution and this is what I write next and here, haven't you? Minus. And then cut lively divergence between my encoding distribution, andw my distribution on the latent space often called prior we're here latentvariables moral interpretation, bathing, interpretation and mix. Okay, let's look at this here. This is how you can ride the elbow in general this year is a generator or the decoder

and this is the encoder. So what this does here is it takes a sample from this distribution. It goes in here produces a latent code and the latent code goes in here and this produces another sample in the X space but we start with X here, this leads us to a set, this leads us to expand and if you look at this, this is and I was having a normal distribution. If you take a walk, this is like and square distance. So this is roughly something like X minus x square. So this term here can be interpreted as a reconstruction term maybe with a minus sign because we'll be in the maximising

business, so this can be considered a reconstruction term. This here means that our encoder should um close for this um, distribution on the latent space meaning while this year can over fit two samples, this is trying to prevent this overfitting by matching this distribution to be independent of X. andw matching this distribution on that. So well, this year's reconstruction, this you can be considered regularization so variational out encoder or

in general, these variational objectives here with the elbow has naturally built in regularization term, let's look at some pictures um, so if you want to visualize the E and said that we have here our input space acts that might be images, then we encode this with two neural networks, we get a million sigma out of it, then we sample here some uniform caution, we add this together, we get our latest code that then the latent code goes into our decoder and then it spits out here, a reconstruction and then the first term in this elbow kind of enforces that x is close to our reconstruction and here is a distribution we have here, make sure that if I have several of these distribution that they kind

of look like a Gaussian here but that they're not collapsing so easily towards reconstruction. Overfitting on our data. Mhm If you know I look at for instance am nist and we want to probe the latent space and we visualize it in the latent space maybe looks like a ghost in here and the different classes are separated according with the numbers and here you see the digits 1 to 9 and the colors correspond to these classes here and if you translate this into the image space of black and white and nist this would look like this but in this region here we have zeros in this region, we have once These reasons we have 7s area force 5326,

eight and so. No and this is Halloween. The national out and code operates and the objective directly follows from our general variational inference objective. So in summary we just take our variational influence objective and use all models to be and neural networks plus some Gaussian noise then maximising the elbow with plastic rating descent and mini batches and we're done

1 Variational Inference - Variational Bayes /

hi everyone in this video, we want to talk about the variational base approximation, also called mean field approximation. Let's recap so in the variational inference framework again we start with the true underlying distribution, which we assume and we also consider we have samples from these and we model these with the latent variable model here some and some X. Or we consider this a basing setting where these debts are actually the parameters distribution that is prior and these are just hyperparameters So this is just a family andw of models with priors and hyperparameters Mm And then instead of maximizing the log likelihood

um we are going to maximize the elbow, the evidence lower balance next month, lower bound and for this as we have seen we need a family of influence distribution which I supposed approximate the posterior so these are supposed approximate this year. Um and the elbow here's a model, he has influence distribution and this needs to be maximized with respect to both parameters of both models, E and Q. And if you train this after training well then this is

an and this is smaller or equal to E. X. So there's evidence and after training is considered to be very close to it. So left hand side is an approximation after training the right hand side and is supposed to be close to it and approximated from below. So in the variational based framework and the mean field approximation. What we do is um basically assuming that our prior our posterior that except theater and this is approximated buy a product distribution gazette here is now a higher dimensional vector or like at least a few components,

there may also be either one dimensional components, but we could also block them, this could be three dimensions and there's five dimensions and so on, so we block them into parts and we make the assumption that these blocks are independent at the in the posterior. So posteriors independence, that means what we're doing is our inference distributions are basically defined to be only product distribution. This is how we approach this. This is called variational days approximation of mean field approximation. These are all product distributions as you can choose this infant distribution, you can also restrict to these kinds of distributions and of course there's nothing wrong taking product distributions for inference and distributions, one

could then say why do we say a proximation if this is basically a better choice of mhm influence distribution. The main point is, as I explained that people consider this year as an approximation to the true posterior in the sense that you need to do beige in approaches, that means like you need to get the posterior and then you need to approximate the posterior but one could also argue for a direct variational approach in its own right and then this is a valid just valid approximation, valid choice of distributions, not considered as an approximation of anything, but um if you start from a basing point of view, then this is considered an approximation for this and this is called the meat feta proximation buy this product. One

remark is that if you consider this an approximation to our posterior and let's say is marginals are fixed, then this is a high entropy, a proximation the joints entropy smaller equal to the sum of the marginal n trapeze. If this is a product then the C. Is equal. So this takes if the margins are fixed then this has the highest entropy all distributions with given marginals The idea now is that if we have these product distributions, maybe we can find a close form solution for our elbow or at least find um find, find out houses,

distributions need to look like in terms of our model We have data and we have this 18 variable models and this needs to be fit and we wanna basically get rid of the optimization of a.q. by doing it explicitly using this approximation. So what we do is first we abbreviate the distribution like this, so we get rid of this W and we get rid of this data so we consider this implicitly dependent on dependent um w maybe so and on our data, so it's allowed to be dependent on our data, we also drop the andw parameters in our people. So then lagrangian

you know, lagrangian um you wanna know so for this year. So now we write down our elbow in terms of this distribution and then we have some constraints that these are probability distributions let's write down the elbow. Okay, okay, elbow yes. Q of that. Then we had to lock P that X hat divided by user. This is our elbow. As I said, like we just dropped the dependence on all these parameters so that's easier to create. And then you can write this, you can write this

yes, E que that walk, is that X and then minus E Q. That block use it so far. So good. Now this year is Integral over you won that one U K said K times block E of that which has all these components extent. And then what we get here is this is now the product here

and the log just makes them into a sun we have here minus some K equals 12 K. No U K. That came thanks logq UK? That K E that K here we have Exactly, okay. And now, so this is our elbow and you want to know very res respect to this Q and want to see what comes out. So we want to maximize the elbow. That means like if we you wanna wiggle here and see what is the maximum, so you wanna take the functional derivative, respect these opponents and put it to zero. So

um, you might wonder why this here secure that K The main point is if you take the sum here. So if this is a product, then this turns into a sum of locks and then each of these um terms only depend on that K. And so the other variables drop out here by integrating them out. So the expectation value over let's say Q J Oh this year is like is because it's not depending on that J it kind of integrates to one. So we have really just some of the single terms here and now this is not everything we need to add we not need to add a constraint. The constraints are here that these are probability distributions so better let us add.

So it means like okay equals 12 K. Okay. And then we have this integral K K. Is that okay minus one. Each of them. Each of them is normalized on its own and we have this year as well. One. Okay okay. Can that Kay is it? Kay that's one N T. As well. Not really. Plus Yeah. OK Quotes one. Okay from the k integral

UK that K as one. So this is now our lagrangian So now let's compute derivatives Here. I wrote down two Lagrangian again and now we want to take the lagrangian derivative. Retrospect to u K. What you want is we wanna This 20. So what is that? The relative. Um so let's see so we want to take the derivative with respect to this U k Z K value. So for every k we basically have one variable that's how we need to think about it. Now we have here the product, there's one of these is in here so if I take the derivative, one drops out and the rest stays derivative. So this is our first term.

Take the product all day. Not in K. That's jane times, dirac mm that X hat T that and I write minus K to mean all uh huh. Performance that I'm not. Okay, this comes from this contribution then we have here contribution taking the derivative. You get this take using the product rule. So first first let's this derivative minus rock.

Okay, is that K comes from this? Now we take derivative of this one and this divided by one over Q K. Is one minus what? And then finally we get here a derivative where only this lambda K stays us. Okay, this is the function of the repetitions Let's rearrange all in between like this to know we have this equation here and he set it to zero and we want to have an equation for Q K. And here is the equation of Q K. So we can bring this on the other side and take the exp bring this on this side

take minus and an expert. The line is the way we only take the express. So what we get is X. Okay minus one times X, integral J not K. Then lock be that X. He said not K, parables all of them not care variables and that's it. Now we can simplify a little bit. So we have here constant normalizing constant given by the Labranche multiplier. So we cannot derive this further besides

really integrating over this and then setting this to one. So this is a problem proportional now too this here and this is a probability distribution over all others. So we can write this as xxt expectation value of Q not K. That's not K. Of lock E. Now we can write here that Okay, that's random variable then we have the value of that K. We have the data and that's it. We derived a formula for Q K explicitly. Wait a moment.

Okay, now you think we have a formula we have one big problem. One big problem is that this year is a recursive re dependence so UK is dependent on all other Q K. And even though I have this equation now this equation here, Okay, all dependent on each other in this equation. So here is the big problem but this gives us a condition andw the condition is now that any distribution which I could come up with need to satisfy this so we could use these fixed point equation. So it's considered a fixed point equation and the point is actually a whole distribution fix distribution equation if you want. Um

and if we have a pre metric model for this U K for instance normal distribution or something like that then we have here parameters and we have here parameters and what we get is a fixed point equation for the parameters then. And this often then leads to update rules or the parameters or sometimes one can explicitly solve the system so then we get like I don't know anyone here and two here in the next equation, maybe 21 here, which one with some coefficients and then there's an equation which can be solved or at least I initialize this year, we want gas and with gas and then I use these equations as update rules until these converters and then I also found an approximation of this. So this might lead to either iterative update ruling

or or to a system of equations that might be solved or so. So this is a variational based mean field approximation approach. So you assume that your posterior factorises product distribution and then you arrive at this fixed point equation and then you're trying to solve this. After you solve this, you have an equation, an approximation of your posterior and if you plug them in into your elbow you have an approximation of your evidence. This is a program, the relational basement iota proximation and it has to be said that this is difficult to do in practice by hand

2 Variational Inference - Variational Bayes / - Example

hi everyone this video I want to go through some example for the variational base mean field approximation. Okay let's start let's assume we have a very simple one dimensional mall andw our distribution is given by normal distribution, probability distribution of X. Mm even the parameters and the parameters army and precision. It's a normal distribution. X. Our interest, this is our model for X. And we assume we have from this model actually take the product the product of this model I. I. D. So we assume we have X. one

until X. N sampled from this I. I. D. And we call this data we call access this is our access so now we want to do a basin approach so what we do is we specify a prior for these two these two parameters. So our distribution for me Which will be dependent on another maybe zero andw Mhm. Okay and Tyler will be normal distribution as well centered around zero and then we have this scale times tau english and then we have a distribution for tau

which will be dependent on A zero and P0 andw but if you want to draw this as a graphical model we would draw wow two three there we have here now X we have here tau you have here tau

then we have here in the movie then we have yes yes this X. Then We have also a zero. Andw zero if zero we have a positive role and then we have A and b nonzero zero like this

upon zero zero and then we have I. D. Samples. Yeah we have a plate like this we have and observations here. So this would be the graphical model in the following. We might suppress dependence on these hyperparameters A zero B zero and zero and zero. Mhm No we have a joint distribution. Joint distribution directly take a look mm X. This is data me town. As I said we suppress the dependence on these hyperparameters

and then this can be written as a lock of P. X. New town this is a product over these plus logq mu now plus log of and here as I said I have the parameters I omitted. Then you can write this this is a normal distribution of product of normal distribution. So what is this? This is n half logq town usually there's this um variants here in verse but we have the position so we have the position here minus behalf some overall

andw wow capital M X m minus mu squared this is first term plus some constant. Then for the second term we get block of a towel minus so uh We zero squared plus some constant. And now for the gamma distribution we need to know a little bit about the gamma. The gamma comes with long term and the term in the exponent and the rest up to some constant. So the main point about gamma is if I take the lock of the gamma

I get log of the town and the town with corresponding coefficients and the coefficients here are zero minus one times dirac tau zero How plus and then we have some constants and our goal is what is our goal? Our goal is Bryant the posterior of tau gibbon X. And our proximation is now you're one of many times youtube of tau and we will use in the following in next move and tell to talk

about the expectation value of respect of these distributions. Now the fixed point equation variational base infield fixed point equations they know same that lock My Q one of me this is the expectation that over some tone right town walk tat very tall plus some constant up to this, normalizing constant and also walk the two of

this expectation value for me um E X and some other council maybe I'll drop these indices here if the dependence on immune tunnel is clear. Now let us go through this bit by bit first let's look at these so for the first term this is our first fixed point equation and our lock here inside that was given by this formula this was a conversation, we had this caution and here we have the gum of the block of gamma. Now we have to take the expectation value respect to tau and up to some constants which we can omit note

that we hear one a distribution of a move everything which is not dependent on mu we can just incorporate in the constant. So if you now take this means an at rock you're one of men equal to. And now we can look watch drops. You see this is not dependent here. This is not dependent on new. We see this year is not dependent on me and this is not dependent on me also this is not dependent on me. So we have only two terms that are dependent on you. So what we get is it is the expectation value tau

no minus so enough times Some 1, 2 n next and it's very squared wow, nonzero minus zero squared plus some constants. Now the only depends on tower here is linear. So this is basically a prefactor which can you're taken out. So what we get is one, I'll write it like this

tom and then zero minus zero squared plus equals 12 X and minus mu squared are so constant. So unless you can see we have squares here in terms of move and we can factor them out and I will not claim that I can write this as minus tower and half position of me minus maybe and square. And I can just factor this out. I can define my prefecture like this and the rest is a

constant and one can easily check what's his tau andw is we can see that this U. N. Is 00. That's all the exercise X. n. 1 to N divided by zero plus M and tau and zero plus family. I'm this you expectation value so what this means is that this is known and this here this part is unknown bill but what we know now is so we have derived

at lock Q one ask me you is minus power and half million minus million squared buy some constants and this means we know that Q one mu it's a normal distribution with mean million andw precision tau and inverse. This is what we know now also this is a lock of caution so we don't need to know all the constants because we know that they will add up nicely and again I'll write it down again 00 that's a song six and

divided by zero plus and this how and wherey Plus zero Plus N. And the expectation of this tall okay oh we found this out. Now let's look at the towel. The town were given by this fixed point equation here and again this lock of the data and parameters is given by this so again we want to now put everything into this constant which is not dependent on tau but basically everything is dependent on town, here's

town, here's town, here's town, there's town out here in town here so we get that look no how is and now we have to look at expectation value of a movie when you see it here and what we get is and half tau lhe minus uh huh expectation over me, Andw equals one and X. And very scratch pass kappa zero minus zero squared this is a dependence on me and maybe

we move this year bit down here and right here all the terms that uh mm only dependent on top you can't year and off. tau then 8 0 -1 minus museo don't forget past constance okay and now we write this S 80 we and plus one half minus one lakh tau

penalty then minus zero behalf expectation value or will you scratch plus zero zero squared times oh buy some calls just bring a range that's a bit much awesome concert so we can write this now as

A N times block. tau cross minus E N times chow plus classes where we have this year this means that mark you tau is equal to oh let me mistake we call this a N minus one over here

all in just this part here I am the reason is that you can now right This year as a -1 tower minus the M times town as a constant and now we know that you two it's a gamma distribution even with A M P M. Yeah andw dust and zero plus and plus 1/2 and this year is known andw bien

Yeah it was zero half oh hey zero -90 Spanned plus X and minus mary squared and equals one to end And now we can factorises out. So this is B0 plus, How about 0/2? Now we have expectation over u squared last 30 squared minus two expectation gnew times zero plus

and half some in equals one and x squared plus expectation of a movie movie we had minus two Yeah X I So here we have unknown quantities given by the expectation diary but we now know what the expectation value of me and we square but we know that you want

what's a normal distribution let's move in. andw tau and inverse. That means we know that movie is and we know that squared is 1/12 and plus million squared this is just here variants plus being squared and now we can plug these in also we know because uh we know from the gunman distribution that you could take the tunnel then we get A M divided by B M.

So these gives us no equations we had N was a zero plus N times the expectation value of tau but we can now replace this bye A N divided by B N. Then mu hand and A M. They were given by the original data so there were no um so I click relationships here and this tower N is dependent on this E N. These are also given. So we have now basically Tau and -1, zero plus N -1

times the end B N divided by Yeah other quantities and we now know that being yes, zero was about 0/2 then We had one over tau N plus and squared zero squared minus two and zero plus half.

Some want to end officer X n squared Plus 1/2 n plus you're in scratch minus two. Yeah, excellent wow, this is hey, so we're now looking at this term, what is this? kappa zero kappa zero but two times K zero plus N. Again we re

fries is now in B N. Okay this here now we also look at yes, maybe first we write it in terms of 802, inverse, our inverse is your half and then we have here town and universe and the sun. So we have bus and of bus the rest and now we take this formula and plug it in. What we get this

bien invited by a M half in front us no zero nonzero half U N squared us is squared so we can just and put them together again. No, so what we have is now E M equals this year and now we have the chance to solve for B m like this, we bring this part over here effect of this out and now we bring this on the other side

let me get this bottom minus a N minus one and then this quantity and we have found a solution for B M. No, we can solve for oh in to end this now zero plus 10 am divided by B M perino plug this in but I won't do this. So finally we have found you t of town which was gamma tau A

N B N where A and was given by a zero N plus one, the end was given by 1 -1/2 and in verse this quantity with me and A 030 plus the sum of all the expense divided by hey zero andw And finally tau and given by K0 plus 10 A divided by B N. andw You one off for you it's a normal distribution

of this and I just wanted and so we have found both distributions with this fixed point equation and finally this means that he moved town even our data is given and Miriam you're safe now in this times gamma distribution town A M E n mr quantities from above and furthermore

the evidence the X will be marginalized out these parameters Yes, basically now the elbow roughly similar to the album of this andw minus one this is quantities and this shows all the variational base regional based mean field approximation, maybe we right here elbow

contestant evidence often we cannot find explicit formulas for this but we have seen that we have a classic equation so what we could have done in this setting here and the setting here is initialized towel with value initialize beta and with a value and then using this equation and this equation here. Iterate until convergence to get value for T. N and B. N. N and B N. Out. Then we would also have um approximate value. In the end here, we were able lucky to all but explicitly, Okay, this is a variational venice in field approximation.

3 Variational Inference - Universal Variational Inference

everyone in this video, I want to give you a bird's eye perspective on all the learning and inference frameworks we have seen so far and also a recipe for you to create your own So what are we aiming at? Let's say again, We have some underlying true distribution and some data set samples from it considered to be sampled from it and what we want is we want to learn how this data distribution looks like and also we want to do predictions. Also, often it's important that we need to quantify model uncertainty, corporate prior knowledge, regularized our distributions and it should be computationally efficient, that means like we need a lot of ingredients that are flexible and adjusted to our needs. So then let's talk about the ingredients so and it's very general learning framework of universal variational inference which is just an umbrella umbrella term

here is what we need to first do is specify a statistical model. We always did this for instance for simple distribution, that could be normal distribution, it could be an exponential family or it could be a neural network or neural network with some softmax output and so on. But it could also be given by implicit, so called implicit distribution or likelihood free distribution or even by probabilistic programs. So basically sample based distributions and here we assume that that's a parameters. andw theta here, the real promise as we usually think about them and latentvariables are basically incorporated here to the sea to so we don't specify the distribution of the latent distribution and incorporate them here into the parameters

and these distributions are aimed to represent our true underlying distribution we assumed so and then we need to specify a loss function, telling us how we measure success. And the special part here about the loss function is that we're measuring the goodness of fit of the data with respect to the distribution. So we're not comparing to let's say analytically given distributions because learning is always about discrete data, we always have to go from data role model and the model could be an analytic form or probabilistic program or that you can come up with. But the main point is that data is always discreet and you have to abstract from it. So the loss function need to compare discrete data points with an implicit or explicit

model, then we need to specify the inference model. This is very in the flavor of variational inference. So we come up with a model that's supposed to in for the parameters and this could be very flexible, it could be very restrictive depending on your computational constraints you usually have or if you have some hard constraint on the distribution, some expectation of what you assume this distribution to be or some simplifying assumptions. The main point is here that you write this down here is that you want to be explicit about it really have to say if you want to say we do this, there is an interesting statistic model, the loss function and what influence model did we use then? As the next step we have to regularize our distribution during learning and this is given by some

functional that can be anything which restrict our distribution during learning in some soft way. For instance for instance it could be that it's high or low entropy or that you're trying to penalize the complexity of our model by taking putting more mass on simple models and so on. And then we basically also have to say something about what optimizers we use is so hard to create in the sand or some other version and we also have to say how we evaluate a specific integral but this is for this framework rather minor. So after we have this set up, after we have specified this, the model or loss or inference network and our regular horizon then what we need to do is we need to optimize the following problem.

So this is not the key objective. The key object objective is minimizing the loss between our data and our model. That's very, very typical that what is the best parameter and such that this distribution fits the data and that we regularize. It's also not a new thing. So regularized regression looks very similar like this. The difference here isn't that we are not optimizing for the best parameters. optimising for the best distribution over the parameters. We're doing some model averaging and so this is very in the bathing sense and this distribution here gives us the possibility to quantify model uncertainty. So that means we need to solve

this and this is like where um this function class comes into play generally this might be infeasible to do so we need to basically have a motorized version of this or we have some simplifying assumption that you can solve this. So here the computational bottleneck is choosing this Q. Such that we can do it and then we want to do predictions after this and predictions are done with more leveraging after we have trained um our model by minimizing this year with some optimizer and we get our cue out, then we use this Q make predictions. If I want to know how probable a specific value access, then I plug it into my model and then I average over all my variables. This is very

in the flavor of bathing approach maybe model here over the posterior. This is something like approximate posterior if you want. Well, this is the other predictivedistribution so and what we want is of course we want. This is close to the true distribution. This is a goal that we have like good approximation of the truth. So these are the six steps specified statistical model, lost function inference model and regular rising functional, then solving these optimization problem and then doing predictions with model averaging. So then the question is what kind of loss can we use and a typical first of all we wrote it the last year with the whole data set in

this model and often it just comes under idea assumption is just the sum of a point wise losses at this point. By his loss is the most difficult cases taking the log loss which is minus localism of the probability of that point. We take the specific point here and we look how likely it is. Respect to this probability distribution. Another way to interpret it is these are the number of bits if this is like this binary localism, the number of bits to encode the state acts using this coding scheme given by this model. These are the number of bits given to encode this data point giving this coin This is the most common. We used in this. Yeah, in the course of our

history of course you can also use a square loss when you compare a point with the distribution as saying if this is a distribution and this is my point, then you compare point with the mean of this distribution, you're basically looking at the mean, the center of mass of the distribution and then you take the difference between your point and this distribution squared this typical square off you can also take quantum loss we take either alpha Kwan tile of your distribution. Then you get the Arthur Quantum out. If it's unregulated. Rised then of course you can for instance use a misclassification rate classification and that's kind of the worst loss you can think of. That's very bad practical and theoretical properties. So this should

be avoided at all costs. So the lacrosse is much better or if you want to focus on very underrepresented classes, some people nowadays uses focal walls very have here power to and then you can do classification with rare events. Other very popular things are mmds maximum mean discrepancies, optimal transport emergencies or steiner emergencies. We can all use here for loss then let's talk about the inference model. So what kind of possible inference model can you take? So what we can do is we can just take point, estimators meaning you're basically instead of learning the distribution of the parameters, we're just learning one parameter and this is what we usually did in all these mix um maximum likelihood approach,

maximum posteriori approach or empirical risk minimization. We all usually only learn one parameter and this corresponds to taking he has a delta peaks um the distribution that peak at just one point. And of course we could use some neural network, some distribution, parameterized neural networks then we have seen in the variational bayes mean field approximation that we use product distributions and one in the basin setting on would use all distributions. If you think about the base rule then we have to compute one conditional out of the other. This is inference and in the basin setting, you always assume that you can do this perfectly and it means like the inference networks are all distributions, what regularizing functions can you choose? You can either choose just

to not regularized just take zero, this would be the maximum light would approach or and uh variational influence in the standard variational inference framework or in the basin setting to pick a prior over your parameters and then you use a cool but like the distance between your Q influence distribution and your prior note that this is supposed to be something like approximate posterior in the basin setting you get this regularization term on the variational influence setting as well and then there's kind of a beta version Um which is usually one but you can also choose other bodies from the time. Can use all the cross entropy andw or the entropy itself like high entropy, low entropy

what you'd like to enforce um for instance semi supervised learning, you want to have low entropy regularization and if you wanna put uncertainty estimates and better high entropy regularisation people also try sometimes is that they consider based distribution on the observation space and then they say the models they learn shouldn't deviate so much. What they do is they specify how much deviate with some virgins and then they regularized parameters by choosing the ones which are not far away you have given by this expectation, these are possible regularization then what about the supervised learning setting? The supervised learning setting? Our Q of X is now basically a Q. Of

Y. Of X. So our interest on other labels Y. So we want to learn this conditional. And we can just press it into the same framework by basically saying that our model distribution is given by this um discriminant function here. Or discriminant distribution plus the distribution of X. Which are the features which were actually not interested in, but we assume that we even know it it's given or we totally ignore it or we say it's basically the same as empirical version, but we usually take here nothing to learn, we just take it as is but if you plug this in and then this cancels out everywhere in the same framework. Also holds this support is telling this year. So and this gives you a recipe to come

up with your own learning frameworks as you want. You just have to specify all these six steps and then everything is transparent and you can go on learning from data

4 Variational Inference - Universal Variational Inference - recovering special cases

I won in this video, I want to go through some corner cases which have already encountered in the past and show that they're basically special cases of our framework and just taking specific choices Okay let's go through with the first example let's start with a standard variational inference we have seen before let's see if you can press this into this framework, we have just said here we just started with distribution P. X. Given theta then our loss function can just be used as a low close the theater of X and if you want we can write it as a P at given theta then our distribution influence distribution can be any or neuralnetworks

any choice we can make. So this is not restricted at this point and then our regularizing functional is a robot like the distance between Q and pie and the pie of prior note that we might have used zet here before we now use theta as a variable to the all the sets to be included in then then then let's write out what the objective is. So the objective here is that we take expectation value Q in town mhm

In here minus block next Utah eter then plus the distance you theta and lock you, it are divided by the prior which we can right now maybe I'll read it up here, what is this now, this is now expectation value por que theta now we have

walk X even theta theta now we have here you eta so now we can you write this as a joint distribution andw this as um yes influence distribution, we can write this walls, we can write this as minus thank you Peeta walk E X. eta I don't like you as you can see this is minus the elbow

so minimises here means like which is minus minus and maximizing the album so this is now you see that these choices we made here, this is just a lacrosse and this regularization term that we recover the old variational influence we have seen so forth then here we didn't talk about predictions so he is just more leveraging, you just accepted next example basing approach so the full basing approach you can again take here the lock E of data given to be here now it's important that we take all distributions more possible distributions then

um regularizing functional here is again Q I was prior brian hi theta um and then we have seen already before that this year's now argmax you of all distributions of the elbow but lacks um that's uh mean minus you've seen last slide that this is E. Q. Now this is lock of

our hm of the data and our Q. You know then we had a minus sign in front and also raise the minus sign, do it the other way round like this and now as you can see what you can also do is you can hear take conditioning and then times the evidence, we can write it like black come in, it was taken off all distributions E que eta dirac ure theta

Bc two like this and then minus block the X hat so it's not dependent on the parameters so this will be here and this year is now the same does art mean view of all distributions over library between X. eta andw posteriors the waters and the argument this is exactly When this is zero and since we take all distributions we can make sure we match the posterior and then

our averaging here is P X times through chris terior and as you can see this checks out so in this framework when we do it like this we recover the full basing approach with this universal variational inference setting thanks let us think about what about maximum likelihood estimation now for the maximum likelihood estimation we can take the law, clause or data, our functions here now we only take mhm only take the point, estimators only the delta peaks which are only on one value, this is only equal to one when it's

equal to this W and now this w here is a parameter and we want to find one where this is possible now we take no regularization and this year's argmin now think about what this means here taking the expectation the expectation of a delta peak means just to pluck in this W so that means you take the argument over W minus broke e w XT to and then plus zero and we're minimising here minus a lock. So this is the same argmax and it's

the same as arc marks. Ow remark P W Thanks. So this is really the maximum likelihood approach check and now what are the unpredicted distributions? So first what we get here is uh is set to maximum likelihood estimation, right? So this is our setting and um or better if you write it more properly write it, hopefully the delta w all these distributions and then we get here the delta distribution maximum likelihood estimation, lossy eta So basically this corresponds just to this typical

maximum likelihood estimator. Have now written a little bit fancy but it's the same and the same information but now we plug this in. If you take this here and plug it in here then again we're integrating over a delta peak. So what we get is here get here P. X. Maxim lack of estimation, meaning this is really our typical distribution predictivedistribution the maximum likelihood estimation, we just plug in one value let's go to the next maximum up the stereo again we take here the lock loss for data and again here we consider delta picks now our

our regular rising functional we take the cross entropy of Q. And then I where we have like a prior I over this is not what we're doing and what you're basically getting is here is our community deltas then we get minus book zw extent. No we have here we have here the cross entropy which just means the expectation value of minus lock of this minus logq I W

this is if you take the cross entropy and plug in here this distribution if you just give this and this can be written ass argmax block B X. Not really who all Ws And now as you can see we can also write it round right W this and this is also argmax datapoints lock in W ex boss acts which basically here

does not matter because it's not dependent on w what we get out of this now yet released the map estimators this is our this is really the objective for the map estimation or looking at the posterior mode but now we can plug this in. Yeah and this is again a delta peak so what we get is get P of X. You know X. estimators again we have recovered uh the maximum pastoral way estimation framework then empirical risk minimization, I think we didn't talk much

about it where we hear whether we can take any any you would take again point estimators only think like this and then this year's arc men basically W And then just plugging notes. So just minimising the empirical risk and what we get out is basically yeah your risk minimization pita and here again X.

That's the point estimate next example variational base was mean field approximation again you can take here the lock loss of our data and now here we take all the distributions you w data that can be written as Taiwan. Okay, all these components, all distribution and distribution, then you can hear again take um you why by prior and then um as before we see

ark Mex, you propp distribution, get the elbow out as similar to the various no influence framework and then we do model averaging or prediction, which we haven't talked about, so we have not seen that with this very general framework of universal variational inference where we have to specify statistical model laws, inference model and a regularization term. If you specify this in all kinds of possible combinations, you get all the frame learning frameworks we've seen so far out, this is a very flexible learning framework, this is meaningful and flexible enough out getting into

into many details from this bird eyes perspective

5 Variational Inference - Summary

hi everyone, let's summarize what you have done so far in this. variational influence part so what was very influence, variational inference the setting was we assume we have another true distribution and we have data given samples from it, then we either have a latent variable model or we have the base in setting when we interpret all this quantity but differently but as far as the same rules. So then um we start with some latent variable model where we consider the late and variable may be here to correspond to information which was lost in the data and we're trying to recover or to infer to do this. We introduced an inference model for the family of inference distribution. Um

parameterized by some w here and the meant to approximate this posterior here maybe here we can write extent and then the key objective was that instead of maximizing the likelihood of this evidence um we maximize evidence, lower bound likelihood is a lot likelihood is often a tractable to optimize directly. So this year is the elbow and it was given here by the joint distribution so we don't need to marginalize that out. We have really the joint distribution we're blocking here the data and we now as we probe this set here with our influence network and then we maximize this with respect to w First of all, sometimes in

the Beijing setting the Ceta is hyperparameter so it's fixed or it's not even there, so then we only optimize respect to w but if there are considered latentvariables model then we optimize also for sita and then maximising this gives a close approximation to the real evidence. likelihoods as you want. Optimizing maximising respect of these parameters and gives us a close approximation of this. This is a basic setting of variational influence and we have seen that equality holds, why don't you holds maybe write it here, quality holds if and only if you that X equals

posteriors then we have used this framework, we derive a few algorithms in different settings, so one of them was expectation maximization algorithm and what it does is badly just optimizing the elbow once with respect to w once with respect to theta and then alternating between them Alternating between these two steps. The step with timestep and when you looked at it more closely we were able to simplify the elbow so we need to evaluate only this quantity where here is basically the optimal inference net optimise infants distribution. So we have to evaluate this respect to all the parameters, you are the new ones and this is not a joint, there's no condition

in here, this makes it easy if you have to join gibbon and then in the m step we maximize this year with respect to uh to get a new estimate of Ceta and sometimes you have to add constraints here. If the parameters lived in a constraint space and then we repeat this um up and then it's converges and then we have a good approximation um posteriors The second approach was variational out in quarters where what we badly did is we parameter rise all latentvariables lhe model and all influence networks with deep neural networks with some output, normal distributed output and some normal distributed in some latent space and then we just

maximized the elbow with stochastic gradient descent, everything is differentiable and there was just one small um problem that we had to evaluate this integral basically corresponding to this year and we solved this by sampling considered this to be a normal distribution. So we were able to use a representation trick to separate parameters and the random noise and then we could back propagate through these samples. This was variational autoencoder algorithm and then last was variational base mean field approximation. That was the last one where we basically chose approximate true posterior with a product, factorized influence

distribution, the product of the marginals this was a variational based mean field approximation and then if you maximize the album with respect to the distribution of this type, we found that they need to satisfy some fixed point equation or the marginals together each of them satisfies this equation. So we have this whole system of fixed point equation over here is something dependent on K and hear all the marginals not in K. And then for our case and this often leads them to some recursive or cyclic dependencies of the parameters, if you hear parametric distribution then you have parameters here, your parameters here and they're dependent on each other but we now have the whole system 1234 K of these and then you have any

system of equations for the parameters and you either solve it explicitly or you iteratively do some updating types of convergence and a six point scheme, solve the system either explicitly or this update scheme. Finally we talked about a bird eyes view on all the learning and inference frameworks ahead so far and where we basically need to spay specified statistical model, a loss function, inference model regularizing functional. Then we have to solve an optimization problem which looks very similar to a regularized regression problem with the only difference that we have here distribution instead of just one value, we optimize over distributions, these are the influence distribution

and then we do prediction by model leveraging. You hear this resembles I mean predictivedistribution similar to the basin setting and because this is a distribution this queue mhm We have even model uncertainty quantified saying which of the models are likely and which are less likely so and this was the general framework of variational influence and which can be used for learning by specifying all these quantities and all these tical all these problems and ingredients bit by bit and this is general enough to raise at least all the And learning frameworks we've seen so far one perspective

9 Probability Theory

0 Probability Theory - Transition Probabilities

everyone in this video. I want to talk about transition probabilities. Also called Mark of kernels. But before we start with that we have to talk about notations we used in the machine and of course consider we want to talk about traditional distribution then we usually start from a joint distribution for random variables X and Y. And we divided by the marginal here for Y. And then we write probability of X given Y. What you mean by this usually is that the small ax here in augmentation correspond to the capital acts are random variable while the small Y here correspond to a capital Y random variable. And at the moment I'm just ignoring the case where this could be zero or we're actually talking a density function instead of a probability mass function. What I'm concerned about is that we're implicitly using the

corresponding variables in this notation and then we have very ambiguous outcome when we do this. If you use this heuristic assignment because the problem now is what do we mean? If you now change the arguments, if you change X to Y and Y two X, then you could say that P Y given X would be the random, the probability that the random variable X actually takes the value of Y. Well, why is fixed to X? Why would you think that like mathematically you usually say if you have a function if you change the arguments, you just change the arguments here. So here would be why he would be X and he would be excellent. This would be let's say the mathematical approach would be interpreting us like this correct mathematical

way would be this. But of course we made some simplifying notation to keep it usually easier to read. So this was just like less cluttered but less correct the other side. And what he usually mean is that math that if you put the arguments here and this is a value, this is a real value. And what we often use for this rotation is also then when we mean actually the function. So instead of having a value we also write a function just to highlight the arguments what notations for the arguments we use. But then we get confused between a value And afunction which are two different mathematical objects. What we usually mean is when we use this notation,

why given access in this course we usually meant the probability of why given the random variable X. So we implicitly change the random variables when we change the notation for the arguments. So here the small X. Where we would actually use this X. We also swapped here Y for X. This is how it was used in this course and before an altar in the books. And even not just that we often mean also that the function and just highlighting this and you mix this up. We had mixed this up in the course or even dysfunction. Sometimes we meant this, sometimes we meant this, sometimes we meant this and even sometimes

when you consider let's say likelihood function, then why would be the data point and this was considered fixed. And then sometimes we consider this as a function of acts. It's not written here, but like there was a 4th version B used in combination with this will always be using the same rotations. So the context always gave us a hint how we interpreted. And also we could consider it as a function taking X and then mapping to a probability distribution. So X maps to the probability distribution that maps y to this conditional. And also again, we could argue for swapping these because of the substitution rule in mathematics. There's a lot of confusion. Why do I talk about this? I talk about this because what I actually want to talk about is this here, I want to be able to talk about

the function where I switch, we're just substitute X for Y and Y for X. Why would I do this? Well, I want to be able to talk about homogeneous Markov chains. So I even need this here. Or actually I might need this. This might be needed. So we now need a solution for this course to be able to talk about what I just said. So we make for now a small um convention that if I use an index on my probability distribution, then I'm not in place of the changing also the variables

in our probability distribution or if I use um very unusual letter for probability let's say K. Then I also don't want to change their probabilities. So we could write for instance if I make the index year for the random variable X given Y. Then I mean that in the first argument I consider the random bearable acts and in the conditional part I have the and other arguments. So here this is a probability X. X. Why? Why? And of course we can't find this quantity to be zero if this is zero and then if I swap here, Y and X also,

we made clear what comes first consider this as a probability X giving small wine given capital Y small X. Or if I have Markov chains, then each of the timestep has an index. And if I then make the index here then we basically wanna talk about the ends variable Given the and Smile is one variable. So and still with this notation we don't know if we are talking about the fixed value or if you're talking about the function of both arguments only or only in one of the arguments or if he considers this as consecutive functions where we first consider this a function X two the probability space, then we choose an argument

for why but this is still um deductible from the context. So let's now talk about transition probabilities of mark of criminals. So the transition probability or also called Markov kernel, which is the same. Just different world from the space X to Y is a function from X to the space of all. probability distributions of hawaii meaning for every X to get a probability distribution over Y. And that means like we have to take an argument why andw say what probability corresponds to but also um using this for densities of probability mass function of course we have like two properties here and the property that a Y x is bigger equal to zero and

if we integrate over wine Then over the whole space, why we get one these are and conditions and this of course also all X and Y. And to keep it simple. We again have some abuse of notations here. We often just right ky X if you actually mean dysfunction here. Sometimes we also write K Y X two mean just the value sometimes you mean masses, sometimes bility masses, sometimes probability densities. And you often also just write and p of Y given x

mm like in line with conditional probabilities, savannah presentations, Markov kernels is basically the same as conditional probabilities. But what we mean here is that we can substitute here the values in and we usually when we use Markov kernels we don't changed expectations for the random variables just to talk. So the difference between Mark of kernels and conditional probabilities or transition probabilities and conditional probabilities is very slight one for conditional distributions, conditional probabilities. What you usually start from is a joint distribution here again,

by the usual abuse of notation P. X. Y. And then we write the conditional distribution by just dividing this if of course like as always Your ex is bigger than zero and equal to zero F. P. X. Right. And if I define a conditional probability like this then we have a well defined transition probability from X to Y. But if I have a joint distribution I can define transition probability. Just by conditioning we get a transition probability out. And on the other hand, I can start with the transition probability on Markov Colonel

without referring to any joint distribution or any marginal distribution. But what I could also do is again, I could um define joint distribution by uh distribution. Let's say I have P of X. Then I could define a joint distribution. Why X. And this is not a definition, I could use Y. Given X times P X. So we define a joint distribution by having a marginal or just actually turns out to be the marginal of this distribution. But we start with the distribution of the X. And then we combine

it with this transition probability to get a joint distribution. So the difference is only where we're coming from. So if you come from a joint distribution we can get a conditional and that defines as a transition probability. But all can start with a transition probability and then having a distribution of P. Of X. And then a joint and this year's independent of this P of X, I could use any P of X and define a joint and there are different interpretations of Markov kernels or transition probabilities. Let's say we start again with a transition probability Y given X going from extra space, why? Then we can consider it as a family of distributions? So for every single X,

I could consider this sn function Why? And this is a probability distribution on why? This was the definition, this is a family of distribution indexed by X. M and X. Then, as we said, it's already, it could consider as a conditional distribution where the marginal needs to be specified where it's not yet clear what the marginal is or the marginals arbitrary or where we have basically considered all marginals at the same time and we want to have a conditional that works for all marginals So we specified this as a conditional for all possible marginals and also another important interpretation is considering transition probability as a noisy channel or as a probabilistic map.

Also from X to Y. We basically take its input and X and Y is then output, which is sampled from this distribution. This means that we are basically assuming that there is a deterministic map in X and in a second argument, which is random and uh this is random noise and independent of our input, independent random noise and this interpretation usually comes from noisy channel coding so on. And then the corresponding between this K and this probabilistic map is that if I take the probability and here we make the probability over E. And I want to know what the probability

that this function equals this value Y. Then this is um it's getting Colonel here. All these um control that all these interpretations are basically equivalent. And the standard way to construct this G out of this K is considering conditional cumulative distribution function or the inverse. L let's look at some corner cases and examples, let's say we have a deterministic map, deterministic methane in an argument X. And you can consider it as a mark of colonel Which assigns 100% probability, 100% probability um

the functional value of G of X and zero otherwise. So it's a delta peak at G of X and and has no mass more probability anywhere else. So we can consider any deterministic map as a Mark of Colonel for interpreting it in this way here. Then what we have already seen if you have like a statistical mapping, a statistical model permit tries by an index. We often already wrote it as a conditional distribution sometimes we made in next year, but often you wrote it like this and in the frequently setting we didn't even specify a distribution over a theta. So that means like for every pick of Ceta, I had a model over X. But this was interpretation a family of distribution

over X for instance and in the Beijing setting we specify the prior over over the parameters and then we defined defined P X theater in the joint actually it was defined in the bathing setting to be this model times and arrived the prior as a pyre. This was amazing setting. And as you can see we didn't start with a joint distribution. We started basically by specifying the model and then a prior so we started by specifying actually transition probability on Markov then so that means like optimistic maps fall into the framework of

Mark of candles. Already statistical models fall into the framework of market kernels but now we can reinterpret probabilities as just in constant mark of kernels if I have a probability distribution of why over the space Y it's not dependent on X. I can consider it as a mark of colonel from X to Y. That gives for every X always the same distribution. So it's a constant valued Mark of colonel every X the same distribution in this sense, probability distributions are also Mark of Colonel and considered constant. Then another example where we have this interpretation of probabilistic matters consider we have normal distribution in X given the mean and let's

consider this year as a hyperparameter covariances hyperparameter then I could consider this Mark of colonel in mu and X. I take a movie as an input and then I sample acts from this distribution given this movie and then I get a new X. Or in other terms written, I can take mu a sample the from a standard normal or with this covariant here then I just added and then this value here, if I consider this as X and X. gnew plus E. Then samples from this distribution. So a normal distribution can be considered a mark of corona from mood to ax benefit from space X two X. Very first argument is million. The second argument.

Yes. X another example is a probabilistic map. Let's say we train a neural network with dropout then a test time. Every other grades are fixed. But if you plug in a test point, a test point then our E. Is corresponds to our bernoulli masks. It's random. So for every X we sample an output by sampling a dropout mask. But odds are considered a micro crime by looking at the distribution of this function value by resembling the dropout mask. This gives us a distribution on the output space. And then that means like taking the input X through the distribution of the output space defined as a mark

of car. So why did we all do this because we want to be able to talk about Markov chains. We say that a Mark of chain. Let me draw this like this Markov chain and we call this homogeneous if the transition probability going from X minus 12 X. N is the same as going from X and two X plus and for all possible. Um ands it means like first, like all these variables, they need to live on the same space X. And then there needs to be a mark of Colonel which is used always at the same like at all places at the same time. So that means we start with

maybe you maybe let's start with zero, then we start with zero and then we have the product A X. M plus one Given x. n. and this starts from N. equals 1, 2. andw minus one. And be aware here, we use no abuse of notation where this X zero corresponds to capital X zero. This value in the capital X M two small X. M. This is what's meant here and then I take these values here and I plugged them into my function, transition probability, we're here, there's no reference to any random

variable. It's just like I take this value, plug it in here and then I get the distribution over X. And to know what the probability is for this value of plug it in here, meaning the transition probabilities that each space is the same. You can also formulate this in inthe this kind of language a little bit by saying that X. Yeah, he has a probability problem. You want to say that X N plus one given X. N is somehow the same as X and X and minus one. And now you see why I cannot express this with our usual notation because we implicitly say that this is

X. N plus one. Then I have this argument X plus one plus one. And here I have not the other variable which is fine. But I don't want to have this value equal, I want this value their equal. So this is the reason we can't express this with our usual abuse of notation. So we have to go through a little bit of more. You have to make a little bit more effort. X plus one equals X X N equals why let's say A is equal to the probability X. And equals X, X -1 equals y. For all N. For all X and Y. And this is why it's convenient to talk about transition probabilities. Otherwise I

cannot write this with these huge and notations here have to either write it like this or I have to use transition probabilities. And when they're all here equal, that means like uh equal to this transition. This is basically our transition probability in our market chain. And if this is the case then it's called homogeneous or timetable genius. A small typo here, that's from zero, of course

1 Sequential Data Models - Markov Chains

I did not in this video. I want to talk about sequential data models And first you want to have a look at marco chain sequential data models usually refer to time series where our data points are ordered by time but it doesn't need to be time what orders our sequences it could be for instance also language model where the words come in some specific order. The first look at an example, this is an example of a time series here. The daily new corona cases in the Netherlands from last week and before. As you can see, we have here like a rise in the distribution. Then it falls to here and then there's a rise again that two things that are interesting. First, if you have points here that are high then it's very likely that the next point is also very high.

Or if you have a point which is low then the next point is also low. This means that our samples which are close by are highly correlated. So what we don't have is an I. D. Assumption. Every point. It's kind of dependent on the point before or after. So in a close area. The second thing is since this is a historic time data, we cannot re sample this. We cannot make the experiment again where we basically reset the Netherlands and then look how the coronavirus would emerge in a I don't know, parallel universe or something like that. So this is not possible. That means like we have only one sample a time point. When you have only one time series. So we cannot do this experiment again. Another point is, let's say we're now in the summer,

the corona cases are down and conditioned on that and our past, we would not know if this goes up again or it stays down. So that means that um, if it's rising or falling here condition on the summer was considered independent of what was going on here because there's no clue or like consider no clue if I know that it's low here, if it's going up or down based on if it was up or down here. So we have a conditional independence of the past, even the future of traditional independence of the future and the past given the present. So we have something like a mark of property here. So we could try modeling this with a Markov

model as a Markov chain model. But first let's summarize what you just said first, he said that that we have four ladies samples. It means like the idea sanction does not hold in general. Then we have only one sample usually in the time series and we usually can't rerun it. Sometimes we could try to make an argument that we look at maybe other countries or we look at a very similar plague years before something like this. But if we consider other time series, let's say gross domestic products, then we cannot reset the economy just to see how the gross domestic product would evolve again. So then from

frequentist point of view where we say probabilities that reflect frequencies like long term frequency. If I rerun the experiment over and over again, we can say assuming true underlying probability distribution or even modeling time series probabilistic lee is debatable in a bathing setting. This is less a problem because of the notion of subjective probability where we can say we have some uncertainty in basically everything. We talk about model or at each time point and then we just factor this in as kind of prior knowledge or model assumption but in interpretation of frequencies um I'm serious having no idea assumption and only one sample is problematic. So still make sense out of this. We have

to basically say what we assume. So we usually need a strong assumption and we need to be transparent about it. For instance we have to say we have time stationary T basically have no I. D. Assumption. But if you have some time stationary T. And we can still say something or we have stable transition probabilities or we have a structural latent space or something like that. It's a little bit like in regression where we had to make and the argument that points which are nearby better connected by some smooth function plus some random noise additive noise model. This kind of assumption we also have to make in and

time series sequential data model. So let us remind us what Markov chains are. Markov chain was a graphical model and pays in network. Mm hmm. Given by this deck, which just looks like a chain until some index M. And a probability distribution which can be written as E x zero times P. X one given X zero and so on. Until next X M Given XM -1. Now assume that this is discreet Meaning that every XI is basically a value from one capital K. Then the question is, how many,

how many states do we have? Maybe that's O M here? Capital T. Tour. Hey, it's time. Then we have a total at we have case. Let's hear case. Let's hear case. It's here. So we would need we would need badly K to the here and plus one states of probabilities to be specified. Which are much simpler here by specifying these conditional distributions. But what we're now interested in, what you're interested in is that how much information do I have about the past? Is the future About the past given the presence. So

we have for instance, the independent X. T separated, let's say from zero. Even when some X two. And actually it's in general, if I condition here, um, X N then I have independence from all variables from before. This is the future. This is the past and that's a present. This means that if I wanted to predict the future, let's say we take n here, t minus one. But just this note before then the past is not helpful. The future is independent of the past. Even present,

that means us has no information. Let's say it has no addition, no additional information about the future. You don't presidents. So and this is fine. If this reflects reality, we could argue for instance, in in the example we just said this example that even if I know this state here, there's still some information from before. For instance, a number of person already died. Um can have an influence on how many

people could die here or that it's already a lot of infected people here. Then this cannot go beyond this. So that currently and the current infected people might not be enough information to say something about the infected people in the future. Someone could argue here that there is information in the past which tells something about the future which is beyond just one point here in the present. But let us first give an example for a Markov chain or example where we could argue that the macro chain could make sense. So let's consider like a simplified as I model where we have one person only and we say through time

this person is in one of three states it's either susceptible to the corona virus. It means it's healthy, still still healthy. It's not infected yet, but it's also not gone through the disease yet. Then we have it's infected or are it's recovered and so there are three classes and this one person can go from one out like from one of his classes to the other. So what we now have is a discrete time series where each time point we specified one of three classes and we can go from these classes to the other classes with some transition probabilities let's say the transition probability to going from susceptible to infected

is given by beta. And this can be dependent on the time. And each time point that could vary for instance is the number of infected people in the country changes. Then this one person with a higher or lower probability or to get infected let's say gamma is the rate at which the probability the infected person can recover. Then we can just fill out all other quantities in our models. In our model namely let's first assume the person is successful and so um so acceptable. Then we have three cases we can go from S to S from S. Two I. Or S. Two R. And we say that a person cannot directly go from susceptible to recovered. Were saying this is zero from I. Two S. We just said this is our beta T.

And then we have here to assign one minus P. J. T. Has a probability to just stay in this um status. Then once a person is infected there's no way back this is a model here we cannot go back from infected too successful But we can go from infected and to be recovered. This is by Gamaty. And then here we have only one -Gamaty um to stay infected. Then the question is from the point I'm recovered, can I go back and we just say this model that this is zero here and we just recovered. So this would be like a very simplified individualized S. I. Model where we model

each transition with these probabilities and these transitions could be dependent on T. It's not dependent on T. So if independent of tea then this mark of chain it's homogeneous. And we can argue that this is unrealistic. Unless the country is in a stable um position where there's a stable number of infected people. This is a case. Then we could argue that this is an homogeneous Markov chain. Then we see um each state the

probabilities are all the same. Often we um and put this into a matrix. All these numbers can put into a matrix transition matrix where we just use these as entries. So this is three cross three matrix by three cross three. Because we have three states be coming from like coming from and we have three states where we can go to So this is a three course free Yes. A three course cross free transition matrix. They're now several graphical representations which are different from the user graphical model representations to to visualize this.

And here we could say this is our s This is I. And this is all are recovered. They all represent recovered at different time points. You're infected. Yeah, what's that for? And then we just making and just usually only when there's a non zero probability and then the weight of this edge correspond to the probability to going from this to this state. For instance, it could stay in this stage also go from

if you're here but we would consider not possible to directly go from here to here. So we would usually not draw these errors. Then we can go from I. two, I stay there or we can go from I to recovered from infected to recover ridge. And you would also consider this, you know possible. And then if you recovered we stay recovered and we would consider this not possible. And you would actually only draw and the ones that are possible in this diagram, if you have a homogeneous Markov chain then we can also

just make um one box per state and then draw transition probabilities um directly in this diagram. So in our model it would not be possible to go from here to here. So this will not be possible. We could go from as to I this would be possible. The reverse is also not possible. So we could say here this zero, this is zero. Then we can go from eye to eye. So we can stay there also we could stay in S

Then we can go from I two R but we cannot go back. So this will also be zero and then we can just stay here being the recovered face. So this will be one. This is how you can represent discrete Markov chains, homogeneous discrete Markov chains and this state transition between them. Uh huh. Now we can also consider higher order Markov chains first let us consider a usual market scene there we saw that if I take for instance um this time point, my condition on this time point then all the past here is independent of this is independent of that given

this in between and we desired to model time series, we have longer history dependence. So what we can just do is it could just at more dependencies, not just going one back would go to back. What about this era? And these are called higher order Markov chain dependent of how many these arrows we go back then our probability distribution would factorises Also again with P0, Then we would have the X. one given X0. And then we would have a probability distribution of X and bus who

won't let us say X and directly given X and minus one. This was the usual Markov chain condition and now we have x and minus two and this then starts from and equals two. The team. So now we could even consider higher order chains. So this is um m equals two but we could of course consider more dependencies but making more and more errors. So what we could do is oh for hire, I have orders for this. Here we go

here until conditioning until X and minus M and here it go until yeah we would need mm probabilities for the starting positions and if you model it like this, what you then get is that the future two X t is independent, the cost given em consequent consequent values in between we

have the future, that's the past and we need em present values um so that we can discard the past. Oh mm but maybe it's more important to write what the dependence is. If I condition unless for instance just and plus one then this is um dependent. If you have for instance a Markov chain of order two or higher meaning we can have now something like an intermediate kind of dependencies. The future depends on the previous m values And beyond that it is independent and in the context of

our corona example this might be a bit more realistic so that if we have a value here, this is the future and we condition um let's say these values here then this is independent of what happened actually here will also be here just to start like it was here but of course that can also still be criticized, you can always say this is kind of a very sensitive system so it will always depend on the starting position, then what we need is dependencies which go all the way back and one way to do it is just increase the number of independencies going back, I'm just increasing m higher and higher and higher and get a better and better model.

The problem with that is that if you have K states, if you're K states for each of these variables then we would need K to the N Plus one. Um and for these we would need a table of probably table of K the power of N plus one and this would scale exponentially in em if you wanted to it get bigger and bigger and bigger let us summarize what we just said um let's say we have this mark of chain of higher order was K state's variable, then we have seen the present or we set the future is independent of the past. Em consecutive intermediate values independent of the past

even and consequent intermediate values. That's what we just said, but it's like we lose information of the past beyond this and time steps at um to overcome this we needed to increase chris n. The problem is then that um you need to end quote and the transition probability from x. andw gibbon X and minus one until X and minus. andw and even under homogeneity assumption that the transition probabilities are all the same. We still would have something which scales exponentially in andw And one way out is studying state

space models and in the discrete case are called hidden Markov models and in the linear abortion case are called linear dynamical systems where we have dependency going basically back to the starting point. So to summarize if you want to model time series, we could use Markov chains or higher auto Markov chains. But if you want to have a model that can model higher time dependencies, then it's inefficient to use um is higher auto market chains as I scale exponentially then in the number of and and then variables, the order and.

2 Sequential Data Models - State Space Model

I won this video, I want to talk about the state space model, the state space model is a latent variable model and it's a graphical model. It's a basic network with latentvariables The top here is considered latent not observed and the excess year is usually data points that can be observed. You see that on top we have a mark of chain standard Markov chain and we assume homogeneity for the transition probabilities are here always the same, given by the same function. And then we have emission probabilities going from the latent space because of the space. And the main idea here is that there is no direct relation between the observations, observations are

just mayor projections of selecting space and everything that's happening happens in the latent space. You can also see that graphically all the variables are dependent by going just through this chain and they cannot be blocked by for instance, conditioning on another observation. Even if I condition on this observation, this path here is still open and this as well, I can condition all of these and still all these are open and the latentvariables if you were able to block this party by conditioning on one of these latentvariables then of course you could block the information flow between these observed

variables. But this is the point about latentvariables We don't observe them. So we cannot actually condition on them. So maybe let's write down the probability distribution of this model first. So right here, all the variables on distribution. Let me call it as capital acts and that then this is given by the distribution over initial latent, then we have the emission X. Zero, that's era. And then we always have transition emission transition emission

but we can write product That NN -1 times X. And and then andw wow capital land. This is a factorisation of this distribution. And here we call this distribution the initial distribution. Then we have to transition probabilities and we usually assume homogeneity meaning that this is some kind of function. Let's call it T. That and that -1, which is basically

Function in two variables. And for any of the variables here, that and -1. The transition probability is all the same here here here here here this is a homogeneity or something. And then we have emission probabilities E of X. M gibbon and there are also some kind of function that's called S. X. And that. And if we consider homogeneity. So this year's only under homogeneity, it's human genius. Initial distribution. It's called the transition probabilities latent jim. And the ones going from latent space to observe space are called emission probabilities.

So, again, a state space model is called homogeneous. If all the emission probabilities use the same mark of colonel and all conditional probabilities have also their own lack of So then we have two corner cases. This model is called the hidden Markov model where we basically assume homogeneity and discrete latent space. And then we have linear dynamical systems where we assume Gaussian latent space andw linear transition probability.

Well for the emission probabilities, even though they're all use the same mark of calm, they can be arbitrary and if you want to be a very general we could for instance you was exponential families then if you have defined the state space model then you want to do something with it and what you wanna do is like all the use of things we do with such a model. You want to infer the latent states, this means you wanna given the data, want to infer how the latent states distribution look like for instance and marginals or the pairs of latentvariables let's follow each other.

Then we also want to know what is the most likely latent state given the observations. So if you have a whole observation, you want to know from which latent state comes and most likely and you wanna be able to do this for any parameter. So if it's parameter arised, then we want to be able for every parameter. We want to be able to do this and of course we want to also fit the state's based model with the data and this can be done for instance, it was maximum likelihood approach which would basically say this, we take the parameterisation that maximizes the observations but as we know in latent variable

models, this is very complicated. So we rely here on variational influence approximations of variational influence methods, maximising the elbow using am I version and then of course we want to do predictions. We want to have a predictivedistribution after sitting or for any any value of the parameters. We want to do this but usually the one we just fit it and for this um influence for instance on the last date is very helpful because as soon as we know the parameters either fitted or given, then we can just use the transition probabilities and emission probabilities to get the distribution or our next data point. So we could

here take X. M plus one and then the latent expert peter we take the same cedar and then we have a transition That in plus one coming from that in also for the same data and then we only need that N. Pita and then marginalized over that plus one and that N. And as you can see um if you want to have this for any distribution, we could use methods to infer this this bad thing easier than this year. Um it's just a Markov chain going to the last

note, then use this distribution, the transition probability there and the mission and then we have to solve this integral either analytically when this is given analytically or some numerical approximation or some sampling message. You might learn later

3 Sequential Data Models - Hidden Markov Models - Inferring the latent states

everyone in this video. We'll look at hidden Markov models and how to infer the latent States we have seen are defined that the hidden Markov model as a state space model which is homogeneous and where the latent states are discreet. So here we assume that the latents that end are in the class is 101 K. So assume K plus one classes. Then we make the one hot encoding like always take that and K. With the indicator if that N. Of glass K. Then we have initial distribution which is discreet. So that means like we have a categorical distribution there. So if you're

right E zero parameters, pi apply on the probabilities of each class. This is a parameter then we can write this as the product. Okay, equal 0 to capital K By K. That zero K. With parameters Hi zero Reply Capital K. Which we abbreviate with pi So this was just about this starting distribution. Now let's consider the relative part, transitioning from this to this and use some homogeneity. So we basically have one transition

matrix. And then all the probabilities should be given just by looking at the seats. So we have a probability of that and given that And -1 and then we take a matrix condition Matrix recorded a and then we thatthe um the product of all classes K from zero, capital K and the classes from J. Capital zero from zero to capital K. Each representing the classes of this and this and then we have the probability going from this J.

Of this K. And this is written in the matrix and now we just take the entry of this matrix A J K. And then we check to the power and this is the one we're looking at that And -1 J times that and K here A is parameters transition matrix matrix andw A J k. theprobability J Okay, meaning that If ZN -1 is in state, J probability

turning out in ST K for that N is given by a J K. And of course we have here is a condition that it is some overcame F one. Yeah, some of them. His price will get one. So now we have um so successfully separated here the arguments from the parameters and then we have emission probabilities where we don't need to commit now to which kind of model we use so we have of X and given didn't parameters each time and I write because I might assume that this is just and an exponential family or for every class in the latent space

it might be exponential family. So for instance, this is an example, first we have to check which of the classes is that is okay and then we could use h k x n and then exponential family. K. Natural parameters statistic an A K. Now it's a bit unfortunate that this is the same. A here, I mean a different one off. It's okay. This is exponential family and then we have to check which of the distribution we use and

then we take Z N K here and this is just an example we could use, for example, we could use other emission probabilities and um yes, index missing. And then the joint distribution is just a product of all these conditional distributions as it is a basic network which consists of these components. So what would be interpretation of um, this model? So we could think that we only have cake classes in the latent space, red, blue and green. Let's assume they're just K. And here we assume that each of the case gives us a normal distribution with its own variance

or co variance matrix. So the mean is here for this case, the mean is here for that K and the mean is here and then we have here the school variance for each class in the latent space, the observed space has its own variance and mean, these are the emission probabilities Given the Class 1, 2 or three. and then one data point maybe starts here, This is like for instance X0 that zero is just green. So this is basically just So that zero has not expansion in this X space, it's just green, it's just like discreet, but the X could be here. The X zero Will be here and this corresponds to this point on the other side here, then X one

would check what sample from the transition probability and it could stay here, could go here or it could go here and here. It transitions to blue. So that one here would be blue. So that means like we could be anywhere in this cluster and then we do a mission by sampling from this distribution and then we'll end up at this point. Then we do transition again this time we again sample blue, then we sample again from the mission and then we get here and then and so on. So we stay blue blue blue. The transition doesn't happen so often and then we may be here And then we have a transition to read so this is like I don't know that 10 and it could be anywhere here but we sample from this distribution here, our emission

distribution and then we get this point here. So we know from blue, we go to red, then resemble from this. We get here, then we don't transition, we sample again from this distribution, we go here and so on and then there's After, I don't know 20 steps, there's a related spaces transition to green and then we sample from this emission and we'll end up at this point and so on. So this is how we could interpret the discrete nous of isolation space by just having mixtures while each of the components, each of the classes has their own emission probability, which could be here some boston. Let's talk about an example, for instance, let's consider, we have a person and we want to know what is the corona status of that person and this is considered

unknown. And we assume this as an I. Model in a very simplified form. So the status of our person at time t. Could be could be successful, success, successful. So acceptable. It couldn't be infected or it could be recovered, that's either infected, recovered or none of these, both or not yet. But we don't observe this. But what we observe is we observe symptoms of the patients or the person and we record them every day and the person could have none, no symptoms, a cough or fever. So here the emission probabilities would be also discreet and state you observe are also discreet. So everything is discrete in so we had transition probabilities between all the S and I. N. R.

So this is the three cross three matrix. Then we have an emission probability. That would also be a three cross crema matrix. But these three states with these three states now, how could we infer just from the observation that we recorded every day what kind of corona status the person has. So first of all, if someone is not infected with corona that they can still can have cough, they can still have fever and all the person who is infected can or maybe have not has no symptoms. So there are probabilities and that these things happen even in the other states. So what we do is we make the simplifying assumption here that we have homogeneous Markov chain like as simplified as I model, which for the moment does not change over time,

let's say the corona situation is stable for a few weeks and we assume this is a good approximation to reality andw and then we have the emission probabilities. So and then the task is to infer the corona state at each times. Then given the symptoms we observed during during the days we're here inferring related states right, Verona statuses latent maybe even we want to know for instance but two times that in a row how they're correlated, how they depended on each other. So we want to also see maybe not just marginals we wanna have like two marginals in a row and see how they correlate. Let us write this down formally.

So we have in Markov model which comes with um initial distribution, then it comes with mission probability given this matrix A. And then and the product of transitions that n. That in minus mhm transition was with eta have transition given A. And then admission X. M. That capital A. andw

and then the parameters are pi our A and R. So these are the ones if we were doing learning we would fix them but we assume no for inference part that these unknown already. So let's say we know the transitions and for the corona case already then we have data. and as you can see we have for every single timestep we have only one observation. So we have we are not I I d there really just one observation per timestep then we assume we have this latent which corresponding to the data points but we don't know them. And these are classes then we have the transition matrix. Okay, Take a go from 0 to K. Its

initial class parameters Okay, and then we have the emission parameters for each class. No, but this could be a vector for each case, this could be a vector or self interest and now the goal for influences as we said is computing this distribution this we could consider this a posterior distribution. After observing the access we want to infer the latent distribution or after observing the access, we want to infer these pairwise marginals to see how they depend on each other to have some information about the transitions and this fly as I said, is given so on the next slide we might just ignore it to have it easier to write things down. So then let's have a look first

psi hidden, Markov more began with the distribution and we now can say what we actually observe by maybe marking it. So we observed this, we observed this, this is observed, is observed. Now we want to turn it into a fact a graph so we can do inference andw to get these posterior distributions um we know have these dependencies on the data and it looks very complicated here this market model, but as you can see the dependence is just through this factor here. X n is only here and then this year's factor which is only dependent on one

variable. So that means we can well, this year's depend on to we can just at this joint this sector, No, no, no and tear, this should be done. So what are factors? So we can say is this a factor hi zero, that zero and you can call this side and that and minus one that n. Then we can just draw a photograph and for the fact of wrath we need the maximum cliques the maximum cliques here after

moralisation would be just this year. There are no colliders here. So there's everything just turns into undirected edges and this factor included. So we just have basically linear chain like back to growth, so that actress here, but undirected yes and then yeah, this is some names size zero and we have

that's zero, then we have something which is dependent on Size zero and 1 then we have side too that too and so on. And this thing goes until that end, so and now we could just declare any of the notes a root note. We could start message passing from both sides independently to this note and as soon as the note has both messages and we can compute marginal and this year is now with the evidence factors. Um so we get the posterior distribution like given the data, it should be the

marginal distribution of the latents given the data. And that's basically we head to some product algorithm already know how message passing works and we're basically done. And what we can do is now we can write it down and simplify it a bit and for computational reason later we can make and we often declare the last note, the root note. And instead of going from both sides independently, we often go just first forward parts and then we do a backward pass even though we usually um could go from any sides independently. And as soon as any of the notes has both messages from both sides, we can compute the marginal the belief. So

let's write down the messages 1st. Let's consider the messages at uh that's a leaf. There's a message wrong, 020. And this is just A factor and this is a size 00. While the message hear from that N I N. That was that just one As a variable note. Very belief. Then we have the messages from

Variables that in -1. Yeah, this is it in -1, two. The next factor. The next factor is cyan. And and it's always dependent on the variables that N -1. And what happens here is if I get the incoming message, I kind of aggregate them and then I send them further but there's only one message here coming from this one, I basically take this message and you just send it further and the other way around the message Here going to that zero here just aggregated and then sent further. So meaning

that the message going from there to there is basically here the incoming message. So this is Message from the factor before and -1 going To that and this one that in Venezuela meant as a function here, we send the whole function No, we have to look at the messages going from this factor to the next variable. This is sigh N capitals that end. No that N And here now this side takes incoming message and it multiplies the factor with it

and then it marginalize out the old parable and sends it as a function of the new variable. So here we have a song, we have discrete discrete distributions. Now we take sign in That and -1 that N X a product as in some product algorithm. Then I take the message just came in Which is that N -1 sent to sigh N As a function of that N -1. And now and these are equal here, I can just also plug it directly in and always jump over one step. So instead of message passing here and directly message pass

here, This is is that and this one I am one andw now we have side one, wow this one. So now we consider always who was only factor too variable messages. This one this one because this year is the same as this signal message from there to there. So and just for this application people called this alpha and

just the definition there And then we have that and -1 by and and swan that and This is N -1 That and -1 and this is just and to be this and so we get this ultra recursions formula which is just a very special case of our some product and This year is zero is that zero? And then we do this for N equals one until capital N. And we call this forward pass or what I'll see

now let's look at the backward pass. So first we have a bearable note here, so leave a message from here to there is one then we have a factor variable message so this is sy N True that and -1 That and -1 and this is the sun over that N. We have the factor Side that -1 that end and then

we take the message which came from the right that M oops I N. That and and then we have messages from that end let's say this is that N to the factor I N. And this is a function of that and now this year going from that andt in this year's just like one note it basically copies the incoming message which is From the factor one higher and plus

going to that end that and then we call this year beta andw that and and then what we get here cy and minus one. That end times We sign em plus one that's him, andw and now let's turn things around a bit this so let's move this here. Uh huh. What we get is yes And the -1

That -1 equals that and side and that's minus one that end times time seven this is called the backward pause and I do this until andw equals and down to one with initial this year defined to be bitter. What's that man? So let's write this together.

So perhaps it's actor tree. Perhaps it's forward pass α zero that 0 50 andw perhaps this back foot pass grace. Yeah it's too warm. Well we can actually tell you what these factors are half year E zero high times X zero. That zero Utah andw Sign. And that emptiness is one

that M. R transition probabilities times admission you tell. And then we get and then we finally have that mm that N. X. Fine is proportional alpha and that end times that's an andw he opens that and minus warm that's N.

This is another factor belief this is proportional to two. alpha andw minus one. That andw minus one times The factor which involves it and -1 and n. And this is given here Use that n. Given that -1 with transition matrix the X. M. That and meta and the other incoming message. This is now better andw that and

we also know that normalization constant in all these cases can be locally computed, it's always the same. andw yes that we're just normalizing constant here is the sum of all that oh in a concert. Yeah yes, normally the constant here maybe introduce a different set and it's always the same, namely it is um evidence of the data given the parameters so we marginalize out that here and it's always the same and the claim is can

be computed as the sun. Was that an Oh fine that end times and that this is a normalizing content for this and it's always the same for all ends. So it's it's normalizing constant. So in total what we do is we run is author recursions forward, pass these factors along with spectrograph according to this equation, then we pass backwards you think these equations then we just multiply the beliefs and we get our posterior beliefs of the latest

spaces although the paralyzed ones by multiplying these income is coming messages times this factor. So this is this factor, there's nothing new about it and the normalizing constant. We get always afterwards by normalizing so and that's it basically but this ivory ism has some numerical instabilities when this chain becomes very long because these numbers are super super small. So there's a version of this which can get rid of these numerical issues by just normalizing these alpha at every step, every step basically normalized by the sum of all of these and then you pass it on and that's what you write down next.

So what they do is We define alpha tilde of that zero be size zero, that's zero. Then we declare t zero to be the normalizing constant. What's this? What we need Add up size 00. And then we declare alpha hat zero one over C0 times ofa tilde 00. This way this becomes normalized and this becomes a probability

distribution in that zero and then declare off until then that N As a sum over that and -1 sigh And that and -1. Is that an times off ahead and minus one That and -1. So this here is basically the same as before. Just that we instead of using this alpha we have something which is multiplied just by this constant and this constant of course is just really just constant. Not depend on any bearable, this goes really out and now

what we do is we declare that N. You'll be that and and for Children of that andw maybe we could have made this clearer here 02 down and here are for tuna and then we declare head of the head and that and one over c. n. times of Matilda and again this normalizes this of N. So this year is a probability distribution in that N. And this mix um Mix um his scaling

um more stable And we do this for an equals one andw and then we do the backward pass then for the backward pass you take and that and Be equal to one and then for andw Going down to one we do eat up andw -1 that and -1 hit find to be the sum was that an

sign And that and -1 that as before and now we take this beta head here which at the moment doesn't look different than before but now each step with the scale things again by taking their one over C. N. And for that reason these cns they need to be stored in the forward pass. So we really go now need to go to set and store these quantities then in the backward path we have to be scale them out again. Okay let me write this here

and what we then get is first that is it. And given exhale eta yes everything is normalized also andw that end times it ahead and that and and then or sector beliefs are CN we have to take this into account for scaling also andw minus one. Yes

before That N -1 -1 and the transition That that and -1 transition matrix. Then we have admission. All right. And then we have all belief and that and also what we know is that you want to have the evidence then we can just multiply all these ands and this concludes the re scale some product algorithm for hidden Markov model. And as an exercise

you can prove that these quantities in the end turned out to be the right quantities and fallout. So and then the last thing about inferring the most like a hidden state in hidden Markov model is where we want to know what is the most likely state of all that. Not just the marginals Um you want to maximize this. So what we use is the same factor tree as before and then we run the max some algorithm with no basic adjustments, so basically the same as before, andw in this special case this is also called to be terribly algorithm so there's nothing much to change here, nothing much to say. So this was

the influence part about hidden Markov models. The details are left for exercise

4 Sequential Data Models - Hidden Markov Models - Fitting HMMs to data

everyone this video will talk about how we can fit a hidden Markov model to data. So let's repeat how our model looks like Markov model, it's a state space model where we have latentvariables that we cannot observe and we have data also observed variables but only one data point per time step at least. We consider timestep and then we have two parameter rise this year and this comes first with initial distribution which is a parameter pi which we are from. There are two K and the hidden Markov Model, all the latentvariables are considered discrete and we have K plus one states and we have the one hot encoding here.

Um then our parameters are given by the initial class parameters I we have seen here, then we have emission parameters and then we have transition parameters. This is a J K. Okay, Okay. Going from 10 to capital K parameter. Ize the distribution and going from class J here to the class K here. So this is a transition probability given in this transition matrix and then the mission parameters are also zero until four K kept. Okay, there could be vectors of each class. We have an own set of vectors and we might consider

them as the natural parameters of an exponential family. Each of them, they're all an exponential family. So and what we want to do is now we want to find the best set of parameters and for observation and this is usually maximum light would approach. So we're maximising andw likelihoods with respect to this set of parameters given the data given your means like plugged into our likelihood function and since we have taking variable model this approach is usually not feasible. It's infeasible So what we do is we use E M Iversen

which is based on variational inference and maximizing the elbow evidence to our account. But first let us write down how the probability distribution looks like. So here we have first than zero given pi this then we have the emission probability X0 that zero eter then your product and equals end to end transition That and -1 to end even the transition matrix times the emission probability of X and Y and Z. And and tie. So now we need to write down

the E and M step of the E. M algorithm and the objectives there let's do this so let's consider a new step and for that we need to you need to compute the function um mhm detective function is theta given the old parameters and this year is supposed to be the old parameters is our model, we can put it here so and this year was expectation value over that given our data given our old distribution, these are the old ones

and then we take the lock joint distribution that there you go. These are the new branches so we can write you get the expectation value that X. You done now we have seen what the distribution looks like. Let's just copy it over here. So here's our distribution. So if you take the lot now, easy to spot it. So it's a lock E. That zero I us walk P X zero that zero Utah yes,

some N equals one to end block Block E. That having that in -1 a last block E X. And that each time and like this most brackets. So how do all these terms look like I always want to separate the parameters and the arguments. So this year is now product A 0 to capital K. Okay That is zero K indicate afunction

this year. Yes. The emission probability which we don't want to specify for now and this year can be written as a J K. Then we take Z and -1 J times that K indicators and then we take the product of J equals 02 K. And then okay, 02 K. And again here we have emission probabilities, Let's put this over here. So then we can go

on with turning everything into sons. You can hear the same distribution. Now we get the sum A equals 02 K. That's zero K block. pi eh bus they mentioned probabilities by put this together. Now Going from an n equals zero. Absolutely. Okay thanks. ptap plus now we have here Some and equals 1 to n. Then some

J equals 02 K, some K equals zero to capital K. Then we have a set and -1 J. That M K. And then look A J K. So now the expectation value is linear and it's random here and that with respect to the old parameters that of course here and that of course there and that occurs here. Maybe we rewrite this here a little bit by replacing, does that end here? It's okay. And then putting the indicator in front that's N K

and then do some overall case. So here okay equals zero occam So when this is one soul set and equals K then I get this and if it's zero then this just drops out as always Okay and and then accept the expectation that it goes in here goes in here and goes in here. What do we get? So we're getting a okay equals zero. Okay expectation value Oh Isaac X expected the old parameters. They're okay lock I K plus

some or N. And okay then we have an expectation value. The fact that given the data and all parameters of that M K than the emission probabilities for case class and then plus All right, only one some and J K probability that parameters and the data then that n minus one J times Z I N K and then times walk A J K

and now we make some abbreviations this quantity gamma they're okay. There's quantity gamma and K. And this quantity here on that sign andw minus one andw maybe just one end is enough always and N -1 sy N J K. Then our objective looks like problems.

timestep maybe we'll make some Hilda's here to indicate that these are the old parameters that play a role here then we can ride our cue and the old parameters as summer. Okay. Uh huh. They're okay. Right Mark. Okay. Class and equals there to end. Okay. Okay. Thelma. Okay, times walk. Thanks. It's class

U. Turn plus Andw equals 1 to N Equals 0 to Capital K. Because zero to kept Okay side Children and J K times lock and J K. With what are our quantities gamma zero K was expectation that X hat you can hold Of that zero K. But this is an indicator variable here. So what this is here. This is one If

the class of that zero is K. That means this year is just a probability Probability that that zero equals K. Given our data and our old parameters also come on in it's just probability of that accent you go that M K which is just a probability that that equals K. Even accept theta And on sy told us and J K.

What are these? There's an expectation value that X hat eat a hat of that and minus one day times that N. K. So this is one when both of them of them are one and otherwise it's zero and they're one when zero and minus one is class J. And that s class K. So we get here the probability that that N -1 equals J. That N equals K even ex head and the old parameters. So how do I compute them surprise we can compute them using

compute um those with uh re scaled some product algorithm tourism or in Markov models. So this is really the inference step we did before exactly these quantities. And you see it's very beneficial that we have this joint distribution calculated all J and K. S for all of these values for all case. We had all these values. So all this message passing gives us all these values for all case and that's and also these factor beliefs. So all this quantity we can now do with a sample terrorism. That means for every step

compute this objective function here. We need to infer these quantities and they can be efficiently computed using some product algorithm to end market models, maybe they re scaled version for stability and note that the parameters here are fixed and this was in the arbitrary and this algorithm here worked for arbitrary theater. Didn't need to be the best one or something. So it worked for any given parameters. This means with this step we have here done our e step step is done with this previous andw I know this, we have done. So we are now the hamster gemstone. The end step we have to write down lagrangian

and often and this was just you and Q old plus some constrain one minus that. The pies I need to add up one. Mhm And also there a J. S need to some each. Okay, now I have lambda K. Times one minus A J. K. And equal zero. Okay. Mhm. Here we need J.

And this Q. We have computed before looks like this so we have now here the constraints and here we have the queue while these quantities we have derived just before the instead and as a reminder, theta which you know want to optimize was pi was it on and this was this transition probability matrix. Now let's thanks to derivative of our lagrangian with respect to a pi K for instance. Then what do we get first we get um these quantities out here

So we get a gamma zero one over Pi K. And um then we have a T. Again so what we get is minus lambda Oh we get this that okay, Times lambda scott is there? Okay and now with some of the case then we get lambda que que que but this is not the sum of all

gamma is there? Okay, andw this is lambda and then that means that's like a yes thomas there. Okay divided by K equals 0 to pay uh they're all okay some of them they found already this kind of initial parameters. Okay now lets us take the derivative for a suspect to the A J K. Here again the same, think same objective the same. And now we want to put zero. Yeah. E A j K. So where

does it occur occurs in a grand multiplier and it occurs here so we get minus longer. J. And here it only occurs one in this double sum Plus and equals one N side. And the k tiredness one over in Jk this means if you bring lambda on the other side and you multiply by a J K the Sun does not depend on it. So we can bring it on the other side. That means like a J a times language A It's just the sum of N equals one

to end thai N J K. And now we can use the sun. Okay, so we get, I'm not J. Is that J? The sum of K? 02 K. Okay, this is one. And now we can replace this with a K. Places with this here. So we have now Andw equals 1 to N tie. M J sm Lagrange multiplier. And that means if I take this chair and bring it on the other side

that I get a J K. Yes this year. Yeah. Face. Um first we have a sum of n equals one and side and okay, and then divided by the sum and equal 0 to pay and it was one up to em tie and a k until this year is a cake. Mhm. Maybe we use a different index, confused.

Maybe let's take L here. So we have already initial distribution and the transition matrix headed by the data for the same step. So the last thing what we need to do is hitting the and um as you can see here there's no constraint on the guitar so we only need to look at this distribution here and we have to say what the emission probabilities are Okay here again and now we make the assumption soon that of X. Okay, is given by an exponential family

a parameters okay of x minus. Unfortunately we have no another A which is local normalize er but it's not the transition matrix. andw it's also it's different for every K. Okay like this, so I assume so there's exponential family now we want to fit andw this is distribution with respect to it. Okay, so there should be utu that's right. Q back to

Okay. Or we can also write gradient anyways where does it acquire? So we have here the log it does not occur here does not occur here, it only occurs here, so and it only occurs, we are we have the xk Otherwise it's not dependent on each Okay, so this drops out and we only have a son and it was there to end then. Maybe you're right. Mhm Okay and now we have mhm Okay

times JK Okay X minus. normalizing Okay plus log H K X 10 Excellent. So this and this canceled. So we only get this out this does not play any role and what we get out is um great interview This is some an equal 0 to N. gamma and K times T K X 10

minus Some and equal 0 to N grandma and k I will just write prime for the derivative here, make it simple to read a prime K top K So and this is supposed to be zero and as you can see this is not dependent on N so we can also write it the other way around like this and then we just bring it on the other side and divide so that means that it okay yes

our local normalizing function the derivative which is meant to be a gradient so it might be several components then this is a function of it on and we invert this function and then as an argument what stays is this year divided by this and we have seen this same already for mixtures of exponential familes T k X N divided by

From 0 to N. gamma and okay so basically saying it's a class mean class mean of sufficient statistic, evaluated every point weighted by the responsibilities computed in the easter so that means like we have computed all quantities maybe let us write this down again instead so we have pine yeah appliances quantity then uh it's a

okay they're kept okay Here for JMK both going from 0 to capital K and here also for K There are two. Okay. And here we make the assumption of exponential family. Just natural parameters. And this is the M step. So what we need to do in summary Yes. Let's summarize how to fit a hidden Markov model to data with the email em algorithm is the step

we compute these kind of responsibilities and these cross coalition efficiency here using the re scales and product algorithm algorithm with all parameters and then we sat using these quantities the new parameters Yeah like this nonzero component divided by the sum of them. Yeah, we sum up with relation terms for all data points and divided as the sum of all classes as well. Then we get the transitional probabilities and then if we use exponential families for emission distributions then we take the weighted sum of the sufficient statistics for each class divided by these um um, total number of weights on this class

this year is a lock partition function, the derivative inverse. And then we repeat we go to e step again, do this again, we run some product algorithm again we do this step again and so on. Under convergence and converged we have the optimal parameters. So we are approximating optimal parameters by um these quantities and then we have fitted our model and from there on we can use these parameters to do inference again or and for the most probable that's for our data or we can use predictions by using influence in the model until the last step in the latent space, and then using another transition step to a new data point and then the emission to our

test data point, Make predictions. And then this is how you fit Markov model to data using the E. M. Algorithm.

5 Sequential Data Models - Hidden Markov Models - Summary

however one let us shortly summarize what you have learned about hidden Markov models first, we have learned that higher mark of chains, they scale exponentially with order of em when you're trying to model increasing time dependency, in contrast to the hidden Markov models. Hidden Markov models achieved this by modeling a latent space with a Markov chain that badly really connects all observations and that cannot be blocked by conditioning on any intermediate observed variables. Observed variables here appear just as protection from the latent space using emission probabilities, but the more latent space of hidden Markov models are discrete. The distribution factorises according

to this basic network which is written here where we have like initial distribution for this corner here and then we have emission probabilities given by some arbitrary distribution for instance exponential families. Then we have transition probabilities here and we assume that this is homogeneous, so the transition matrix is assumed to be everywhere the same. Also the admission distributions are considered every time that the same. Then we have investigated inference inference if we have the data points given, want to know how are the marginal distributions of the latent spaces and also these kind of factors here is pairwise marginals

and we have seen that we can using evidence factors, we can transfer this into a chain spectrograph and we can efficiently do the inference task with a standard some product algorithm for numerical stability. We have also used a re scaled version where we scaled every message to be a probability distribution by dividing by the sun. Then we have seen that we can also given the data in for the most likely joined latent stage by just using the same pictograph and using the mark some algorithm which in this setting is also called the terrible algorithmic

After we have investigated inference if investigated pitting our hidden Markov models to data via the E. M algorithm. Since this is a latent herbal model standard maximum likelihood approach does not work but maximising the elbow using the E. M algorithm does work. And then we have looked into the E step and it turns out that the E step is just the same as the inference step. We have just investigated before. So for every step we run this some product algorithm on these uh given parameters at that time. Then after we have these quantities of interest, we can compute and the corresponding parameters, update

them using these quantities just by taking um this quantity divided by the sun's here as well, normalizing them and then we have also investigated how the parameters need to be updated when using exponential families as emission probabilities and then the natural parameters will be updated as this formula here but here is weighted sum of sufficient statistics while he is a derivative of the log partition function of class K. And then the universe and finally we have talked about how to do prediction for a new test point and this is done by using just parameters we have inferred and then using transition from the last

stage in this net to the next one. By just basically assuming the next test point will be just the next point in this chain, using transition over there, first inferring the latent state here, then transferring over to the next state and then using emission with all the parameters we have inferred, and so we get prediction. And these are all the goals you wanted to achieve for the hidden Markov model.

10 Sequential Data Models

0 Sequential Data Models - Linear Dynamical Systems - Inference

Hi everyone in this video. I want to talk about linear dynamical systems and how efficiently to do influence in them. Consider for instance, the task we want to track a car, let's say the car is driving along this line here in this two dimensional space on a map. But the only thing we can detect all the GPS data which are noisy, but we get and of a noisy version of the path of the car. And each time point when the car is here, we observe this part. It's time here. We observe this year and so on. And we want to do this in a real time Mayor. So kind of an online inference task. So what we could do is we could compute for the first

data point, the most likely um, point. Maybe it's here because you don't have any other point. Then if you have a second data point, we make a prediction for the second point of the car. But we also use the first point. So our estimation gets better and better. The more data point we have meaning until time point. T. We use all available data for time point T. To get the next location for T. This would look like this. Here again, you have the re line which you don't know the green here are the data points. And now we make estimation, which I hear the red process and around

them you have confidence intervals, how confident we are that they are there. And you can see that these kind of lip sides, confidence lip sides become smaller. The reason is that we get more and more data point for the further we get, the more confident we are. But now you can say if I stop actually recording the data, then I could actually correct my previous data points here because I got more data later meaning that the data I had for this prediction until this point can benefit from future data, make this more clear. Let's look at this point here, for instance, as you can see these points have a little trend downwards and you might assume that the car

goes in this direction. So if you look at the true line here, we see this is more like an outlier. But until this data point we don't know that this is an outlier. So we assume it goes in this direction. Later we see that the next point is basically jumping up, jumping back on track and this is actually much closer Point measurement to the real track than this one. So from the next point on we kind of know that this cannot be a good approximation. So knowing this, we would say maybe the car was not here, it was more like here andw if you want to do this, we have to basically condition all the future data as well and this can be done. So here we have the predictions where we only use the

online data until this time point to make this um inference and here on the right hand side we see how the line looks like after we adjusted the previous points based on the later points as well. And you'll see we get a much, much smoother curve and you get smaller and smaller um confidence intervals meaning we get more certain about these points. And you can also see that the interior points are much more certain than the out buying points, which totally makes sense. The algorithm here on the left is called the common filter. The algorithm on the right is called the common smoother for this. We use linear dynamical systems, linear dynamical systems basically

state space models that be considered to be homogeneous, meaning the transition probabilities are the same for each of the time steps and the emission probabilities are also the same for each timestep And furthermore, we assume all distributions. We caution and all maps to bilinear plus ocean noise. So linear goes in transition and emission probabilities. Since you've already seen the state space model, we know how this factorises So the network of this type. They come here with it, initial distribution and the mission for the zero timestep and then transition

and the mission for all further time steps then we have an initial distribution. Yeah. And we have to say what these parameters are. And the parameters here are given by a mean and a co variance matrix on that zero note that that zero can be a vector as we have seen in the car example. This could have two components, one for the X and one for the Y coordinate. Then we need um emission probabilities. Same what happens um during the measurement. And this basic justice there's kind of a linear transformation of the true point and adding some Gaussian noise. And then we assume that the true transition probabilities are also linear

question. So we have another linear map and also some noise. Then we have parameters and these are basically all these quantities occurring in these costs and distributions. And we just put them in one big vector. And then we have data measurements. This would be gps coordinates which we consider to be noisy given by this emission probabilities. And what we now want to do is we wanna in for the true position at time point N Given all the data points under time point in this is called the common contributor. And you want to do this in an online fishing. And then we have to comment smoother which stops after

like which starts after the common filter and then goes back into time to adjust these points to also take all the other data points into account. So here we want to have the full additional distribution. Given all data points. Not just until m. And also you wanna sometimes no like the distribution between those which comes on barbarians. These are filled and common smoother. And note that we're not fitting the parameters at the moment. We consider the parameters to be given one parameter setting. Let us first think about how we can do this. So the common filter idea first we have zeros distribution here up there and

I since the parameters are fixed, I don't care about the parameters so I will suppress them in the notation to make it easier to read. We have one distribution here and this is normal distribution, then we have an emission probability X0 given that zero the boss question and we know that then jointly they also caution and then we have conditional formulas which tells us oh zero X zero looks like and also the margin looks like but this is part of such um it's important for now but this year is already a quantity of interest. So if you call this alpha zero, that's zero.

So this is like zeros alpha it's a quantity of interest. So we have this already. So and what we do is we use the bay the rule for Gaussian which we have seen already a few times. This is the first step. So we have barely solved this step here and the question is how do we deal with the rest? And the easy part is now the main part is now that what we did is what we did is we basically reverted the arrow in this direction by having a conditional now going in the other direction. So we could actually change this graphical model after reverting this into a graphical model which looks like this,

which looks like this and then the next step would be marginalizing out this distribution. This distribution and X0 is given. So we can maybe write it like this or Using this box notation where we consider now X0 more like Parameters and also kind of shape this here, if you wanted, so given So now then if we marginalize out the zero, then our graphical model looks like this here, basically the same as before, but now we have a new distribution initial distribution here on that one instead of zero.

andw this. marginalization So here we get this distribution, then we compute what, Then we compute what is that one, even X0, why a transition in marginalizing out, that's zero. And then you're basically the same setting as before or but with E one given X zero as initial initial

distribution. So and then we can do the same again, then we can look at X1, we look we know this is Gaussians you know, this is caution, you know, this is caution, then we invert this, we marginalize this out, we have now X one and X zero pointing to this, then we have this here and now this is the initial distribution and so on and this is already the idea of the Kalman filter, so and put this into equation, we need to recap gaussians the base rule of gaussians how does this work? We start with a marginal cost distribution and we have a conditional, which here is given by a linear map in that we have a matrix and some bias and we add noise, this gives us a joint distribution of

accent that and we have already seen a few times how this looks like we have a joint distribution over set an X and the mean stays the same of this marginal, then the mean of this is just like plugging here, the meat, you know, they're so we get a mean of C plus e. covariant matrix here stays the same. And the co variance matrix down here comes from multiplying this, co variance on both sides with psi and then adding the independent noise here on top, we have covariances here, here we have some error propagation plus some additional noise and then we have a cool variance between them and this is given by the fact that this year's covariant V and then we have to multiply as we see and

then we have this joint distribution and then we can look at the marginal or X and we just read it out out of these components and then we get the marginal and now we also have a conditional version and the conditional is usually one that's very complicated, We need to invert this matrix and subtract it basically from this one with some coefficients in front. So here the main changes according to the school variance and the variance in verse and then x minus at the meat here and then we add the meat here to get the meat out here and for the co variance the same holes, we take this and we have to correct it by this

inverse times this year. This is the usual conditionals And for the common filters to prove them, we won't use more than andw this lemma here, which we have seen several times. Don't need any Woodberry identities or anything else, but we'll use this over and over and over again. So this looks very complicated here. If you want to compute it, we can do this in order. So here again the marginal here the conditional. So what we can do now is we can call this year, we call this B small feet. And these quantities we call capital B, which is also here. So these are the first steps, so we just add them together, multiply them. Call this

be called this capital B. And now recall this year capital K, which is also called kalman game matrix and once computed including the universe, we also find it here again, then it can compute the meaning dnx and you can compute here, yes, and you can see this is just like take the difference between these points on game matrix and then move and here coming game matrix times psi This is common game matrix C s k e s s C. Then we have here B and T V.

This identity matrix. So we like picture it. Ouch. So we can officially computers all like one after the other and then we get that the marginal can be written just with this abbreviation, let's just be B and the conditional can be written as me of X and an are just computing this and we will do this over and over again for the come on filters. For the common filters. Just uses these formal ideas. We start with these parameters and important is that nu zero and V zero are given and then we go from an 02 capital N. These are the time points, the data points coming in. Then we do exactly what we have seen before. We compute this. Be compute this, be the common game matrix as we've seen before this conditional mean

this, conditional variance and then we get already our quantity of interest namely that the conditional distribution of this, latentvariables that And given all the data point onto there is just given by these matrices. It's not important thanks. And then to repeat the steps we have to do the transition this is a transition stuff condition step and then we use these as a new initial initial parameters for the next step and then we repeat this again and again and

here's an update step where the data point comes in and then we make these predictions, andw that's already the common culture now we can talk about the comment smoother. So after we have done this we get these influence steps until time small end step by step and after we have finished all our data points we can go backwards and correct our old inference points and even our certainty above it. And this is called the common smoother. So what we do is we initialize our nu mean these are kind of the smoothed mean with the old one we start at the back and they come from the forward cause yeah they come from the common

filter. Then you go backwards in time. We compute similarly basically by the same formulas, reverse common game matrix, we have a common game matrix now we have to reverse common game matrix by these quantities and these quantities come from the common filter so these were stored, you can reuse them by computing this quantity here and then we adjust you just the me from the common filter by this difference here. Times to come and gain matrix reverse common game matrix. And also we take our covariances matrix at the time point in minus one and now we adjusted by the previous steps and

we propagated back through the reverse common game matrix adjust for this and as you can see he has a minus meaning that the covariant or the variances would become smaller meaning that you're getting more certain about this meeting. And then after we have done this we have all these quantities and then the claim is that um conditional distribution of these latentvariables condition and all data points at each step is given by a normal distribution with these computed new meaning and new co variances which were corrected on the common filter. And also the joint distribution is similarly given by these quantities with these covariant terms. So we do the common filter, first get initial estimates

for mean and covariances Then we do the common smoother. We go backwards and adjust all the means and covariances according to the new data points. And as you can see data points here, we don't need to process again, you only need the quantities you already had the forward pass and the common filter and then we use these as better estimates

1 Sequential Data Models - Linear Dynamical Systems - Kalman Smoother - Proof

everyone in this video. We want to give a proof for the comment smoother. Remember the common smoother when we first go through the common filter, we get all these quantities here and then we go backwards to adjust the mean from the common filter and the forward pass by the later I mean and covariances using the reverse common game matrix. The claim was here that after I've computed all these quantities, especially this mean and the variance here for each data point for each time step, then the distribution of each of the latentvariables is given by Gaussian distribution with exactly this mean and exactly this co variance furthermore, but

to following time steps, latentvariables joint distribution of latentvariables is given by these covariant matrices with these cross terms these cross covariant terms. Our goal is the goal is to find an expression or this joint distribution of these two given our whole data set and we can write this. That's a distribution. That's a previous step. Times that given that end and to data times, is it an gibbon the data.

So as you can see now by induction, we know this quantity. So this is n that n. andw hatch mm by induction and we started over here and we knew already the like this corrected version, it was a common filter. By induction, we know what these are. So we know what an X. Given the whole data set is for the last one. Now we wanna go for instance here, this is now opposite end. Then what can you say about this quantity here first

let's say this is that N. This is almost at n. Write it in there like this And now as you can see we're conditioning on that air so we're conditioning on here. It means like we're blocking an information path in this basic network meaning that this that and it's blocked off all these data points here, meaning we can now use and this here is E of Z. And like a swan given that M. And now only X. andw

minus one until zero. Why false? We know that that N -1. It's independent. Or we can even say directly D separated from X and X. Capital and given that mm smallerthan But this holds for all andw So this means we have to look only at this distribution look at

But instead of room going for the conditional we basically look now at the joint again. Now only conditioned on X and minus one until X zero. Or I write it the other way around X zero Until X & -1. So let us write it down on the next slide here. What can we do with this one? We can now also use the product rule but now we do it the other way around you now call it that and given that and -1 X zero until X and minus N times P of Z. 10. Given X zero until X & -1. Well Let's take here is a -1.

Missing mine as well. But this quantity we know from the common filter Is N. That N -1. We and -1 And then -1 from Allen filter. And what can you say about this one again? We can exploit conditional independencies Let us look at the graph again. Yeah, we have now a distribution opposite N. And we are conditioning on that and -1 and all of these here By conditioning on this one. We know that these are independent

because they are blocked by this that N -1. So this means at this distribution here, there's nothing else than the transition between these two since That ST separated from X 0 to X and -1 given that and -1. For the more we know how the transition looks like as it is. Um linear dynamical system. So is that an That in -1? Yes. A normal distribution that N. A. is that and -1

plus T. Come on. So what do we have? We have now that this quantity is a product of two. Normal distribution. So Is it and -1. That N given x zero x minus one is now a normal distribution. Is that n? Is that a -1 plus the uh Times normal distributions at -1 -1 and -1. And we know by the usual formulas in how the joint distribution looks like namely normal distribution said in -1. That any

even corresponding to me here it's here. What is this? andw minus one. Then we get A one plus d. And we know we called this year we called this mu nu end and then we have the co variance matrix. Yeah and and -1 in this corner. Then we have Andw and -1 A. Transpose And and -1 A.

And then a. And and minus one A transpose plus sigma And we call this one here because this svm but we were interested in not the joint here, we were interested in That and -1 given that an X0 It's N -1. So let us write this on the next slide so write it down here again. This is a joint distribution, this means these co variances

and we want be off that minus one even that N X zero X m minus one. And now we use the conditioning formula, We will now hear that -1 conditioned um hmm M n minus one but now we have to adjust it. We have to adjust it by the inverse of this V and this co variance, what is it Mm -1 A transpose the inverse and then conditioning on that. End minus. Um this is already the mean,

then the co variance. We get a similar formula namely you have to take covariant this um opponent and then adjusted by these and then in verse minus And and -1 A transpose the inverse A and and -1 transposed but it's symmetric, this is andw invariance matrix that's right this down here and now we can just say that this

is our our reverse gain matrix, andw you see the game matrix here again and if you look at this year you can write bien bien inverse A N minus one so you can write this as B N times gain matrix transport. So what we get is one was at any given X0 Until XN -1 or

yeah actually all data we have seen that by conditionals independence, That's an E -1 -1 Given that NX zero until XM -1. And now we have seen that this is normal distributed. We're -1 once again matrix that and you and and then And and -1 minus J N the and the and transport and what we want wanted then was you wanted That and -1 that N even all the data

this is now is that n minus one given that N X hat times E of Z. And given X. Hat and now we know that this was normal by induction and this is now normal with this distribution and now we have to compute this. So this is what is with you know that an X hat that was normally distributed. Hatch And the other one minus one Given that and given all the data points and that and -1 gibbon UN -1 plus.

Okay. And that new and and you and minus one minus J n b n a M transport and again, as you can see we have here a linear map and that N. And we have is that an Gostin? Then again our lemma applies and we can write down the joint distribution So that means that one and given all the data points also Gulshan now I write here that um That N -1 -1 and on top with you and hat here

And -1 last J.N that new and here and we have co variance matrix here M K M N at and then down here we have to squeeze this covariant matric through this linear transformation and then we add this on top. So what we get here is J N and and Hatch A and transpose Plus and and -1 -1 A N D n J N. Transpose furthermore you get em

and times J and transpose here we get J N. I'm hot. That's covariant semantics so and then this is already one of the things we wanted to compute and the other one was we wanted to compute the marginal but this can now be easily be just read off out of this representation. Very simple. It's just like this is oh man hat Hey ann um Hatch your en coma and then we just take this here and you're right, it's not a little bit different

And and -1 is this term plus this term manifest. Um And and Hatch minus yeah andw transpose. So and by definition we want this to be and that n when I write it down here, we want this to be on this, you andw minus one and this we call m hats andw minus one. And these are exactly the formulas, the update formulas we had and the common smoothness

and the claim is shown.

2 Sequential Data Models - Linear Dynamical Systems - Fitting LDS to data

everyone this video want to fit a linear dynamical systems to data 1st. Let us recap linear dynamical systems. It's a home in the states based model where everything is linear Gaussian, the transition probabilities, emission probabilities and also the initial district it follows the space in network over here a Markov chain in the latent space and then emission probability to interpret it as measurements with errors. Then we know how the joint distribution over the observed and latentvariables factorises some initial distribution, then we have some transition probabilities and here we have the emission probabilities for the observations.

But if you now initialize permanent tries our initial distribution we use the Gaussian with mean and variance and our transition probabilities, our linear maps, the conditioning variables plus some boston noise and this is the next stage here. Similarly the mission probabilities here they take the latent space linear really transform it, add some cost of noise and then we get our observation the parameters we fit here. I want to fit or hear the initial parameters then here the emission parameters and here the transition parameters and you want to all fit them to data and note every time step we have only one data point. So we have no

chance of rerunning this whole chain again and then averaging or something like this. We have to fit this whole dependent model, one data point with timestep and then we take all them together. How do we do this? So we would like to do maximum likelihood estimation but we know this is infeasible and latent variable models. So we use maximising elbow from variational inference and we know that we can do this with the E M algorithm then and we want to simplify this a little bit first we have here these bias terms and um I want to get rid of them. So what can we do? What we can do is we can replace our matrix A by taking a column which we call D and A. You can also take

C here By taking E&C together into one matrix and then all the set ends. We had one on top andw consider these vectors and if you do this then without loss of generality we can assume that T equals zero and E equals zero. So we basically got rid of these two terms and then we have Basically two Parameters less or we have included them in these matrices. So they if you would do this by hand or what leaves them in then you would see that you always have to Estimate these and these two together anyways. So let's start with

the step in the E step we have to evaluate the function eta even theater old by trying some of this integral here this expectation value and then lock is that X hat given the new parameters and we can write this down as lock of the initial distribution bus,

all the transitions mhm that And is that an gamma and then all the emissions that's bad excellent. X N given seeing that andw sigmoidal like this now we know this is not normal, this is not normal, this is not normal. So what does it lead us to? That leads us to

quadratic equations between this year and this year. So this means it's too dyadic equation and then we have squares between these quantities here and then we take the expectation value of those so what we need to do is efficiently we have need to evaluate, you need to evaluate and this is the main point now three type of quantities is that X hat

the old parameters of that end we have to evaluate xxt in town that and that and transpose then the question is we also need to evaluate here where we have T N and T n M and yes we have to do this but um only Or is that and -1 and that N N. We don't need to look beyond um there's which follow each other so what we need to evaluate is in the posterior you want to call it like that

that andw That in -1 transpose. So how do we do this? We know that this year is mary ann from the common smother and filter so the magic words are here that we use kalman logn filters plus Collins moses andw with respect to the old parameter. Right. rameters interchanging and then we get these quantities out of this forward backward algorithm given by common closest comments moses and then the quantities occurring there, we

called them and and had and here this was and and then this reverse gain matrix transpose. So this is already the easter, these are the quantities which will occur here, we need to find them and define them efficiently by running the common photos and then common smoothness respect the old parameters and we consider this year. Now if he used them next we will consider these computed from the old parameters. So this already tells us we can go to the end step timestep um you know, make an abbreviation, Take Lt to the -2,

this Q function respect to the old parameters, this minus Q gets rid of all the minus and half from all the normal distributions and this now needs to be minimized instead of maximized suspect to a new parameter, sita andw we will go through them that by step in disorder. And since the terms here are very long we will only write the terms that are necessary for deriving these quantities. So let us start with zero. So what we need to look at.

Yes, zero D L in new, so this is now expectation value or we make another abbreviation. So we don't have to write it, we just use this in the expectation value or we right here just e so if you write e and we mean back to this distribution maybe yeah, we don't need this year, maybe even more convenient. So then we get the expectation value of T he

nu and then let's look at or objective functions. So here no new Yorkers here, no new Yorkers. So we can only look at this one and we know the lock of the normal here is quadratic. So what we get is That's 0 -10 transpose zero inverse, That's 0 -10. And we have to evaluate this all the rest was not dependent on new zero. So now we know and this is an expectation value. No we use it product rule of differentiation. Oh this is

minus in that derivative. Darryl at zero minus 90 zero Transpose that 0 -10. Now we know this is symmetric so what we get is minus tool, they're all expectation of 0 -10. So now we can multiply that through and then we get you get that parameter needs to be the expectation value of

zero and we have computed this in the easter. So we're done with new zero. Then next is btl E Zero. This will be expectation value E. zero. And let's write down what where? B zero curse. So B zero curse here here and here does not but we can forget about all these terms here and he had a quest twice, Once in the quadratic term and once in the determinant. So what do we have? Yeah through rock determinant which arrived with these bars make

it easy And then we had 0 -20. I'm supposed ure inverse. That 0 -10 like this. So now we have to take the derivative of this with respect to this andw we know that this is invest matrix transpose. Then we have to take the derivative of this quantity with respect to this matrix here. And what it does is is text one. So they need to be two of them. Right? So because like you know this from the scalar version but in the matrix version the

order matters. And the rule is that one goes before one goes after and these go in between the minus sign in front minus zero minus transpose at 0 -10. This one comes from here, it looks the same but this is the one which calls you Then that 0 -10 transpose. This comes from here and then we have to do this rule again and this is already the derivative. So what we get is Darryl it's eric transpose expectation value of that zero

minus and we know that news already expectation value that zero minus expectation value. That's post times zero. So in this um can factor out and put it in them and the quantities we have computed in the E step or we leave it like this because it's like the co variance matrix office at zero maybe it's more convenient believe it like this And then we multiply through with me zero squared transpose and this goes um two plus here and this kind of draw one from left, one from right. Um

so what we get is means Like this and if you want from the algorithm from the common photos it was an had zero and also if you wanted to write it here so this was this quantity. So next step yes we're looking at E L T A, the transition matrix we have an expectation value

50 D A. Let's see, okay next step going So we want zero L. You come up, you get this expectation value then we have t. gamma and now

we have to look with some workers and practice he is objective. Um gamma does not occur here. gamma does not occur here. gamma occurs in all these terms so in this time we have to take care of the log determinant. So we have logq gamma these are again the determinant. Mhm and I forgot the song Some and equals 1 to n. Now we have Z. N minus a Z and minus one

in my universe. He has transport that's an A. That N -1 -1 then closing, closing so now again what is this, you have an expectation value about? You can take the sun out and equals one capital and in have an expectation value. Yes, now here this is gamma english transpose the rabbit tick and here we have again this rule. Yes of course there's a rule that this is minus my interest transpose that N

minus A. That and minus one coming from here That and minus a that and -1 transpose coming from here times. gamma invest transports now what do we get? I have a son over something which is not dependent on N. What we get is end times gamma in restaurants post then mind us gamma invest transposed some and equals one to end. Now we have to factor this out is that that transposed?

Then we have this year and this transpose minus That and that and -1 times a. Transpose minus A. That N -1 that. And transport and then we also have plus A. That in -1 That in -1 transpose a transpose, this is all inside the sun times minus transpose. So now we multiply this through with

gamma transport from both sides. gamma is also symmetric so the transporters no concern and then we divide by N. And then we have this quantity and note that A. Was already computed so yet equals one over N. soliton Andw equals 1 to n. Of this quantity. Let's write it again I suppose minus that M. This one transport a transpose minus a

minus one and transpose pass a minus one minus one transport this quantity and again these quantities were computed with the comment filtering comments moved into the step and these here were computed just before and we did it in this order. We know this a already and then we can plug it in here by averaging these quantities. Then we get our gamma Next step is matrix, psi zero equals btl psi And so we have an expectation value for T. C. Now let's have

a look which of the terms are dependent, your subjective function. There's no sea here. No see here see all the occurs here where here we have X. You have that N. And you start from zero because we have also emission on the initial state instead of going from one, we go from zero here. So what do we get? X and minus C. Z. E. N. Transpose take my inverse Xn minus C. That's N. This is what we need to compute. And of course I forgot the sun

Just set starts from N to 0 to N. So this is not wrong and equal 0 to end expectation value. What minus two times. Take my universe X and minus cheated. And that. And transports coming from the inner derivative actually would have two terms one with minus 11 with minus transposed But this is symmetric but we get to and this -1 comes from the minus here.

So what do we get minus two? Take my universe. Um and equal zero to N. X. N expectation value of that and transpose minus seed expectation valuable that and that and transport but we cannot multiply through this is gone. Then we have this son and we have this son bring it on the other side and divide by the sun so we can see Yes

and equal zero to N. X. N. That M. And now and equal zero to N. That and that and transpose and in verse and again these quantities were computed in the E step with the common filters and smooth gers so next step is computing sigma so zero L. A. Signal we have an expectation value E. D. sigma the sun

and equal zero to N. Now we have logq determinant of two pi sigma plus X. And minus C. That end seek my universe transpose here Xn minus psi that like this. So now what is this? Let us take out the sun to end expectation value. And then again we have here take my invest transport minus sigma inverse transpose then X. And minus

C. That N. X. And minus C. That and and all transpose and then say minus transpose of this one And here I should start from zero. So we have now several of these Namely n plus one and then minus sigma minus transpose some 102 n expectation value. So we have to factor this out so we get X and X. And transpose. Then we get

minus psi expectation value of that X. And transpose. Then we get minus X. N expectation video that um psi transpose and then get plus C expectation value that transport T. Transpose and this needs to be multiplied by sigma on the other side. University transport

why? Okay so now we multiply from sigma so this is also symmetric so the transport does matter sigma from right from left. That means that we get a sigma here and there's no sigma here anymore we just have to divide and bring it on the other side. sigma nice sigma one over n plus one um

and equal zero to n. andw my copy this quantity and note that this C. Was computed just in the step before these quantities were computed at the step and these are fixed values from the data. So these are the data values andw this means we have computed all quantities and from the data using the E. M. Algorithm. Okay and one timestep

Okay let us summarize what we do is alternate ultimate between you stop by computing for all and that and is that an that and transports and e That in -1 that end false using almond filters andw smoothness in front step the M. Step huge

gnew zero zero A C. sigma in that order you would think the door either one of us and maybe right here with respect to the old parameters. Old parameters arrives commonwealth andw quantities from easter and then get um approximation of maximum likelihood estimates.

After convergence at let's say variational influence estimates music Ma at you had Uh huh. Mhm. No. And then we can plug them in into our linear dynamical systems and use them for all kinds of things. Inference or predictions and so on and these are now dependent on the data.

3 Sequential Data Models - Linear Dynamical Systems - Summary

everyone in this video want to summarize what we've learned about linear dynamical systems. First linear dynamical system. It's a special kind of state space model given by this latent variable model. We have observations here on the bottom and we have chemical latent space Markov chain where observations are considered too noisy measurements of these latent states. In the linen dynamical system, we assume homogeneity, meaning that the transitions probabilities are all the same at each time steps and also the emission probabilities are the same. But the more we assume that initial distribution over here this is boston and

that all these transitions and emission probabilities are linear washing ecosystem could be used for noisy time dependent measurements for instance tracking with noisy measurements. Then we know that in such models the joint distribution of observation latent variable factorises according like this. Where we have here initial distribution here the transition probabilities and here the emission probabilities an asset and all the initial distributions. All the initial distribution is given by a normal distribution, transition probabilities. It's given by a linear map plus some culture noise. And the emission probabilities similarly, we have linear map

from the latent space to X and then add some noise. Our parameters then consists of all these parameters of all these normal distributions. And as you can see, the homogeneity assumption shows that that these are not dependent on N. So they are same at each time step. Then we have data, one data point for each time step. So we are not able to repeat. And the experiment. We can only go further but we cannot go back and repeat the experiment. Then we can do two things. We can do inference or we can do learning one is fitting the parameters and the other one is trying to andw In 1st Latent states, so for influence we have seen um version one

is a common filter where the aim is to make inference in an online fashion. So we in first latent and using the data points, the measurements until this timestep and we can do this by running through is computation steps from end to capital N until last point by taking these these as initial initial parameters and then updating them according to these quantities emission and emission variants. Then we know the distribution of partial distribution of the X look like and then we want to do conditioning the other way around using base rule using the common gain matrix,

then the conditional mean and the conditional variance or if you want posterior mean, posteriors variants. Then we do transition here and then we do the same step again and this we do until convergence and in the last step we have the last latent state condition, all variables. This was a common filter become smoother. Now allows us to go back and adjust these old estimates with the new data points, I haven't, I haven't seen before and for this, we initialize this means means um and filter in the first step but only on the last point and then we go backwards in time by computing the reverse common gain matrix, then computing

these posterior means by adjusting the previous means that we're only depending on the time the measurements until this timestep and now adjusting them from the information from basically their future. Also we do this with the variance and you can see the variance can go down since new data arrived and started taking them down. And with this we are able to compute the conditional distribution of each of the marginal in the latents given the whole dataset. And also we can do this for two consecutive time steps where we get them these variances, covariant matrices coming from this commons museum, this is inference. Then you can also do learning So we start

with these parameters. We initialize them somehow. Then for simplicity we get rid of the biases by just putting them in these matrices and petting one suit that ends make it easier. And then what we do is we alternate under convergence that cm algorithm. The step we have timestep and in the e step, what we do is we infer the latent meaning given all the data points with respect to the old parameters and we've seen that for this you could use the common filter and then the comments moves is on top and then we can get these quantities so posterior mean posterior variance and posterior core variants. These quantities from the current muses and then we can just update our parameters using these quantities.

So are initial distribution comes from this posterior mean and the posterior core variance. And then similarly the transition probabilities can be estimated like this. It's easier the variance co variance of the position noise if you want. And with these quantities this a here was computed first, then similarly the mission matrix, maybe computer like this, and then the admission noise measurement noise can be computed like that. And then after the M step, we go back to the step and use these parameters then to compute the new quantities and so on. And until convergence after convergence we can use

these parameters here are linear dynamical systems, make predictions or again do some inference tasks.

11 Sampling Methods

0 Sampling Methods - Classical Monte Carlo Methods

everyone in this video. I want to introduce you to the classical Monte Carlo method. Let's start with a simple example of estimating the mean, the function let's say we have random bearable. It is distributed like some unknown distribution or maybe it's even known if your distribution Q. And you have a rain unbearable. Also have a real valued function here. This F. And what we're interested in is the expectation value for simplicity. You can also assume that F is just identity and you just want to have the expectation value of xz And this is by definition as you have to define it is integral over this density. So if Q. Of accident city then I evaluate this function on all kinds of access and then integrate over it.

And this is often very difficult to do if you don't have explicit formulas, analytic formulas for these two quantities here. But even then even if it's explicitly given, sometimes it's not doable. Instead what we can do is we can estimate this quantity by using samples. So let's assume we have some I. D. Samples from this distribution, not just one, so a whole bunch of them and and we know the assembled from Q. Let's assume this, then we can estimate this expectation value just by the empirical mean like this. And we have seen this already a few times. Well this is the only thing we needed samples and the ability to evaluate these functions on these values and then

we just need to add them up and divide by the number because this is like a buried short way to approximate this. integrals So the expectation value and what properties does it have? The first It is unbiased. This can be seen just by the linearity of expectation value. This expectation value, we can just put it in here. So what we get is one over n. Some n equals n. two n expectation value X. N. So now we know that this expectation value often X. N is the same as the expectation value of F. X. Because they're identically distributed. So this is one over end andw equals end to end expectation value of F of X. So now the Sun

here this appears and this is not dependent on end, we have end times the same value divided by one over N. So this is the expectation value X. So that means like an expectation, we get what we want, then we can think about how big is the error we're making. So we need an error measure between this empirical mean the quantity we want and then we have to measure that difference. So we take the difference, we square it, we take the expectation and then we take the square root because the root mean square error. And as you can see is mm what you can see is that this one over N. And

the Sun can be put inside so we can as well put here records then we can take this one over N. Out here. So we get one over N squared, then we can take it out here because it's linear and then we have here the square root, so what's left is one over N. So then what's written here and then the sun over independent random variables, then we square them and we take the expectation value. So this is just invariance terms in the co variance terms, the co variance terms are all zero because it's independent and then we have variance terms what we get is

what you get is end times variant F of X. And then the square root, so what is left, there's one over square root. End invariance F of X one half and here you can see that the arrow goes down with one over square root of N. This goes This one over square root of N And this goes to zero for N to infinity. This just means that if we have more and more samples are approximation gets better. And we also know with what

kind of rate it does. Let's say I want to have an error, acceptable error. Absolutely. Then this just means take andw bigger than sigma squared. Done squared samples. Yeah, sigma squared here's a variance, you can also write the variance here. So if we want to have like a maximum of epsilon error, then we have to take more samples and this quantity

to push this root mean square error below epsilon So this is what this analysis revealed. So then the last property follows basically from the second property is that it's consistent, consistent means that it's asymptotically unbiased. So if I go with N to infinity then it should converge to the expectation value. And that is what is thus this converges the expectation value F X for N to infinity. And this follows basically from this analysis here, but this is basically two known between these quantities andw Because of this storm goes to zero, meaning that this goes this quantity in

L2 norm and also if you're interested in other forms of convergence that appear in statistic probability theory was also almost truly So in L2 that we have seen above and almost surely. And this is a law of large numbers now to only invariances finite as well as here. So even though it's approximate, we have very close control over how many samples we need to reach an approximation error, we know it's unbiased and we know that if you take more and more and more the conversion right quantity and as I already said, it's

a convergent speech relies on the finite nous invariance and demean in other words, the second moment, if you have on the other hand, fair tail distribution, then you might not have a second moment. Some distributions don't even have a first moment means like fairytale means like all the probability mask is very far away from the mean, the tales andw practically. In these cases we have non convergence of our empirical mean to the expectation values there is of course a lot of large numbers which also holds here but we need un realistically many samples to have this convergence and um like so high

that it often makes no sense. If you have any distribution first check does it have, does it have fair tales or not? Does it have a second moment or not? andw in case it is fat tailed, he won't talk much about this in this course but if it's fat tailed you should not rely on sample methods at least not the vanilla versions like empirical mean already just the mean is already a bad that estimator was expectation value in that case. What can sometimes help our robust statistics like meeting of mean estimators but here you take batches small batches where you compute the mean and then you take the media knows this this is a biased estimator but often closer to the expectation value than empirical

me in the cases of had tails. Okay, that's set now we ask ourselves if I really want to use these monte Carlo methods where do I get my samples from? And the general set up will talk later in this program is that we have something like a program actually like afunction P and this spits out of value P total of X when we confronted with an input of acts so there's no analytic form here, there's no grading information like the moment we just can plug in the value and it spits out the number. That's the only thing we can do to be considered as a program as an input and an output. This is how you have to think about it and you think about it

mathematically as an unnormalized probability distribution. And what we want is we want samples from it want samples from the probability distribution, I want the best case I had these samples but sometimes we have some dependencies there in the end you want like a bunch of samples that reflect the distribution. andw What we also assume is that we have access to a pseudo random generator can give us some uniform these samples are approximate. Had these samples from a uniform distribution on there, they're one interval then we want to get samples uh huh The distribution that corresponds to this after normalisation

in certain cases will also have to normalize the constant at hand given to us. Sometimes you also have gradient information or this function or this logarithms of it where we can just plug in X and it spits out the value for the gradient. Um sometimes you also have a cumulative distribution function of our p which is here integral from minus infinity to X. U of X the X. And I made this dot here and this tilde here that you make sure that you consider these as functions really like as a program, a plug in the value and spit something out. So this is in contrast to this of X, where we always consider P of X, the distribution that changes according to what kind of letters I use for my variable. This is a general

set up for our sampling. Yes, you want to develop.

1 Sampling Methods - Classical Monte Carlo Methods - Example Applications

everyone, you've talked about classical Monte Carlo methods and now I want to give a few example applications where it can be used but first let's reflect on what we've done. Think about we have a normal learning task, we're gonna minimize the generalization error, treat true distribution and the model distribution and we can hear learn the model, we can change the parameters such that it minimizes the co but like the distance, this is a cool but like the distance written in this form between the true distribution, our model distribution and the problem here was always that we don't have access to this true distribution. So what we did in the end we were just relying on samples anyways because all data is given as samples. All the distributions through distribution are only given by our samples in the data center. But here we never thought about having

the real distribution at hand or an unnormalized distribution. So here everything was already in terms of samples. So implicitly we were already doing learning based on samples or explicitly andw we couldn't optimize for the true generalization error. So we couldn't really minimize the true generalization error. What he did instead is we did exactly this, what we have learned in the the monte Carlo methods that he replaced this expectation value by this empirically and then we minimize for this and we have seen that this is the same as maximising maximising the log of the dataset

was the whole dataset. This was our maximum like your destination. Now the whole learning theory is about the mismatch between these two, we have something that you optimize based on samples and we're here is a true generalization error. So what we need to introduce is regularization to avoid overfitting and under fitting. We have to talk about the function classes we used and so on. Just to get a grip on this. Also we had validation sets and test sets just to be sure that in the end instead of optimizing this, we're actually optimizing for that but that already took a lot of effort. This is just to give you a feeling how difficult it can be to learn from samples. ifthe here, we wanted to have more samples. What we need to do

is sampling from this distribution here means we have to do the experiment. So the data set, we are using the accents they come from an experiment. And if you want to sample again from this distribution means in real life we have to redo an experiment. So this was I was sitting before let's go to another application of on to count methods let's say we have a distribution given in two variables X. And that for instance that could be latent variable and X could be here observation variable. And what you want to do is we want to compute or represent and this marginals and we assume that these conditionals we have then what we can do is to get this marginal, we can write it as its integral P of X. Given that I'm speaking of that. Is that

this can be written as an expectation value over that of P of X given that about monty carlomethods This is one over N some Andw equals 1 to N P of X given Z 10 Z eight and that one that and our I. D. From. Yeah, what's that? This means that the marginal is approximate the mean of all these conditional distributions. And because of this property here that you can compute marginals as a mean on empirical samples,

sampling methods can be viewed as approximate influence methods of course when we want to compute expectation values and all these kind of things, there might be more applications but since we were interested in inference computing, marginals andw conditionals and so on. We see here that also sample based methods can be used to achieve this goal. one of the most important cases. It's based in prediction. Think about we have a statistical model given and we have prior and we have some data and now we want to compute the posterior which is product if it's I I D data of X N given theta and it was one to end

um there's no end let's say okay, more in and then we have uh the prior then we have to divide by the integral yes over detail. Just right. The evidence here is the ofx Yeah, X integral over this. Product was one too small end keine like this and then we can write the predictive distribution, how does the predictive distribution look like? predictivedistribution Yes.

The time given access and this is an integral over the posterior um this first statistical model X. You have excellent teacher times of posterior. Now the problem is how do I compute this integral here? So we had one approach, this was variational inference where if you have a latent variable, we were able to evaluate this by value in this joint and maximizing the elbow. A sample based approach would be this is one of the capital and and the sun

M&M equals 1 to capital M p of X pita of em where eta one a.m. sampled from the posterior. If you're able to do this, you can make a prediction but but by just taking parameter sample from the posterior, looking at this X, we want to build it here and then taking this mean, then we have an approximation of this predictive. Now you're here by just summing up over predictions from each of these models, sample from the posterior the problem now is how do we settle from the posteriors If you wanna sample from the posterior first it's

very already very expensive to evaluate if you want to evaluate this for one value. We have to evaluate it here at the prior and then we have to evaluate it here. And for each evaluation we need to evaluate the whole data set on this likelihood function. This is already complicated or if you have a lot of n, this is already expensive. And then the biggest problem is having this normalizing constant here, which we have to integrate over as well. So we would like to have something where we don't need to evaluate as normal as a constant and just evaluate this year and then we want to be able to sample from this posterior. This is exactly the framework we're working in in this sampling method. So the question is uncle sample

from posterior or even the unnormalized stereo posteriors So another application is if you want it Then 18 integral and this integral here is just about the function here. We don't have a probability distribution here. Right? So it's just like evaluating any integral. Of course we can write this as an integral over F of X, where you write a Q d f X and where Q of access uniform distribution. Is this just let me write this

indicator function. There are one x X. And here this corresponds to the uniform distribution. The back integrals the back measure on the unit interval that means we can hear approximate this one over N. Andw equals 1 to M. And now they're created at xN X n r I I d on the uniform distribution on the internal and then we can estimate such very general integrals if you want a more general, let's say we have something on high dimensional space X. Or any space X. And you want to evaluate this function here, A function integral here. What

we can do is can introduce density like this and then go from it and then evaluate the caution. Yeah. Thanks man, I I d. Thank you. Then we have better aided this integral. And here we need to be able to sample O que and we need athletic form Q or let's say we need to be able to evaluate this value here.

These are typical examples where sample methods can help with the most important prominent example the basing setting where you really want a sample from the posterior fromthe unnormalized posterior.

2 Sampling Methods - Sampling with Quantile Functions

however one in this video, I will show you how you can use a quantum function of a probability distribution to center from it. But first recap what a cumulative distribution function is and what a quantum function is. First we start with any real valued random variable can be discreet, can be continuous, it can be any combination of those or any other real valuable and unbearable. You can come up with in the communicative distribution function of X. It's just the common relative distribution until this Adele you. So we plug in a value, we look how much probability mass we have that value and this is the value we give. So this is Always between zero and 1, there's always between there and one continuous distribution. We can write this as an integral. We can integrate

like this or for discrete distribution where the values are still in our you can just sum up all the values smaller than t fft 10 years. That's a good case. But if you have any other combination for instance something continues with some jumps. This definition still works then based on this cumulative distribution function, you can define the quanta function also sometimes called the inverse cumulative distribution function of your distribution. E or you say of your random variable X. And what this is is take the smallest value

maybe Alright minimum minimum values of all X and r let us include past minus infinity. Such that 1/2 is bigger equal to you, Where us the value in zero and 1 and then this here it's a well defined value in our let's say plus minus infinity. And often this is also written as f inverse of you because S F is continuous. For instance in this case then this is just inverse function. But if this is not continuous then you need this definition replace the inverse

and this function is called quantum function because if you plug in here, let's say 5% and it spits you out the X. Such that their 5% mass just below this X of the quantum. And for example, if you want to have this probability interval with 80% then you would say 10% to the left. I don't want 10% of the right, I don't want. And then in the middle you want 80% and what you do is then 10% of the left, 10% of the right. And then this interval covers the intermediate 80% of your probability in the situation. This is how you can use quantum afunction Let's visualize this a little bit. Let's say we have a probability distribution then are cumulative distribution function might look like this. But here we have like a discrete point until here everything is continuous.

Here we have a discrete point. So this distribution would be a mixture of continuous and discrete. Then the cumulative distribution function maps the other way around. O que but here we have you here we have X, then this would be Q. It's free. So are you here? If I start here then he has no value. And because it takes minimal value such that this is a buffet. We go to this value here and the cumulative distribution function maps the other way around. This should be rather familiar. And now we can do sampling with it. Again we have a real valued random variable

with some distribution. And quantum function here. The important thing is now we assume that we have it and we can affiliate the and the function by planning something in and then we take a uniform distributed random variable. It's called you here and then we plug this in our distribution. Then this is distributed like X. Or you look like P. If you want to use different rotation meaning the distribution of this random variable is the same distribution as this random variable. And this is given by people. So that means if I take I I. D, samples uniform I. I. D. Samples then and I then plugged them into the quantum function. And these are I I. D distributed Aioli sandals from P.

And then from P. And this is how you can use quantum function to sample from the distribution. What you need is a quantum afunction Maybe let us shortly prove that this is true for this. Let us lose as a probability that queue of you is smaller than a given X. So this is the same since the quantum function is something like inverse of the cumulative distribution function at least one sided. This is a probability of random barrier smaller than F of X. Now U s uniform 101 1st. You can write this like

this year. Um now we're using us uniform intheform distributed meaning that you have to measure the length of this interval, meaning make the links this interval, the length of this interval is just F of X Because it goes from 0 to ofx and this is per definition the probability that acts a smaller equal to X. This holds all acts this means that a few of you same distribution as X. This is what we wanted to show. Now let's look at an example let's say we consider

the logistic distribution here. abilistic distribution as a logistic sigmoid as a cumulative distribution function which we have seen already a few times. Now what is the inverse of this function? The interest of this function is a logic distributed logic function and it looks like this, it's just taking the lock, you want more an issue. We have seen this interest already a few times for these functions. This means by now sample ure uniform and then I take the lock, You won one is you and this is sample like the logistic distribution or if I

take you want U n. Take this iittie uniform distribution and X take this Xn U N and X. Warm Xn I I d logistic. And here we didn't write down the density because all we need is the quantum function which is here the better description of our distribution or the center process. This looks rather easy but there's some remarks often namely we need some miracle approximations to our quantum function as we not always have an analytic form that we can evaluate easily even already in the normal distributed case we know that the cumulative distribution function is an unsolvable integral

and then we even have to take the inverse. So even for normal distribution this needs to be numerically approximated with some sophisticated method and even until now they're still papers appearing how to efficiently approximate this. Another note, we can consider this interest transform sampling which is just another word for sampling with quantum functions can be viewed as a generalized version of re parameterisation trick. Re parameterization trick. If you have a family of distribution separates randomness from the parameters. This is basically the idea and it's used to compute gradient retrospective parameters, separate randomness from parameters. And we have seen a specific version of

the representation trick for the normal distribution. Now let's go through a few more examples of parametric distributions and how we can use these Montel functions to separate parameters from uniform random noise for example, let's consider the exponential distribution. Exponential distribution as this kind of density, then we can try to compute the cumulative distribution function namely just by integrating but this is the integral from minus infinity to x. Your ex lambda the X then um you know if you're smaller than zero then zero we can he was again This indicator function times now the integral also just goes from 0 to X. lambda

X -1 times x the X. And this can be written pass so then function X, X, then we have inner derivative, then we have a minus and we have to affiliate zero and X what we get is one minus X. Next times lambda and then we still have this indicator function. Thanks, this is system function and now we want to invert it and since there's no probability mass on the negative and the negative values can only look at these positive values, andw inverting this

is basically just taking, then you should be one minus x minus longer times X. Then you can bring it on the other side. So we get xxt I understand the X one line is you and this gives us X equals It's a walk, 1 -2 and we have a minus sign and then we divide this by longer. So this means that we have inverted this function so the quantum function is minus Block one -U divided by λ,

this is a quantum afunction means if I now sample you uniformly new form then if I define X to be Block one -U. λ then X is distributed like the exponential distribute tribution with lambda and as you can see this is the uniform distribution with no parameters. The parameters are here are separate separate from this randomness. So we can take something like a relatives with respect to X. If you want to use it for stochastic creating the center learning. So here you can see that this is a re parameterisation trick for the exponential distribution. Another example let us consider the caution distribution with

location parameters and scale. Bush distribution is here one plus X squared one over it, this is the main point and then we have like a shift of the location and the scale and the scale also appears here for normalisation reasons. Then we wanna integrate this from minus infinity to X. D. Of T M S U T. We have to make sure thanks it's correcting then what is the stem function from one plus X squared. This is our kiss tangent but what we get here is normalisation constant to one of the pie and then we get the anarchist tangents

X minus M divided by S. And then if you evaluate the stem function of minus infinity we get 1/2 pi pi over two and then we get this year so we just get in half. So this is our cumulative distribution function and now we wanna say are you is one over pi our chris tangents X minus M over S plus and half. And then we wanna solve this for X. This as you can see is just multiplying I for subtracting. Uh huh multiplying by pi then taking the tangents

and then multiplying by S. And then shifting by. So this is our quantum afunction as times engines, five times your minus and half last shift. So if you know half again something uniform distributed then we can just define as times tangent, high times this random variable minus and half plus M. And this is koshi build it with its parameters and as you can see again we have separated the noise from the parameters another example

generalized extreme value distributions as you know if you take some of random variables they will converge to a normal distribution by the central limit theory, generalized extreme value distributions. They occur if you instead of taking a some you take max and then you kind of take I. I. D. Samples more and more and more. And then usually the distribution converges to a generalized extreme value distribution. Then they come with three Parameters, occasion, scale and shape and I don't write down cumulative distribution function of intensities and more complicated than the quantum function itself. So we need to defined like this, there's a log inside then you have the power of the shape and then you have some scale and some shift this holds here for this is not shape is not zero and if it's zero is in basically take the derivative

or something of that and then instead of having years this polynomial you get this log log log minus walk. And then if you take uniform distributed here and define your random variable like this. Like we're just looking at this case then this is distributed like this extreme value distribution with these parameters where he has this Excitable zero, you can do the same here for this case and this is also called gumbel distribution which is nowadays also used machine running now we have always done just one dimensional distributions. What about multidimensional case? What if I wanted to do the same multi dimensional distribution let's say our distribution depends on several variables in the art of the D

then one trick one can do to reduce this to the one dimensional cases. Just factorises using the product rule by just factorises at one after the other The X three given X 1, 2 and so on. Until we get EXEX one X -1. And now we sample them bit by bit. So first we do quantum function of this quantum function of this quantum function of this and so on. What we do is we sample I I D uniform distributions, then I take here Q one U one here To one of you. Yes. The cumulative distribution function inverse

of X one. Mhm Maybe that's right. It. So these X are the minimum of explores the quantum function of X, Minimum. All ex that said F one, X one is bigger or equal than you Then for X to take you to given X one We have Q two. utu given X one. It's a minimum over X two. Not snatch F one F 2

X two given X one bigger than extra and so on. Go XDUGX one x -1 and so on. May be all right, last one X u d X one x -1 minimum. XT such an FD XD give an X one X 2 - Woman. You got the call to routine. So the conditional quantum function and then the claim is that if I put them into one big vector, the distributed like this and the reason of

course is that we use this product formula. The first is of course distributed like this. And the second one, when I have the first value access in here, then the second one is distributed like this and together they are distributed like P X one X two and then we take the third one and so on. And in the end we get this whole distribution and this is mathematically correct. But often you don't even have these functions, you cannot compute them. And so often it doesn't help much that you have this mathematical formula here and practice. You often cannot do it. There's only one special case where that can be used is if this year is a product distribution, it doesn't depend on all this conditional part, then this is a product and then you just have to do the product or like

each component independently by just taking the quantum function um independent of the others. And in this special case of independence you can just take the Kwan tile functions of these uniform distributions independently in each component, then this is distributed like the X one, it's too broad distribution. E X T. You can use this for instance to sample d dimensional Gaussian you can mention like goal now so what you do is um you want to have and samples and I. D. Samples but each end you sent over each component, one uniform distribution, everything I. D. Then you take the interest cumulative

distribution function near the quantum function you just plug this value in and we know this component is normally distribution. Standard normal, this is standard normal and if you put them into one big vector, their standard normal in the dimensions and then we know if I take cholesky decomposition of my sigma this is one half of what it means here and it multiplied by this vector here and to shift it by sigma, then this xN is distributed like and Gaussian distribution is this movie as expectation value and this sigma is co variance matrix. So this is basically represented. Ization trick for Gaussian distribution and first step was first getting something which is normally distributed and for this we use is and quanta functions.

Okay let us summarize what we did. So what we did is we use quantum function or cumulative distribution function of real random variables. May you take a uniform distribution, plug it in and then it's distributed like the X of the distribution of X. We like to have the entries saw that this can be generalized to a higher dimension using conditional quantum function, which is practice but often infeasible And even in the one dimensional case, even in the one dimensional case only for a small number of distributions works. And even for those you need often numerical approximation even for the normal distribution to get this. But they exist programs to english to arbitrary precision. And then

we have seen that you can see this in a transom sampling symptoms. Quantum functions as a form of generalised version of a re parameterization trick where we can separate the randomness here uniform and the parameters so we could compute gradients if you're interested in that

3 Sampling Methods - Importance Sampling

everyone in this video. We'll talk about important sampling. So let's first talk what you're trying to achieve here. So the setting is his photos have a target distribution and we have a proposal distribution. And we can do two things. We can evaluate this unnormalized distribution of our target and we can evaluate this unnormalized part of our proposal distribution, andw We would like usually to have samples from P but we don't do this, we cannot do this. You can only take samples I. I. D. Samples from the proposal distribution. We can about your age and you can sample from proposal and then what you're trying to achieve is you wanna estimate

the expectation value of afunction over our tag distribution and disclaimer will not end up with I. D. Samples from this distribution. Instead we'll make use of these samples. A goal is just to estimate this quantity here, this expectation value of this function. And then we could have two settings. One where these normalizing constants are known and both at the same time or where at least one of them is not known. And this results in two slightly different approaches but let us first assume that we know the normalizing constant as well. That means like we can evaluate basically this normalized distribution as well because if you can evaluate this unnormalized if you know this

we just divide by it and then we can also evaluate this one. So what we didn't do is we just tried this as an integral, this is P of X and F. Of X. X. And the idea here is now that we bring in our proposal distribution times half of X. View of X the X. And as you can see this cancel house, we just bring it in and now we have written as an expectation value of a cue as your ex you have X. That's F of X. And now we use the usual monte Carlo approximation or Q. Then we get

one over N some in equals want to end P of X. N. U. Of X. And and X that effects man. And we're here we assume that we have I. D. Samples from Q. Plugged in here. So this is already this is already important assembling and what we have now is that we have these additional weights here. These are called important weights. So instead of just having one over end and this function where this XN assembled

from P. We now have this additional factor if we sample X. N from Q. So these weights correct or um or the fact that we assembled from the proposal distribution instead from the target distribution. And this is intuitively clear. Usually if we have our distribution and there's a lot of mask here, then we'll have a lot of samples here. But since we have a proposal distribution in this area we don't have so many samples and instead we reweighted them such that it's well like the case we had more of them instead of having five samples here, we have only one but then we multiply it by five. Now let's look at the properties. So

first of all I claim that this estimator is unbiased and you can see this just by taking the expectation value. That's the definition and we have to take the expectation that you know or suspect to cue the reason is that they that the samples are distributed like you. So we have to take their distribution and the distribution of the cube one over N. Andw equals 1 to end the ofx andw ofx andw F. Of X. S. Now the all I I. D. And this is again linear. So this cancels out. So what we get is get the expectation value you know X. Your ex F O. X. And as you can see this queue here cancels our to have an expectation value overpay

as we've seen on the last slide innovation. So see this is unbiased then I also claim it's consistent consistency means it's as methodically unbiased. If I have more and more sample then converges to this and this is just a lot of large numbers. Andw equals 1 to end. You don't text and this turns into these are I I. D. And this turns into the expectation value of the corresponding variables and this is again this expectation

X. Is this again this quantity here and this again holds bythe law of large numbers. Then let's look at mean square error. So Take one over N. Because 1, 2 n. ofx andw your pecs and F F X and minus expectation value. Then the square is is then we take the expectation

this is equal to here, we have to take you again. So this is just one over N this year squared and then covariant germs in between so again, This is one over end and then variants with respect to Q E of X, you know, X. ofx this means at the root mean square error Goes with one over square root end terms, invariance you know X X times F of X and heart

similar but the usual montecarlo and again, this is independent of the dimension. As you can see. So it might be a high dimensional acts here, but the rate is independent of the dimension but in practice this can be huge. So maybe we can, since we can choose proposal and you want to have a good proposal, you can now look at what kind of proposal would minimises root mean square error and as you can see a good proposal, would minimises variance here, let's write the variance down again. So variance I suspect to queue here, you're missing of P of X,

you are X F of X, what is it? This is square expectation square, yeah X U of X times F of X but there's a square then minus expectation of this expression squared and we know already that this is E F P X. And all of this squared. And as you can see this is not dependent on cue. If you want to minimize this variance, you want to minimize this and we know by ends and inequality that this is you can take this square outside

you, you know X your ex ofx where andw this only relates to this term, this job and then this is also just E P. F of X. wheres those variants bigger equals to zero this inequality. And the question is that means like this term which would minimize is lower bounded by this quantity, which is not dependent on Q. And we know exactly when this is equality and equality holds when this equals this

year. So when this is equal. So we cannot say it's a variance. minimising proposal, polls if and only if this P of X divided by Q of X times f of X equals expectation value. Yeah, of this one, I am. It's an inequality. This is a condition and if we saw for Q X Q X needs to be, we multiply this here, the other side U of X. F of X divided by E P. This says

that if I take the probability mask, I multiplied with his function tell you multiply it with this function, then I get a new function which has some peaks and some lower ends and I use this function normalized as a distribution and if I can't sample from this or approximate is very well, then I get here very low variance and low root mean square error. So I'm trying to attempt to get something which is similar to this. So I need distribution that put smiles where the product of these two is big. This is basically what the inside is. Of course the question is what if I want to use reuse our proposal for other functions then of course,

uh this could be anything. This could be one that means like you should be close to P this is an analysis for the case where the normalizing constants are No. And let's look at the other case if f now unknown constance, then we can use same trick trick twice to get rid of anomalies and constance. So this here let us write it. So what did we do? We wrote this S N integral over Q. U of X, divided by Q of X F of X times Q of X. The X. And what we're now doing is we're dividing by one. So we do the same

T of X. ofx X. D X. As you can see this canceled out. This is a probability distribution. So this here below is just one. This is just one. Well this year is just like this expectation value but now why did I do this? The reason is I have normalizing constant here. I don't know, I'm normalizing constant here, I don't know. And they just cancel out the same holds for Q here and here. So this is he told of X which I can evaluate you tilda X, I can evaluate F of X. You're xtx divided by you told that awful ex utu

X. You have xtx So this is just expectation value of Q interval of X acute till dawn of X. X divided by expectations of a Q detailed X. utu bell fax. Now I approximate the upper part samples and the lower part of samples. So I'm using the same trick twice. So this is one to end detailer of XN by the Q and X. N. Mexican United x one and details of X. And

you told all that andw I would have here one over N but I also have one over in here is canceled, this is out. So this is a sample version now and we can write it like this right Course 1 to end right. W Okay and fx and where these weights are Now the function I can actually evaluate and again the samples are the samples from Q. No my importance weights are instead of these normalized functions there no unnormalized functions

and I make good for it by just dividing by the sum of it. So this year is unnormalized wage they sum up to one because of this normalisation here and because of this normalisation normalizing constance both of them drop out but this now leads to a few problems. So one thing is um this is a biased estimator because now the denominator here Before this was just one over N. Now we have some randomness in our denominator and this makes problems so if I take the expectation better here, denominated doesn't go separately in these arguments so we have something biased, it's still consistent.

The reason is again a lot of large numbers, the one over N Equals 1 to end. Okay, Yeah one over N 12 NWN. And they converge the expectation value of a Q w N. Okay, w maybe I write it out tilde X q tilde of X. ofx andw both separately emerging and then also the coalition congratulates

this strong kind of convergence area, mrf so and this is you can again here at normalizing constant to both sides so we get let's cue out and then we get here X if I want this is exactly what do you want? We have a bias but if I have enough samples, I converge to this one mm One problem we have is it's difficult to estimate, it is difficult to estimate

and that you can treat them separately. It doesn't help us with this joint term here, we can separately analyze variants here and the variance here but it doesn't say anything about the variance here, the reason is that the dependent and even if the independent one over and an estimated the variance, it's not clear how this go through this one over universe part. So this is very difficult to estimate. And if you do it separately then we get unreliable estimators but try and make some heuristic arguments but usually it doesn't work so well important samples one sampling also has problems in high dimensions and one thing is it's prone to put huge amount of weights onto a few samples,

wait on a few samples high dimensions, how you can see this is making very very cute crude arguments. So these important ways are proportional to these questions of the target distribution and proposal distributions and if you have more and more weights. So we're basically making a product approximation of this and we're assuming they're all basically of the same size. This is just a rough argument of course in would be in some cases where this doesn't hold like it usually doesn't hold but we could we could argue that we have a problem where we have product distributions and this is usually how the probability scales and higher dimensions kind of gets thinner and thinner just by having more dimensions

where probability mass needs to be put on it's roughly factorises roughly in its values here we get it all same size. Get something to the power of D. You also get here something the power of D. Now you can see if you have one, this is close to one and you make the power of D. Then it stays for long, close to one. If this is let's say on half. And then you take the power of D. Where D. Is high, then this will after this exponentially ation be close to zero. So that means if there's a slight difference in two different ways here, slight different for two different points X. N. And X. M. Then these difference are exponents exponentially ated one becomes stays

close to one and the other one quickly goes to zero. And that means it's a small difference weights will be put on a few samples because you normalize. That means like the ones that don't go to zero that they get a lot of weight. And then this is prone to put weight just on a few samples. The relative weight, they quickly drift apart financially overcome this. Some people do. Um So they kind of, instead of using the weights directly or um averaging, they use these weights, they normalize these weights and then they sample from these weights L samples and then they use

andw this estimator means like it's still high weights on a few samples but the others are not zero. So if you sample them, you can still get them and then you have a little bit more liable estimator and there's a there was a lot of research going on years ago to change these weights to make this estimate better. Let us summarize a little bit um on a proposal. So our proposal should have wait, where are you target distribution target function where there together as a product have a lot of mass. So this should be high where this is high. And of course for this we need to know P and F leads to some degree. Also don't need to normalize the constant.

Just need to know something about where the product is big, We don't know dysfunction. Then we put masks um we should put masks of Q. Of X. V. P. Of X is back. And even if peace both not known what usually works is trying to have something which is close to uniform distribution. It does not exist in for instance the real line or euclidean vector space. And then instead you take something which has heavy tails which is close to something like a uniform distribution if you want. But what's the math? That's kind of uninformative as much as possible. And trying to be equally good for all possible choices. So this is a good What choice if you don't have any information about four people, then

we found out that important sampling he knows normalizing constance and gnome. Then we have consistent unbiased estimator with an error that goes down with one over square root of N. But the variance could still be huge due to the variance of these terms divided vacuum. Then we have seen if normal lives in constant are not known, then we still have a consistent estimator but it's biased and we have difficulties to assess arrow also in higher dimensions. Um get the problems of huge weights, a few samples and we still have bad approximation with high variance

if the proposal differs just a little bit from the target distribution because then the variants can blow up. Finally we said it's called important sampling but in the end we didn't assemble from our target. What else what what we did is we assembled I. I. D. From I. D. From the proposal instead of the target. But you can rephrase it by saying we have waged samples so we can consider the sample has weighed the samples where each has one weight but then samples from Q. andw or if you put these weights on we have seen in this um averaging then we can

take the same samples, it was a different weight and then we could consider them as weighted samples from people. I I did too. So yeah so we now take the samples and the weights and then we use them to estimate this integral here. So we replace basically this expectation value of P. Just with I. I debated samples by just normalizing these weights and adding over all these values. And if you use this concept of weighted samples we actually don't need to know from which distribution it was samples. We only need to know this is in some sense weighted samples from P. X. Then you can replace this which is

empirical average where we use these normalize weights. Finally let's look at an example, let's say we wanna we wanna estimate afunction as before and now we want to use a normal distribution, our target distribution is normal distribution and the problem is mhm The sample from the normal distribution, we did this already or like we discussed it already by using by using the quantum function but we know that the cumulative distribution function is an unsolved integral over this uh density and then we have to take the inverse. So here we only have usually numerical approximations and there are a lot of approximations

one could use and there are also other methods based on sample methods to estimate this. But here what we want to use do is just to give you a feeling what we could do is be used an approximation a proximation of our cumulative distribution function with the sigmoid So the cumulative distribution function on a normal distribution looks like a sigmoid like this almost exactly like the sigmoid so you have a sigmoid and then you kind of scale another sigmoid and you approximated and if you do this which is logistic sigmoid and use this scale then you get the best approximation, this is known since 1963 or so and we know how to sample from this, we did this already with Kwan tile sampling in west transform sampling. So we sample uniform the inverse this function.

So there's a cumulative distribution function in versus the quantum function and it's written here so the scale is one of our and then we plug in this uniform distribution here and here take the lock and this is our sample. This sample is in sample like logistic distribution, that's logistic abilistic distribution with scale are and now even though we don't know the cumulative distribution function, we know the density and you also know the density of the logistic distribution by just taking the derivative. So what we do is we take our weights of this normal distribution and you take the derivative and then we have to take these are for the derivative if you know the sigmoid has this derivative just evaluating the function one minus

afunction Now we have samples uniform, we have samples plucked them in, we got sample the sample from the district distribution and then we plug them into this function, we evaluate them, we know all of them. And you also know the normalizing constants then we get our weights and then we can estimate this integral. Yes. And because we want to enter andw Mhm I've forgotten um No we can write one over N W N divided by n times ofx and and here we could do this because this is the case, we know the normalizing constance and then this is a proximation of this integral

over the normal distribution and these weights here, they will be very close to one. So there's only a slide of correction here. Let me sampled actually from the logistic distribution. This is an example how you could use like what you've learned here, this quantum function, um, sampling and then this correction Of this density function evaluate an integral. And here we know that this goes down with this rate. So we I don't know, we know we need 1000 and then already only have a few percentage of mismatch between these two. So this was just an example how you can use weighted sample weighted sample to approximate an integral more expectation value.

4 Sampling Methods - Selection Sampling

Hi everyone in this video. I want to talk about a simple idea that has implicitly used in a lot of sampling methods. I want to spill it out more explicitly. I gave it a name, I call it selection sampling. Well I actually don't know if this technology already exists but I know it's a simple form of rejection sampling and if I would have called it rejection sampling there would be more confused. Insert terminology already exists. So let us talk about a simple example. Consider the task of sampling uniformly from this unit ball here, this disk. So the unit ball all these X. Wise we have X. This direction. Why this direction with all these points such as the distance from the origin

is smaller equal to one Over here. We have zero Here we have zero here we have -1 -1. Here we have one and here we have one. He is original. So now what we want to do is if you don't know how to sample from this kind of curved object here. What we do is we sampled uniformly from this side. We sample uniform from this side. That means we uniformly sampled from this square. And then we select only the points that lie inside this ball here on disk. That's why we call it selection sampling what we select only the points we're interested in.

So if I have a point here, we select this one. The point here, this one gets rejected. We were uniformly from the ball and said we use uniformly from the square. So let's write this down. So we take X. The random variable that is uniformly Samples from -11 as well. Why? And then we define our selection variable here and our selection variable is the indicator function X squared plus Y squared smaller equal to one. So this is one if X. Grad plus wise, grad

is small equal to one and zero otherwise. So this is a deterministic function here in this case of X and Y. And that means we can write down the probability the conditional probability of that given X. Y. So this is a delta peak. delta peak if that equals indicator function of X. Y. This is authorized just as an indicator function of being X. Y. So then Mu clearly this is probability one if That equals one. B. Why?

So if Z is zero and the points don't like by in this ball then this is also one, This is one and the points lie in the ball then this is one and zero. Otherwise then you can write down and joint distribution which is just uh you've seen uniform distribution which is just minus 11 for X times -1. 1 for wine. Now it's l tone that may be authorized as an indicator function indicator that X. Y equals that.

So that means like we have a joint distribution. Now we can look at the conditional, is that equals one And that is only one. If X and Y. Lighten the ball, We have that equals one. That's a capital here we're in the random variable if and only if X. Y lies in the ball. So this is now here the probability probability that X. Y uniformly

for Maclay, probability for X uniformly from the square gibbon that's also in the ball. This is uniform in the garden. andw Or let's write it out formally this is E of X. Y. That equals one Divided by the opposite equals one if you plug in one in here. So you have only this indicator function

multiplied by this indicator function. But these are automatically one X is in the ball. So this is just Yeah, one B x, Y over one half. And now we have to divide by the probability that that equals one, which is a probability that xy in the ball given the uniform distribution. So this is just one B X. Perhaps. Or now we take volume on the bone

divided by the volume of the square that we round square. Find it by the volume of the ball. This is this term here turns out that this is four. This is pi So this and this canceled. So what we get is one over pi one B X. Why? And this is exactly our target distribution. So what we got is summarize we're able to sample from, able

to sample from from this distribution then Yeah, then we construct variable from X and Y. This could also be constructed in a totally different way. Then we have like this conditional distribution, this leads us a joint distribution. So we end up with a joint distribution where we can send the from. How do we do it? Yeah, we sample from this and then we can just compute that. So we are able to sample from it. And then what we do is we select the points where this that equals one. So we sample from this conditional distribution. And what we end up is with our target distribution. So

a target distribution is a conditional distribution of a higher dimensional joint distribution. And this is the general idea. So the general idea of selection sampling is we want a sample from target here and we trying to interpret it as a conditional distribution of a higher dimensional one. This means we construct it here is not why we construct random variables X and Y. That's it. We can sample from the joint distribution and such that the conditional it's given as a target distribution. So if your condition we get the target distribution this year is important part and

then we can sample from the joint. So what we do is example I. I. D. From the joint. And then we select the samples, collect only the samples Where the Second Index is one. And since all I I. D conditioning on one does not affect the other samples. If I condition here on one, it doesn't mean anything for the others. So that means we take our sample X. And with the end sample where this is one and then we only take this one. And these are because of this equation on the time of distribution. I I do. This is a general idea of this selection sampling. We're selecting only the ones from a higher dimensional distribution

where this is one and of course this could be anybody, right, this could be zero, it could be that the ones where you get your target distribution is so and want to visualize this a bit, let's draw X. Axis and that axis. Was that here next year. X. That let's draw some some controls of our distribution, something like this and then our target distribution if for instance, here and here That equals zero. Andw

mr bution and that line equals our target distribution. Yes, officially, xk one here because one and this is supposed to be our target distribution and this is supposed to be that year. So the general idea is We have our target distribution, we construct a joint distribution where we can sample from and such as the condition is our target. And when we sample from this joint we get this conditional

just by selecting the samples whereas that equals one

5 Sampling Methods - Rejection Sampling

I have a one in this video. We'll talk about rejection sampling. The general setup is very similar to the other sampling methods we start with the target distribution that you wanna sample from and the only thing we can do with it is a value age, an unnormalized part one value in one value out. andw we use a proposal distribution where we can actually sample from and when we also can evaluate some unnormalized part here, we can also pluck a value in and get the value out Furthermore we have to assume one crucial thing and rejection sampling and this is that you can scale the proposal distribution to be bigger than the unnormalized part of our target. Often this can be achieved by just multiplying this by K. And the normalizing constant which is unknown

also by K. So we now assume that this holds and to be precise what you actually need is that we can evaluate this quotient here, the quotient of this and this but of course if you can get better at this and this and we can also vary the coercion. So let's draw how it looks like first maybe we make some access here. X. Access X. And this is their xor you have X. And then let's draw targets dancer. T. This is tilda and then maybe we draw

another color. Our proposal is our Q. T. Lass. So rejection sampling works as follows. We sample from the green distribution. Maybe we get a sample here, then we look at line through here and we get this sample here. So this sample is sample from this distribution with these probabilities here. So the sample is not here. And now the question is do we keep the sample yes or no? And the probability

if you keep it or not depends on Depends on these two points here. And what we're doing is the sampling uniformly here, scale this year uniformly Gayle This to Uniform one. in this area we accept and then the other area we reject. And this is very in the flavor of important sampling where we took a proposal and we weigh it with this amount was a quotient of these two points or wait and hear. Instead of putting a weight on it, we sample from this weight

and if it's here then it's too much. And if it's here it's fine. So the central procedure now looks as follows. We sample an X. Here written as capital X as a random variable from this distribution. Then we sample a uniform distributed random variable independently of X. And then we put this selection variable as if this you is smaller than this quotient or not. So this was the part where we had to save, it's above or below. Is it smaller than its in this part? And this is either one then this is smaller or zeros and it's bigger and if it's one we will accept then you accept yes, this is one

and we reject. If it's zero and then accepted ones will be distributed like our target distribution that's already rejected something and for the last part, we might want to analyze the eye version. intuitively this is clear if you look at this picture here then we sampled at this point with probability of X. And then we sample this point with P. Of a cue probability so that this Q. Part canceled and we're left with this P till apart and since it's a probability and it's will be normalized. It's automatically it's unnormalized version. This is already the intuition behind the algorithm

but let's do it formally. So what we do is we write down the joint distribution of our X. Our you and our set and we can write this as Q. That given our X. You then we have our Q. U. Our Q. X. And you can write it like this because we sampled this one independent of X. And this one was a deterministic function of these two. Like that was a necessary function of these two formally Q of you can be written just as an indicator function as a uniform distribution of you perspective of the back measure. And then it was that given X. You this is just delta peak because it wasn't

this mechanistic function if this that equals two. You being smaller than P. U tilde X divided by Q tilde ofx that's how this that was defined. Now we have to check if this that here matches that redefined. So we have all ingredients, we have this here and we have this here now what is the marginal. So this was a joint then we can look at marginals So first thing we look at is X for that equals one because this is acceptance theprobability and this is just marginalizing out here, this X that the issue here is X is still here and

you can see this is not dependent on you show this is just X times integral but we now have to plug in here that equals one and if you plug in here one then this will be only one, if this will be one. So this will be just this indicator here then times And the uniform distribution which is here and this just means that we're integrating from 1-0. Eu So now that we have here, another indicator function going from until this part, we're going from zero until this value. So what we get is just Q of X times

P of X divided by Q of X. And now as you can see you can write in here normalizing constance, so you can write that P divided by that P and that Q divided by that Q. And then you can see that this that Q and this cancels out and then this and this is a P of X. So we have only that P divided by that Q times P of X. This is the marginal here now we can even marginalize out X. What do we get then? So what we do is we integrate this over X. So this is just that P over the

U. E. X. D. X. And this is just that P over Q. The normalizing constant. So and then we have to look at the conditional. Finally, so we're interested in Q. Of X given that what's one? So and this is just two. X equals one divided by Q. Of Z equals one. So what is this? Is this here divided by that year? But this is just ofx Maybe I should have made all these thoughts here and this is the target. And this was what we wanted to show saying that all the things we did we did um

get to target out then let's talk about a second version which we won't talk informally about it. And that's called adaptive rejection sampling. So here we make the additional assumption that the lock P of X. Is actually concave so that we have this kind of shape. And as a proposal we now use piecewise linear. So proposal here in blue Q of X. Or let's say precisely lock ure backs and assumed to be piecewise linear.

And it comes with these support points here. And what we do now is we sampled from this Q. This is possible piecewise linear means we sample from an exponential distribution on all these pieces and then pizza together and then we do as normal rejection sampling step for instance we go here example, sample this point here. X. Here they write that X. Then we have to do this uh rejection acceptance probability thing and in this case was high probability this will be accepted, but let's say it would be rejected, let's say it lands

here or you lands there and it will be rejected. And if this is a case what I then do is I use this rejected point, I evaluated at this function and then I draw this line here and now I get it better and more closer, much closer proposal, all the acceptance points. I take all the rejected points, they improve my proposal and if I do this, I get to want more tight proposal maybe I'm here and I'm doing this, doing that and so on and this is adaptive rejection cycling. So if rejection sampling is good or not or this adapted

version, you could not always need to be checked in high dimensions because that's what we're interested in the machine running. So for this. Now think about a very simple example we need to factorized caution, let's say one a sample from it and we use just or making this point another question and we assume we can sample from this one but we actually cannot sample from this one and here we have um variants. So here we assume that the variants of sigma Q bigger or equal to sigma p this is a requirement uh that can be scaled above the others and to scale it, we have to look at the quotations and then you will see there's always this two pi and segments on it was to the power of D if you take those questions

and we need to scale this by this factor K. The mccue time signal P to get gets that your ex smaller K times X, tilt off X. Which is K times ofx So we need to scale this up and this is like the optimal portion to just get it above. As you can see here in the picture and then I can look at the acceptance rate in high dimensions and the acceptance rate also depends on these questions.

This is uh how it was defined when we use this, you to sample from it and if you do this, we also get that one over K is acceptance right here acceptance right? And as you can see one, okay, it's smaller than one because it's bigger and if this is just a little bit smaller than one and I take the power it will quickly go to zero. That means in high dimensions, even if there's just a little bit of aviation between these two distributions, the probability of acceptance might be as low as 1%, of all the samples will be rejected Because this is very, very inefficient and what we hope for is that something like you know with 95% or something, things will be accepted. But here's

the other way around. We have just very very small number of samples that can be used in conclusion. So what we want is of course you want to have a proposal where we have a low number of rejections. So that means it works very badly in high dimensions. Now you can say what about adaptive version? But even the adaptive version works very badly in high dimensions. And the reason is that support points also need to be scaled up. So if you want to have um good debt, if distribution number of support points need to grow exponentially with the number of dimensions. So we could only use it basically in one or two dimensional one or 2 dimensional case or or like as a building block

for more sophisticated methods as you will see later. And this is already all about rejection center

12 Sampling Methods

0 Sampling Methods - Law of Large Numbers 


everyone in this video. You want to talk about the law of large numbers for dependent variables. We need this because we want to develop Markov chain Monte Carlo methods. Those methods we have a Markov chain means we have samples that were constructed from other samples. So we have time dependence between all these samples. In these Monte Carlo methods. In the end we want to evaluate something like such an integral, such an expectation value by using samples. And up to now we always used independent samples because then we knew that this will be a good approximation of that Markov chain monte carlomethods The dependence now is a problem

unless we can show that this even holds an independent variables and exactly this is what the theorem says. It's called the exotic theory or we call it the law of large numbers dependent variables. And we need two conditions which are not I I. D conditions which are dependent conditions, which we'll talk about later. It's a stationary T and a goddess sity And besides this, two assumptions we need on our sequence of random variables. We have that our more our moments are finite and we only need it for instance, for two or one. So you could say we have a finite mean and a finite variance. And then here as would be equal to two. This would be already

in love. And then the theorem states even under these conditions here we get convergence of the empirical mean towards expectation value nationality here implies that all the marginal distributions of these access here are all the same with the expectation value of X. One is the same for each of the ends. Then we get a special case here, one special cases I. I. D. Case D. Case if these are I. I. D. Then the also stationary and garlic. And then also we have this convergence but we knew that before under either the moment finite moment conditions or

actually much weaker conditions but we don't want to get into that. So this means under session, charity and authenticity you can evaluate or approximate an expectation value by empirical means. So what does this what do these quantities mean here in this case? So we could interpret the expectation value from a frequentist point of view and also from the independent the I. D. Case of the law of large numbers namely you basically say that we have infinitely many. We take an average of infinite many independent random variables. So here We draw X- 1123 and so on. 123. And if you average them and they're all independent

then the odds converge to this. We could interpret this from a previous point of view as the average of infinitely many independent restart of the process. So we reset the process 12345 times and so on. And then we only average the starting point and then we get this. This was a law of large numbers for independent random variables but this year is now a little bit something else. This is still the average overseas extends. But we now have dependent variables And it's only one run. We just run one time series for instance, then we average just these points on this time series and we say even if the dependent, even if it's one run will converge to the right quantity. So in this sense the garlic theorem says it's a spatial average,

which would be this equals the time average, which would be this or large times. So now what do this assumption of stationary and al goddess sity actually mean it gives a definition now a stochastic process. This is just like a sequence of random variables is said to be stationary. If the distribution doesn't change if I shift them. So if I cut off a few of these random variables in the beginning and I just shifted back, then there's the same distribution. So if adjust to one shift, that means like it has the same distribution as unsifted or if you want to write it differently, the probabilities of the shifted Random variable is the same as this one here. For instance, X zero is missing. So here X one is in the place of X zero, X two is in the place of

X one and so on. And then the distribution should be the same. Or if you want to use these mass functions for discrete random variables, it means like if I put X one to this value X 22, X two value and so on. And it's the same as if I were taking X two and put it to X one W and X 32 X two values. And as you can see, I need to write down what random variables I mean and what values. I mean That cannot just write PX- one or something like equals PX- two, but usually we change the variables with symbols we use for the values. Yeah, maybe in uh illustration. So this would be an example of a stationary distribution, meaning if I would just look at this part and I shifted over here and compared to this part then they totally look the same.

Of course there are small gaps and so on. But we're talking about distributions of the distribution of these values would be the same as the distribution of these values. So this would be stationary. In contrast, if you would like look at this one, this is basically super dependent with some trend here and then you would take this one which would go down, then you could not say that the distribution is the same, one goes up, one was down, so this would be stationary and this would be non stationary. So for the law of large numbers for dependent variables, we need this stationary t condition. So it does not hold for such, I'm serious note. Also that here we have some dependence in the timestep for instance, if something is very high here, it's also high here. So it just goes bit by bit down, it's not like it jumps from here to here, might sometimes do this, but on average it might be still dependent

between the time points. Then the second condition that wrote the city and we say that so has it process again. This is just like a sequence of random variables is a Gothic if every shift in variant event for the whole process either get zero probability or one probability and this may be easier to digest if you look at this year down here. So what it says is that if I look at an area here for every bearable the same area and I start in this area and then I look at the probability that all the following timestep they're all in the same area. So I sample zero, it's in C, then I take X one,

it's in C and so on. I have just one area and they're all in there. This is like what this event says. Then the goddess sity says that this only occurs with probability zero. So it will not happen unless and this is kind of a little bit this weird corner case is if the probability is one of this event meaning it's either the full space because my process cannot leave the full space, but if it's not the full space, then it will eventually jump out of this area and the probability is not along the time. The probability is still meant as if I would re sample, I would re sample the same time serious, then it might look a bit different then um if this occurs so differently expressed

is either every realization of this process will either stay fully in this. C. This is like when the probability is one either fully in there, it will never leave or all realizations will eventually jump out of this, but either staying always in there. Was able to jump out and of course you would say, okay for one, if I have one realization, this is logical. Either it's a full in there was out but now we're saying that either every realization stays in there or every organization will jump out. So what is excluded is that some realizations stay and see and some realization jump out. So what we're actually saying is if it's not the full space then narcotic process will always jump out

of every small event. If you have any event, eventually that Gothic process will jump out of it or in other words, the process will explore the whole space over time, it doesn't stay in some areas. So if you have these two assumptions, stationary T meaning if I shift them process, it looks the same negative city that it will always jump out of every small area, it will explore the whole space. So it's exploring and distributions. Um This processes look the same, then this time mean converges to the spatial mean if you want.

1 Sampling Methods - Convergence Theorem

however one in this video, we'll talk about the convergence serum for Markov chains. Another building block for a Markov chain monte Carlo method first, let's remind us what the Markov chain was. Markov chain was basing network that does lies on this chain here. And we are considering homogeneous Markov chains where the transition probabilities are always given by the same Mark of Colonel, which we call T this homogeneous. And here we consider X zero. Where we usually put the initial distribution more like parameter. So we only need to deal with this T and we don't want to commit directly to a distribution on X zero. We can always put it later on.

And in fact we will consider two different distributions on X zero. Nevertheless, the joint distribution of X N X and minus one and so on down to X one given X zero is given by the transition Colonel. Xn given X m minus one times X m minus one. Given accent to down to T X two given X one times the X one given xz lem and note that we here on the left hand side and we use the usual abuse of notations where this year means the probability that X n equals X. N and so on. Down to X N equals X one given X zero equals

X0. That's what we mean. Either the probability mass function or the density. Well here we have a fixed function that we evaluate here on different arguments, namely the arguments. Yeah. So just to recap Markov chain, the basic network here given by this graph and the joint distribution or in our case it was conditional that factorises according to that graph. And we say that this is homogeneous if they exist the Mark of colonel T. That's that the conditional distribution is given by this Markov cone. As I said here, probabilities are meant for these variables but the arguments are plugged

in into this function, then we need a few more definitions, namely let's say again, we have a Mark of colonel and we have a probability distribution. And here we consider this distribution as a function, we plug in and we get some value bigger equal to zero out if it's a mass function is between zero and one if its density it can be bigger. And then we say that this distribution is in variant with respect to T if it satisfied this condition here. And this condition means if I integrate over the starting position with this probability then I get something out ax. And this ax has the same probability given by pi again. And in short notations we could write

that tee times pi equals pi And in the case you have finite state space where this is discreet, then pie can be represented as a vector, evaluated in all possible values X one X two and so on. All the cake classes and then we have here vector this integral turns into some and then this year is represented as a matrix also evaluated at all these classes. and then this is just a matrix times a vector equals this vector. In other words, by its Eigen vector for this operator team Another definition is that Markov Con is said to be reversible. Respect of this distribution, pi if I make a joint distribution out of it, right? Multiplying them in two variables and it's the same if I do it the other way around

or if you wanna picture rise it if you have like start from a state that and a transition to a state X. And this probability is the same as going from X to that and for this to hold we have to weigh the starting position with this probability and this probability and now you can think if I start with reverse ability here and I integrate here over dizzy yet also here over the set you see this is not dependent on set so this just integrates here to one and you have an integral here equals to this and this is exactly what this so what you get is if you have reverse ability then we get invariance as well. And this equation here is also called a detailed balance

in the literature. It basically means if I have a Markov chain with these distributions then I can run it forward and backward and follow the same rules. So with these definitions we now have these two limiter. So if I have a Mark of colonel again and have any probability distribution then what you get is if if it's reversible, then it's also invariance That's what you just said and it's just integrating over one of the variants. And then we can consider a Mark of chain. I'm a genius. It has this transition probability T. andw an initial distribution. Hi. So we start with pie on X. zero then

this is stationary the time serious definition if and only if pi is in variant with respect to T. So here we can So homogeneity plus invariance stationary T. And the other way around also holds. If you have a stationary Markov chain then it is homogeneous and the initial distribution is an invariance distribution. Now let's come to our main theory and this is a convergence theory for mark of change here we assume we have an exotic Mark of Colonel in the typical sense but we have a small amount of technical extra assumptions here. The main point is so that this is aquatic and in some cases it's technical assumption um I used to redefine the definition of a goddess city just

to just be able to say, okay let's just start with a garlic. Markov chain. Mark of crown, then the claim is from the theory of first claim is then it has a unique invariance distribution. pi is one and only one and then if we take any any initial distribution any initial distribution and we create a Mark of chain starting with Q. And with T. S. Transition probabilities then we get all the marginals so P zero is Q. P one is one transition with tea and modernizing out X zero and X. Two. We start with X. One transition with tea and get X. Two. And you get all these marginals and we say this

ESPN to mean that this is meant as a function we can plug in any value and here we have to say X. And equals this value. Then we have the convergence. Remember we start rescue and the transition with T. Then we have the convergence of these marginals converge to the invariance distribution. No matter where we start we will end up with invariance distribution. This is a convergence theory for market chains. This means that after some burning time or mixing time how it's called or in machine learning we could just say it's a learning time because it means like from random initialization to our optimum, we have to go through several learning steps and applying ti here is the learning steps. So what people call burning time, we could just say

this is the learning time, from initialisation to the optimum. And after this time we converged and then the mark of chain the marginals apply that means like if you from their own start by the time we had before it is session eri and exotic and this means we can apply the God big theory and then from there on. and also what also comes out of the theorem is that samples try apart so let's say we come to some burning time and then we wait take another sample weight, take another sample and these samples approximately independent. This property is called strong mixing and this Markov chain also has this property and of course

this convergence time is much lower. If the initial distribution is already close to pi then transitioning will not do much and we already have a very close total variation on you. This brings us now to the general strategy for Markov chain monte Carlo sampling methods so we turn it around a little bit. So we start with the target distribution they call it pi And what we now do is we construct the antibiotic Mark of Colonel such that it has in various distribution, pi and such that we are able to sample from it. Practice often the a goddess sity assumption is a bit ignored. So main importance is that we get invariance distribution and then with simulations

that will be checked how fast it's converges anyway. And then what we do is if we have this Mark of colonel with this or target distribution as in varying distribution and you can sample from it and we start from any proposal distribution any proposal distribution andw preferable one that is already close to pi but guarantees are also for if they're not close and then the iteratively it relatively sample on the market. So we plug this in here, Then we sample a new value X. one. Then we take this, plug it in here and we sample extra. Then we plug this in here and we sample X ray and so on and we go on and on and on and on. And then after some burning time. M we have samples X. M

plus one X. M. Plus two and so on. And we just keep going on with the sampling with tea and from there on these are all distributed marginally like pi problems though is or like no problem. But like we have to be aware that these are non I. I. D. Samples but in any case they're stationary and exotic. This means we can apply the garlic theory. So for approximating an integral with respect to pi. You can take these samples here, you can average them for this functions you apply validate the function of the sample and the average and then we go from here on very far. Such as this is a good approximation. And the theory says that if I take any big enough, this is a good approximation of this and in case we are interested in I. I. D. Samples, we take

after the burning face every case samples. So you throw away like all the intermediate samples, then this is approximately independent of this, approximately independent of that and so on. And then they still marginally have the distribution of pie. We get approximate iot samples. And this is the general strategy for Markov chain monte Carlo. And what's left to do is looking at this step? How do we achieve getting this mark of colonel that has in their distribution, pi and property producing samples. Say it again. Start with the target distribution. We create a mark of kernels that has its tigers distribution, as in their distribution assembled from any proposal or fully closed

the pie. And then we keep on sampling from this mark of grown over and over and over again. And then at one point it has converged and the samples will be close to being sampled from this pie. And this can be used for averaging.

2 Sampling Methods - Markov Chain Monte Carlo - Metropolis-Hastings

I won in this video, we'll talk about Markov chain Monte Carlo especially as a metropolis Hastings algorithm, this is maybe the most influential sampling algorithm in history so far. Let's recap So the general strategy for Markov chain Monte Carlo started with the target distribution. We want to sample fronts and then we construct microcanonical and in such a way that it has as invariance distribution or target distribution and we need to be able to sample from it. So this needs to be constructed from then on, we sample from any proposal distribution referring to be once that is close to our target distribution and then

we iteratively sample from our transition probability, andw on and on and on and on. And after some amount of time we know there's convergence, the convergence theorem of Markov chains, this means that the samples we have following the distribution of our targets and we know that this isn't generally not I. D. But it's stationary, exotic at least. And strongly mixing. That means we can use these samples for estimating an expectation value by averaging this was due to the exotic theory. And also this strongly mixing property allows us to take approximate I. D. Samples by just taking a huge gap between these uh samples one after the

other. So what is left to do is this construction part. So how do we construct such, such a transition colonel? And this is done with the metropolis Hastings constructions So here again we right now our target distribution like this where we can evaluate this part but it's normalizing constant is not known, this could for instance be the posterior from some basing influence task where this normalizing constant is the evidence or some huge graphical model this year is a partition function. This is our target distribution and what you want is you want sample from it. So and as I said we assume also that we can evaluate this unnormalised part but that's all we

can do with P and then we take any positive transition colonel all the different chicano and that could be as simple as this normal distribution here or if you prefer heavy tails you can use a kosher distribution basically everything where you can write it as a probability for x minus that for instance then you can always take this as a right and we consider this transition colonel now more like as a proposal transition colonel because we're about to change it And we change it in the following way. 1st let us assume that is given this is a point, there is a set and we had that state and this is given. Now we want to construct the next point and for this point we evaluate in this function of P. So if

it's higher it's more likely and compare it to our current point. A new point is compared to an old point and for the moment just forget about these if this is normal then this is symmetric and this is gone and here you can see what we have done in rejection sample thing. If this were high, we accepted it and if this were low then we would just totally reject it based on some uniform sample then we integrate this year And subtract -1. Why is that? The reason is we want to use this down here as a transition come so we come up with this proposal and you want to correct it by the probabilities given by by this quantity here. So we want to accept accept something which has higher likely the higher probability with

respect to this. P So you want to go closer to the mode so we have to weigh our propose a point with some probability problem now is that if you multiply this and this is not a probability distribution anymore, you know if I integrate this over X, this is one but now that away each of these point, if I integrate this here, the X is not one, so I have to correct it and I have to correct it for all the probability that lies out and this is like the rejection probability of the rest probability that this term does not have. Of course we could think about normalizing instead of adding something but then we are basically the same problem of trying to evaluate the inter global p tilde, which

we cannot do otherwise, we would have done so instead of normalizing it by dividing we normalize it by adding another point and this point is just like staying where we are since no other point is distinguished. Let us put this here maybe in a box. So this year is the main quantity of interest, this is our transition colonel. And now we show several things about this first thing we have to show it's actually transition Connell or Markov. Connell just a different word for it. But what we have to show is if we integrate it over X then we get one. So let's do this. So we integrate here. Hey xxt times. Okay

except the X. Plus our sex is not dependent on X. He accept the X. So now we know That this year is one andw We know that this year if you look at the definition of this term This term is just 1 - this into all So we can write this as 1 -1 are of that Plus our of that times one. As you can see These boats cancel out and we get one. And also clearly this is Tical equal to zero as this year is

between zero and one. This year is between zero and one and this is bigger or equal to zero. And actually this integral here lies also between zero and 1 since this is big or equal to zero, it sums to one and now we have something multiplied with something between zero and one and integrated that cannot go beyond one. All these quantities here have the right The right sign and it integrates to one second thing. We need to know about his tears. It's a want to show that this has or target distributions. invariance distribution. How do we do it? We do it by checking that it's reversible. Then we know that P will be an invariance distribution respect to T

first of all to peace. Um So what do we have to show? We have to show that if I multiply this tea with P of Z. And it's the same as if I switch these quantities. Well let's write this down T. Of xz And for simplicity we use this unnormalized part because a normalizing constant does not matter but it makes it easier to write. So now what is this? This is A. Of X. That times K. Of X. That times detail the of that plus are absurd each other off that. delta xz

So now let us write down what this A was this A was the minimum. I write the minimum just like this of one and fraction and the fraction was details of that times K. xz and then he told A of X. K. Z. X. And now we multiply this with K. xz times p tilde of that as you can see this will cancel out and this will multiply with this one. You know we have one line is an integral that means like we have into the upset minus the integral over.

Well let us keep this on hold for now and just look at the other term. Let's stay with ours. Is it delta X. Is that? And maybe we make one small adjustment here. So this year is a delta function. It's one if X equals that basically there's only marks where they're equal and zero otherwise means like I can this year will not contribute to anything that will be zero if they're not equal but if they're equal, I can also as well exchange these points here so and this is what I'm going to do. I'll change the argument here by access No my access and also

this delta function is symmetric so I can also write that given X then here this is now hey X givens at times until they're given that this is just one times this year. And then the minimum with what's written above here that given X plus r of X, details of X. delta that given X. Now as you can see, I can instead of taking this out which was written here, I can also take this out

but what I get is que xz at times that find it by K Z X. He killed off X minimum one andw maybe we should have made it clear times K that given X details of X. That's R of X in terms of X. X. And as you can see now this is of the same functional form as this one. So we can swap this. Right? So this is just competitive. It's the same as here that we just switch X. And that

this is really T. That X times picture the X. And this is what we wanted to show. And from this it follows that he is in variant back to cheap, meaning that our target distribution, we have constructed a transition economy that really has P as a target distribution as an invariance distribution. So next thing is we need to sample from transition colonel but this is our transition. How can you sample from this? And we can do this by first let that be the old state. Now what we do is we sample from this that is what we assumed this proposal transition function. We can sample from for instance the normal

distribution or co she know how we did it. The sample just one point from it. Then we compute this function by just plucking it in. You can get better at this, you can get better at this, you can get better at that. Maybe we should have clearly said we need to value being available. Being able to evaluate this as well, then you get this number here. So the old state. This is just a number. That's why its sampled now. It's just a number. You plug it in. We can compute this, then we sample a uniform sample independently and then we define the selection variable, looking it is a smaller equal to this And then we take a new.2? py If that equals one and the old state is this equal zero. So this is just like a fancy way

to write X equals why if you are smaller than Hey, why is that? And X is defined to be that if you it's bigger and why is that? So we sample you better at this here we evaluate this year and then we put at this point given like this andw Okay, now we can do this here, we can sample it. But the question is why does this reflect this transition colonel? And and this is because this s is distribution with success probability a why is that? So this really

reflects the probability of success here. Then if we have success then this probability of getting this ax is given by this. So if this chance ought to be one, then we have this probability And if it's zero then we rejected. So here because why this probability bility this year and here X equals that with probability not andw in the

above state. And what do we need to do? We need to multiply this by this. That's a probability to get this wine and then the probability not ending up here is multiplying them probabilities. Andw integrating and then 1 -1 mina Which is the event as equals one. andw X equals why is since you were traveled independently then a Y. Is that times? Hey, why that or in this case because the equal X. X.

And then probability as zero any wine?

No, a few remarks. The one remark is if you get a point that has higher probability up to these um these factors we always accept, always accept and if it's smaller than we only accept, like let's say worth point with the corresponding odds that why are

we interested in Markov chain monte Carlo? The main point is that it does not suffer, does not suffer from the course of dimensionality so much as the convergence speed. It didn't analyze it because it's a bit complicated but the convergent speed only depends on the size of the second highest Eigen value of our transition probability considered as a linear operator. So if you have a discrete finite state then t can be represented as a matrix has its Eigen values and one Eigen value is one and the second Eigen value and the difference between the one and the second highest Eigen value about the spectral gap. The convergence about something like this, one minus this second highest Eigen values to the power of our time steps

and this one value does not depend on the dimension if you have like a higher dimension space which is similar transition probability and the Eigen value might just be the same and the convergence beat might be the same. It's not directly and on the dimension this is one of the big advantages of M. C. M. C. And also here in metropolis Hastings and we have seen we have again a rejection acceptance step. So in this sense M C M. C. Metropolis Hasting can be seen as an iterative version of rejection sampling. We could tasting as an interactive version of rejection standing instead of where we totally reject the point and then do everything from scratch. We allow for re

sampling of the increments. So we at a point we assemble something if it's not correct, we fall back to this point instead of starting new, then we simple another point and then we accept maybe and set off totally throwing out the point yet re sample the increment and take it if it gets better. So this is Metropolis, Hastings, Markov chain monte, Carlo, ivory, ssms

3 Sampling Methods - Markov Chain Monte Carlo - Gibbs Sampling

I won this video, we'll talk about another mark of chain monte Carlo sampling algorithm namely gibbs sampling. So let's hope directly in and see how it works. So the setting is as follows we have again target distribution and the variable here has several components. So this might be a higher dimensional space and we can evaluate this function up to normalizing constant about dysfunctional families and constant witnesses this is our target distribution and now what you want to use is we wanna exploit the fact that you can sample in one dimensions efficiently. So and here we can plug in any one dimensional sampler. We have access to center for what dimensions or maybe unnormalized distributions or all the distributions that occur if you have direct access to them, we can also use

them. And then what we do is we sample first from some initial distribution to get like a starting point initial point and then we sample each of the component. Each of the components we sample individually note that like the time index is now here on top a superscript while the components are now here with an index, logn So after we have like this initial point we sent the first component of the next point by sampling conditioned on all the previous values of the previous components on the other components on the previous values from this conditional distribution. So this is just a one dimensional distribution condition. All these values and this is of course proportional to the same function here

where these values here are fixed and this year's bed that you're running and we can evaluate this because we can evaluate this, these values are given. So if we sample here we can evaluate this value. So for instance we could hear use adaptive rejection sampling or slide sampling or metropolis Hastings in one dimension after we have like the first component, what's the next point example? The second component but now we plug in we plug in the value from X one already so if you have a new X one we already plug it in and then we use the values we have so far for the other values and you see here is X two years X one which just samples and these are the old ones And again

this is proportional to this function and now these values fixed and X two is Ryan and so on. Then in the dS component, You plug in the T -1 values. We just sample before and also this one here of course and the other values are from previous steps from the previous point and again this is proportional to this, unnormalized distribution and so on until we reach the last point. The last component which is now here dependent on all these previous points from this step here. All these are in here so your point are gone but this is the last component and use the same unnormalized just before and then our next point

consists of all these components we just sampled and we have another point and then we repeat this process for the next end and so on. And the claim is in after some burning time we have that is points constructed in this way they are distributed like the target distribution and again, maybe not I I d. But we know they're stationary exotic and strongly mixing. So now let us more or less prove why this is a case why this works that you can only sample these conditional distributions in one dimensions, iteratively and then return out to get to this. And for this we have to construct the mark of chain and see how the transition

probabilities look like and that the target distribution is an invariance distribution to this transition bam for this let us draw the mark of chain that we have. Maybe we use that for the old states and asked for the new states so we don't have so many indices. So let us write down here All this at one that too, that's free at four. So these are the components, this is one note and he had the other goals. Let's call this maybe next one. X two X three. X four. mtm maybe Y one by two

by three by four. No, I was a sampling procedure, sampling procedure, Samples X one conditioned on all the previous values. So what we had is at this then X two was dependent on X one like this and then instead of drawing this error, we draw this error is not dependent on that two on the previous set, it's only dependent on these three. So we get and this here then X three was dependent on

X two but also on X one and then the last one is what's that form? So previous X. And then X is at X four, it's dependent on that's three and then all the previous ones like this. This is one note, another note and then we can do the same for the next step. So again by one is dependent

on ST is that this stage at this stage? The stage in the state and then dependence here. Yeah, andw Yes. Thanks. This. So this is a Markov chain, this is not one. This is note too, This is not three. And these maps here they constitute the transition condom. So let us write down the transition corner more formally

the transition colonel now looks like this, It will be dependent on X one until X. D. gibbon that one until that D. And this is a product of conditional distributions equals 1, 2 d in X. D. And then That one until that D -1 and then X one. No the other way around dependent on X one to X D minus one, then Z. D plus one. Until that deep. So here is

okay, so this is a transition product. Now we want to show that Colonel. Now we wanna show that or target distribution is in variant with respect to T. So what do we have to do? We have to multiply it Miss P. Of that and then integrate over that and see that it becomes P. Of X. Just remove the dot here.

So let us write this down. So we have P. X. One until X. D. Z. That one that G. Times of that one 13 which is here. Now we write it this way We go from d. to one in a different order but it's still the same. Well Tawana that was warm until that D. We just multiply them in a different order and now multiplied by is that one until that D. And now we can write this

that's a product The front t. 2 2 E. X. T 61 XT minus one. That D plus one. Until that D. Times now we just take the first out. This is now that to until that D. Times that one. That too that deep. So and now I forgot to integrate. We have to integrate over overall the sets which I haven't written down yet so

so we we do right here integrals and then we have D. That one until D. That D. And also here. Yeah that will integrals We have these at one. Is that too until that T. And as you can see now is that we can through the integration Or was that 1 1st because none of these is dependent on that one. So what is this here? If you integrate out is at one, what we get

is peace at two until that deal. And as you can see is exactly the ones occurring here. So what we can do is can say that this year S P X one comma that one until that D. So now we can write similarly, you go from D called D. The three E X D X one until XD -1 30 plus one until that D. And let me move this year a bit around

then we get here E X three next one. Oh it's x two given X one then that three until that deep and now we multiply with this here and now is a tool until dessert D. And again you can see that you bring in. Oh there's a typo this goes from that to so as you can see now

again these are not dependent on that too. They go to three. That means like they're not dependent on X. To that. To hear none of these is dependent on that too. So I can again bring this integral in, let me remove this and now the integral here we can do this year and similar to before we integrated out here

that one we can now integrate out of that too. And what do we get? Thanks integrals sign a bit smaller. What do we get we get That here? We have X one And then it starts from X. Set three That three until that D After integrating that two out here we get X one the three until the D. And that's exactly what's written here. So this is no E o X to Roma X one,

that's three until that G. And of course we can switch what's our usual views of notations and you cannot see how this works. Inductive lee is we use this here. marginalizing internet next that multiply it here we get full distribution where this is replaced by X three then and so on until everything is replaced by xz So what we get is And this year is EX one until X. D. Let's write this down. So here it's relatively We see that this is a PX one until X. D.

And from this follows that be here is in variant distribution. fft and this means that the mark of chain converges the target. Mhm. We can also make them I want you everywhere, that is what we wanted to show. So that means like if he uses gives something scheme then this get this transition probability here and uh if you multiply it by our target distribution and integrate all the sets and then we get all the access and this

is what invariance means in this setting. So we have shown that if sampling converges modulo some considerations about a goddess city to our target distribution then let's discuss about another version of Gibbs sampling. So it's the same as before. We have our unnormalized distribution here or target distribution and we can Sample from one dimensions but instead of going in an order we update randomly. So we sample index uniformly and then we just update that component by sampling from this conditional distribution as before. But now as you can see, we always take the previous values because we don't have any other. We've taken the previous roles here, we sample this XD new. Then our next point is just

replacing this component by the new component. And we claim that even if you do it like that, if we do random updates that this will convert to the target distribution and again it's not I. D. But it's at least a stationary exotic and strongly mixing. Let's let's look at the transition colonel here transition condom again is dependent on X one X. T. In the old state that one that D. And maybe right just up to some normalizing constant. So we know That the probability is zero if yeah X. andw that differ

more than one in more than one component as this transition only changes one of the components. If I start with this old state and we have a value here with two different and the probability getting there is zero. Otherwise let's see say the different and that if only we're at max only in dimension deal then I can write

maybe I can then write X. S. Or let's say Make it simple. That is then say Y one x -1. Then we have here. Ziggy YD Plus 1 to Whitey. And X. is then on the same form 1, -1. x. d. Y. D. Plus one until I. D. In this case.

And then we claim his when we say that he is reversible with respect to on time and distribution. And for this we see that T xz even P of that Is one of the D. E. Of X. T. Now I take Y one until Y D minus one. X T.

Y D plus one until Y. D. You find it. And the probability of here we have X. D. Two X. T. Where we now has this not. But by dividing this is a conditional distribution. If you divide by the probability of the second party. And this is written like this. Now we write out what the probability is this that and this is why one B minus one that D. That Why do plus one until I. D. Invited by one.

Soon as you can see now this is probably symmetric, I could take this one. Let's say right it here in the middle. Now I can start off conditioning this part here by dividing by this. You can also conditioned by this one dividing by this. So this is one over D. Your strategy conditions and all the Y ones and two yds that this one is missing times your ex or in other words this is T of that X times X in this case.

And from this we know that P is in variant respectto So our gift sampler where we randomly update the components, converges to our target distribution here, we implicitly assumed that the case will differ in more than one component is zero. This is symmetric in both cases. Okay then let's have a short discussion. Mm So in case we have conditionals independencies geographical structure. The gibbs sampler can be more efficient. The reason is we conditioned

here on a lot of variables. It means like it introduces a lot of variants by having here all these random values in there. If you have conditional independencies then some of these might not occur here and you have less noise in these updates. If you have graphical models, you can exploit the sparse structure to have less dependencies on variables. So for instance, if you have micro random field or a fact a graph and we use this random update, you can hear just take the values of the neighbors, the conditions on the neighbors conditioned on the neighbors, meaning we're only involving sectors this product of a lot of factors that are dependent on this X. D. Here. You could also do is for

basic networks can introduce a topological order, meaning I take the parents before all the Children in this order. And then I can go through this order to update in the gibbs sampler my components. Then you can see that All the X and -1 values actually disappear and example always given the parents in commissioner. This is also called ancestral sampling as we sample all the ancestors of D. First by iteratively sampling. The parents first to summarize if sampling uses one dimensional samples,

the sample from the conditional distributions and we do this in order iteratively and then update them step by step and then we redo it. And in the end we get stationary exotic samples from our target distribution. That's a gift Sandler.

4 Sampling Methods - Auxiliary Variables Sampling

hi everyone this video. I want to talk about axillary variable sampling. This is again, procedure used in many other sampling algorithms like slice sampling, hamiltonian monte Carlo idea is very simple. We want to sample from a target distribution and the idea is that we see the target distribution as marginal actually higher dimensional distribution and the higher dimension distribution is much easier to sample from. This is a case. Then we can also sample from a target distribution just by throwing away all the auxiliary variables. So what you do is trying to construct variables X. And that such that you can sample from the joint distribution and that the marginal is a target

example from the joint distribution. Again, have some marginal. Then what we do is just what I said, the sample I I D maybe then we get these samples X ones at one X. And until that end. And then we throw out all these dead ends. And then if these conditions hold then X one until XN is IID from the target. Of course if you want to apply this, then you have to find this, right? So if you start with any distribution, you have, you have to construct some joint distribution with these properties. And you might ask yourself why would it be that like a higher dimensional distribution is easy to sample from than marginal distribution. And a very simple example is already the Goshen we

have seen that we have problems sampling from the one dimensional Gaussian using quantum functions since the quantum function is the inverse of an unsolved integral. It can show that it's much easier to directly sent from the two dimensional boston and then just taking one component of it and get the one dimensional motion. And this is called the box muller transform. Well, let's start with the distribution, let's say X one X two. This is not the two dimensional caution. This has the density 1/2 pi x minus and half X one squared plus X two squared. This is already the density of a two dimensional Gaussian. What we now use is polar coordinates.

Maybe let's visualize this a little bit. Oh gosh ! In a symmetric from the origin. And now we take a point maybe here, then we have here, next one And here X two. Then we know we can also write this point here, this is X one X two. We can write this point by saying what the radios is and what angle is. So we say X one is of the form r times both sides of theta, and X two is R times sine of theta. And every point

in the plane can be written like this, every point has an angle and the radios Maybe the only biggest point is zero which can have several angles but are zero. The angle doesn't matter. So this is like the two dimensional space and this is our distribution. Now let us look how this distribution looks like andw polar coordinates. If you change variables, we need to compute the jacoby in to adjust for the volume change. So let us compute the jacoby in first, jacobian says I have to take the X one D r d X two R e x one. theta X to B C. To this is a jacobian that's called back

opium and now we can compute it so X one the relative respect to our call sign, you can thanks to with respect to our sign. No X one respect to theta is minus our sign you done. And X two with respect to Theta is our co sign. This already is a jacoby in what is a determinant of jacobian Cobian is this times this minus this, times, this posing a squared plus sine squared is one. What we get out

is just our so yeah, I'll write it down again. Our two fine xz minus in half X one squared as X two squared and the determinant of the jacoby in it's just our but then our distribution or are and theta SNP of are both science. theta R. Sine theta maybe right here of this function here and then times determinant of jacobian so

what do we get again and maybe we write also the range in which they live our lives and Between one and infinity and eta lives between zero and two pi Then this year is 1/2 just from here and this is just R squared X minus and half R squared. And the jacobian here are times are that's one. There are two infinity are times two. All right time. So let us write this as our times next

minus and half arms squared indicator function 0 to infinity are And then times 1/2 pi 1 - two pi. It does. And as you can see this factorises you are times cute, he does where this is now a probability distribution with this density and this is a probability distribution with this density which is just a uniform distribution on the interval zero to pi. So how to sample from those? Look at you theta This was 1/2 pi 10 to pi it

on this is uniform uniform. There are two pi So yes You won this uniform 01 then why you won is like this. So we already know how to theta defined as this variable is an example from this. So we know how to sample from this distribution. Now let us look at the other distribution that was Q. R. This was our times x minus and half R squared From 0 to Infinity

Art. Mm And now let us look at cumulative distribution function they call it a capital R. R. This is integral from minus infinity to our of Q between another index until the D R till the this is just minus infinitive Actually from zero because of this We can directly start from zero are from our times X minus and half R squared tilda tilda. The R tilde. And what is this? Because we have an R tilde here. Now that's the one we didn't have in the normal distribution. But in the two dimensional version, if

you take the radios we have it so we can easily write down with the anti derivative. So this is minus xxt minus and half are scratch Then plugged in 0 to our. And if you do this you get we plug in zero, we get one, We get one x -1⁄2 are scrapped. This is one minus x minus and half square. What is in the quantum afunction Q. R of you. So what do we have to do?

We have to say that this here is you. And then so for you that's our this means bringing it to the other side means that x minus and half are scratch equals two, one minus U. And this means r equals -2. Natural Lock, -U. And then the square root. So and this is the quantum function of are So what we get is that this is -2 lock, natural walk you square root.

So we have an explicit form for the quantum function. So we can send it from it. Let's do this. So that means If you two is uniformly distributed, then We know that also 1 -60 is also uniformly distributed. Right? So that means that if I take our minus two. Book two, this is then distributed like our Q. R. By the in 1st Transform Sampling Method. So on the other hand we had already you won

they're one and then Depined as 2.1 was distributed like the uniform of this distribution. So that means we were able to sample those of Q. R. So since they were independent, this means that our a sample like you are theta and we know that X one was our times An X two was our teacher. That means we know that X one and X two. It's distributed like X one X two which were normal distributed zero.

You know, So we have found a formula To sample directly from the two dimensional question. And then of course we know that X one a sample from X-

  1. Or let us write this explicitly sampled uniformly I I d then next one.

Minders minus tool. Look hello,

times 2.1 and x minus times block. Um Of course I'm missing, sign missing Or sign to buy one times sign. All right. What Then x one x 2 are sampled from The 01 and independent the boss and I do. So this is a box muller transform which shows

that sampling in two dimensions, sampling a ghost in two dimensions is easier than directly sampling in one dimensions but of course by projecting this down to these dimensions and you get to independent normal decision. You did random parables and to use this as an example or the axillary Maribel sampling method.

5 Sampling Methods - Markov Chain Monte Carlo - Slice Sampling

everyone in this video. We'll talk about slice sampling the general ideas as follows. Let us draw probability density function here, Y axis in the X axis why X. And let's say we have a density like this or this could be our unnormalized part. And now assume we have an initial point given by some proposal. Let's say we have a point here, This is our X and -1. Then what we do is uniformly sample from the slice.

So yes, value that's Ash The Tilde of X. M -1. And we uniformly sample from this part. Then we get another point for instance. Yeah, this is now our Y. S. And then we sample from the slice first let's try to draw slice like this here and here and now we uniformly sample from the X points on this slice

maybe here. And this is then our next point. This would be X men. And then we again sample uniformly right here. Then again we temper uniformly from this part here until here. Here we have you killed that accent Then again we repeat the process. A sampling something uniform here and then slicing. And maybe the point is here. Now,

just by coincidence, our next point would be here. This would be I. M plus one and now we have these areas here and now we've got uniformly sample from these two and so on. And the idea is that doing this. Well in the limit uniformly sample from this area, another curve. And sampling from this area under the curve is the same as sampling from the probability distribution. We're looking at this anomalous part. And the reason is that this area under the

curve integrates to one and the height here correspond to the probability of falling in this area. If you uniformly sample from this area, sample from this from X. And uniformly here means we have here the point basically with y coordinates, Y n, X coordinate X N. With sampling points from here by slicing andw uniformly sample from this point here and then we sample from all the slides. It means all the access such as this is above. This makes sure that we're not like just ending up in these high places once but in proportion to the area under this first. So if you tamper here, we

only get one point. But since we often end up here, you sample much more often at this place. This is how we get the frequency of xN in this place, right in the limit. This is the idea of flight sampling. No, let's look at the algorithm algorithm as follows again. Think your target target distribution and we can only evaluate this unnormalized are here and then we initialize the point from any proposal distribution or a good point close to the mode and then we uniformly sample from this, the height of this curve and then we sample from the slides and this means we're looking at all the points whose value, probability value is bigger or equal the previous white hand

this was here and then we looked at wanted to sample from these parts here and these are all the access such as the curve is above this value. And we can write this as saying the supreme image of all the Y. N. To infinity. So meaning looking at this interval and we're looking at the access suchthat above this point. And this means uniform sample and then after some burning time that's a claim then if I throw away all these wise the X. And will be not I. D. But there will be at least stationary, exotic and strongly mixing like all the other Markov chain onto carlomethods Another question is why does it work for this? We I have to say slice sampling is kind of a mixture of auxiliary variable sampling and gibbs sampling

but what is our auxiliary distribution? X. Y. We define it as one over z. p. And then All right, two cases is a one if or why is between each other and X. You have X and zero. Otherwise this means that has a curve and I have an ex which value is outside here. So these points you get zero probability and everything here gets probability one and then we normalize the same normalizing constant of detail of X. Now let's look what happens

if you look at Q of X. This is U. Of X. Y. Y. Because this is one over p. And this can be written also asked one over that P and the indicator of want I more detail the ofx So we get here interval zero. You tell them to fax of why E Y. And what is this? This is just this uniform distribution on this interval. I can just take this value.

This is one of those that P times P tailed off X which is our target, you know, facts, dnx So this means that this really falls under the auxiliary variable method. So if you can sample from this distribution and then throw out why we're really sampling from target distribution and now what we do is we do gibbs sampling, I'm saying with Q of X of Y meaning iteratively sampling from the conditional distributions. So example, xn

q Even XN -1. And what is the conditional distribution? This is a joint distribution Which is one over ZP one there are you tilda of X and minus one of why invited by the joint by the marginal distribution one over that p times he told off X which is one X X M one X one I'm a swan. Unless you can see this normalizing constant falls out and we have

one over p tele x minus one times indicator of why? Which is nothing else but the uniform distribution zero in Children X & -1. And then we sample X N from the other conditional X. Given Y N. What is this here? We can argue similarly. This is proportional. This indicated function. But now why is fixed? This is an indicator of the set or access top set

each other of X, bigger equal to Y. N. Thanks. The same indicator before we're looking at why? Now we're looking at X where Y and is fixed all these eggs such as these holes. This is an indicator of right universe. Why? And to infinity. Thanks. So this is the uniform distribution over it's set andw and the T. Yeah, we can right here the arguments if you want.

So and this is exactly what sly something does samples uniform from this year and then it samples uniform from that year. And we know that gibbs sampling converges So we know um what the limit is and we know that all the properties of Markov chain hold so we get a non I. D. Samples but stationary exotic ones. Okay, so this shows and that is auxiliary variables. Pluses. Gibbs sampling procedure is to slide something and if your remarks. So the question now is how can we compute the second step here. How can we computer this set here? And what one does is when no one knows that xN

Is XM -1 is bigger than this. YN 2, 1 starts with XM -1 and then builds interval until one hits here, one hits the walls and this is called stepping out extend left and right of his previous points. And if they're close to X and minus one, we know that the p of them are bigger or equal to this Y. N. So I go now left and to the right until they actually fall below the slice. Okay so I extend them to left and right until they fall down. I can always interview mint them a little bit by being a number until they're smaller. And then I just sampled uniformly from this interval and of course it could be that

my slice slice here. This is when andw I might go here too. X men two X. Marks and then have a little bit of overlap right and to the left. But I can just sample from this and if the value lows below, I just reject and resemble until I get something where the value is actually bigger and as you can see usually would require also descended from this part if it's multimodal but often one does not care about this part. So one starts with X. M minus one and this is why and then samples the next X. M. From here and um

basically stays around this mode instead of going to the other one. But one can still go to the other mode if one samples the next point here and has a slice here and then samples here again. So when it's not trapped in a mode but if one is close to one mode, One stays as a mode for a while. One does it like this. Well, this is what explained here, this is disconnected, disconnected. Then the question is how can I generalize this to the dimensions? One way to do it is just using one dimensional slice sampling inside. Keep sampling where we have to sample from the conditional distributions. And the question was there, how one can do this and we can deal with slice sampling. And in this combination one can generalize slide something to higher dimensions and then it means like one would need one of these auxiliary variables for each

dimensions or each time I sample I have to introduce the auxiliary variables at the moment. By sampling. Class and gibbs sampling gives us light sampling and hired image. So this was slice sampling.

6 Sampling Methods - Markov Chain Monte Carlo - Hamiltonian Monte Carlo

everyone in this video. I want to introduce you to him with Tony in Monte Carlo. Also called Hybrid Monte Carlo. Maybe one of the fastest sampling algorithms out there so far. Let's recap metropolis Hastings the metropolis. Hastings algorithm we wanted to assemble from a target distribution and he did this with a proposal transition colonel, which needs to be positive. And often this is used as a normal distribution or something of that kind. And then we start sampling from this proposal and accepted with this probability. This is acceptance probability. If not we just copied the old the old state. The problem with Metropolis Hastings is that it's sensitive to the choice of

this transition Connell, as you can see, it's basically samples randomly. Often one could say that Metropolis Hastings behaves like a Random walk. The acceptance probability is actually not so high. Sometimes in the re um just of about 20%. So if you want to make it better, we have to understand what's going on. Especially higher dimensional spaces. Often important for this. Consider for instance a uniform distributed to you And let's assume this is two dimensional 2. D Now assume we're sitting here like here's the middle and

I call this the radius, this is our And the smaller one. This is our and there's a small R. And let's assume The small r is about 99% of the other one, 0.99. And the other radios. Now we want to know in two dimensions. This is like a uniform from this area um with how what is the probability that we are in the smallest smaller disk here. And this is just taking square and putting them into into um Interrelation

so we need to take the small bone from zero, take the volume of it divided by the volume. This bigger one. And you will see this is about 89%. So if we're in two D, most of the probability will be here. If you're now going to 500 D. And I do the same. So here maybe we write it down, this was election tour. Now if you compute The same or 500 From zero this bone and divided that's a volume, Pick one and 500

Round zero, then this will be smaller than 1%. This is of course due to the fact that 0.99, The power of 500 is smaller than 1%. But even though this is close to one, if I exponentially ate it quickly or quickly go down. So in higher dimensions, you'll see that in this area here We get less than one of the mask. So where is the mask? The mask is out here here in the this area

in this area. Uh here considered uniform distribution. But if you take a Gaussian is very, very similar. And one point is the Gaussian mode would be here in the middle. Well out here. Of course He has also kind of roughly 0%,, Roughly 0% of the mass or less than 1% most of the mass is really in this fear some medium degree radius. And now think about, we wanted to do sampling and think we are here, let's say we already where do we want to go? Where do we want to go next? This is a question. And usually if you have great information, we would go in this direction. So we would go to the mode here is the highest point.

But if you would draw Gaussian relatively to the dimension, then the mode would look like this so you almost cannot see it. So but the gradient would go in this direction. So the gradient would lead us into this region where almost no probability is. So in Haida mentioned, the grading might be very deceptive. So what we want instead is instead we want to explore for sampling, we want to have a you want to have representative samples and the mode is not representative anymore. So what we want to do is we want to go around around this sphere here find samples that are representative. So instead of going towards the mode, you want to go

in this direction and this means you want to go perpendicular to the gradient. This is a gradient, maybe off lock tilde. And now we wanna go perpendicular to it. So we stay closely onto this set here. And this is this is counterintuitive at first. But if you think about it like this, then it makes sense to stay in this typical area of medium probability. So what we wanna do is you wanna go instead of optimizing you wanna go exploring and while doing this, we want to make sure let's say we have already a good point here that the next point has similar probability mass. So in a small volume, if I integrate our the probability density, I

want to have a high value. This means I want to have basically similar probability density and I want to explore a similar volume element. This means if I go here which is transformation from this point to this point, I want that the volume is kind of preserved and I want that the probability densities preserved such that I have a similar amount of probability mass that I explore. So let us summarize what do we want? So we want first we want to avoid random walk behavior, random walk behavior. We want to avoid that. We want to increase acceptance rate from metropolis, hasting something better. Then we want to explore typical probability regions and what we rather than optimizing for mining the modes, modes might be deceptive in

high dimensions. As I said, instead of following the gradient, you rather follow the contours, we want to be perpendicular to the gradient. And what you want to do is as well as you want to preserve at least the volume and the probability dancer to me such that we roughly explore the same amount of space and we have hopefully like an exploration of at least similar quality meaning in terms of probability and of course this should also fall, we want to still have convergence to our target distribution so we still need to make sure that we get an exotic chain such as we converge to the target distribution. So and most of this up here is rather inconvenient to achieve in the original space inconvenient. So what we do is we introduce auxiliary variables such

that we get a higher dimensional space but this is much easier and this, this kind of settings have been studied extensively in physics and mathematics and it falls under the term simplistic john tree or cotton gin space here we don't, we won't introduce as you make it more clear in direct terms. So we're setting Tony monte Carlos as follows. Again we want as always example from a target distribution where we can evaluate and this unnormalized part but not just that we can also this time you want to still use grading information to guide us what are the contours. So we also can evaluate the grade in here of this and then we call this is just terminology from physics, it takes a lock of this um

unnormalized distribution and we call it potential energy, then we out comment each variable X, each component with its own momentum variable. So we're introducing andw the same amount of dimensions as access. So we're going from D d RRD two, R 2 with these momentum variables and variables. And then we have to give them a distribution here and this distribution for the moment. It can be arbitrary. Often one uses a normal distribution where the covariances dependent

on the space. So this is the very geometric setting if you want. And but for simplicity in the sampling procedure we usually just take the identity matrix. So this is just a standard normal distribution and this is even though we have the X. This is if you choose this, not dependent on X. In the journal setting, you only have to make sure we can evaluate it, we can evaluate its gradient and you can sample from it. And the normal distribution at least allows for this. So then we define the kinetic energy, this is just minus a lock of this distribution is very similar to this, it might be dependent on where we are if we have the X dependence dependence but in the simple case in this case here it's just as the L two norm of Y may be divided by one half and then we just

add these energies up and this is called the total energy or the hamiltonian. From where this procedure gossip name and then we know that this joint distribution including the momentum variables is proportional to the act of minus this total energy. So this means if you want to preserve probability mask or probability density, we can also just preserve the energy. And note that high probability means low energy and of course we've created this in such a way that the marginal of this joint distributions are target. If we can sample efficiently from the joint distribution, we can also sample from the marginal distribution just by throwing out or the wide variables, all the momentum variables.

So now let us think about how we could achieve how to preserve the energy because preserving the energy means we are running around the on tours and for this. Let us look at the energy and let's say we are at the place X, Y maybe draw a little bit of picture and we are here and we want to go next step, we want to go there. So how can we read this off from the energy? What could be the next step? And what we do is we just do tele proximation taylor approximation tells us in first order this is very similar to the energy and then class and take the gradient I suppose of H

x y times a small step I go out so I take a small step out and small step out here and plus why transpose h Y times why? So it means in first approximation first approximation I have, this is the error term, I need to make zero, such as this equals this. So I want these two things be equal. That means preserving the energy, meaning running around the contours means in first approximation this needs to be zero could be gradient H X. Y times is no no movement next direction, X Y times

a small amount of movement in the right direction. So now that we look at it, how can I choose I have, I have to choose this. These are given these are evaluated. I want to choose these such as this zero. How can I do this? So what you could do is how can you, so if you want to solve for it, let's say in one dimensions, then you would bring this over here and you would see that this needs to be this age needs to be this term divided by this and then you would multiply through and then you would see how it is. In high dimensions, you can use the same same procedure, but now you're just you're not dividing, you're just making sure that this holds one dimensions, one dimensions age and why would be roughly

something like minus gradient H. X. Y divided by that makes no sense. But I want to mention it does times H. Y. So this is roughly the idea and if you plug this in, if you pluck this edge, get some X. Of course it needs to be and it should plug this in here, then um you see that this and this multiplies and you have this initial plug it in here um similarly, so what we can do is the idea is now that we choose would satisfy this

edge of X. There's some small amount of the gradient Y direction, an age of y minus small amount in X direction. And as you can see if I would but without their don't suppose as you can see if I plug this in here, then I have a scalar product with this and then epsilon I plug this in here here and I get the scalar product of this with a minus epsilon and this is satisfied. So this is this is exactly what it does. Let us visualize this a bit

so and uh you know yeah and here is the mode. So our gradient would point to the highest, the highest point, this will be the gradient gradient, end of this year's X direction is why this will be the gradient of age and this year the contours of age, this would be the gradient and we now have project projected down to x. Yeah, this is it

radiant and now we have X component, this is a wire component and an X component and the X component is up to some scalar or you know is just one dimension for X, one dimension, Why what basically is they are the same and the Y component is flipped. So what this means is that we are running now into this direction, this is exactly what simplistic geometry is about. So it's about Turning the grading to 90°

such as is perpendicular to the contours and with this minus sign here on why we exactly do this. This is subjective geometry and this is the intuition behind this. So now we know what an update would be would update this H X. This X was X plus this year and that's why with why mine is this one. So this would be our vanilla which is called simplex stick Integrator. And this of course because of this epsilon its numerical approximation as we have seen. We took a taylor approximation but if you make it smaller and smaller and smaller and a continuous thing, then this would be an exact thing. But since we need to approximate anyways this would be a step and now basically claim that this is also if you do it like

this also automatically volume preserving and up to some small amount of proximation and the approximation also goes away. If you would take epsilon smaller and smaller and smaller and what you say accepting what we need is we need to look at the jacoby in as epsilon and this is E. S. X. I mean say the first component, then there's two X one why ES two XES to the UAE and

to be interested in the determinant. So what is this? This is that's one is your F for X. This is I plus epsilon X. Y. H. It's the first derivative. So then this year with respect to why this goes away, this is epsilon why why h then this year respect to X as minus? epsilon X X. H. And now this respect why is identity here, then X. Y us minus

X. Y. H. So and how would the determinant look like? You can see I can write this as permanent plus epsilon my H Y Y H. andw minus next next page. andw minus. And as you can see it's epsilon appears here twice. So let's say it would be one dimensional, then you would hear have her epsilon times, epsilon or if you look here then you would multiply this with this. In

the one dimensional case then you had one minus epsilon squared of the same quantity and then you had multiply this, that's all same quantity. So this is roughly one plus something like all of epsilon squared. So if ε is very small, we have volume one. So we see that this transformation is approximately volume preserving volume preserving. That's what we wanted. We wanted to have probability preserving and volume preserving up to numerical position, let's say um It turns out there's a better integrator simplistic integrator because it's actually integrator because one could have formulated this in terms of differential equations and then integrator means like numerically

approximate ours but we directly found this by the idea of preserving the newtonian and the energy meaning also the probability Adjust 1st order tele proximation here and then we show that it also approximates volume, it preserves the volume roughly. And now I can motivate that there's actually a better selective integrator which is called the leapfrog Integrator andw trunk. And as you can see it is very similar to what we just learned here as you can see the X similar as before we take the X and then we take the gradient respect to Y. So this is very similar and also we have two times an update for Y

but they're half and also they have very similar quantities here. So that's why it's a minus in both cases, basically does a half update Y, full update acts half update Y. And I think you believe me if I tell you that this also is approximately energy preserving and volume preserving by the same taylor approximation, if these quantities are all the same roughly, then you see that if I add these up that I get by plugging this in boxes in, then I add these up and I get epsilon by basically approximately the same as before. But it turns out this works better, it's better volume preserving and it has a few good properties we see later and then this epsilon is hyperparameter like something like a learning

rate you have to set and we can then take L steps of those so you can all this function which maps are X. So this year is a leap frog algorithm and make an index L. Here if you take all of them. So what we're doing is if we have this all function bigger and I'm here then I make a leap frog step like this, another leapfrog step like this, another leapfrog step like this and so on and always close to this contours up to some miracle errors and I can take all of them. Yeah L steps why? And here still that Y tilda

after elo That's while the leapfrog is better, is that if I use a hamiltonian where here is no X dependent um on the kinetic energy and this is actually invertible The reason is that if you take here the grade in respect to X, then this drops out. If you have no dependency on this Y. andw similarly here we have a Y. So this would drop out anyways the X. Here. Yes, this is not a dependent on why. And then also up here again you wouldn't have the dependence online, meaning

if I started with X tilde and y tilde, y tilda goes there, this extra that goes there. Then it can compute this why? Until that was out having a dependence here and then I can have X there, I have white tailed her here without the depending on the further previous step and then I can compute X. X goes here awhile crime I've computed and then I can compute why? So, having this factorized distribution meaning this split hamiltonian and the leapfrog is actually invertible and not just in the limit of epsilon for any epsilon even as a numerical approximation, this is precisely invertible and we'll see it has one more property, which is when this is an even function. So what we want to do now is we wanna

start here in our probability density in our face space where we have like X and Y and we want to go here and you want to use this leapfrog after L step as a proposal distribution for metropolis Hastings algorithm I start here and I just do l steps that come up here and then I do a proposal and then I would have andw metropolis Hastings would then compute the acceptance probability and says yes or no. And if it's yes, then this is my next point. And if it's no, then I'm falling back here and would maybe compute another route or maybe go the other direction. So this is what we want to do. And one problem with this is this is a deterministic function. There's no randomness in there. The question is what does it do? This has no randomness in this, but as high as and numerical error.

So it might be not exactly on this contour, it might be off. So it's deterministic and it has numerical error, let's see what it does to deterministic, what, what deterministic transition proposals. Yeah, what comes out if we analyze deterministic transition proposals? So let's be a bit more general, let's say you have T. X. It's a function similar to the sleep frog. It takes his ex out, it gives me another X. So this is a realistic function and this p is again our unnormalized properly institution, this is more like in the general sity and we want to use this as a proposal transition function for our metropolis Hastings so then are probably proposal transition colonel in terms of probability distributions would be a delta distribution

that starts with X. And then as that is sampled from it only if that equals X. So this is one if Z equals X and zero. Or depending on how interpreter infinity if z equals X. But if you integrate over it it um they only have months wherex guys. So and in the metropolis Hastings algorithm we have to compute acceptance probability and this colonel goes in there on top and on bottom of this fraction nominator and denominator plus the unnormalized distributions. So here they go in, here's the old state and then we do this deterministic, you forgot here the team, it's a deterministic function.

T the domestic faction early X. And then you check if that equals to this function value yes or no. So we would have this and this and then this year if these are just values this is fine. Now we have the problems with these delta functions here, if X. It's not exactly, so what you're doing is you're going from X to that with T. And now that might go to some Y. Here, if this happens if that is not going back to X and this year will be zero, it means we will never accept if Z is not going back to X via T and so on. So what we require is that T of X. This is all of that and then not a

guarantee that this is again X. So what we need is we need this afunction it satisfies such an equation. So if I apply it twice its identity and such a thing is called evolution andw What it means is it means like t inverse is the same. S. T particularly, particularly its invertible So what do we need to make sure we make to make, we have to make sure that our leap frog algorithm becomes motive, how to get to subjective integrator. And so we want to use the leapfrog because it's already nice, it's preserving

the volume and stuff like that and we compose it after we run l steps, not at each step after L step, we just flip the sign of the momentum and my claim is that this is in polluted under two conditions if our kinetic energy is not dependent on X and it's an even function Y. And a very simple example is a normal distribution like this um This year has these two properties. So that means I take a leap frog and then if I like write this as a vector X and Y. Then a multiply X was nothing and why was just minus one? This is what this matrix says, I just flipped the sign of why and this is in polluted and we'll check this. Okay, let us look at what the leapfrog does just

one step first X of y. But the leap frog algorithm we said why prime is defined to why minus eps on half. And then we have um the gradient respect to acts of age but we directly right real facts because K is not a knock on on X. Then we had X tilde one X plus epsilon and then the grading of why? And then now we have K off Y. Right? And then we had why? And this was why prime minus epsilon half gradient of X. V.

X tilde. And then we return, what do we return? We return X tilde tilde and now we can l times and then we have to flip sign off why what we get out X tilde minus y tilde. So now we do the leapfrog again. Yeah now we take the leap frog of epsilon and we plug in white Children and minus exhale that minus white Children. So then we get out, I make it's under mark here, this is now

minus Y tilde minus upsilon half gradient of X. V of x tilde. And now let's look here, let's look at this line, then we see that here and here. Um this is why I bring it on this side and bring it on this side and I see that this year is minus why prime from before. Now we define why chill that, prime S X. Which is the argument here and I add class, epsilon Y A. Off, why crime on

the score. So this is now Children plus, yeah, this is mine, this is Y. Prime, so since this was an even function minus sign can be, can we get out here? So this is why A of Y, prime, but now we can look at the second equation here, so we bring this on the other side, then we get exiled mises packs. Then next step was by tilda under underscore, why tilda underscore? And this is not why prime,

but this is not mine is why prime for sale minus this used, so it says minus gradient X. B of X till the underscore, but we know that this is already X, maybe let's write it on a score on the score and then this is here minus by prime minus apps on half radiant X. And now I can look at this equation here and what I see is that if I bring this on this side and this on this side, then what I get out here is minus Y. So what this returns here? Yes

X minus Y. No I can do this all times. As you see here we always plugged in the minors already. So what we get is all times and then flip sign why what we get out is X. Y. You see we started with X. Y. We did this algorithm we get here this we plug it in again and then we turn out to X. This shows that the leapfrog terrorism with this additional sign flip at the end it's in polluters this means we have not transition colonel or proposal transition colonel or our hamiltonian monte Carlo.

So what we use now is Justice delta function because everything is deterministic. I use L leapfrog plus a sign flip and use these are the new status as the old states. They have to check if this equals this and this only holds if this equals to this, otherwise it's zero. And we know that this is in polluted and you have seen that acceptance probability that all these delta functions they drop out because of the polluted property while this is in polluted all these out and the only thing that remains are the probabilities and they were the unnormalized probabilities and this year as you can see is X minus H. X tilde white Children divided by X minus H of X. Y. And then

because it's X. Uh it's multiplicative So this gets in here with a plus sign. So we have an acceptance probability like this and now you would argue we did all the work to make sure that our transformation from X to X tilde and white Y tilde is preserving the energy so this should be zero and it should always be accepted. But as we already know, the leapfrog algorithm is just an approximation. So this can correct for this. So if they're not totally equal then this occur and in polluted property was not in the limit. This was exact. So this is really the exact acceptance probability or our deterministic proposal was a metropolis Hastings algorithm and this is just a proposal. Right?

So we propose this deterministic quickly determined point and then we accept this point with this probability is this is accepted, then there is this point to spit out. If it's not accepted then the old point is copied and all this together. This gives us the transition colonel which we with this additional acceptance rejection sample gives us transition colonel. And this is uh here it has some probabilistic properties but only if our leapfrog algorithm makes bad approximations. So we have achieved what we wanted. Now the next problem comes. Problem is namely if you have so we achieved that age will roughly be preserved by going from xtax prime with this leap frog algorithm. Deep rock, synthetic integrator steps.

The problem is now and even though this here is all invertible and so on. andw that for our Markov chain to converge, we need sity and we have usually ignored this part but we cannot ignore it here. Our diversity means that you're not staying too long in one place that we're exploring the space, if we are going to preserve age, that means like we're exploring um we don't explore, we stay at age. What we need is we need something where we change the energy level. Otherwise we're not exotic and our change doesn't convert and to do this is we sample at each step before we run all these leapfrog by random sample why? And then we have a chance at this energy

level changes. And if you do it properly then we get all energy levels. We explore the whole space. And then on each energy level we can then explore. What we're basically doing is we're sampling an energy level randomly. Then on this energy level we make these transitions with a leap frog algorithm deterministic lee so we have a separation between two transitions, one random and Y and the other one deterministic in X and Y. So how do we do this first? Small lemma? And this lemma says if I have to transition colonels with the same invariance distribution and also their composition has pi as its invariance distribution. And the composition means

that I multiply the corresponding matrices if you want in a continuous way. So I marginals out intermediate variable then if this were like discrete, then this we're matrix this matrix this is just matrix multiplication. And in the continuous case we have to replace the sun with an integral. So this is the corresponding matrix multiplication of market criminals. And the proof is very simple. It just uses that. If T one times pi is pi and T two times pi is all supply then because of this chain of equalities. Also this is pi so if I use now a second transition But on top then the composition also has this s. Pie and this is what we use. So we have T2 already determined. This was our deterministic simplistic

integration steps. This was not a garlic. And now what we do is we do an additional gift step. Every sample from P Y given X. And in the simple case this is normal distribution. And this means our transition probability is given I next tilde Y tilde given X. Y. This is our transition probability. X. Day is the same. X. Just copied and why is sampled from X with this conditional distribution. We introduced for the momentum which could be just this normal distribution in the easiest case. And here we have already seen that it has a joint distribution as invariance distribution. And you can

see this here again this transition. We multiplied by the distribution, we want to show it and very into it and then we integrate over Y. As you can see here, why drops out because nothing else is dependent why And then you have to adjust the marginal here and here and because of this this is tilde tilde and then you get the drawing so the joint is invariance respect to this transition colonel here and what we now do is um since this is is exploring this explores the whole space and this is part of the hamiltonian. Um the whole hamiltonian can change by just sampling this at each step and now as a transition colonel we don't use T two and we're not just using T one we use them first T one and T two. T 1 92. And this is

aquatic so we saved be safe today. Now you would think this one good step. And finally we're putting all this together we get hamiltonian monte column. So what you wanna do is again we wanna sample from a target distribution which is of this form. We just simplified directly in this be here and we are allowed to evaluate this V of x. Just value value in value out and the gradient which is not like a huge vector, like a d dimensional vector. So we have we can evaluate this gradient then we have to choose the step size and the parts links. These are hyperparameters from this algorithm, initialize any X zero, it could pick any point or sample from a proposal distribution and then for all

bigger end, what we do is we sample from our py which in the symbols case we used standard costume, these are not like the augmented the auxiliary variables. And then we do leap frog steps, L of them, those these steps we take X and the previous point. We take this, we plug them as a point and then we do early products steps, maybe we draw this again, andw this is X. Why that start here. Yes, maybe X and -1.

Then you sample of why this is why and -1. And then we do leapfrog steps then what you get is get here X till the end. Why tilda end prime maybe all this mexican minus one By N -1 prime. And here we basically sample at this point. Then we do the leapfrog steps go over here and then we check um we flip the sign this, I've written it down here and then we accept it with the usual acceptance probability. Where instead of writing the hamiltonian, I've written out explicitly,

I've written this V function which we introduced here and this is kinetic energy with respect to this normal distribution which are just squares length of them. We just set this up and we check if it's bigger or smaller than one, it's bigger than one. We automatically except it's smaller than one. Then we made a mistake with our integrator and then we sample a uniform distributed value and see if this goes below or not. If it goes below then we accept this point at this point there. And if not we fall back to this point and we use this one if you're not using this one here, we're using this one there. The reason is um that this acceptance probability is just part of the second transition operator. We had the first transition operator has already invariance in itself. So we use this one

for the second one. So and then after some burning time we can throw out all the wise we could also take them and then this chain is by the convergence serum of Markov chains is a stationary exotic points. Might be not I. D. And usually they're strongly mixing. That means if you take samples far away they are approximately independent. So this is a hamiltonian monte Carlo um sample procedure. Now a few last work is our that we have this hyperparameters L. And epsilon The question is how do you do I chose choose them. And what people found out is that you can actually automate this by instead of taking a fixed L. You make an adaptive L.

And this adaptive L. Um stops when a specific criterion is met. So let us draw this again and we start at this point here. If you start hear me doing all the leapfrog steps here here here here and so on. And at one point and we know actually that these contours are compact sets. so this cannot go infinitely at one point will go back and it will make a U turn at one point. So this will be U turn and when it starts doing this U turn and we can just check this by looking at this gradient compared to this function. And this is like goes beyond some angle, let me just stop. And this is what is

called stop at your turn. This is a no U turn criteria. Use the paper and then you stop and this actually is very efficient because we don't need to hyperparameter rise or check this. L it's automatically set with this criterion. It works very well. And then the question is how to choose step sizes and there's some heuristics from convex optimization necessary. Do averaging was proposed in this paper, but there's also other research going on how to do this. And then the last step uh like last sentences were already headed. We can make this depend on X maybe in a curved space so we can do no U turn sampler or Tony monte Carlo and remaining manifolds if you wanted to. But maybe this is the best takeaway here.

This no u turn sampler, antonia Moncada. Maybe the fastest sampling algorithm we have for continuous variables where we have function evaluations and creating the valuations. So this was a video about Tony Amonte O'connell

7 Sampling Methods - Markov Chain Monte Carlo - Langevin Monte Carlo

everyone in this video. I'm going to give you a short overview of the longest river in monte Carlo method. This can be approached by two ways. One way is over diffusion differential equations, discrete ising them and using them for sampling or just by stochastic grading descent, I use the letter one. Let's look at great innocent, we know that if you had like a probability distribution of this type and here we wanted to find a mode not to sample from it. What we could do is we could use grading descent, start at any point and then all steps later we use an update rule that just take the other point and then we go down the gradient and minus here because you want to minimize this function which because of this minus means maximising this,

you have seen this several times now we can ask what if I don't want to have the mode. I just wanna sample from this distribution is possible just to use the same update rule just by adding noise and in fact this is possible. This is called the longest ravine dynamics or langevin algorithm. And here we use the metropolis adjusted one and what we wanna do here is yet again he want to sample from such a distribution which is potential function then metropolis adjust langevin algorithm, take some step size, you can call it learning rate initially it at some point and then we sample a normal distribution and then we do hear this normal grade in the center step here, this is a great in descent step plus some noise,

this is creating dissent medicine noise, you can also see we need kind of this square or to answer learning rate here appears again with the square and then we use this as a proposal. This was like it was like randomness in there, we use this as a proposal and then we accept with this typical um metropolis, Hastings accept, reject probability and if it's rejected we just take the old point and this here is just if you write this in here then this is Gaussian distributed with this mean and this invariance metrics. So that means like the transition colonel is just normal distribution distribution with here these

kind of equations and then um here we have maybe all this, you took them of acts and he had the usual probabilities and this looks like the interesting part is that it looks like creating the sand plus noise. But if you look closer you can show that this is the same as hamiltonian monte Carlo where we only do one leapfrog step and after l approx steps we usually sampled a new moment and if you only do one, we sample each round, you sample into one. So this is just like Tony monte Carlo with one leap rocks tech giving us noise every step or grating descent was causing noise and then this converges because of this equivalence and to our time

distribution sometimes evaluating this is very expensive. Another one version which not directly follows from our Markov chain convergence theorem is what is called una just as long as you've been algorithm and there we want to do the same and here again we have this great an update step which is causing noise but now we have here time dependent step sizes, oh sure you see here dependent on and then if you do this properly, if you scale these properly then you don't need a metropolis acceptance rejection step and it will still converge to our target distribution. This was in fact used do um posteriors standpoint

as an application let's say you want a sample from a posterior distribution which is prior times the likelihood or proportional to it. What we then do is we initialize and then repeatedly sample from a normal distribution but here is just as the has degraded update not this is um stochastic gradient descent, we're now the main difference is the unadjusted langevin dynamics is that here we take mini batches because evaluating these likelihoods is very expensive. So we wanna learn with mini batches as you can see this is basically normal stochastic gradient

um descent update usually converging to the maximum uh posteriori estimator but now we always add noise to our gradients, it's already stochastic gradient stochastic but we add additional noise awesome, Nice and the interesting part is that this converges even though we sample mini benches, converges to posterior samples if you choose the learning right learning rate appropriately they need to satisfy these equations. For instance the typical one over T, which is not the time, then it converges with the posterior. And the good thing is you also don't need a metropolis acceptance projection point. This would be very expensive think about these access as

huge images, let's say image net and we have thousands of these and just for one step we would need to evaluate all of these images and this is not feasible and this is improvement that we don't need to do it if you choose the learning rates appropriately and finally one can even go further. One can look at under damned langevin equations which are basically second order diffusion equations and then try to describe ties them. And it was shown that those converge with better dependence on the dimension of our space. So this is all about tangerine dynamics and langevin multi columnist

13 Causality

0 Causality - Types of Correlation

I have one this video, I want to talk about causality and correlations. Let's look at this graphic, what you can see here is a plot where we are on the X axis as the chocolate consumption per person in the country, on the Y axis. We have the number of nobel prizes that country one. And as you can see we have almost a very clear line regression line, a linear regression line through all these points. And you can even test for the arrows and how much it deviates and so on and what you get is a p value of very, very small size. This shows that there's a very strong correlation between the chocolate consumption of this country and the number of Nobel prizes

with Switzerland on top. So what does it mean for us? Does this mean for us that we eat shall eat more chocolate increase our chances to win the Nobel prize. That's correlation imply causation and would say he didn't know it makes no sense, but why is that? So next question is where does this correlation come from then? If it's not a causal, where does it come from? Why is there a correlation if not chocolate consumption increases our ability to get noble prices. And the question is would this correlation still hold under different conditions or circumstances? And what would the circumstances or conditions be? These are typical questions when asking causality So let us explain what's happening here and I will come up with a bunch of different explanations. They're all related somehow to causality

first explanation? First explanation would be that the number of Nobel prizes causes our chocolate consumption. So how would you represent it? We would make your note and here, note your note and here note and you withdrawn error and you were right in here, number of Nobel prizes and here chocolate consumption, that's how we represent that time of normal prices causes chocolate consumption to rise or fall. And the possible story actually can come up with a story, a possible story would be that every country when they win a Nobel prize, the people tend to celebrate it with chocolate every year, they're so happy about the Nobel prize that they eat chocolate every year. And that every person on average some amount of chocolate,

that means like more noble prizes, we have more chocolate consumption. We have this would be ramble ended in this graph and that could make sense and explain our coalition. So the data is the same but the story is different and maybe I can come up with a better story, but it is a possible story which you cannot just throw out by the day. The second explanation actually, it's the other way around no prices is a fact of the chocolate consumption. So then let's draw mm round parables Now we have it the other way around now we have here, the Nobel prizes is a chocolate consumption and the cause effect goes in the other way and we can also come up with such a story. Chocolate

consumption contains brain enhancing chemicals. The country that eats more chocolate will have the best scientists. So it's no surprise that Switzerland has the best scientists, they have a lot of chocolate consumption and the data cannot rule out the story, but we can come up with other explanations, explanation three. Actually, actually it could both be true. So there's no prices and the chocolate consumption, they could live in some positive feedback loop with the same explanations as before. So if I win a nobel prize, if I win the Noble prize, people tend to celebrate with eating chocolate and if you eat chocolate, it enhances your brain activity, making you more

prone to get more normal prices. So both could be true and that would lead to a feedback group feedback could be a reason to get the correlation between these two variables. And again the data cannot rule this out, but I can come up with more explanation explanation for its actually actually I assume that this graph is not complete, what they're missing is other points. They have a point here should have a point here, should have a point here and here here point, their point may be here, another one, another one and they're all correspondent to other countries. And the reason is they didn't display this, that they have some selection bias in place, which favors Western country. So they didn't

display the other countries as one normal prices because they just displayed Western countries. Maybe here's one exceptions or like two or three, but like most of them, these are european countries so we can assume um that someone just cherry picking and just displaced more Western countries and if you like at all the other countries, that one noble prize, which I don't know at the moment, but there might be others, it would actually look like this, then you wouldn't get a line would actually if you did an independence test this and this will be independent. So there's no actual correlation between chocolate and Nobel prizes. It's just that due to this cherry picking, we see this line and from the data we have, we cannot rule this out

graphical models which may be represented like this, this so he had no prices. Yes, chocolate consumption and here we have a selection bearable. And then we actually say that this is an observed variable that doesn't basically collider case and a busy network selection bias related case, breakfast and what is missing. We don't have this election variable which would all be one for the data we have but the other values are missing only have data for these and

and our rule is out just from the data we have but that's not the end, we can have other constraints, We can have functional constraints that make them dependent. For instance, we could come up with a story that the United Nations actually decide to only allow countries to import chocolate proportional to the number of Nobel prizes and citizens to reward them and this leads to a functional constraint where Nobel prizes and chocolate consumption is coupled and need to satisfy the equation equal to zero. For instance, here, linear equation and if they don't and they then basically not able to deviate from this because there's a rule in place that makes only these cases possible, but if we ask, eating more chocolate makes more normal prices, then this

would give us a bias And there's usually not a graphical representation for this, but invent one, but more like this is like a representation which is less, that's rigorous Orson, but the possible explanations don't stop there. Another explanation. Nation six would say our Nobel prizes and a chocolate consumption have a common latent confound. Er so what you're saying is that there's a third variable called wealth of the country and wealth decides if a country can spend money on science and can spot me on consumption and the wealthy country tend to spend more on both. So we can say more money means better. Science means higher probability of winning an old price. Also more

money means more consumption in general, for instance, chocolate. So we could have also displayed something else and chocolate maybe, and in the graphical model it would represent this like this here, we have a number of normal prices, consumption and the wealth may be represented like this, but we could also come up with all kinds of other stories, we could say maybe we did just a simple regression kind of the sample size is too small. So if you had more samples, if we were able to sample more countries then actually um this wouldn't happen, it was just a coincidence or we have measurement noise and just by coincidence

they put into this place or we used an unreliable independence test or regression methods. If you had something in the other way around you could say, okay, we did correlation which is a linear independence test but in in real life there would be independent the other way around their uncorrelated but they're dependent in a more non linear fashion or you could come up with your own story and there might just be other cases, other scenarios of correlation which we didn't think about. So there might be even something else. And as I demonstrated with all this explanation, you have to be careful when you think about coalitions. The summary here is that correlation does not necessarily imply

causation and the reason is causing other forms of coalition and I repeat, there could be latent confirmed this election, buyers, feedback groups, functional constraints, measurement error and all kinds of other things you can think of and in the field of causality we need to make sure you're not not fooled by relations, direct thing is also to see correlations is something which is symmetric, right? If I have X. Sam invariance Thanks Yes zero If and only if variance of Y&X0. So this is kind of correlation if you want but if X causes Y does not imply

by callers. X coalition is symmetric. causation it's asymmetric. And then as an exercise, maybe three other interesting scenarios where you can come up with eight different explanation based on the different scenarios. We had the feedback loops and so on and you can check or analyze the situation and see what you come up with. For instance if I said like the news claim that chewing gum causes diabetes, what other explanations could you have? Or interesting as like if you have self driving cars and the neural network can uh huh

We want to train driving cars with neural networks that try to imitate human drivers so that video data and they see the displays on this car and the display they show. For instance brake lights and you can think about if it makes sense and the reason in this framework about these things for instance a car short break when the brake lights show up. Now you can reason and make some stories about again, you have to think about effect confirmed selection, buyer's feedback groups, function constraints and other kinds of relation inducing scenarios

1 Causality - Experimental Causal Discovery

I have a one in this video. I want to talk about causality and what it means experimental terms. Let's look at an example, let's say you wake up every morning and you look at onto your thermometer here and during the day you see that the sun is rising. That you know at the moment arises as well. And even even then you'll see that the sun is going down and your theater thermometer is also going down. So what you record is that your thermometer has value up and down, let's say finalized. And that the sun can also be up or down. Then you recorded what you find is if you have a thermometer in the sun whenever thurman thermometers up, sun is up and thermometers down some is down. So what you see here is you have a perfect correlation between your thermometer

and the sun being up or down. And now the question is is this causal, there's a causal relationship. And the question is basically if we have the case that the monitor and here he has a son that the son We have two cases in simplified the sun is calling the thermometer to go up. But we could also think of the perfect relation it goes the other way I would have that thermometer makes a sunrise. And we have other cases as well. We have talked about different types of correlations. Let's say we have these two hypotheses and you want to know which one holds. The problem is that the data doesn't give us any indication we have perfect correlation, which is symmetric,

is it not this or this? So just by this table. But this data set, we cannot decide anything. The question is, what can we do? What would you do if you wanted to decide if the sun goes up by itself or if the thermometer is making it go up and down. So what you could do is you can just get rid of this glass cover here and just take this needle and just force it to go here to 60°. Just put it there and then you look at the sun and look if it's going up on down just by doing this or if it stays the same, of course, you know, you do this, son will stay the same. So this means you can rule out, can't rule out this. Thanks. But what you have to do is you had to not just look at what is happening, you had to intervene into the system by changing

this thermometer. Now you ask the question, is this true or not? Maybe there are other hypotheses, then you would need to go to the sun, horses up and down and see if it's the moment it goes up or down. This is of course a little bit tricky. You could try to do some surrogate experiments which trying to simulate this for instance, instead of making the sun go up and down, you could take this thermometer and go to a different country, maybe Australia, where the sun has a different position in the sky then see if it works and you would see that this relation is more true. Would see that sun is producing heat and the thermometer is sensitive to heat. So did you get this that

if you observe the relationship between two variables, your thermometer and some maybe then this is called an observational experiment observing just what's happening without interfering. We could call it like an idle state. And the important part is that you're just observing you and then we usually get one data set in It's one dataset and observational data set with all the observations we had and then what we can do in this data set, we can measure correlations. This is a standard case in this is usually the standard case and data science learning

statistics. Then on the other hand, what you just said is if you have the possibility you would like to interact with the system, we wanna see if if we interact with these variables and then measure what changes what happens there. And this is called an interventional experiment and you can do several of these, you have seen it could make this monitor going up, we can make it down, see how the sun changes it could change the sun up and down and see how exchanges Already in these binary cases, we have already four different cases but we can do four different experiments. That means we get many interventional data sets and what you get is not just correlation here, we get calls of relations

and one can say this is a case in in the field of causality Often also in medicine for drug testing, the drug works and often in reinforcement learning where we say that the agents interact, we have someone interacting with the environment so there we have something interacting with the variables and environments. So after we have said this we can try giving a definition of causality Let's say we have again two variables X and Y. Then we say that X has causal effect onto another variable Y. If when forcing X. Different values

X. Zero and X. One and why changes forcing X. Why changes so? And we could say that for every setting of X. Yeah, zero, next one we have a different variable Y. So we have Y zero and Y. One. And if there are different, if the distribution is different, we can say that acts as a causal effects on Y. Another notation is the do notation. We do meaning means interacting or forcing variable to some value. So we're saying distribution of why forcing X to X zero do X two X zero is different from the distribution of Y. When forcing X two X one or they could also be the same but there's cause and effect when they're different

at least for the values. And this is in contrast to conditioning the difference here, Is that why changes when X. Is observed to take different values. And this year would be the case where we just observe the sun going up in a thermometer going up and down. And conditioning doesn't help us to uncover the cause of relation. So conditioning is not the same as intervening here. This is why intervened X. Yeah this is why when isn't relax and these are

different and how this is done. Usually in practice, the gold standard is called randomized controlled trials and this is used medicine a lot. How does it work? If you want to check if a drug works you have a population let's say of ill people have a cough or something. Maybe coronavirus. And you want to check if the vaccine works then you're randomly or a cure se secure then you split them by 5050 into groups that pass. And then one of the groups you don't do anything or he gives him a placebo and the other groups and the other group gets gets a drug this was would be X

Equals zero. And to make it clear you would say intervening. And here this was the other group. This wasn't giving the drug and here this was not giving the drug or giving a placebo and then we do this and then wait until they recover not recovered. And then we count how many of these people recovered here and how many in the control group. And here you see four people recovered and here to people recovered. So we see that there is some effect um using this drug this cure for this truck for this these patients. Yeah. This would be Y one Would here be 10 and he has depicted and how it's done in

In real life. You randomize who is assigned to which group. But you could also think about it as you take one person and there was a 5050 chance. It's assigned the intervention or assigned to control of an individual person. This year is a random part for us. This is kind of fixed and we randomly put them in. But for a person getting this or this is a randomized part. I usually I tend to think it more like this because and it's clear that there is even though we write like this, put a probability distribution on these two values. So there are some downsides with this method. The downsides are usually that this um that you could have several values. Let's say you have a drug, you can give it in doses 50 g 100 g 150 g. And for every value you would have

your own interventional group but would not just half, half. It would be on 4th and 4th. And force every setting every experimental setting you have would have to do the experiment. So it's time intensive and costly. Also often you cannot do this. For instance, if you want to check if smoking causes cancer would have to force people to smoke, which might be unethical. Nevertheless, if you can do it, this is gold standard for experimental cause of discovery of course and more things to do. For instance, reduced biases like double blind assignments and so on. But we're focusing here on the quarter part, not on the and practical part specific applications where could you use this? There are several examples

where this has been, you said like drug testing. Then people try to use this for advertisement placement for instance amazon or facebook. They placed an advertisement in different areas of your screen and then you see how often you click on it and they assign this to different people and you don't know about this but they do this and then they see we have people click more on it and then they get more money by putting it in the area. So it was there for instance then evaluating public policies, one could use it. There was like a very nice case. People got to know the price for it and they did poverty research and they went through some poor areas and they wanted to know what can they do such that students go to school more often and their performance get better and they signed different interventions

with different tiles. Some they would just get more textbooks and see if they would be more willing to learn or to school others they do simple things like health checks or d warming and they figured out that most of the Children were more or less ill and if they warned them they were willing to go to school and they had much better performance, interestingly randomized controlled trial since has a long tradition of medicine just recently went into social sciences. So in summary not causality because causality is not an operator on your observational data set, so it's not like you have one data set and then you can just turn it into a course of one by applying some operations or pre processing steps on this facility takes usually place in the real world. So this means if you want to uncover causal relationships between variables,

you usually need to run many interventional experiments secondly. Or thirdly we set conditioning which is observational data set here. It's not the same as intervening, I think about um the sun and a thermometer example conditioning doesn't give us any information about causal relations. andw not the same as interventional. So conditioning is not the same as intervening, conditioning here here, andw you want to measure distributional change in y and it's very important that you really understand the difference and you decide which of the two concepts you need in practice. A lot of people use conditioning in place of interventions because

they don't know the difference and often they never heard about these differences. So and then both standards of experimental called discovery randomized control trials.

2 Causality - Causal Modelling - Causal Bayesian Networks

everyone in this video we'll talk about cause modeling will introduce calls based networks but first let's talk about the different school of causality ease. The two major school in causality those courts, one is called the potential outcome school And the other one is a graphical model school mention outcome cool, basically founded by name is Rubin and the graphical model school by pearl. Main difference in these two different versions of causality is that in the potential outcome school trying to be as general as possible and for this they trying to model every potential outcome as its own random bearable. So for instance

Y0 and Y one correspond random variables X which was set to some value X zero and X was to X one so each that's possible values of X gets its own random bearable. And then of course there's some assumption needed to make this possible. Often this is done imposing some conditional independence constraints on these potential outcome variables and so on. Well in the graphical model community of causality people trying just to Kind of incorporate all assumptions made in one big cosine graph meaning if you use the cause of a graph

photograph of causality then this is what the assumption you called craig's and in many corner cases these both schools are equivalent meaning if you make strong enough assumptions here we can impose a graph on it and can express everything in terms of a graph but it is true, there are models that cannot be expressed as graphs or like a basic networks or something like this and one needs to keep in mind that the graphical world is not the end of causality Nevertheless, we will stay in this framework as we already learned basic networks and also this is usually easier to grasp because it's visual and can use it as a guiding tool for

getting all these conditional distributions and independencies and so on. Then we need to talk about what kind of assumptions you make by making a causal model. The main assumption is which is called the polarity assumption, meaning that we think about the causal model, that there are several interacting causal mechanisms cause mechanisms and what I can do is I can switch them off, I can switch them off by intervening into the system And the modularity says here that if I switch on one part then the other part is still intact. So that interventions are kind of local and precise, they're just cutting out a piece of the system or leaving everything else

intact because this is a moral narrative assumption that only the target mechanisms are switched off targeted by the interventions and if it's not targeted, it stays intact, same as before. And what we want to use is basically basic network where edges correspond to cause the influence and people talk about cause of basic networks is actually mathematically not big of a difference. The word causal usually just means um when you use a basic network to model it and you say causal that you interpreted causally, that is the main point and this comes in with some consequences but let us just look at the mathematical description which is almost the same. It's only a slight difference. So here on the left we have our basic network, let

us recall consists of a directed a cyclic graph with nodes and edges and the joint distribution. And this joint distribution needs to satisfy some properties, namely the factorisation property, which I have written down here just to be clear. We start with the joint distribution, this joint distribution, we can compute the conditionals by this formula where am I or zero if you have experience of zero. Um but basically the joint distribution gives us a tool to define and compute the conditionals in theory and then we take the product of all these conditionals and then we have two distributions left side on the right side and the basic network says by assumptions this should be equal. If

this holds and we talk about basic network then a cause of basic network, you need to be a little bit more careful but only slightly the reason here conditionals And if these are densities and um if you have densities let's say normal distribution, we could change the normal distribution on any different values like finite values. The reason is if you integrate over it, then uh integrating over a point on the continuous density doesn't contribute to any mass on the probability. So we could change this on any many points we want also the conditional is on this point here where this is zero, meaning with probability zero, this will happen, We defined it with zero but you have any choice here, we could make it infinite also whatsoever.

Since if you integrate over it, this will not make any difference. The cause of settings on the other side, we are allowed to intervene into the system and set things to specific values and then it doesn't matter if he has a zero or he is an infinite five really. And that's the reason why instead of computing these conditionals we have to give them clear values that are defined and fixed for all times. And this is why instead of starting with a joint distribution, we start with Markov kernels transition probabilities and I have any indicated this was an index here for each of the note, I have my own personal depending on the parents of that note, then we define giant so these are given

and then the others are computed Yeah, this is compute and if you want. And here we start with that, he was conditional, we fix them and then we compute the joint so he has a joint, he has a conditional oh, this is do it. The other way around now that we know a little bit what called the basic network is and we know there's not much of a difference here. Just that we start from here instead of start up from there, that is not a big of an issue and honestly we could have just defined basic network from the start like this and just forget about this approach now we have this before I go on I just wanna a little bit um extend the definition namely I want to talk about

cause of basic networks that have also latentvariables in it and input nodes that makes us later easier and to talk about other things. So here we now say cause of basic network when you actually mean cause of basic network was latentvariables and input variables. We have latentvariables we have input variables and this consists by definition of observed notes. These correspond to the random variables, we measure 18 notes, this corresponds to the random variables we cannot measure but we still model them then we have input notes. This corresponds either to parameters where we have no probability distribution over or it corresponds to andw settings of our experiments that are set by the experimenter for instance, let's say you do an experiment

and you have to put the temperature of something and the temperature is not random, it's set by the experimenter and it still effects other notes but we have tomorrow them and we make more of them with a box while the others get circles for saying that there is a probability distribution over them given by the parents And here these input notes, they don't have parents meaning they are not dependent on anything, we consider them as being able to set by the experimenter on top of this, all of these nodes, redefining diag structure which just takes all these nodes into one set and then has these directed edges like you can see here furthermore, we need, as I said before we need this mark of criminals for each of these probabilistic variables

um always dependent on the parents like before and then we define the joint distribution by just taking the product as you can see here um just input variables, there always are written on the conditioning site so there's no distribution over this, we say this is a joint distribution giving the input, this is very very similar to what we did before, we just spit it out in more all those terms and more explicitly then a few remarks, sometimes you're only interested in the observed distribution given the input. If this is a case then we can actually make all these latent space a bit smaller, so reduced latent space a bit while still preserving the ancestral relations, meaning if there if you have a directed path and we kick

out some of the ancestors like some of the latentvariables we still have this directed path and this can be done just by iteratively marginalizing out, latentvariables that don't have that have parents, latentvariables here which has a parent and this is late and we don't know it, we can also directly model this causal relation without making this step over slate and variable so without loss of generality we can assume that the latentvariables and our input variables don't have parents and this is what we will assume, we'll assume lot of generality once we assume this year and that this doesn't hold you can also gather them more together these latentvariables if they affect same variables like this and we don't need this late and variable to define it, we can just gather them together,

pointing to the others. Let's give it to an example, look at this graph which you've seen before and let us assume, we cannot measure this so this is latent but this has a parent here which is observed and it has an incomparable, so instead and modeling this which might not give us any benefit, you could also just just erase it, marginalizing out means we have to connect, we have to connect the parents, it's not these two with Children is too so what we have to draw is we have to draw these errors here, this is how we marginalize out one of them and here you can see

two latentvariables that affect two variables and they affect each other. So here for instance I could modernize out this one, so out instead I do this more like this, so and then you see we're left only with variables that don't have parents like latentvariables they don't have parents are preferable said don't have parents andw if this were latent from the beginning and then we had one latentvariables here. So the interpretation is of our causal based network is

so first that all these directed edges are interpreted causally so meaning, there's cause of influence. So it's not about what we thought before conditional independence, so it's not about interpreting based network is just incorporating conditional independence relations, they're also there. But the main point is um that interventions have like interpret in terms of interventions will see that, you know mm hmm. Then you can say we can say what later. Confounding means. Latent confounding between two observed variables is then when they exist unobserved variables that affect both. So this is this case, right?

This case when we have a v W we have you here, this is kind of a typical case of a latent confound er when you hear is unobserved and these are observed andw And if they're all these latentvariables they only have one child then we say there's no latent confirmed but we have latentvariables but there's no confounding through some kind of arc, like our relationship here is not like a fork coming from the space furthermore, if you use all the basic network, we make some implicit assumptions, implicitly assume that first modularity assumption and the model parts are the mark of kernels, you'll see that in a moment

then we have no unknown. Late confounding meaning we have modeled all the latentvariables so it cannot be that um if we write this, I don't know maybe if it's not there and we have only this system that is randomly and you which acts both. So we are this is it was basically more than we're saying we know all the later confounding other latentvariables There's no feedback loop. The reason is there's no cycle in the deck furthermore, there's no selection bias. You didn't condition on anything yet even though things are observed, you're not conditioning on them became model selection bias later but just conditioning on some of these variables but

in this form we didn't specify any variable, be conditional and I hope I didn't forget anything but of course say you don't have measurement error otherwise we would have would have modeled them explicitly by having a causal model plus some emission probabilities like in a hidden Markov model but we didn't do this or in this case there's no measurement error. Um but these are the main things so no unknowingly and confounding no feedback loop selection bias and the cause of modularity assumption. So now we have to define we define this. Um Well you set the definition up that you can define interventions, what does it mean to intervene, intervene means I take some of the notes. Yeah and I changed the values so I changed the values of the notes.

W and this needs to be part of observed variables. Then we have an intervened causal network constructed instructive from this and this is called the do operator. And it changes all these kind of entries this year back. This is the Mark O'Connell's. All these entrance will be changed by intervention and I will write down how first of all what we do is we make return observed variables into into input variables, meaning we control them now by our own by the experimenter. That's the first thing. We don't believe it just there and see how things evolve. We take it in our own hands and put values on them. This means we turn them into into this. Just means we remove them from

the observed values latentvariables staying the same. Nothing changes. And now what about our edges? He will say remove all the edges. Any bearable. Maybe remove all edges that points to W. So that means like all the causes in the systems are overwritten by our intervention. Nothing can change W. Besides the experimental. And then you ride down the joint distribution which we now ride with a dual here do means that it's under the control of the experimenter. So these values are under control of the experimenter and the Js already control

but now you're also controlling W. And here we had to remove them. So values of X. W. That went from here to here. Now we have to make a definition and this definition says I take just the same mark of Connors as before and go over all observed and observed variables in this model. This means this is you all without W. Q D U. O. And then a take same as before. Like this. So I write down what is missing. So all the B M W E V X W.

Ex parents W. These are missing. So meaning leave this out. So this is what we had in the in the observational distribution. But now that we intervene on this, you kind of cancel these mechanisms. This is what we do intervene by changing this mechanism. And the modularity assumption is incorporated by saying this. They're all intact without change. While we kind of remove these from the observational distribution. Let us have a look at a simple example. Let's say we have this crab model as before and now our

observed variable W. Is the one we want to intervene on. So what do we do? We look at this here? What we do is remove all the incoming errors. So we remove this error and also remove this era. But justice like this. And then you turned this into a box. This is how it looks like. So this is our G. And this is our G. W. And w. Is just like this small one note. And if you would intervene on other variables, let's say give me to then you would remove this

edge and turn this into a box and so on. And again just to say it out loud when we use such a causal based network of this type, then we assume that m like this um graphical models, models observational setting, and all the intervened versions. They model interventional settings, meaning this is a model, this is a reality, this is intervened model, This is intervened reality. For instance if you had we have to make sure that this makes sense. In reality. For instance we had okay Sun and the the monitor experiment or experiment where the sun causes the monitor to change by changing

the heat and the heat might be a latent variable here which we did measure. So we directly model this to now we intervene on T. Then what does it mean? I change the thermometer, the needle in the thermometer. And of course the sun doesn't change. That means the sun will move independently of how I set the value. And this mechanism doesn't change. It doesn't do anything anymore because we intervened and change the needle by ourselves. So we override the natural natural mechanism to here by our own their own intervention. So we will have here T. Then you would have here the sun and here the arrow is gone. This is how it would look like. And because this is a case here that the sun will move independently of this one. We were able by

this experiment to see if there's a causal relationship or not. If we had it the other way around. Like let's say this is case one. If it were case too, let's say it would really be that we could control the sun by changing the needle of the thermometer. Then we had this direction here and it would be like this, he is not sure he has a son. And then if you intervened on T and only the incoming edges are killed. So what we had is would have oops mtm then we would have still this edge here meaning if we change

this year we have a change in the sun by this independence here. But it would be really a causal relationship that we could control the sun, then we would still have it if he intervened. And um so these two cases in our example would really model exactly what we wanted. And of course this is the correct model in reality and this would be not the correct model. But the causal relationship would be well modeled by our cause based network. But the same argument randomized control trial works is this would be our treatment is our outcome variable. Then by randomizing the treatment here getting T equals one T equals two or zero and one. And then see how this change tells us what the cause

of effect is of this treatment. And if there's no causal effect, then the treatment doesn't work. But in this setting more because there was no error in the first place. So this is how we model systems. The real world was cause of basic networks.

3 Causality - Causal Modelling - Structural Causal Models

I have one in this video. I want to show you that you can also model those relations with structure causal models. Let's recap cause of basic networks cause a basic network consisted of three types of notes, observed notes, dated notes, observed notes and the input notes then deck structure on these and Markov kernels that said if I multiply them I get my joint distribution out. So now structure called the model is almost the same. It's just a different flavor of modeling it. Instead of using Markov kernels, transition kernels, I can also use deterministic functions, deterministic functions which you can see here. So instead

of sampling, if you think about it of sampling at each step from one of the notes, I just use this one to be a function of the parents and if I have these functions and of course I have this deterministic transition. Ronald these marks of kernels here. Every structure cause model can be interpreted in this way. That's amazing cause model as a cause of basic model and you might think okay if I have only deterministic transition functions, this is more restrictive but it's actually not. It's actually just that the toxicity is model explicitly Why is it not more restrictive? The reason is we can always add notes that said it becomes structure causing model if you start with a

also basing network causality basic network. So how can we see this? We have seen this trick already. What you use is the conditional quantum function. So you start with your mark of Kernel and now you want to turn it into a deterministic one and you do this by introducing a new variable which I call here X U V x u v is uniformly distributed. Then you plug it into the conditional quantum afunction And this new variable you just add add to each of your variables and you add a new era. They're changing the graph a bit by new variables. And then This turns

a probabilistic map, mark into a deterministic one. And your functions are than just these conditional quantum functions as a function of these two, which will now be the parents as we have seen here in this new graph. But these are the old parents. If you add this, this will be a new parent. So let us just see how how this works and think about for instance at this variable and we have here Mark of colonel. What we do is we introduce another note, later note for this. Now you will be too we had another error here and now just with this and with this edition, this note becomes and the domestic function of

these two variables. Then you can go to the next one. You can't adhere, let's call this U W. Yeah, not here andw Now if you use here this quantum functions as I just said and w is now a domestic function of all these barriers then this is external. So we only change the ones that are either don't have a parent that have parents would only change those or we change, let's say the observed ones um that don't have parents. Then we can just introduce a new latent variable such as observed ones are all the deterministic function of something else. And then finally we're down here introduce a new variable here you before

for an edge and now we have turned the forum with the quantum function of center and deterministic function and deterministic bearable in terms of all its parents. So as you can see roughly also based on networks roughly equivalent to structure, cause models. And it's a matter of flavor how you want to model your causal model in a graphical way.

4 Causality - Causal Modelling - Do-Calculus

everyone this really will talk about to calculus for cause of banking networks but let's recap the rules of probability, the rules of probability to hold what you need is a joint distribution or here a little bit more general mark of Colonel Mark of course, where that is a conditioning variable that always stays conditioning and then we can define the marginal distribution but just integrating our intermediate variable X. Here. Why given that? This is also sometimes called the sum rule but depends if you define this first and then that first and it's related to this or just define the marginal to be there, then you can define the conditional distribution by just dividing the joint by the marginal or as we always do define it to be equals to zero

if why is that equal? Always with this convention in mind and then we have the product rule which relates this marginal again with this conditional adjust that the joint is a marginal times the conditional. Then we can also compute the other marginal by integrating out why here and plugging this in this is a typical formula we use but what I'm telling you this, you know this all ready. The reason is that I want to make the point that for these rules to hold, we usually only need a probability distribution. What we now learn in the neck on the next slide is we will have more structure

like graph we have caused based network and we will have additional rules which complement these rules of probability where rules of causality if you want here in the rules of probabilities, all these kind of quantities, marginal, conditional joint and so unrelated to each other that's relating the joint, the marginal and the conditionals These are the formulas how to do this. Now let's go over to the cause of basic network setting. You have a closet basic networks, it consists in our setting of some graph was observed, unobserved input variables and some directed edges plus a choice of market kernels. And the underlying deck here is um it's jean and from this we were redefined all interventional distribution, this is now the additional

input we had before joint and marginal. Now we have conditional distribution, interventional distributions on top And we were able with this model given to define them as follows. It's a little bit like similar to the conditional distributions. When you only have probability distributions now it's becoming more complex. These are not conditional distributions, these are the interventional distributions. So this notation is defined. So this product of all these mark of kernels but here this w is where we intervene on and without loss of generality we assume that J is always inside this W J is input variables anyways. So we barely replacing input variables J with input variables, W then we don't need to write W and union J. So we only have one letter here for

convenience instead of two, then if we have defined these for all kinds of all kind of subsets of our observed and input distributions you can now again define conditional and conditional. marginals versions if you want will be marginalized out some of these distributions some of these variables. And then condition and here what we do first is to first define the interventional distribution and then define this conditioning here condition on xz by taking the joint here divided by this but first intervening, then conditioning. This is a convention for this notation. And here again if I had to be zero if p of X C two, is equals to zero in that case.

Now the question is how do these relate to each other? No. How are these related to each other? And there are three rules. And since this is a graphical model the graph will play a role and there are different versions to formulate these rules. I chose a version I find the most intuitive and simple like simple to remember. And again here we think we have a causal basic network same notations as on the last slide and now we introduce auxiliary intervention notes so we commend our graph with more of them. There's more notes, subset for every variable of the observed variable.

We have another variable. Another note with an error pointing towards it. Consider these new inputs notes and this will help us formulating these rules of um causality if you want also and wanted to remember the separation and here we have now subsets observed variables, then we hear her from like um indicator variables for a subset that we define I. B. To be all Yeah. So these are subsets of the notes but these are not observed. These are not part of the input variables, very specific intervention variables. Then we have conditioning set and then we have like this do W Here

And what we mean by this notation here? Yes. Want to admit in a printed in several different ways. But what I mean is that A. Is D separated from I. B. A. And yet be separated conditioned on C. But the question is not inwhich graph. The graph we use is this G. Where we intervened on W but we are commended with the edges from I B. Let's draw an example first. Let's make a very simple example. Maybe bigger one. True. Yeah. Four. Now we add

maybe like this. Oh my God, it's called this ex. Usually we call them E and W like this and we want to look at why independent of my ex. Even that dollar real. So what does this mean? This means first we intervene here on this graph and instead of deleting it, I copy it. Yeah, basis we have versions. Maybe I'll write down that this is G. And this will be our d ju w andw intervening on W

meant that we turn it into and into input. Note mentioned is making a box around it. Like this. Move it a bit more. now we want to check if this holds or not for this. We have to introduce an input variable, let's call it I X augmented. Yes and like this. So this is a graph, we are checking this in first, we do this operation where we intervene, this was done then we augment the graph with all occurring by access. We could also make your eyes es but they're not needed,

they don't make change, it's a result here. Um so we only have I X here now you can check so we have Y here, then we go here and here. So this is not separated given because this doesn't work it. Maybe we could look that's this one here. X. We have now both in here, the condition on X and that and intervention variables here. And you see we have to look at one path and to pass this path and this path be conditioned on why this is a non collider This blocked. The path is here. Now that we condition on here, this is open and this part but we condition on that so it's blocked here. So this is blocked here. It's blocked here. So this checks out

and holds. So this is what it means. Now with these notations, I can formulate the three rules of Duke Oculus first let us recap we have again our core of the basic network with this underlying deck and we have here our w which is our intervention said that contains J. And then we have a B. C. disjoint sets. So we have here W abc and reconsidered and disjoint you call the basic network back. And then we have rule one insertion deletion of observation. The question is when are we allowed to insert or delete observation? And let's first look at the rule what it states, it states that my distribution of X.

A condition on XB and XC doing X. W. It's the same as this part but without the X. So we're inserting or deleting this XP, highlight this bitch XP is here but it's gone on this side. So when are we allowed to just delete this on the conditioning side? And this is um and just a normal mark of property when a is independent of be given psi in this graph where we intervene on W. So here no intervention there both occur and this is gone. And when we have when you have these independence in the graph. So this is a graphic of criterion and this holds in the probability distributions

and we know this already because this is just an application of the global market property to this graph. Then rule two. When are we allowed to exchange conditioning with intervention. So we have seen in general this is not the case. General conditioning is something totally different than intervention but sometimes it is the same and here what you see here is that you have intervention here on this side but a conditioning part on this side. So this is action and this is observation meaning intervention conditioning exchange if you want. And all the other variables are here always the same and you can do this again you see here is a conditioned and here also also

there's a everywhere the same. So you always find A here and this here, the only thing that changes is basically this party. I love this about this. So if a independent from this intervention variable of be conditional B and the rest, then I'm allowed to get rid of the do operator here and just have conditioning and the last rule this insertion of the legion of action before we had insertion deletion of observation. Now we have action And here we have now one i. B. As before. But here we're not conditioning on I B. On B. Here we have B. And we don't have to be. And then we allowed to remove this year totally

as on this side, no B occurs, maybe we call this a bit, you're always the same texas. So let us again look at an example, here's the same example. So what we have in this example we have for instance that let me remove maybe this part here it can. Right? So what do we get for instance is that is independent of this year without conditioning. So this means we have that

independent I X. Without conditioning or like and here always say W. Then we know by rural number three that p of that even do X. And do W equal to the upset on top of you. And I write here W small W. To mean the value and big W capital W. To mean this as a note. So what else do we get? We already saw that why is independent I. X. Given X. And that

W. And from this we know by rule to that P of Y given ofx conditioned on that W. It's the same as P of Y conditional X. That W. This you have to look at these two. So you have to check if it's only once there is this X. There or is it both there? No we don't have a good example to separate them for one. Maybe let's draw another note. Yeah that's called this S for instance enter crore standard then the rest is still valid. andw

Then we have for instance why independent of S given X. W. And here we get by the first row that P of Y Given SX. two W as the same S. P of Y. X salary. So we have these points, wow good one. Right. This extends the rules of probability. Also taking into account the graphical structure of your causal model and allow you to play around with interventions and conditioning. Now you might ask what is the proof the proof is actually um you don't need to say much like number

one is just it's all just global mark of property that's true. Global mark of property applied to this graph um here this uh intergraph where we have these intervention variables, this intervention variables is present and then we need to use to extensions basically I write it simply that relax and so on basically the same yes taking I X to this value but it's also the same as

taking I X with this value and again conditioning on X. Alright, Capital, this is the second ingredient which one has to realize and prove this is a bit technical, you have to extend these Markov kernels to also include these kind of things and you can do it and then show that they're equal but then this rule directly follow first one was Market property. 2nd, what did we have? Let's look at it 2nd says a independent of B E C W. So we have by independence of bih I B. Alright, givenby and then psi do W.

And by the global mark of property this means that P Y even writers is available. Mhm Oh made in a year. Sorry X A. Then two X P X. psi X. W. So by this rule up here, this is the same S P X A I B equals X. P. Also X B.

xz X W and this is equal andw these kind of things, what's in here? This girl's in here now you see you have IBD and XP here. So this goes away to get X A XB xz expert do X two X W two X. W. And he can do X. W. And this is already what we wanted to show that you can change this with this number three we have A is independent of I be given psi

W. That means take X A two X. B X. C. To expel you and I can write it called the Stars. Also by the Star here that star. Um then I can write this again in terms of E X A X P equals X. B X C X. W. Now again, yes where is going in? I can now ride P. X. A. xz two X. W.

And this is exactly what you wanted to show that you can get rid of this just replacing. Using Yeah, happy. In fact here the facts were used such in such a way you can prove these three rules, they're basically lying on this year plus the global market property

5 Causality - Causal Inference - Identifiability

this video, I want to talk about causal inference and that had the viability of causal effects here we again consider a causal based network given by observed unobserved and incomparable some mark of kernels and directed edges with underlying cause of the act like this. And we have seen that we can define interventional distributions all subsets of our observed variables without loss of generality including all the input variables by just multiplying all the involved marcos kernels here excluding where we intervene, we turn here these variables w into input notes, andw can also do this for the conditional conventional distribution just by dividing by

the corresponding marginals So now the question is how can we compute its conditional interventional effect? So under which circumstances is it possible to estimate them? Oh, possible to estimate this is a question. And um we have the best situation in in mind where the graph is known and we have infinite amount of data. Now you would say that it's possible for sure the graph is known. Infinite data. So what it's infinite data. I mean you have infinite data from some of the distributions if you have them. Now you have to corner cases you have the best situation

where corresponding to the cause and effect of interest? We have w here you now can sample from this joint distribution where you have exactly here the same conditional distributions? Maybe let's mark this here, if you have the same conditional as interventional distributions. And then if this is the case then trying to estimate this conditional distribution where this year's basically given boils down to doing standards, inference techniques methods we have already learned or density estimation techniques. So in this setting we are going back to inference methods like variational inference, exact inference in graphical models

or some sampling methods and so on. So this is in this case, it boils down to this message. So there's no new problem here, but it's also not getting easier. But we have the data, there's a structure and we have all the requirements we need if the particular method buys more difficult situation. Accursed that's more like a worse situation. S if you cannot get and on this interventional data so we get only observational data. But here we had interventional settings and data from it and here we have an observational setting. So meaning we cannot perform the experiment with this particular interventions. It could be for ethical reasons or it could be not

feasible. Many reasons can come out. Now we have here the distribution, Yeah, this is only like the standard, the National variational input variables. And then the question is, what kind of circumstances is it possible still to get an estimate or a formula or this interventional distributions. We're interested in the interventional case and here we have only observational data in general, this is not possible but their corner cases especially special cases are well behaved where it is possible and we want to analyze some of these and just be aware the graph is

considered to be known and we're considering that you can get infinite data from these distributions, meaning you can work in the eyeglass case of probability theory. If you want, let us have some simple examples. Let us draw just like two variables for instance like this. andw Yeah, let's call this T and S again. And now we are interested in two cases let's say I'm interested in that's a little X and Y. Why you're interested in this distribution in the interventional distribution of the I two X. So this corresponds

the model. Thanks why I was here excess here get this. So here we have the joint distribution and we are interested in this quantity and as you can see that this year it's the same as the conditional. In the special case. A special case. It's a conditional by just the standard rules of course. And what we need to check is if you have your intervention variables, that this is separated from X given Y. And you see that this is the case. So second

rule applies here. So dimensional distribution is same as conditional and this can be computed from this. So here this is possible. Now let us think about what if you're interested in why X at a straw what we have here a box now and we'll hear why and there's no there's no error in between because if you intervene X we have to get rid of all the incoming errors. So what do we do? We apply the third rule of two co close meaning again draw box here. Now our intervention variables, then this is separated from

why and why is this the reason is because this is a collider We have this. So the third rule applies and that means what we get is py is just a marginal and this makes also sense from this graphical model here. If this year quality influences X and it just being on X doesn't change anything about my distribution of Y. So here this quality effect can be computed by the conditional, this causal effect can be computed by the marginal in these two cases as as possible. Um let us now think about an example where it's not possible and just it's already a bit more complicated because the police needs to roll the variables and this is that det X and Y and X.

Why? Maybe it smells like this. Maybe any direction. And we consider he has that unobserved these are observed than the observational distribution usually excludes unobserved variables. Now we have confounding Layton confounding between X and Y and no matter what we do, we don't get any estimate for this in terms of in terms of our distribution. Xy formalize or off national

this and what you mean is here just X and Y. This is considered not really observation. It is a bit difficult to prove why, but if we wanted to use our I I. Y criterion with the Duke Marco's, then we see. We cannot apply the rule to because if you condition on why we have to check that this box is independent of this X. Then we cannot condition on here because this opened this path. And since this is late and we cannot condition on it. So and the other one was that this is directly independent of acts. But if you don't condition here, this is an open path. Neither rule two or three applies. And the first rule was not about dues, it was just about conditional distribution.

At least we see that these two rules cannot apply. This is not possible. Now lets us talk about criteria which is called the backdoor criterion which gives us a general procedure to find a formula or a all the effect and the formula goes as follows or the settings as well as first, again we have standard calls for basic networks as before. And now we have two conditions that needs to be satisfied first. So we wanna we want to have a causal effect of A and B. Like you want to have to call effect from e onto a or two sets of variables. And we have a third set of variable. There's a psi and the third set of variables

is not there to kind of block these causal path, cause of influences. And what we want from the Sears that has this independence relation. The separation of C. Is d separated from the intervention variables of B. And this translate to that. No note in C. Is a descendant of a note in B. So this can be translated to this. I usually prefer these formulas and also that A. Is separated from the intervention variables of B. If I condition on B. And C. And this translates to that every backdoor path from B to A is blocked by C. You will see this in an example in a moment but this is why it's called backdoor criteria because all back doors are blocked and if you have such a set it's called an adjustment set.

Then you can adjust for this pc with this formula such that if you have this integral get this cause effect out. And usually for this effect we estimated you would need an experiment in the real life where you have interventional data, you need experiments where you can intervene on B. And then look at what acts as output is And then use this data to estimate these kind of and quantities here. But its vector criteria allows us just use observational data and with some more variables. So we measure not just A. And B. We measure like the sea such as this is satisfied. Then I can just aggregate these observational datasets

and I get an estimate for this interventional for the effect without ever doing this experiment. But one thing is the causal graph needs to be known and it needs to satisfy is presets need to satisfy these conditions in this very very corner case you can actually um estimate this effect without doing an interventional experiment and you might think okay this is just a standard formula for computing conditionals but compare. And just for comparison if he had X A conditional, just for comparison this would be the X A A X B X C. That would be the same. Now we would have X. Z. Given X.

B E X. C. The difference to the conditional distribution here is that here you integrate over the marginal while in the usual setting you would hear integrate over the conditional. So here's the difference you see explicitly the difference between and cause and effect and the conditional effect. This is just for comparison that you see that these formulas are not the same. So you cannot just replace this with this here this year. It's not equal. It's awesome. Maybe we should also mention that A. And B and C. Or observed an example. Let us draw again a very simple triangle

example X. Y. And we assume they're all observed so and then let us draw or box I. X. And we are interested in interested in. I write it uh X and Y do X. Patient is a bit more intuitive now and what we need to check is the back door Hector criteria

says that I have to check here if I want to use that S. N. Adjustment set that I. X. Is separated from that and this is a case. So I have my ex here have that here. This path is blocked because he is a collider This path is blocked because he is a collider So this actually holds, this was number one. The second one. Is is it true that why is independent from the interventional X. Given X. And that. So we now look at path from here to here. We have two paths is 12. And now I block this by conditioning on X. This is a case.

Yes, this is blocked. The second path goes over here and recondition here. So this is open here. But also conditions that this is blocked. Let's hold the back door pass andw back to cartoon holds because this backdoor path is blocked and that is not a descendant of X. It's not here, it's there. So the backdoor cartoon qold that means back door that we can compute e. Y. That I am the marginal of that dessert. This is how we can compute this. So in this scenario we already have a back

path, we can condition on it and just like this. So this holds, this is a standard example for the back to criteria. No, maybe we ask ourselves, we want to prove this, but it's not the example disapprove. So back door criteria. So we have condition one condition 1 says that C is independent of I B. So I cow close rules by the rule of and the duke our clothes, we now know that X c two X p is actually the XC.

So this is like the third rule of look at those second. We wanted to check if A is independent of I be conditioned on B and C. So now here we can use the other home run X A given two X B conditioned on x. psi equals by the second rule two, X A X B x C then what are we interested in? We're interested in X A even do X B. And now this is normal rule

of probability theory you can write X A max psi two X p times E of x. psi two X b d x. psi as you now can see this quantity is this quantity? This quantity is this quantity? Maybe we call them clearer A and B. And this is then the thanks a xz XP

X B x c times E x c E x C. And this equation comes from a this equation comes from B and this is already the proof for the fact of criteria. The only thing we needed were 1st and 2nd, 3rd and 2nd rule of Tako clothes, then finally remark actually there is an algorithm sounding complete algorithm that when input mhm is a known graph. So the graph again is known and we plug in some subsets than any causal effect, you know, this form. So we have next year, let's see here and a C.

It's an identifiable and check if it's identifiable or not. And this means identifiable means can be computed from the observational distributions alone. The semi voyeurism which can do with this and not only this, it spits out of formula for similar to the backdoor criteria. It says basically you have all this kind of nested conditional stimulation, you have to aggregate and just kind of iterative and a bit complicated. Will not write it here. But one Interesting feature is the only thing it needs are the three rules of nuke alcohols. So if you want to do it by hand you can by just applying it over and over and over again. These three rules kind of direction plus the rules of probabilities maybe.

And here with observational distribution. We make clear we exclude marginalized all the latentvariables We couldn't measure charlie. We're left with one very special application. The backdoor criterion. This is all about the identify ability of causal effects

6 Causality - Causal Reasoning - Example

number one in this video, I want to show you how calls our reasoning can help you understand real world scenarios. This example and trying to demonstrate was taken from the book of Pearl causality models, reasoning and inference following scenarios that you have a conversation between a medical doctor and a tobacco company. Medical doctor claims that can cause cancer. And the tobacco industry says that that's not true. What the doctor did is he measured features of the patients or of other persons as well, such as smoking? Yes, no. Did a person smoke also checked if they had lung cancer after some years and when they did this, they also checked if there's tar in the lungs. So the conversation unfolds as follows, doctor claims smoking causes cancer.

thereare tobacco company know why that? Because smoking is strongly correlated, it's cancer. Talk about company says, correlation does not a population. Do your homework also, what you actually need to establish is this interventional study which shows that smoking causes cancer said, Isn't that what I said? That's the same model that I have company, you monster. How many people did you force to smoke? You cannot do that. Doctor said, we didn't need to we know that if we have this observational cause model, then we know that also when I intervene on ask that this causes cancer. This is the

second law of do calculus company would say, but you cannot conclude that from the correlation between smoking and cancer alone. Doctor Sure, but we already ruled the other direction. House cancer surely doesn't smoke. It doesn't call smoking. Yeah, why not? Yeah because cancer symptoms come many years after years of smoking and we know that the effect always comes after the course. But that's not good enough. Why not? What the doctor ask. You already have said that cancer doesn't smoking and we know this smoking cancer are correlated strongly. So the only way we can model this is that smoking causes cancer. Tobacco company. Your coalition is just

coincidental. Poorest coalition. Actually the reason that they're correlated is a latent confound. Er For instance a gene mutation could be responsible for craving of nicotine which causes people to smoke. The same mutation is very unhealthy and causes people to have cancer. Doctor. But how do we measure G. Company exactly. Good luck with that doctor. Would you say that your gene mutation would cause the times alliance only indirectly. So the genes cause you smoke and the smoke contains Tar so the Tars and the latin. But that doesn't help you unless you measure the gene mutations. Nothing can be said about the cancer if G. Causes S. Or G. Causes T.

Or S. Causes cancer or T. Causes cancer it's the same same structure. Same mask same problem. So you admit that Tar causes cancer smoking causes Tar. So by transitive. Itty smoking causes cancer the public company would say there is no simple conductivity law and causality You cannot follow that. And even if in fact of the G mutation or cancer is much much bigger than the effect of smoking. You still need to measure the gene mutations. Doctor. So at least we have a joint model. So the model where the genes causes branding for nicotine's cause people to smoke. The same genes cause cancer

maybe and that smoking causes tar in the lungs and lung causes cancer. The question is now how big are these all the effect which is bigger which is not. Can you agree to that? Okay fine. Still good luck with measuring the gene mutations. As long as you don't measure that nothing can be said and as long as it's not measured smoking it's not wasn't cancer. Doctor wait a moment. Let us calculate that through 1st. Let us introduce here the stomach variable whichis intervention variable on smoking. We were pathetically able to intervene on smoking. Then you would have this andw resist notation. We can say that cancer

doing smoking effect is the same as as the joint distribution. Very integrated T. Out. So we can write T. Then we can take the tar to ask times E. T. You ask D. T. But we integrated out here the T. Just a normal product rule with this integration. Now we know that psi is separated from our intervention note here. If you just condition on T. This means that our P. C. T.

As it's actually p psi T. by the 3rd rule of two cargoes also we have the separation between between T. And this intervention note given. S. As you can see here, so this path is blocked by s other parts is blocked because he is a collider So this means that P. Of T. two s. It's the same S. P. T. Given S. That's the second rule of two car calls.

So let's move this around. So there's no imply that you can actually huge. What was the effect of smoking onto cancer, adjust the observational data where we know that smoking for the star. This is just we measure this in all the patients and the tar causes cancer. We can also measure this in the observational data. This is just conditional distributions. And so we are able to compute the causal effect of smoking on to cancer with this graph. So let's plug in the data and see

14 Causality

0 Causality - Causal Discovery - Overview

I won this video. I want to give you a short overview over the area. Of course. Discovery all the discovery assumes basically it calls a basic network, the simplest case as an underlying model and we consider the graph be here, the deck to be unknown, this is jesus unknown. And what you're trying to do is you're trying to learn autograph from data question is now from what kind of data, interventional or observational data. And in practice it depends of course what kind of data you have. The best case. You have all the interventional data, interventional regimes for every intervention, every possible values you can put on there, you have enough data, the worst case you have only observational data

in real experiments, you might have kind of a mixture of some regimes and some observational data in case you have all interventional data and you can actually consider this problem to be solved. For instance, let us go dack like this and we have causal relations like this, then I can just intervene on all these variables. Let's say I'm interested in what effect this variable here, then I can just intervene on all the variables and I will see that that if we intervene on these variables Say this is W one

utu W one W 2 B. It's the parents that if you intervene on everything, you will see that he is only dependent on those and not on the others because you can just intervene and other variable here does not do anything to change. So if you can intervene then you can also intervene on the parents of this note and you'll find that out and then you know that only be changes under these teachers. And since you can intervene on everything, you can do this for this, this and this variable. So in this regime if you can do all interventions, you can consider this problem more or less solved. Let's say more or less

you still have problems. If you have one unobserved can founders but let's say there are no unobserved founders. Then you intervene on everything and then can be considered solved. Then you have all these kind of cases in between. We have a mixture of interventional data and observational data and of course they follow similar principles um that you can that you would arrive if you use observational data to find the graph the cannon. And for parts of the graph. Just by looking at interventional data, then you can use the same techniques or intermediate interventional data to say something about the graph. So the techniques to learn from observational data to in photograph can be applied also to the intermediate intervention.

So let's restrict to the worst case trying to say something about the graph. When you only have observational data, this is a worst case scenario and the most challenging task and it starts of course from the advent of Philip and the viability of the kata graph, it's not identical then of course it is not possible or only partially possible. Let's look at some difficulties when we do this. First of all the graph is not known. That was the definition of this problem. Sometimes the question arises if the graph is identifiable. Only partially impossible. A defiant one.

Okay then all the techniques are very sensitive to measurement error, selection bias, Lincoln, founder feedback loop constraints and so on and if you have them or not will change your conclusion. You have latent confound er Then if it's like this direction, the cause of error in the other direction might be obscured actually by electing can founders and so on. So what we need is here strong restrictive assumptions are needed and that of course makes things more complicated and less applicable. Did you say graph own respective assumptions. Then often the ground troops I know you don't have actually very good

data sets to check something is or not and if you wanted to make such data sets and you would ask me to ask people label what is cause what is effect? There's only one or two such datasets. My knowledge and that means like you have no supervision when we do cause of discovery and even missing a validation set. So at the moment you could consider facility as an unsupervised learning method So we need to rely i in simulated data we can related data. This is not as good as real world data but at least it's a good sanity check. If you start from your model and you simulate data from it, can you cause the discovery algorithm recover the causal graph?

Then another difficulty is that the theory is difficult and as more you're trying to include this kind of measurement error selection by label, can farmers and so on. More and more complex theory becomes you have to understand all these kind of it's to develop an algorithm that can handle all these kind of different relation inducing scenarios. If you want to reduce it, you have to assume there's no feedback groups for instance, then if you're making such assumptions or no functional constraints, then you're quickly, if you're not sure about this, then you quickly have a causal model mismatch that your model doesn't actually reflect reality and then very big example, more practical scientists at the search space

scale super exponentially with the number of notes. And then you kind of need to try to overcome this. Either relying on restricting assumptions or trying to group grass together into equivalent classes or need effective approximation or search algorithms. How bad this is. Can be seen from this table. You can see the number of notes deck has, if you have only one note, there's only one deck possible. If you have two notes, three decks are possible, the one where there are no edges, The one will be in this direction this direction, then you have three have already 25. I cannot write them down anymore and four and five and you can see that even the number of digit needed to represent the number of decks.

Gayle's more than linear. So if you have a deck with 17 notes only means 70 features only, then you have the search base um this amount of different tax trying to penetrate a data and to handle these problems and are different approaches for causal discovery And the most important one of the most famous one, a score based methods were badly trying to do some put a prior on the graphs and then trying to use the data to approximates or give a score or you know, bayes in posteriors sense. The other one is constraint based methods. They're trying to reduce the search space by looking at conditional independence constraints

and these constraints kind of kick out a lot of these graphs because they in court conditionally independencies their other constraint based methods where you have inequalities. Often they come from information theory, Shannon inequalities and with these you're also trying to throw out graphs. Then other approaches use complexity complexity measures. They're trying to measure the complexity of your fitted model, including the functions and the data generating processes and then using and the one that is least complex years here like Occam's razor, you know, comes raise again,

kind of a causal version of what comes razor if you want another one is an independent component analysis based strategy. We are trying to model your functions and then decoupling them into independent noises and then trying to recover these functions and the last one invariance called the predictions. Well basically say that most cause model is the one that is most robust two nuances. So you could say this is kind of a robustness approach so just memorize a bit if you've seen cause modeling already and there the graph is known

because of the graph is known and now we are talking about cause of discovery and here the graph is unknown autograph I need to be learned from the data while cause modeling can be considered as a princess story because it as application with robustness transfer learning and it's an eye, reliable machine learning and here because the graph is known, you have like a very strong prior knowledge that means life, he has strong inductive biases that help you learn in these kind of settings also helps with causal reasoning. So if you know about the closet graph then causality can be considered always helpful here. The case where the graph is not known, there are not so many success stories, real world application

and success stories out there. So this because of all these challenges, this is more like an active area of research and tries to overcome some or all of these difficulties but you have summarized on this lead, there might be more so this is an overview over the discovery

1 Causality - Causal Discovery - ICA based methods

number one in this video. We'll talk about I say a base cause of discovery method first. Let's recap I see a I see a assumed that we had some independent sources. They were mixed with some mixing matrix in a linear noiseless fashion. Then we were able to measure this axis here and what we wanted is we wanted to reconstruct these sources and we wanted also to reconstruct this. Mixing maitresse this we assume that this matrix is invertible which was called completeness in this setting. And that the sources actually had proper signals meaning that they were non Gaussian, they were not like white noise or what we could assume is that at most one of them is noise. And then we had a theory that said that under those assumptions both mixing matrix

and sources can be identified and then there were some restrictions, sign scale and computations of the sources. And then we also had an algorithm algorithm initialized some inverse of this mixing matrix. So we were working with the inverse of the matrix then until convergence we were like running through these updates here where we take a data point, then multiply it use some activation function, multipliers and then update and then we repeated this with all the data points we had and this converged some estimate W and then we were able to use this W to reconstruct the sources just by multiplying against the data points. But these were the sources respect to this data point.

Different sources here. We had some activation function learning rate. So now we want to use this for causal discovery. Again, we need to make proper assumptions and this will be non linear linear, non Gaussian acyclic model what's called lingam and to spit it more out explicitly. So what we assume is we have a dag that means acyclic It's one assumption only observed variables which translates to there are no latent confounding all the latentvariables are independent and they are they don't point to more than one variable. And then this expressed in this linear structural model. Each of the measured variables is a linear combinations of their parents, the other parents. And here you see the index set is only

of the parents. Then we have some independent noise on top And there's only one per x. This is kind of what this knowledge and controlling its here. And we assume that that jointly independent stochastic disturbances, each of them is independent and that makes this one independent of all these parents. So these are right X w B independent or follows by induction And Furthermore one of the main ingredients. It is here that you have non goshi entity of all these distributions or if you want at most one of them. This is the assumption here linearity. No latent confounding simplicity negotiate sity this independence

for example, this is from the original income paper. Um you can show it like this. We would usually you see around brown notes, they make it squares here are the external stochastic disturbances and these access are deterministic function of the parents for instance is X. To the deterministic function and linear distance function of this and that and these coefficients you see here these are kind of linear coefficients um that came from the equations and we interpret them as a cause of strength Because if we were intervening on this X4 and were changed by one unit, then the output here would change by 0.2. It was X two. So we can interpret this as cause of strength

and so on. Why and so on. And here you see other decks and you see there's exactly one disturbance disturbance per variables and there's not this one is not pointing to this. So there's always just pointing to one note. And again, just if you have such a linear structure cause models then this piece here and interpreted as cause disturbances. And if you want you could take the expectation value with the do operator meaning interacting on this year. And then um seeing all these changes, we do this, we replace this by an ex. And then we take yeah, maybe W. And then we get this

uh its coefficient out. So this is how we could interpret them as called strength. Mhm. Take the interventional expectation value and then see how the awkward changes and the changes given that this directive, then we can come to the theory and the theorem says now that if I have these lingam assumptions in your assumptions and I have sufficiently many data points then we can identify the cause was diag Not just because of that, but all the cause of strength coefficients and all the stochastic disturbances. You can identify basically everything in this model just by looking at observational data and under these strong assumptions. And let us prove this first, we know that jesus deck means

we can order what are the notes in topological order order. This means that if W is apparent of me then the smaller than be, it always comes before. So parents come before the Children just mean topological order and it just takes such one, then we also have that X. V. Was the son of all W of their parents of the strength, matrix times the parent variable plus plastic disturbance. But now we define This year to be zero or every W.

Which is not apparent. So we kind of extend these coefficients to all others, but we say it's zero and we say that B shall be this matrix Goes from 1 to D&D. Is the dimension both of them. This is a cross to matrix and this looks like this. So now we have here zero on the diagonal. This is because um we only have the parents here is the this is not recursive that like it's not a cyclic. So this doesn't show up on this side. So we have zeros here then because it's technologically ordered. It means like all the parents can before before the Children. So we as a child,

let's say W is here. Then we have something here and it's not it's on this side because it's W needs to be smaller than we. So we have here, the W. And on this side we have all zeros. This is how this B looks like. So it's strictly quickly lower the triangle. So and then we can write this here in vector matrix form. And what we get is that X. The vector X is now the matrix B times X plus E. Note that this holds here for all

the envy. So this equation also all lead and we're just aggregating them together in vectors and matrices. And it looks like this. So this now implies that X Is 1 -7 inverse and one note is That 1 -7. It's now matrix. It looks like this with mine issue. So one minus here, slower triangular and invertible triangle. So we can invert this. So and this year it's now the A matrix from I C A.

And this E. Here are our sources. So that means that is identifiable up to sign scale computation. That's right. This again A defined to be 1 -7. Yes. High ball, one kale sign. permutation maybe make clear of rose. But this implies that be yes,

A Inverse 1 -1 Inverse. andw here has clear scale because there's one in diagonals the scale and sign. andw permutation as topological orders. So by our assumptions, by our deck assumptions, all these ambiguities here are soft means he is clearly identifiable. I'm not just up to scale sina computation

but there's no ambiguity for being here. andw clear boss diag assumptions and this shows approve the rest and follows just by multiplying with his matrix. So this now allows us to just use I see a We already had and we could use any I see a actually but we already had one now we just use it and use it for cause of discovery. Now we call this covariant online lingam because we use a covariant online. I see A and here again we pick some learning rates, some activation functions, then we initialize W. And then you run under convergence and these are the same same equations as before. Under convergence. So we

get W. Head out here and then we do something is this uh this is w what do we do? We permute the roles that it becomes lower triangular then remove supplied by a diagonal matrix from left. Such as all the diagonal elements become one, we take the diagonal of W. We invert them and then we multiply this from the left and this matrix is lower triangular and has one on once on. So this means this turns W into a matrix of this form 1, 0 and here are some other buildings mean both. So this fixes permutation

and this fixes scale and sign because of this form. And then if I use this here one minus, I get strictly lower triangular called the coefficient. Making let me construct it. andw no, like a setting we said this to be like this, then here this becomes all zero by one minus and we have here all the minus coefficients and this is now our called the coefficient matrix. Then we can also reconstruct the stochastic disturbances if you want a just multiplying 1 -7 which is w against these variables as before. So here we got everything out by the strong assumption of cleaner and this algorithm. Well then one remark is if you have a better I see a

and he had also a better causal discovery algorithm. If you just use the same idea. Same method as before. For instance one could use money in your I see a develop a non linear causal model, causal discovery algorithm as has been done last year in this paper here. Okay, so this was all about linear cause discovery under simplicity assumptions, no Lincoln founder feedback loop selection buyers, other functional constraints. And if you make all these assumptions plus some long ago she Hannity on our so has the disturbances then we can identify the cause of structure and the causal effects.

So this is all about I see a base cause of discovery lessons

2 Causality - Causal Discovery - Constraints based methods

everyone in this video. I want to talk about constraint base called the discovery methods. The general setup is as follows. We have data usually giving us some kind of data frames. Then we use conditional independence test. Get a list of all conditional independencies and dependencies that we can find in the data and then we want to find the graph that is compatible with all these conditionals independencies In terms of the separation the journal setup is twofold. We have basically two steps run from the data to the independence model and one from the independence model to the graph. But here we have conditional independence test here we have our independence

model and here is actually where the cause of discovery algorithm in this um constrain based setting takes place. So there's research going on about getting better conditional independence test. This is this stuff that's what we're going to talk about. We're talking about going from here to here. Often we need um but the more we need assumption we can hear at prime knowledge sometimes we know already some cause relations or some independencies that are not in data or some assumptions can make a box around it.

Take this one of the assumptions is usually that's a and not just independencies in the data come from the graph but also dependencies. Usually the mark of property says only if there's some d separation, then we have some independence but now we want to also go the other way around from the independence and the data to the independence in the graph. And to be able to do this, we need an assumption. This is not a theory. This assumption is called faithfulness. This is often needed in this constraint based cause of discovery method. Let us take an example. Soon we have a deck directed a cycle graph with just three variables. Let us represent them.

mises Yeah, So and we call one x why and that and the generic case is actually that they're all dependent. That is the most generic case. Let us reflect this with three edges. Now the question is and share with removes these edges or shall we orient them left or right. Let us assume that in our data we find that acts is dependent on that. So here's X. Here's that. So we could think of scenarios where this edge is not present. For instance, if this is a mark of chain, then these would be dependent but conditioning on why would block this that we have here the second, the second

assumption that X and that are dependent even when conditioning on why this means we cannot remove this edge here. So this days make it bigger like this, then if why and that are dependent on the data, we might have an edge here or we might have a dependence like this and then we know that Y and Z given access also dependent so we know that this edge here needs to stay, we cannot just remove it. Thanks bigger like this and then if x, Y, X and Y independent given that X and Y are dependent given that.

So this could be either we have an edge here or we have a collider here, and only the last piece that x and Y are independent tells us that it's not an edge here. It's a collider case. So the independence now tells us clearly that this edge here is too much and also that the this dependence can only hold if you have here a clear direction collider case. And these kind of colliders where there's no edge here are called unshielded colliders. This is what this is about. This is called an unshielded collider If there's an edge like this, this is still collider but it's not called unshielded. Unshielded means there's no edge between these and of course. So just from the information of this condition of independencies

we were able to recover the graph and you want to generalize this now to general independence models. I took this example here because this is the most prevalent 11 that uh usually easy to detect in the independence model but the others are more subtle. So let me just introduce the pc algorithm. Um in the pc algorithm we assume assumption that we have a deck so directed a cyclic graph, then there no latent confound ear's every edge correspondent observed variable here and that our distribution is faithful, meaning that the reverse mark of property holds meaning here that independencies in the data corresponds 1212. The d separation in the deck but the stack

here is considered unknown bogey is unknown. Yeah but what we have is instead is we have as input all the independencies conditional independence relations from the data. And they're given here more like symbolically um set a variable set of variable set of variables. And we can access like kind of we can ask the oracle which is basically the independent stance which houses these holes yes or no. And then we go to step one and the first step is trying to find the skeleton of the deck. That means like all the edges without the arrowheads. So basically if which are the neighbors of each other. But

this is not to be confused with amortization of the deck. So it's just really removing the arrowheads. And we do this by finding, trying to find separating sets between terrible. So if you have X. A and B, we want to find a set that makes them independent. Then the second step is you want to find all unshielded colliders and we had adjust on the last slide characterization of them. And we will generalize this two paragraphs and then we'll find all the unsealed colliders first. And after that we use orientation rule orient um Other edge is not also orient other edges prevent other edges. And the main guiding principle there is that we don't want to create cycles. So we're still

um acyclic and we don't want to create new unseated colliders, we consider them already found in step two. These are these two guiding principles. And and then the output is unfortunately not always a deck, it's sometimes a partial deck, meaning not all edges could be um oriented either in this direction or that direction. We just leave it like that. And this means then that that sometimes a graph where the two graphs sometimes had like right direction or left direction, it would come to the same conclusions with this algorithm. And nonetheless, even this partial graph is mark of equivalent to G. That means like if I and it was D separation on

this output and D separation in the true graph. They would encode the same independencies or conditional independencies So I could actually complete this partial graph to a deck and any version I could take such that it becomes a deck, encodes the same conditional independence is using the separation like he would do Let us go to step one Step one. what you make use of it's an idea that in a deck we have Adjacency between two notes. If and only if there's no separating sets set between them. It's not enough that there uh huh. It's not enough that they're dependent.

For instance if you have like a chain like this. Look at the top and the bottom note here and these are dependent but they're not adjacent and no neighbors. The reason is we can separate them, but we have the characterization that the neighbors if and only if there is no separating set, meaning all sets. I can think of. They stay dependent even when conditioning only then are there um neighbors? This is what we want to exploit. So how this works now is we started as a fully connected under photograph of all the notes and then we remove edges, move edges if there is a separating set. So

if there is, there's a basic theological inverse to this year, but we start the other around so we test in the data if these are independent giving some other set and if there's any of the set then we know that cannot be connected, cannot be connected when we can separate them, there's no edge between them. If you can separate them and to do this more efficiently, what we do is we do this in increasing cardinality of the conditioning set. So what we do is we say N equals zero. Then we check all pairs of variable just for independencies between and the N. W. W. And if they're already unconditionally independent, then already remove remove this edge here and then I go through all these pairs and check if their unconditional

independence, remove all the edges and only then after I've checked all pairs with respect to um empty set, I go to N equals one and then I check all the remaining pairs. If there is a set that has at least that has exactly one element in there. If a condition on it that they're independent, then I go to N. Two and check all pairs or remaining pairs and so on and furthermore. Um we only use neighbors so we check the set in the neighbors in the beginning every every note is connected to everyone. So this is not a restriction, but if you go on and we only need to check the neighbors. And from the point on the first set we find such that this holds the record as a separating set between these pairs of nodes. And if in the

end there is no separating set then they're still connected with undirected edges and the remaining undirected graph we called the skeleton of our independence model. andw The claim is then that the skeleton of this independence model has the same skeleton as the true deck step two. So instead to we look at triple of notes in the skeleton andw we asked them to be on a chain undirected and these two here should not be connected. Let us draw this. So we want to have a note. Another note and one more then what we want is is

this and this is B this is you, this is W. And um what we want is also that there's basically here there could be an edge but we wanted, there's no edge and if this is the case, this is a case, then we turn it into turned into collider collider practice.

Um I forgot to mention the actual condition. So when we have this chain, this is given like here and there's no edge between these two. andw This middle note is not in the separating set. Yes, maybe I make a great, it's not in the separating set of these two notes, then you make it collider and until collider out of it, then we go for all triple where we can do this and make this and if you cannot do it, we just leave them alone then step three we use for orientation rules. Um and these orientation rules follow the

guideline, avoid cycles. Maybe I'll make it red again avoid cycles. And unshielded colliders and Gladys were treated and said to we don't want more and this is a guideline. And the rules, you can just better look here on these pictures is clearer than maybe these logical frameworks here, we have an arrow hat and if we don't have any and if there's no edge in between here, then we also put an arrowhead here, that's what you can see here, Rule two is similar and why why do we do this? If you put an error head here, then we'll create an unshielded collider so the first is directly avoid unshielded collider so we have only this choice. The second is if you have a chain of arrows going all in one direction,

so it's a directed path and there's an edge between the beginning and the end. Then we oriented like this, probit orientation. And the reason is if you would orient it like this, then you will create a cycle. Then there are more rules when we have like four set of three having four notes involved and they're all connected only, these are not connected and we already have here collider until the collider this is unshielded, then we just point also towards this point here. And the reason is that if you don't do this, you might run into the problem of creating cycles. So the best way to avoid them is to also point them down and then we have a similar for

the rules are for where we have similar problems but a bit more complicated. And here again you want to avoid cycles. So if instead orienting like this would orient it like that and you have problems with this cycle here or you would have problems with this cycle here. So the best way to avoid it is doing that and you can see he is a collider but it's not unshielded because there's already after these three rules let us go back, then we have an partial deck. So we do the rule in step three until nothing can be oriented anymore and then we get one and um the reason we don't get a deck is that um the ambiguities points

and a very simple example let's say we have X and Y X, Y, X, Y. And our graph looks like this then there's nothing, there's no there's no conditional independence here and if my algorithm would spit out this with the arrowhead in the other direction then we would need to accept this because just with the use of conditional independencies we cannot distinguish these cases, so this would be mark of equivalent um and we cannot distinguish them with conditional independencies so we need to be good with that. The best we can do is actually finding the mark of the equivalent class and this is what the algorithm does. So

this is the baddest we could expect when relying only on conditionally independencies then a few remarks. Um first of all the question is how many conditional independencies do we need? Um We said we test them all but actually we can go through the steps and only use the conditional independencies we asked for and then you would need this amount of conditional independencies or this is an upper bound where m is the number of notes we have help plus one is a maximum number of neighbors. Any note can have the maximum degree of note then you can see that if L is fixed and you run em then it's polynomial and when L goes up you have like this complicated relationship agreement but this is an upper bound

the bc algorithm as we have introduced them. Pc picture and clark This is often a baseline algorithm which is used to compare to other algorithms. This is usually the 1st 11 compares to and there are many extensions, some that can include latent can founders, which are usually represented with by directed edges in these cases and others other more stable versions when they're trying to change the rules a little bit that they're more robust to statistical noise because this algorithm actually very sensitive to the independence model. If you have one independence more or not, you have one edge more not then you might have one or more unshielded colliders and then you would run into other problems later on. So this is statistically not such a stable actually. But it's nice

if you only have a correct conditional independence model that you can actually recover a mark of equivalent class. So if you wanted to now incorporate latent co founders, selection, bias cycles and so on and also want to be more efficient than one could make course equivalent classes we have seen that or talked about that conditional independencies can actually not. Uh huh distinguish between a few, a few grass. So we could directly embrace it and directly start from graphs from which we know from equivalent classes where we know that that this is the best we can do in terms of identify ability

and this is what this algorithm does is called fast causal inference. They start with um equivalent classes of graphs and they represent them with new graphs that have more arrowheads you see here small little round arrowheads you need see filled arrowheads or star arrowheads and then you see by directed edges and so on and these dotted lines and and so on. So these graphs have more edge types and arrowhead types and they represent actually equivalent classes of the typical bags we have with latentvariables and step one and two of the F. C. I algorithm is very similar to the pc algorithm, you first find a skeleton, then you find an cheated providers and then in step three you replace these rules

with these rules here, you have 10 of them, 10 rules and then you go through and then you end up with equivalent class of grass um that preserves all ancestral relations um that an identifiable and conditionally independencies there are also many extensions of this to make them more scalable. For instance greedy search or makes them deal with american founders selection bias or cycles. The next thing we want to talk about is another way to get better results and causal discovery is what's the framework of joint causal inference and it's a very simple idea is to take all the data, you have observational and interventional and also from different context. Let's say there's an experiment done

by university a and the same experiment is done by university be instead of evaluating each um separately and then trying to compare the results. You can just pull the data by introducing context variables so you're recording all the experimental settings and what university is coming from or what kind of conditions this were from any meaningful way. You just pull the data, you keep track of all these information with new variables. Then you check if certain assumption holds like Winston that your measurement does not affect um these experimental conditions and so on and then they use these context variables just inside all these conditional independence tests as well. See for instance, if something is independent of your research is independent, often the university where it was done and so on and then you create

a column graph either with a pc algorithm or s ai algorithm or any other you can come up with and this was shown to a boost adaboost cause the inference um a little bit and you get better results. So this was developed by your small way and you can find this on archive or so. And then just to give you an overview here this from the same paper, they were just interested in the cause of discovery algorithms that can pull multiple context and then you can already see here a list of um of many many calls discovery algorithms and here you can see what they can do with, they can deal with layton can founders and the plus and minus and so on and you can compare them but we don't go in into all the details

so then maybe general remarks, pros and cons about conditionally independent constraint based called recovery. So there rather easy they can rather easily handle that confound our selection bias datasets from multiple context cycles and so on. Um it's a fully nonparametric We didn't make assumptions about function class, you remember lingam I see a based methods, we had to make sure that we are using linear functions and non Gaussian noise and so on and we don't have these assumptions here. So this is fully nonparametric This is an advantage from these constraint based methods that can basically address all these problems. The drawback is that

the conditions independence test their rather slow usually or very simplistic like partial correlation tests or unreliable if you don't have enough data, they're also very data hungry usually. And this is kind of a drawback. Really good and fast condition independence says is still to be developed then we also, even though we made all these simplification with equivalent classes and so on and trying to reduce the complexity by taking notes like separation, separation sets of low cardinality first and so on. Even with all these considerations, there is still a community real explosion in these graphs because we have to go from all these lists of conditional independencies where we have three sets in like we have a triple of sets or power set, powers that power set we

have to go through all these kind of sets just to put these conditional independencies this can just be a bit um improved, but then we have to do combinatorial combinations of all these independencies and check what the graph does and so on. And so there is kind of a built in restricted scalability and this is really one of the bigger problems that computationally doesn't go beyond like a certain number of variables and it's usually very low and for the more we relied on faithfulness assumption. Most of these constraint based ministers rely on faithfulness assumptions and that means that the reverse of the mark of property holds and the reverse often don't like.

3 Causality - Causal Discovery - Constraints based methods - Example

no one in this video. I want to go through some simple example where we apply the to let's read cath the C algorithm starts with a deck but there are no latent. Can founders no layton can founders. And where the distribution is faithful to the graph. Then we assume that all independence relations are given to us. Then in the first step this algorithm we want to find the skeleton, our independence model and all separating sets for all pairs of variables. For this. We start with a fully connected undirected graph. And then for each pair of variables, we start from lowest cardinality that the current neighbors and find trying to find separating sets if this is

possible, remove edges from ours undirected graph. Otherwise we keep it and then we increase the cardinality and go through all pairs of variables. Again, the next step then after we have found the skeleton we find all unshielded shielded colliders. It's a kind of this be shaped uh parents where we can make this edge where we can make it collider when these two variables dependent conditions on the one in the middle. And then in the third step we just orient all other all other notes with spelling to rules. Just trying to avoid cycles and you try to avoid unshielded colliders in the output of our

you see algorithm here is a deck that is mark of equivalent or not. One that can be proven. That means like they encode the same. The separations and by the faithfulness assumptions. All traditional independence relations. Let's start with an example example has five notes and we're connecting them as follows, connect them like this, this this like this. And we assume this deck. So this is our true deck and this is unknown to us. But we assume that this true deck generates our independencies These are true deck through deck B. And then what we start with is a fully

connected graph like this. We have this ach this edge, there's this edge, this edge and this edge also this edge and from there to there Then we also need to add this one also need to add and what did we miss? We missed this one. So the first step is that deck induces an independence model independence. Mhm. That means like we're checking all independencies from

this deck and we just list them and we're assuming assuming faithfulness. That means like all dependencies. We also could list but let's just list a few of the independencies for instance A. And C. Independent givenby Also this path is blocked. The other path is blocked by this collider So what we would do is we would list A. Now I use this kind of this symbol is independent of C. givenby then what about A. And D. And he is also separated by B. Is separated by B. What about A and E. A. And E. Is separated by the as well we could say A. And we are separated

by me and also C. And D. Is separated by B. Because this path was blocked by B. It's a non collider is the collider. So you could get psi yes, independent of B. givenby and then we could also look at B and E. They are connected, but if you condition on these two, they're not connected independent, kevin TNT and there might be a few more, but I think these are the most important ones and let's just stop there. So now the next step is dnx separating sense. Oh, separating sense. And for this we need to look at two bowls

of variables. So we need to look at let's make a table, two variables. A, B then could have A C E E, you see the A, D, E and D. Oh this and now we're looking at an equal zero and checking if the independencies hold and of course we only have the list of independencies but it is easier for us now to look

at the true graph and check your city separation holes. But in in the algorithm we would just look at these statements here. But for this exercise we can check if this holds or not. So now, first that means like A and B dependent, yes or no and here there are dependent, so they are dependent A, N C. Are they dependent? Yes, they are dependent. Hey, we are dependent A and D are also dependent. A and E is also dependent is like connecting path and er dependent. B and C dependent, then B and D. Is dependent B

and E. Is dependent and see and C and D. Is dependent, then C. And E. It's dependent groups. And D. And E. As well. This means we have no separating sets. We have no separating sets found in cardinality zero. Now let's take cardinality. one extent is a bit here. Maybe move this. Yeah. So now A and B. Can A and B be separated. No, they cannot. So

let's make this year actually let's uh remove these dependence relations and just catch the separating sets. If any. What about A. N. C. We have A. And C. Here and now we have here. We can block this with beer. We actually have separating set B here. Then then let's show the undirected graph. So what we have is A and C. Are separated by B. It means what we do is we're removing this edge here, this is what the algorithm does. This edge is gone. Then we're looking at A. D. A D can also be separated by B. So what do we do? We're

removing the edge between A and D. Because he found a separating set. Okay, next one. A. E. A E. You can also separated by B. So we're removing the edge between hey andw then B and C cannot be separated, then B. And D. B and D can also not be separated then B. B and E. So now we could depending on on let's go lexa graphically. So we found B and E. If you look at sea it still cannot be separated with cardinality one.

Then Cnd cnd demographically depending here conditioning on B. So we find me again. Now we're moving between C. And D. This ad we made clear is belong together and then C. And E. C. And E cannot be separated by set here. Then C. And D. And E. And E cannot be separated and not in our independence model. And of course we usually don't access this true graph. As I said, we actually would look at these independencies here. And as you can see here everything is. Mhm dependent on being next step.

And it goes to so now we only need to look at, you don't need to look at A anymore but N equals two and A. Has only one neighbor. So we can stop the search here. It will stop the search because you already have separating sets. Then B. And C. He has 1, 2 3 neighbors. So we have to check B. n. c. but C has two neighbors. Yes. Um But we cannot find any separating set B from C. But it's not on the list. So here we have to make this then B. And D. Similar. Then B. And E. Axl.

B and E. This is on the list. So we spit out the separating set which is here C. And D. Here. We don't need to look anymore because we have found something then C. And E. C. N. E. You cannot separate them. Oh And we need to remove what we forgot to remove the edges between B. And E. And E. So we remove that edge. Now the D. N. E. That was the last one, D. And E. It's also not on the list And I'll be separated by two variables. That's three sizes. And now look at N equals three. We only need to look at B. C.

And C. Has only two neighbors. So we don't need to look at it anymore. E. D. D. Is also just two neighbors. We also don't need to look at it anymore. Um Then this was already done. Is it already done? And then see they also have only two neighbors and D. Only have two neighbors. So we are done. We have crosses everywhere and they have found all separating sets. So we found the skeleton. This comes out and you also have a list of um list of separating sets. So this was step one. Now look at step two except to one cheated collider

So now we have a lot of the structures here. We have this V. Structure. We have this we structure, we have that the structure here and here. So let us start with this one and here it just all the independencies we needed and you see here all the separating sets. Um So now let's look at abc, this is an unshielded is a V. Structure. This there's no edge between them. Now we have to check if abc is the independent or not. So the question is, is A E C. Does this hold or not? This is a question andw um around a C. Does this hold? Yes or no.

And we have it here. This holds, this means no unshielded collider then and we have a B. D. Hey, mhm. And now this is also in our independence model here, but this is also so there's no unshielded collider then we have be here E C. D. Or C. Mm hmm. andw Yes. We also have it here. So it's also known and she was a collider over here. Over

then we have E. Oh me, mhm mm. psi So we have here actually dependent but he's in the separating sent but C is in C. D. Which is a separating set of this year. So this is also no. Oh I'm Sheila collider then we have um and we have this down here. So B D E B E E given D. For the same reason it's um pendant but also D. Is in the separating

set also. No, I'm to the collider then finally we have this year we have C. Is dependent on given E. It's not listed in this independence model and this E. Is not in A S C. D. Whereas C. D. C. D. The separating set is B and E. Is not in the separating set. Yes, not the separating set. So this means this isn't until the collider meaning we make this move here. So we have found all unshielded collider and we go to the next step. Next step is

orient all other edges. So first we see that none of the rules we had applies and you could actually stop here and use this as an equivalent class of decks, but we could also just start and arbitrarily orient these remaining edges for instance, we could use this here and make this that's here then um now we have this first rule that we have here something pointing to it and here we have undirected edge. Now we're not allowed to orient it like this because we would create an un to the collider So we have

make it like this the same here. So we have here's something coming in and here as well. So this is pointing to this undirected edge, this is unshielded. So if the only option is doing like this and this would be then the output of our version and this of course looks different from the one we started with but as a sanity check, see what kind of traditional independencies would come out. So what would be the independence model coming out of this. So let's say independence model. The claim is not that we get the same deck. The claim is that this is a mark of equivalent to the original diag

So let's see. So all these adjacency notes, you don't have to check. So what is um A N. C. Is separated by pistol. This is monk. collider we have A is independent of C. givenby now A and T. This is no fork here, still separated by me. What about A C. And D. Before they were separated by B. And again they're separated by B. They're still here is an on collider and here's a collider The sea is separated by D. Gonna be what about A and E. Now this goes like this, it's separated by being here and here. This is also separated by B. So again we get A.

E. Is separated by the what about now? Um mhm. andw If I condition um one variable is not enough, but if I condition on C. N. D. They hear non colliders. Then it again, it holds um actually there's more right, we could hear condition on T. And D. N. A. And also here, A and E could condition on C. B. M. T. But you could also do this, the other graphs, they were like Justin, so if you do this again, receive, this is the same as

one or G, meaning we call this output here, tilda. Matilda is a Markov equivalent. This is what we wanted to show, this is how the pc algorithm

4 Causality - Summary

number one in this video. I want to summarize what we've learned about. causality first. We have talked about causality in real life? So what is causality causality means that if I force a variable X onto a value then the other bearable why changes This is the case. Let me talk about the causal effect of X on two Y. This is just a definition if you want to test this in real life and you're not sure just repeat the interventional experiment several times and then you see if you can see changes in the distribution of why when you change acts under the same condition, then we have learned that correlation does not imply organization. And we also said why the reason why is that there are other forms of possible coalitions

like the causation is a Semitic, it could just go the other direction so not A to b, b to b B to A. But correlation is symmetric, then we had talked about latent confounding selection bias, eat their groups, national constraint and some other forms of spirits, coalitions. So and we have seen that if you have correlations you can make any story, you want to explain your correlation and anyone could not be ruled out just by the coalition alone. So you need either plant knowledge or interventional experiments then one major major misconception is that if you condition X. Given Y, then this would be the cause and effect of Y on X. But this is not true because um X. Of Y. It's observational and what we want is X. Why or why relax?

So conditioning not the same as intervening. And often you sian studies people say that they adjusted for parables but conditioning does not properly adjust if you want to have a causal claim. So in generally if you want to test causality what you need is you need interventional experiments many of them and you usually cannot get around this. And the gold standard we saw were randomized controlled trials or reclaimed randomized controlled trials. That means you take the population and you randomly assign them to one of these intervention groups and control groups and then you measure the outcome and compare them and if there's a change in distribution, you know it's causal and if not then not many people hope

that you can actually learn all the relations without doing the intervention experiment generally this is not possible. But if you have let's say very very many features and to interact with each other and you have very strong assumptions on the graph and very special functions form, then you might conclude this this is kind of the hope of causal inference research but in almost all cases I've seen so far you cannot get around interventional experiments. So then we have talked about cost modeling and we had introduced calls of basic networks, basically basic networks with latentvariables and input nodes and we interpreted the actions causally, then we had structure cause um models which basically explain all

elasticity in their variables by external noise variables and all functions become then deterministic but the very similar to each other also by making using a causal model like this. The main point is that we assume that all observational regimes in all interventional settings or regimes are modeled at the same time, same time model at the same time. So usually if you take a model that's a basic network usually assume that this reflects observational setting but if you use them in the causal sense, you're making the assumption that all interventional settings modeled by the same model. We use the intervention operator on it and if you use such a causal model implicitly come with many calls of assumptions.

If you use for instance the cause of basic network then you come with the modularity assumptions assumption meaning that all the interventions are only local and leave all the other mechanisms intact. So if you intervene on a note X then only the mechanism that 0.2 XR removed and all others are intact downstream and the upstream and all the others are the same as before. andw If you use a deck then that means you know all of them latentvariables you have more all the latentvariables So you're assuming there's no other latentvariables that could affect your measurements because you claim you have models and all in the special case where every late available only affects one observational

note, you're basically assuming there are no latent, there's no latent confirming at all means there is not not which is unknown but effects to observe then you're also assuming no feedback loops because acyclic cause based on acyclic no selection bias, selection bias reflects conditioning and in the vanilla case there's no conditioning present, you don't say where your condition on, so there's no selection bias and all the functional constraints are given by a mark of kernels or by the structural equations and fuse the structural causal model. There no other constraints like energy preservation or something like that or any other form that restricts um values in a joint way. Maybe there are also other implicit things that are encoded here which you haven't thought about

yet and not spit out explicitly here but you should think about it anytime you want to call the base and network tomorrow something then we have done do calculus or cause a calculus where we assumed um Network was a deck and then we had three rules, rule one which says something about insertion and deletion of observations, then rule two which says something about how one can interchange action and observation, then rule three inversion and the lesion of action. And these come in addition to all the rules of probability andw Vincent using a graphical model. These rules depend on the graph so you have a graphical criterion you cannot use always this needs to be satisfied in the graph,

then this holds in the probability distribution either be is gone. If this holds and you have to be here then you can exchange the do here with just conditioning and if you have only here this intervention note then this dude disappears on the side. Well this wasn't too cockles then we have talked about causal inference in the sense that you want to compute a causal effect like X A two X B. And the idea is how can we compute this without observational data? And we found the backdoor criterion was says that we can mhm. It was this formula under some conditions, the rule look like this. If

you have the separation criteria then you can compute this. All the effect was due here with just observational distributions but you need to know the graph and you need to verify these conditions. This was a backdoor criterion. We have seen that as a general idea algorithm but we haven't talk much about it. This was about identifying causal effects or inferring effects just by observational distributions. The next thing we talked about because of discovery situations and what we want to do there is we want to learn the graph. You consider the graphs because graphs unknown. So it's not known from prior knowledge and we want to learn this and this is generally difficult unless you have all interventional distributions have all

interventional distributions and it's rather easy because you can intervene on everything and you have enough data and it's basically just a regression task but it's rather difficult if you have only observational uh data or um if some of the interventional distribution would actually know is missing. If you're not in that case, then what you need is rather strong assumptions about function classes, structure of latent. Co founders, for instance, there no latent can founders or selection bias that there's no measurement error for instance, or that your distribution negotiation or that you have linear functions and so on. You also need um some strategy to reduce the search space. Space is huge. Fedex and goes super exponentially up when you increase

the numbers of notes for instance, you could use equivalence relations are equivalent graphs, group graphs together that equivalent in some aspect for instance, they could be local that in one local area interested in they should be the same, but everywhere else they can deviate or if they have the same independence model or if they preserve the same ancestral relations or if they preserve all ancestral relation in the same independence model. Such things. Yeah. Then we need like good theory, it will be developed not only done yet, um and we need effective algorithms that can do this. Most of these things are still part of research and we talked about different approaches, causal discovery different approaches.

Um We talked about, I see a based methods, independent component analysis and we did uh have a closer look at lingam where we made linear assumptions. Non Gaussian acyclic assumptions, then we had conditional independence constraint based cause discovery methods, especially at BC algorithm shortly talked about S C I N J C I joint causal inference framework and the fast causal inference algorithm, we didn't go into details but we have explained all the concept in the PC algorithm that you would understand. The FBI algorithm would read it up. So they're more approaches like score based, its basic position, but you're trying to score each edge individually by some posterior score.

Then there's complexity based basically taking the model that's least complex in some measure. And there's several ideas and several measures out there um in this robustness approach which basically says that model model, if you have founder or the model should be robust, it should be transferable and it should be robust to any kind of nuances and noise and disturbance. So this is kind of if you learn use like the robust literature and trying to learn something very robust, then they say it's very likely that this will be causal, they're robust out lights and so on. And there are assets with huge information theory or information theoretic inequalities to reduce the search space by ruling a lot of miles out. And they're also kind of other graph theoretic and tender theoretic approaches, trying to

on interactions of variables and trying to to amplify the signal by kind of copying and pasting and amplifying these kind of things and the other ideas out there which may not be written here. So the reason we kept our attention to constraint based methods and I say based methods were that we have learned in this course already. Something about I see a conditional independencies and graphical models. So it was natural to look at these cause discovery approaches. And finally we should have talked about a little bit about causal validation. Let's say you used your discovery algorithm. You get a model. The question now is is a causal or not and you could validate this with other methods. For instance you learn something with I. C. A. And then you make sanity checks to the conditional independencies in this model

and so on. So you can cross check it is correct or not. If you have simulations, you can check if the outcome of your model really reflects what you simulated where you started with. So you kind of created some synthetic ground truth. But what you can always do is you can ask yourself a question and you can always make the thought experiments And namely you ask interventions in this model, you just trained or God reflect interventions in real life experiments. But just training a basic network doesn't mean it's causal even if you hope so and I'm not asking the question. Often people ask the question in this causal I don't do this here because it's dangerous. People tend to think in models if you're training basic network, you might think oh yeah all my causal relations and model them. But this is not what I'm asking if there's causality in the model. I'm asking

if your model reflects reality and since cause model only models observed data which might which you have trained on it reflects all interventional regimes. So that means like you really have to ask the question, do interventions in this model with a kind of do operator for instance reflect interventions in real experiments. Then you can answer this like with different different answers, you can say no, then it's not causal sometimes you cannot judge a lot of people now trying to do all causal inference with annual network and they're just taking a basic network and train it on observational data with neural networks and they have no grip um if it's caused or not or what does it do? So this in my opinion is usually just wishful thinking and this is equally good

as saying no, I have no idea this is caused or not. Also they cannot validate it or usually don't. So if you cannot judge, it's usually not call them, then people might just say yes. Then you need to ask a further question here, you have to ask yourself, how do you know? How do I know? And then you have to explain and the more the answer you usually have if you have like some sense sensible algorithms is yes. Under the assumptions made for instance, linger made very strong assumptions on the linear function class and on the non Gaussian noise and disturbances and of course this might not reflect reality, but if it reflects reality then we have shown this identify ability result in its mathematical result because there's only one way how it looks like. So it must model interventions.

If the assumptions were true, then you have to ask the same questions. How do I know this here in this case and how do I know that it's true and you have to explain it. And then of course um just said under the assumption made is then you have to check if your assumptions really reflect real life experiments. Otherwise if there's like a mismatch, then this might be true, but it doesn't reflect real life because your model assumptions are wrong. So this is at least the thing you need to do when you do causal discovery, you need to answer these questions and best as you write it down and to elaborate a little bit on it. Then you kind of often often you get quickly get to know or conjunction, so do intervention in this model reflect interventions and real life experiments. This is the main question

Author: Paul Lodder

Created: 2022-09-06 Tue 16:22

Validate