《碎片记录》—— 2018-07-02 07:29:09

Date	Unknown	Interpretations	Source
2018-05-16 09:14:18 2018年5月17日18:42:15	Bayesian inference下解释 $D_{K L} (P ‖ Q)$	$D_{K L} (P ‖ Q)$ is a measure of the information gained when one revises(修改) one’s beliefs from the prior probability distribution Q to the posterior probability distribution P. In other words, it is the amount of information lost when Q is used to approximate P. In applications, P typically represents the “true” distribution of data, observations, or a precisely calculated theoretical distribution, while Q typically represents a theory, model, description, or approximation of P. In order to find a distribution Q that is closest to P, we can minimize KL divergence and compute an information projection. 2. Imagine a coder that is designed for a source that generates symbols according to a probability distribution Q.What happens if the source generates symbols drawn from a different probability distribution, P? If the coder had been designed for P (instead of for Q), it would need to generate $H (P)$ bits per symbol.But in this case, our coder was designed for Q. So it ends up generating $H (P, Q)$ bits per symbol. (This is the “cross entropy” between P and Q.) The difference between $H (P, Q)$ and $H (P)$ is the K-L divergence $D (P \| \| Q)$ .In other words, the K-L divergence represents the number of extra bits necessary to code a source whose symbols were drawn from the distribution P, given that the coder was designed for a source whose symbols were drawn from Q.	Wikipedia Quora
2018年5月16日 14:14:48	先验分布	In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one’s beliefs about this quantity before some evidence is taken into account. For example, the prior could be the probability distribution representing the relative proportions of voters who will vote for a particular politician in a future election. The unknown quantity may be a parameter of the model or a latent variable rather than an observable variable.	Wikipedia
2018年5月16日 14:18:15	后验分布	In Bayesian statistics, the posterior probability of a random event or an uncertain proposition is the conditional probability that is assigned after the relevant evidence or background is taken into account. Similarly, the posterior probability distribution is the probability distribution of an unknown quantity, treated as a random variable, conditional on the evidence obtained from an experiment or survey. “Posterior”, in this context, means after taking into account the relevant evidence related to the particular case being examined	Wikipedia
2018年5月16日14:44:51	MLE(最大似然估计)	1.为什么贝叶斯定理能结合先验信念？仅仅看数学公式很难理解这一点。我们将再次借用冰淇淋和天气的例子。令 A 为卖出冰淇淋的事件，B 为天气的事件。我们的问题是「给定天气的类型，卖出冰淇淋的概率是多少？」用数学符号表示为 $P (A = i c e c r e a m s a l e \| B = t y p e o f w e a t h e r)$ 。贝叶斯定理右边的 $P (A)$ 被称为先验概率。在我们的例子中即 $P (A = i c e c r e a m s a l e)$ 是卖出冰淇淋的边缘概率（其中天气是任何类型）。一般而言，这个概率都是已知的，因此其被称为先验概率。例如我通过查看数据了解到 100 个人中有 30 个买了冰淇淋，因此 $P (A = i c e c r e a m s a l e) = 30 / 100 = 0.3$ ，而这都是在了解任何天气的信息之前知道的。注意：先验知识本身并不是完全客观的，可能带有主观成分，甚至是完全的猜测。而这也会对最终的条件概率计算产生影响 2.概率(密度)表达给定 $θ$ 下样本随机向量 $X = x$ 的可能性，而似然表达了给定样本 $X = x$ 下参数 $θ_{1}$ (相对于另外的参数 $θ_{2}$ )为真实值的可能性.我们总是对随机变量的取值谈概率，而在非贝叶斯统计的角度下，参数是一个实数而非随机变量，所以我们一般不谈一个参数的概率。	1.机器之心 2.知乎

Date	Unknown	Interpretations	Source
2018-5-18 07:06:13	Deterministic and random variables	1.Deterministic variable:If the outcome of a variable is fixed, i.e. if a variable will always have the exact same value, we call this a deterministic variable. 2.Random or stochastic variable:A random variable is a variable, which may take a range of numerical outcomes as the value is a result of a random phenomenon. Obviously the outcome is not fixed and may differ each time. A discrete random variable can take only a countable number of outcomes; a continuous random variable takes an infinite number of possible values. An example of a continuous random variable is a measurement of a certain quantity: repeated observations will yield different outcomes. A measurement, also called ‘an observable’, is a random variable.	Link
2018年5月18日19:06:38	multimodal	The mode is not necessarily unique to a given discrete distribution, since the probability mass function may take the same maximum value at several points x1, x2, etc. The most extreme case occurs in uniform distributions, where all values occur equally frequently.When the probability density function of a continuous distribution has multiple local maxima it is common to refer to all of the local maxima as modes of the distribution. Such a continuous distribution is called multimodal (as opposed to unimodal).	wiki
2018年5月18日19:06:38	Mode (statistics)	The mode of a set of data values is the value that appears most often. It is the value x at which its probability mass function takes its maximum value. In other words, it is the value that is most likely to be sampled. A mode of a continuous probability distribution is often considered to be any value x at which its probability density function has a locally maximum value, so any peak is a mode.	wiki

Date	Unknown	Interpretations	Source
2018-5-23 06:36:45	Pointwise mutual information	PMI, it refers to single events, whereas MI refers to the average of all possible events.	wikipedia
	inverse functions(反函数)	Not all functions have inverse functions. In order for a function $f : X \to Y$ to have an inverse, it must have the property that for every $y$ in $Y$ there must be one, and only one $x$ in $X$ so that $f (x) = y .$	wikipedia
	Mutual information	the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the “amount of information” (in units such as shannons, more commonly called bits) obtained about one random variable, through the other random variable. The violet is the mutual information $I (X; Y)$ .	wikipedia
	Self-information	By definition, information is transferred from an originating entity possessing the information to a receiving entity only when the receiver had not known the information a priori. If the receiving entity had previously known the content of a message with certainty before receiving the message, the amount of information of the message received is zero.	wiki
	Entropy (information theory)	Shannon defined the entropy Η (Greek capital letter eta) of a discrete random variable X with possible values {x1, …, xn} and probability mass function P(X) as	wiki
	Conditional entropy	If $H (Y \| X = x)$ is the entropy of the discrete random variable $Y$ conditioned on the discrete random variable $X$ taking a certain value $x$ , then $H (Y \| X)$ is the result of averaging $H (Y \| X = x)$ over all possible values $x$ that $X$ may take. property：1.Chain rule:Assume that the combined system determined by two random variables X and Y has joint entropy $H (X, Y)$ , that is, we need $H (X, Y)$ bits of information on average to describe its exact state. Now if we first learn the value of $X$ , we have gained $H (X)$ bits of information. Once $X$ is known, we only need $H (X, Y) - H (X)$ bits to describe the state of the whole system. This quantity is exactly $H (Y \| X)$ , which gives the chain rule of conditional entropy:	wiki