- "entropy" is information; "information", therefore, already is surprise; thus, it's dangerous to re-define "surprise" as -log P(x), which is already part of the definition of suprise, as that leads to ambiguity and a circularity;
- KL divergence is relative entropy (added surprise by a second distribution, given a first, so _relative_ surprise);
- I would caution about terms like "expected surprise" for the same reason as I object to "dry water"...
OP is correct; surprisal is outcome-dependent and entropy is distribution-dependent
- entropy is E_p[informativeness of measuring outcome x]
- take n outcomes, then a distribution over them lives on the simplex \delta ^ (n - 1). you can lift this to R^n via the log odds map p_k -> x_k = log p_k -- now x \in R^n can describe a histogram with n-1 degrees of freedom
- in log odds space, measurement is literally a linear functional from vector space of log probability onto the index of the outcome k.
- imo surprisal of some p(x) is best understood as "the length of a pointer", entropy "the rarity-weighted average length of a pointer", and collision entropy "how specific you would have to be to describe witnessing a specific outcome"
and in the same way, a single molecule of water, you might get by, calling dry
Hi, author here! Thanks for the feedback, as I mentioned this is also to clarify things for myself so this helps a lot.
Regarding your points:
- I'm not sure I get your meaning here. My understanding is that for a random variable X, thr surprise is defined at the outcome level I(x) = - log p(x) while the entropy is essentially just the average value - sum_x p(x) log(p(x)). So to me it does look like entropy is expected surprise no? I do agree though that by being _expected_ surprise, entropy is itself a measure of surprise.
- I very much agree with that which is why I used _excess_ surprise (maybe relative is a better choice, but the intent is the same).
- That one I'm also confused about. It gets back to my first point: to me surprise (or information) is always defined at the outcome level first, so taking a moment is not tautological, it's meaningful, no?
Careful:
- "entropy" is information; "information", therefore, already is surprise; thus, it's dangerous to re-define "surprise" as -log P(x), which is already part of the definition of suprise, as that leads to ambiguity and a circularity;
- KL divergence is relative entropy (added surprise by a second distribution, given a first, so _relative_ surprise);
- I would caution about terms like "expected surprise" for the same reason as I object to "dry water"...
OP is correct; surprisal is outcome-dependent and entropy is distribution-dependent
- entropy is E_p[informativeness of measuring outcome x]
- take n outcomes, then a distribution over them lives on the simplex \delta ^ (n - 1). you can lift this to R^n via the log odds map p_k -> x_k = log p_k -- now x \in R^n can describe a histogram with n-1 degrees of freedom
- in log odds space, measurement is literally a linear functional from vector space of log probability onto the index of the outcome k.
- imo surprisal of some p(x) is best understood as "the length of a pointer", entropy "the rarity-weighted average length of a pointer", and collision entropy "how specific you would have to be to describe witnessing a specific outcome"
and in the same way, a single molecule of water, you might get by, calling dry
Hi, author here! Thanks for the feedback, as I mentioned this is also to clarify things for myself so this helps a lot.
Regarding your points:
- I'm not sure I get your meaning here. My understanding is that for a random variable X, thr surprise is defined at the outcome level I(x) = - log p(x) while the entropy is essentially just the average value - sum_x p(x) log(p(x)). So to me it does look like entropy is expected surprise no? I do agree though that by being _expected_ surprise, entropy is itself a measure of surprise.
- I very much agree with that which is why I used _excess_ surprise (maybe relative is a better choice, but the intent is the same).
- That one I'm also confused about. It gets back to my first point: to me surprise (or information) is always defined at the outcome level first, so taking a moment is not tautological, it's meaningful, no?