MathJax

Friday, September 19, 2014

Entropy by Another Alphabet -- Computing for Everyone

Previous: Entropy and Bits

All right, enough about passwords and entropy. This stuff is supposed to show up everywhere--how about another example?

Here are some Wheel of Fortune Google Image results with upper bounds on the corresponding entropy, for fun.

40.9 bits of entropy--luckily, that's a strict upper bound. I'm sure this guy's got something up his sleeve....
Good luck!
Oh no, an upper bound of 30.7 bits of entropy!
Luckily, as someone who was in an English-speaking country as a 5-year-old, you have the extra information needed to solve the puzzle.
An upper bound of 3.8 bits, but as an English speaker you've solved the puzzle
Only one possibility--exactly zero bits!
The bounds are based on any remaining letter being equally viable for each spot. The calculation uses knowledge of Wheel of Fortune's rules but no knowledge of English. With knowledge that these solutions are comprehensible in English, the entropy is significantly less--otherwise we'd never fit a show into 22 minutes!

40.9 bits of entropy was an upper bound, remember ;-)
The entropy of each puzzle is way less than each captioned upper bound because not all guesses are equally plausible. My previous post mentioned that when we use bits to measure entropy, we mean bits that have a 50/50 chance of being either 0 or 1. This equal probability is crucial.

This last one was an amazing performance by the contestant, but clearly my entropy upper bound calculations don't reflect the actual difficulty of the puzzle. Why? Because 41 bits of entropy would mean there were 241 equally plausible solutions to the puzzle. This guy's good, but not quadrillions-of-guesses good.

The upper-bound any-of-17-letters-in-each-of-the-10-spaces is clearly too large a universe of possibilities. The contestant, knowing English, could work out that the first word was probably "new," especially since missing "t" and "o" means it isn't "net" or "neo."

This leaves us with the number of 4-letter English words that can be made from the remaining 17 letters, the number of 5-letter English words from those same letters, and the triplets that make sense in combination (unfortunately for the contestant, you could have a "new" pretty much anything--"new baby buggy" is as plausible as new any-other-possible-phrase).

His situation is bad, but it's bad on the order of thousands of possibilities rather than quadrillions. Since the category is "thing," one of those words should be a noun, which whittles down the possibilities even further, etc. And lucky for our hero, the solution was near the beginning of the alphabet. An amazing performance.

But still, the second example--30.7 bits of entropy as an upper bound. Surely there aren't a trillion equally-plausible solutions.

The same thing happens in computing--in guessing passwords, in back-solving logic puzzles, and in communication.

If any remaining letter were equally probable on a Wheel of Fortune board, the game couldn't exist. Not only would it be tremendously boring, but equally-probable letters mean that there's nothing you can do to guess the solution other than guessing every word. Thankfully, English doesn't work like that. This feature of the language is called redundancy. Redundancy is a measure of how many symbols, on average, can be missing from a message while still remaining comprehensible. The "on average" is an important qualifier--as an exercise, think of some humorous situations where changing one letter in a word or one word in a phrase yields comprehensible English with an intent far different from what was originally intended! From our new perspective we can look at Wheel of Fortune as a game where contestants press their luck to select letters that give them enough information to solve a puzzle before their opponents can.

Redundancy has implications for computing. You've probably heard of file compression, i.e. zipping files. This is the essence of what's happening when you zip a file: the compressor is measuring the statistics of what's in the file and coming up with an encoding scheme whereby any bit in the compressed file has about a 50/50 chance of being 0 or 1. And this is lossless compression, meaning when you unzip the compressed file on the other side, you get back exactly what you had before compression.

Lossless compression of text files contrasts with lossy compression used in pictures, audio, and video. In that case, the uncompressed form isn't recoverable from the compressed form. We lose information, hence JPEG, MP3, and H.264 are all lossy. The trick with lossy compression of media files is to do so in a way where humans don't notice or care about the loss of fidelity. When encoding with lossy compression, the compression software often exposes settings for what bitrate you'd like to encode at to determine the threshold of what to smooth over. The rabbit hole of lossy compression schemes is extremely deep, but it all comes back to entropy and making the bits that are transmitted over the internet matter as bits you consume on the other end.

No comments:

Post a Comment