As promised, an article on the fascinating concept of Entropy. A real go-getter, entropy seems to have it’s hands in a million different pies (probably because it knows the pies will inevitably become a massive pile of sloppy, mushy, disorder and who would want to eat that?). If you didn’t get that joke, hopefully you will by the end. Let me take you by the hand, and lead you through entropy land.
HEALTH WARNING: feel free to skip over any/all math formulae, just imagine they are not there.
A decision tree is a machine learning algorithm which tries to find the most efficient tests to make a prediction. For example, given information about tumours (height, volume, colour, growth rate etc.) the decision tree learns the questions it should ask to determine whether the tumour is malignant. So how does it identify these tests?
As I mentioned in my article on Lorenz Curves https://remramryan.home.blog/2019/12/16/lorenz-curves-measuring-inequality/ Gini Index (GI) is one measure used in Decision Tree algorithms. This is because GI measures the inequality within a set of data. Decision Tree algorithms leverage this in order to segment the data space into sub-sections, which have lower inequality (they are more equal).
An alternative metric for calculating homogeneity of subspaces would be Entropy, as used in the above example. Defined as:
Where, given an observation is taken from the subspace, and possible ‘categories’ , is the probability observation is of type .
To understand how this expression captures the relative homogeneity, or conversely, disorder, consider the case with only two possible states. Now we have
. So plotting as a function of we get:
In this example, a 50-50 mix of the two different states results in a maximum entropy, whereas if we have perfect homogeneity the entropy is 0. Right, so entropy measures disorder, but why does this function work? To understand this, we will need to take a quick dive into Information Theory.
Step up Claude Elwood Shannon, the (God)Father of Information Theory. We will pick up his story from WWII, where Shannon was heavily involved in studying cryptographic systems. During this time, Shannon came in to contact with Alan Turing the (God)Father of AI (and computing in general). His work on cryptographic systems and Alan Turing’s influence, led Shannon to the creation* of Information Theory. Shannon also had many other hobbies, including inventing the first wearable computer which was designed for cheating at roulette. What a guy.
*Is mathematics created or discovered, I hope to write on this at some point in the future.
So Information Theory, what is it? Information theory studies the quantification, storage, and communication of information. – Wikipedia. Pretty intuitive, yet not very insightful. Well the fundamental quantity studied in Information Theory is, you guessed it, entropy. However, we can define a quantity called Information (content), to which entropy is intrinsically related. Let be a random variable with probability mass function , then . Can you now guess the relationship between entropy and information? Entropy, is the average information gained after observing a random variable.
Information tells us the number of bits gained when we observe a random variable. For instance, the information of where is a fair coin, is 1. (Or tautologically, observing a bit as being 1, carries 1 bit of information — this is in fact how bits are defined!). Therefore entropy gives us the expected information gain after observing a random variable.
To gain a more intuitive feel as to why (Shannon) entropy is fundamental, consider this experiment measuring the communication potential of ants. http://reznikova.net/R-R-entropy-09.pdf. They estimate that ants can transfer ~1 bit of information per minute (the investigation involves starving ants and placing food at exactly one leaf of a binary tree). So entropy tells us something about the amount of information that could be contained in a given message. For the interested reader: Shannon showed that Entropy can be defined axiomatically. (The entropy function, is the unique function fitting a few desired axioms).
Alright it is now time to change lanes. If you had heard the word entropy before, it was most likely in relation to the fabled second law of thermodynamics. You may be wondering, are they related, are they the same, what is the connection?
The Second Law of Thermodynamics
Let us start with a definition. The second law of thermodynamics states that the total entropy of an isolated system can never decrease over time, and is constant if and only if all processes are reversible. Isolated systems spontaneously evolve towards thermodynamic equilibrium, the state with maximum entropy. — Wikipedia. Or in English: Ordered things will tend towards disorder. For example, ice melts. molecules in ice have more ‘order’ than they do in water. The second law of thermodynamics states that this change is inevitable (given we are working in an isolated system). Basically, everything we know and love will eventually turn to mush, harrowing stuff.
For an intuitive notion of why this is true. Consider the following thought experiment. There is a box containing a gas. However, all the gas particles are situated in the upper left hand corner of the box (highly ordered). Over time, clearly the gas will spread out to fill the box (solely due to probability) and thus entropy (disorder) will increase.
Here entropy is defined over possible (micro)states of particles in the box. . Where is the Boltzmann constant and is the probability of the system being in microstate . Looks familiar right. Here is the awaited connection, this thermodynamic entropy (under certain conditions) is simply the number of Shannon bits needed to define the microstate of a system, given it’s macrostate (temperature, energy, number of particles etc.). Or as put by G. N. Lewis in 1930, ‘Gain in Entropy means loss of information, and nothing more’.
Perpetual Motion: Knowledge is Power
One consequence of the second law of thermodynamics was to render impossible attempts to build a ‘perpetual motion’ machine. A perpetual machine is something that can simply power itself, with no need of any energy source. A perpetual motion machine would effectively solve the climate crisis by generating power. Sound too good to be true? Well, as you know such a machine is not possible. If it were, we could place the machine in our example box from earlier, using it to force the gas molecules to stay in the upper left-hand corner. This would violate the second law of thermodynamics, thus the machine cannot have existed. But if entropy is information, couldn’t we use information to power our perpetual motion machine? Bring on the Szilard engine!
The Szilard engine is a box, with two compartments separated by a sliding wall . Lets keep things simple and imagine there is just a single particle in the box. Suppose we knew which compartment the particle was in, which is exactly 1 Shannon bit of information (see the coin flip example from earlier). Well then the pressure in this side of the box will force the sliding wall to the opposite side, thus doing work. Crucially because we knew which way the wall would be forced to slide, we can set up a pulley with a weight and lift that weight, using the work done by the sliding wall. So knowledge really is power!
It would be remiss of me, to say this was not controversial. The necessity of knowledge for Szilard’s engine to function is debated, and its discussion preempts the discussions of measurement in quantum mechanics. However these engines certainly exist, see https://www.pnas.org/content/pnas/111/38/13786.full.pdf.
Okay, I feel like that is enough for today. There is, however, much much more to talk about! From the arrow of time, to black holes, to the Heisenberg principle of quantum mechanics, entropy is everywhere. Hopefully I will get round to discussing these in a follow up post, before those pies sink into soggy disorder that is!