# ENTROPY: Communication, Disorder and Perpetual Motion Machines.

As promised, an article on the fascinating concept of Entropy. A real go-getter, entropy seems to have it’s hands in a million different pies (probably because it knows the pies will inevitably become a massive pile of sloppy, mushy, disorder and who would want to eat that?). If you didn’t get that joke, hopefully you will by the end. Let me take you by the hand, and lead you through entropy land.

HEALTH WARNING: feel free to skip over any/all math formulae, just imagine they are not there.

## Decision Trees

A decision tree is a machine learning algorithm which tries to find the most efficient tests to make a prediction. For example, given information about tumours (height, volume, colour, growth rate etc.) the decision tree learns the questions it should ask to determine whether the tumour is malignant. So how does it identify these tests?

As I mentioned in my article on Lorenz Curves https://remramryan.home.blog/2019/12/16/lorenz-curves-measuring-inequality/ Gini Index (GI) is one measure used in Decision Tree algorithms. This is because GI measures the inequality within a set of data. Decision Tree algorithms leverage this in order to segment the data space into sub-sections, which have lower inequality (they are more equal). A decision tree, illustrating our intuition that geographical location is highly correlated with gdp. We can pretty much determine a European Country’s location 3 questions about GDP/ Capita. Although some important misclassifcations are Poland as western, Romanian as Southern which are understandable.

An alternative metric for calculating homogeneity of subspaces would be Entropy, as used in the above example. Defined as: $H(x) = - \sum_{i=1}^m p(x_i)log(p(x_i))$
Where, given an observation $x$ is taken from the subspace, and $m$ possible ‘categories’ $x_1, x_2, ... , x_m$, $p(x_i)$ is the probability observation $x$ is of type $x_i$.

To understand how this expression captures the relative homogeneity, or conversely, disorder, consider the case with only two possible states. Now we have $H(p) = - plogp - (1-p)log(1-p)$. So plotting $H$ as a function of $p$ we get: Entropy diagram for a state space of size 2.

In this example, a 50-50 mix of the two different states results in a maximum entropy, whereas if we have perfect homogeneity the entropy is 0. Right, so entropy measures disorder, but why does this function work? To understand this, we will need to take a quick dive into Information Theory.

## Information Theory

Step up Claude Elwood Shannon, the (God)Father of Information Theory. We will pick up his story from WWII, where Shannon was heavily involved in studying cryptographic systems. During this time, Shannon came in to contact with Alan Turing the (God)Father of AI (and computing in general). His work on cryptographic systems and Alan Turing’s influence, led Shannon to the creation* of Information Theory. Shannon also had many other hobbies, including inventing the first wearable computer which was designed for cheating at roulette. What a guy.

*Is mathematics created or discovered, I hope to write on this at some point in the future.

So Information Theory, what is it? Information theory studies the quantification, storage, and communication of information. – Wikipedia. Pretty intuitive, yet not very insightful. Well the fundamental quantity studied in Information Theory is, you guessed it, entropy. However, we can define a quantity called Information (content), to which entropy is intrinsically related. Let $X$ be a random variable with probability mass function $p$, then $I(x) = -logp(x)$. Can you now guess the relationship between entropy and information? $H(X) = \mathbb{E} [I(X)]$ Entropy, is the average information gained after observing a random variable.

Information tells us the number of bits gained when we observe a random variable. For instance, the information $I(H)$ of $X$ where $X$ is a fair coin, is 1. (Or tautologically, observing a bit as being 1, carries 1 bit of information — this is in fact how bits are defined!). Therefore entropy gives us the expected information gain after observing a random variable.

To gain a more intuitive feel as to why (Shannon) entropy is fundamental, consider this experiment measuring the communication potential of ants. http://reznikova.net/R-R-entropy-09.pdf. They estimate that ants can transfer ~1 bit of information per minute (the investigation involves starving ants and placing food at exactly one leaf of a binary tree). So entropy tells us something about the amount of information that could be contained in a given message. For the interested reader: Shannon showed that Entropy can be defined axiomatically. (The entropy function, is the unique function fitting a few desired axioms).

Alright it is now time to change lanes. If you had heard the word entropy before, it was most likely in relation to the fabled second law of thermodynamics. You may be wondering, are they related, are they the same, what is the connection?

## The Second Law of Thermodynamics

Let us start with a definition. The second law of thermodynamics states that the total entropy of an isolated system can never decrease over time, and is constant if and only if all processes are reversible. Isolated systems spontaneously evolve towards thermodynamic equilibrium, the state with maximum entropy. — Wikipedia. Or in English: Ordered things will tend towards disorder. For example, ice melts. $H_2 O$ molecules in ice have more ‘order’ than they do in water. The second law of thermodynamics states that this change is inevitable (given we are working in an isolated system). Basically, everything we know and love will eventually turn to mush, harrowing stuff.

For an intuitive notion of why this is true. Consider the following thought experiment. There is a box containing a gas. However, all the gas particles are situated in the upper left hand corner of the box (highly ordered). Over time, clearly the gas will spread out to fill the box (solely due to probability) and thus entropy (disorder) will increase. The particles naturally spread out, from a state of low entropy to a state of high entropy.

Here entropy is defined over possible (micro)states of particles in the box. $H = -k_b \sum_{i} p_i log(p_i)$. Where $k_b$ is the Boltzmann constant and $p_i$ is the probability of the system being in microstate $i$. Looks familiar right. Here is the awaited connection, this thermodynamic entropy (under certain conditions) is simply the number of Shannon bits needed to define the microstate of a system, given it’s macrostate (temperature, energy, number of particles etc.). Or as put by G. N. Lewis in 1930, ‘Gain in Entropy means loss of information, and nothing more’.

## Perpetual Motion: Knowledge is Power

One consequence of the second law of thermodynamics was to render impossible attempts to build a ‘perpetual motion’ machine. A perpetual machine is something that can simply power itself, with no need of any energy source. A perpetual motion machine would effectively solve the climate crisis by generating power. Sound too good to be true? Well, as you know such a machine is not possible. If it were, we could place the machine in our example box from earlier, using it to force the gas molecules to stay in the upper left-hand corner. This would violate the second law of thermodynamics, thus the machine cannot have existed. But if entropy is information, couldn’t we use information to power our perpetual motion machine? Bring on the Szilard engine!