Adaptive Lossless Compression

Retrospection

Adaptive Lossless Compression (ALC) is something that I came up with during the time period between the end of my bachelor and the first year of my master studies. Back then I sometimes joined George Hotz's streams from time to time and once I heard him saying something that was about the idea of compression leading to intelligence being very interesting. Then he said that even in physics or mathematics we try to invent some formulas to represent our sensations or observations. At that moment, the idea of compression caught my attention. I thought about the numbers, functions and sets in mathematics and one question naturally came to my mind: Why do problems have mathematical solutions? Another way to ask would be that why do they have solutions that are mathematical? As I kept thinking about this question, I started realizing the true depth of it. I do not have an answer to give today but I developed some intuitions about it.

If you want to understand it, you must build it!

To solve a problem instance at hand, you have to understand it first. From my biased introspections, I have realized that understanding of something is something that we claim to have when we are able to jump from that some thing to the things that we already "understand". So, understanding, in my opinion, is not solely about the problem instance but the whole set of relations/connections/links between this instance and all others. More connections we make, more we understand. In some sense, having an understanding of something seems like an illusion tied to one's ability of manipulating things. Sometimes repetitively doing something may also give you the "illusion of understanding" what you are actually doing. Now, when you build interconnections between different concepts, those interconnections let you jump from one concept to another or in other words, walk around in the realm of "abstract" concepts. One interesting thing is that, we may also use other concepts that we have already formed in our ontology to build a connection between other concepts. 

I somewhat defined and used two main terminology to find an answer to the above given question. The first one is the notion of stability, and the second one is the notion of dynamic system. Stability refers to something that is stable throughout the whole time, whereas dynamic systems are the ones that change over time. If we had a plot of, say, thousand points in front of us and we were asked to find the relation between two successive points then we would essentially have two approaches: (1) brute force all the successive pairs and save them in a table, or (2) come up with symbolic (one-line) representation to explain the relation. The former seems very obvious since it requires a direct interaction with the points. However, the latter is a bit indirect way of solving the problem since it requires to substitute a given input in your equation in order to obtain the next consequent point on the plot. It is also worth mentioning that, when you had to use and combine this relation with something else to achieve a new relation, it is easier to pick the latter representation rather than the former. The reason simple: imagine going over the table with 999 pairs of points each time you need to achieve some type of confirmation for your hypothesis - it is very repetitive and inefficient; whereas you could take and combine one-line representation with something else much more easily. Now, you might think that having an indirect one-line representation is great. Well, it's yes and no depending on the given situation. If you want to construct new hypothesis and theories then yes, it is the way to go. However, if you want to observe the "reality of it" then you need to do extra computation to get there.

Abstraction is beautiful and it comes with its cost! The cost of abstraction is that you need to take (extra) action(s) in order to get to the reality of it. Therefore, it is the ability to take an action that enables abstraction. Now, actions do not have to be observably physical necessarily; they can also be imaginational. Even when we look at two different pictures of an apple, we have to somehow act on those images to be able to say that those pictures have something in common which is an apple. The word action can also be a synonym for computation.

Experiments with lossless compression algorithms

In 1952, David Huffman published an algorithm called Huffman coding [1] which was to be used for constructing minimum redundancy codes for any given string. To do that, the algorithm has a clever way of assigning each character to a binary code which is also uniquely decodable. Basically, the idea is to create, what is called, a Huffman (binary) tree whose nodes hold distinct characters of the given string and the left and right edges are assigned to boolean values of 0 and 1 respectively. A binary code for a character in the tree is obtained by concatenating all the bits assigned to the edges that take you from the root of the tree to the node holding the character that you are interested in. Frequently occurring characters are located near the root than those of rarely occurring ones which makes the codes smaller for the more frequent characters and longer for the less frequent ones. Although the ideas was relatively simple to understand, I really enjoyed the very simplicity behind it and how effectively it was very useful to reduce the size of big corpuses.

Another algorithm that I got to know was Quadtree compression. This one basically took a 2D array or an image and equally divided it into 4 sub-images recursively until some termination criterion was met. Then it assigned only one color code (e.g., obtained by taking the mean values of the current sub-image) to each smallest sub-image and by doing so, it could reduce the size of the whole image. What I liked about this algorithm was that I could visually see the image blurring out as its size got reduced quite a bit and it was that moment I realized the trade-off between reduced size and integrity of images, or any data for this matter. You cannot compress something past some point unless you agree to give off its integrity or originality. So, if you take any piece of data and compress it in a lossless manner then you would end up with (compressed) data which is incompressible to any further extent and therefore, is irreducable to smaller parts.

When I was playing with LZW compression [3], I found out something very interesting. I was actually a bit annoyed due to the fact that how I have missed the thing, which I just realized, years ago during my bachelor's. We had a class on this and we were told to implement LZW as an exercise back then. I never played with it or explored the ideas deeply presented in LZW algorithm and that is probably why I never realized what I found out during my master's. The realization was the following: as the corpus size gets bigger, some real english words appear in the dictionary built by the algorithm. This was very fascinating to me because the algorithm had no (direct) intention to find the actual english words that humans use. I had the intuition about why it was like that - since the english words are frequently encountered by the algorithm it was normal to construct a dictionary slot for those strings/tokens. But what if it is the same case for us humans? What if we have invented the word "apple" just because we have seen many similar round-shaped edible thing throughout our lifespans? What if our purpose is not directly inventing a new word but rather simply memorizing the observation(s)? Better question would be what if our purpose is simply predicting the next observation(s)?

Adaptive Lossless Compression

ALC is an algorithm that effectively learns what to predict next with the help of the its previous observations. As the name mentions, its goal is to compress information without any loss in an adaptive manner. Although the name hasn't been changed for the historic purpose, we will see that it could also perform lossy compression depending on the hyper-parameters of the algorithm. In a high level overview there are three main steps to it:

  1. Construct
  2. Unify
  3. Compress

These steps are executed sequentially per an epoch of the system training process. Basically, the algorithm first tries to construct a new system for a single observation that's being observed currently, then it uses a rule/policy to unify the previously constructed system (if any) with the current one to obtain the only system in a way that no information is lost, and finally it compresses the system to reduce its complexity and/or introduce generalization by using different compression techniques. Now, let's dive into each step with further details.

Let's first define the ALC system formally - ALC system $S$ is a graph or a pair of two sets to represent nodes and edges between them, that is, $S = (N, E)$ and for any such graph the following statements must hold:

$$ \forall n \in N\ T(n) = \texttt{In} \lor T(n) = \texttt{Out} \lor T(n) = \texttt{AND} \lor T(n) = \texttt{NOT} \lor T(n) = \texttt{OR} $$

$$ \forall n \in N\ T(n) = \texttt{In} \implies \exists m \in N\ (n, m) \in E $$

$$ \forall n \in N\ T(n) = \texttt{Out} \implies \exists! m \in N\ (m, n) \in E $$

$$ \forall n \in N\ T(n) = \texttt{AND} \implies \exists^2 m_1, m_2 \in N\ (m_1, n), (m_2, n) \in E $$

$$ \forall n \in N\ T(n) = \texttt{NOT} \implies \exists! m \in N\ (m, n) \in E $$

$$ \forall n \in N\ T(n) = \texttt{OR} \implies \exists^2 m_1, m_2 \in N\ (m_1, n), (m_2, n) \in E $$

Before going any further, please refer to the table given below for all the notations and their descriptions used in this post:

Notation Type Domain Range Description
$A(n)$ Function Set of nodes, i.e., $N$ Boolean | String Activation of node $n$
$S(n, m)$ (Binary) Relation Pairs of nodes, i.e., $N^2$ - Similarity of nodes $n$ and $m$ in terms of their types and activations
$T(n)$ Function Set of nodes, i.e., $N$ String Type of node $n$
$N_F$ Set - - Set of non-IO nodes
$S$ System/Graph - - A system which represents multiple observations
$S_k^\prime$ System/Graph - - A (unit) system/graph built to satisfy a single observation $X, y=k$
$U(S, S_k^\prime)$ Function Pairs of systems, i.e., $\{(N, E)\} \times \{(N^\prime, E^\prime)\}$ Set of systems, i.e., $\{(N, E)\}$ Unification of two systems
$C(S)$ Function Set of systems, i.e., $\{(N, E)\}$ Set of systems, i.e., $\{(N, E)\}$ Compression of a system
$L(n, n^\prime)$ Relation Set of nodes, i.e., $N$ - Linkage of two nodes which states $L(n, n^\prime) \implies A(n) = A(n^\prime)$

1. Construct

The construction algorithm is to construct a new ALC system for a single observation that is being observed currently. There is one property that such algorithm must have: given a newly constructed system $S^\prime$ for the observation $(X^\prime, Y^\prime)$ by the construction algorithm $\forall X, X^\prime\ X \neq X^\prime \implies \texttt{detect}_X(X) \neq \texttt{detect}_X(X^\prime)$ where $\texttt{detect}_X$ is an intermediate process used to detect a previously seen observation $X$ (i.e., output a certain boolean value to distinguish unseen and already seen observations). There are different construction policies as shown below:

  • Priority-0, i.e., given an observation a system is constructed to output 0 upon the detection of the same observation
  • Priority-1, i.e., given an observation a system is constructed to output 1 upon the detection of the same observation
  • Mixed-priority, i.e., given an observation a system is constructed to output 0 or 1 upon the detection of the same observation

Priority-0 policy makes the construction algorithm to convert all the 1 bits to zero first and then gather all of the bits (i.e., 0's) under the disjunction. So, when the input is even slightly different than the original one the constructed system will output 1 instead of 0 as intended. Priority-1 policy, in contrast, will flip all the 0 bits to one first and then gather all of the bits (i.e., 1's) under the conjunction. So, when the input is even slightly different than the original one the construct system will output 0 instead of 1, which is again as intended. Mixed-priority policy makes the algorithm to choose randomly or deterministically whether it wants to follow priority-0 or priority-1 policy for the current observation. The way to choose a policy deterministically usually depends on the (final) output of the current observation - if the output is 0 then priority-0 is selected and priority-1 is selected as the current construction policy, otherwise.

It is also worth to mention that when a newly constructed system outputs $Y^\prime$ it is actually called a unique observational output since it is an observational output computed by a system with the intermediate detection process that outputs a unique value upon the detection of the corresponding observational input $X^\prime$. The concept of unique observational input/output will be used in the next sections.

2. Unify

The unification algorithm is to unify two ALC systems to obtain a single system without any loss of information. The only condition for this algorithm to work properly is that one the given systems (i.e., the second one conventionally) must be constructed with the construction algorithm with no post-processing. That is also to say, the second system given to the unification algorithm must respect the following property: $\forall X, X^\prime\ S_y(X^\prime) = S_y(X) \iff X^\prime = X$. Unification goes according to the following rule which I have used to call "The Rule of Intelligence" or the ROI:

$$ S \leftarrow S \oplus_{Y^\prime} S^\prime $$

where $\oplus_x$ is a function which outputs $x$ when at least one of its inputs is $x$; so, we could safely say that $\oplus_0 = \land$ and $\oplus_1 = \lor$ would work just fine. Such a unification rule would always preserve all the past information. How? Let's try to exhaustively explore all possible cases which could occur for a given current observation $(I, O)$:

  1. If the old system $S$ outputs $O$ and the new system $S^\prime$ outputs $O=Y^\prime$ when given $I$ then prioritizing $Y^\prime$ obviously isn't a problem;
  2. If $S$ outputs $O$ and $S^\prime$ outputs $\lnot O$ then $O$ will be output since $S^\prime$ has not recognized the input $I$ and we choose the old and much wiser system's output as the final output;
  3. If $S$ outputs $\lnot O$ and $S^\prime$ outputs $O$ then $O$ will be output since $S^\prime$ has recognized the input $I$;
  4. If $S$ outputs $\lnot O$ and $S^\prime$ outputs $\lnot O$ then prioritizing $\lnot O$ obviously isn't a problem.
Recall that this only works under the assumption that $\forall X\ X \neq X^\prime \implies \texttt{detect}(X) \neq \texttt{detect}(X^\prime)$ and this property is actually called detection-uniqueness property. Since any system after construction is just using either the basic detection mechanism's output or its inverse, the output also has the uniqueness property in terms of the all observational outputs. By prioritizing the output of the new system upon which it was constructed at the first place, we essentially ignore the old system's output whenever the new system outputs the only value that it could only output when the particular observation, that it was constructed upon, is observed again. However, if new system outputs a different value then it means that the current observation is not the one that this system has been constructed upon and therefore, we forward the old (much wiser) system's output to the final output.

The ROI essentially prevents information loss by ensuring a couple of things:

  1. If the current unique observational output is 0 then p0 gate will forward 0 when the current unique observational output is 0 and whatever the previous system outputs otherwise. So, whenever the current unique observational output is 1 upon new observational inputs then we know that it's out of the newly constructed system's hands and we have to pass it to the previously constructed-unified-compressed (bigger and wiser) system.
  2. If the current unique observational output is 1 then p1 gate will forward 1 when the current unique observational output is 1 and whatever the previous system outputs otherwise. So, whenever the current unique observational output is 0 upon new observational inputs then we know that it's out of the newly constructed system's hands and we have to pass it to the previously constructed-unified-compressed (bigger and wiser) system.

So, the unification function maps two given systems, one of which must be a unit system, to a single unified system as shown below:

$$ U(S, S_k^\prime) = (N \cup N^\prime \cup \{g_k\}, E \cup E^\prime \cup \{(n, g_k), (n^\prime, g_k), (g_k, m), (g_k, m^\prime)\} \setminus \{(n, m), (n^\prime, m^\prime)\} $$

where $(n, m) \in E$, $(n^\prime, m^\prime) \in E^\prime$, $T(m) = T(m^\prime) = \texttt{Out}$, $T(g_k) = \texttt{OR}$ if $k=1$ and $T(g_k) = \texttt{AND}$ if $k=0$

The unification process preserves all the information up until to the current point since the $g_k$ node properly controls which system's output is to be forwarded to the final output node (i.e., $n^\prime$ and $m^\prime$) of the ALC system.

3. Compress

The compression algorithm, being the final piece of the whole thing, is used to reduce the space-time complexity of the ALC system. Besides such complexity reduction, it also enables generalization within the system. Although I may refer to it as "the compression algorithm", it does not mean that it only does one type of compression and in fact, there are multiple kinds of compression that is implemented under the hood. By default, hyper-parameters are set in a way that the default compression is lossless. However, one might change them to make it more flexible (i.e., lossy) if needed. Here is the basic compression function description:

$$ C \colon S=(N, E) \mapsto S^\prime=(N^\prime, E^\prime) $$

The following statements describe the compression more formally:

$$ \forall n_\text{out} \in N_{\in S}\ \exists! n_\text{out}^\prime \in N^\prime_{\in C(S)}\ L(n_\text{out}, n_\text{out}^\prime) $$

$$ \forall n^\prime, m^\prime \in N_F^\prime\ n^\prime \neq m^\prime \implies A(n^\prime) \neq A(m^\prime) $$

$$ \forall n^\prime \in N^\prime\ \exists n \in N\ A(n) = A(n^\prime) $$

The first statement states that the compression is lossless. The second one states that compression gets rid of all the duplicate nodes and therefore, all the nodes in the compressed system is non-redundant. The third one states that no new information is created during the process and therefore, compression is indeed in charge of only removing redundant nodes. Moreover, the second and the third axioms state that the underlying process is indeed a process of compression.

Static Compression

Static or structural compression operates on the level of the ALC system structure in terms of the nodes and the connections among them. Deciding whether a node is redundant depends on definition of the activation function $A(n)$ and in this case, it is defined to reflect the structural hierarchy of a given node rooted from the input nodes for which there is no expansion of activations.

$$ \forall n \in N\ T(n) = \texttt{AND} \implies \exists^2 (m_1, n), (m_2, n) \in E\ A(n) = [A(m_1) \land A(m_2)] $$

$$ \forall n \in N\ T(n) = \texttt{OR} \implies \exists^2 (m_1, n), (m_2, n) \in E\ A(n) = [A(m_1) \lor A(m_2)] $$

$$ \forall n \in N\ T(n) = \texttt{NOT} \implies \exists^2 (m, n) \in E\ A(n) = \lnot A(m) $$

So, the similarity (i.e., isomorphism) of two nodes is defined as follows:

$$ \forall n, m \in N\ S(n, m) \iff A(n) = A(m) $$

Finally, activation function for the output node(s) is defined as follows:

$$ \forall n \in N\ T(n) = \texttt{Out} \implies \exists! (m, n) \in E\ A(n) = [A(m)] $$

Dynamic Compression

Dynamic or behavioral compression operates on the level of the ALC system (run-time) behavior. It means that this compression algorithm has no access to the structural information. To decide whether a given node is redundant in the dynamic regime, we modify the definitions of the activation function. Moreover, we define it as a boolean function now which actually generates different boolean values depending on the underlying boolean function.

$$ \forall n \in N\ T(n) = \texttt{AND} \implies \exists^2 (m_1, n), (m_2, n) \in E\ A(n) = f_\land(A(m_1), A(m_2)) $$

$$ \forall n \in N\ T(n) = \texttt{OR} \implies \exists^2 (m_1, n), (m_2, n) \in E\ A(n) = f_\lor(A(m_1), A(m_2)) $$

$$ \forall n \in N\ T(n) = \texttt{NOT} \implies \exists! (m, n) \in E\ A(n) = f_\lnot(A(m)) $$

The table for the boolean functions is given below:

$x$ $y$ $f_\land(x, y)$ $f_\lor(x, y)$ $f_\lnot(x)$
0 0 0 0 1
0 1 0 1 1
1 0 0 1 0
1 1 1 1 0

So, the similarity (i.e., isomorphism) of two given nodes is defined as follows:

$$ \forall n, m \in N\ S(n, m) \iff A(n) = A(m) $$

Finally, activateion function for the output node(s) is defined as follows:

$$ \forall n \in N\ T(n) = \texttt{Out} \implies \exists! (m, n) \in E\ A(n) = A(m) $$

Hyper-parameters

  1. Construction modality - number of different constructions/representations for a single observation (default, 1)
  2. Construction randomness - probability of constructing randomly structured representations for a single observation (default, 0.0)
  3. Construction urge - probability of constructing a new representation for the current observation when the system prediction is already correct (default, 1 - always construct)
  4. Compression aggressiveness - probability of removing isomorphic nodes at each iteration [for dynamic compression] (default, 1.0 - always remove)
  5. Compression loss - tolerated information loss (default, 0.0 - no information loss)
  6. Speculative similarity threshold - threshold for determining isomorphic nodes [for dynamic compression] (default, 1 - immediate or weak isomorphism, higher number means stronger isomorphism)
  7. Static compression - switch on/off static/structural compression (default, true - switched on)
  8. Dynamic compression - switch on/off dynamic/behavioral compression (default, true - switched on)
  9. Replace patience - probability of replacing old generation with new generation (default, 0.0)
  10. Replaceable difference - number of previous generations during which created nodes are allowed to be replaced with the new generation 100% of the time, i.e., all generations since the iteration curr_gen-replaceable_diff are considered current generation (default, 0 - each iteration defines a new generation)
  11. Replaceability - a function that returns a probability of replacing a node with another based on their creation timestamps (default, {lambda: it1, it2: return 1.0} - everything is replaceable)
  12. Support difference - minimum difference of supports of two nodes for one in order to be able to replace the other if needed (default, 0.1)
  13. Local memory size - local/short-term-memory size of each node (default, 4 bits)
  14. Global memory size - global/long-term-memory size of the ALC system (default, 1024 bits)
  15. Batch size - number of consecutive observations for which a unified representation is constructed (default, 1)
  16. Batch purity - probability of grouping batches of the [batch size] by their Y values (default, 0 - mixed samples in batches)

Conclusion

ALC is an adaptive lossless compression algorithm that is able to learn to predict based on the previous observations. The algorithm goes through 3 main steps - construction, unification, and compression. To build an ALC system, one needs to provide the base functions that are going to become the fundamental building blocks of the system. We have also seen that there is a few constraints about the nature of such base functions: (1) the set of all base functions must be Turing-complete, and (2) a construction guideline for each such function.

References

Comments

Popular posts from this blog

0, 1, and beyond

The CoBra Project

Books