All posts by Rod

Learned Multidimensional Indexes

At the ends of papers and talks on the “learned indexes” concept described by Kraska et al in 2017, an exciting future research direction referred to as “multidimensional indexes” is suggested.

I described a multidimensional index concept about 10 years ago (unpublished) but referred to it as learning representations that were “simultaneously physically ordered on multiple uncorrelated dimensions”. The crucial property that allows informational items, e.g., records of a database consisting of multiple fields (dimensions, features), to be simultaneously ordered on multiple dimensions is that they have the semantics of extended bodies, as opposed to point masses.  Formally, this means that items must be represented as sets.

Why is this? Fig. 1 gives the immediate intuition. Let there be a coding field consisting of 12 binary units and let the representation, or code, which we will denote with Greek letter φ, of an item be a subset consisting of 5 of the 12 units.  First consider Fig. 1a.  It shows a case where three inputs, X, Y, and Z, have been stored.  To the right of the codes, we show the pairwise intersections of the three codes.  In this possible world, PW1, the code of X is more similar to the code of Y than to that of Z.  We have not shown you the specific inputs and we have not described the learning process that mapped these inputs to these codes.  But, we do assume that that learning process preserves similarity, i.e., it maps more similar input to more highly intersecting codes.  Given this assumption, we know that

sim(X,Y) > sim(X,Z) and also that sim(Y,Z) > sim(X,Z).

PW1_PW2_XYZ_ordering

Fig. 1

That is, this particular pattern of code intersections imposes constraints on the statistical (and thus, physical) structure of PW1.  Thus, we have some partial ordering information over the items of PW1. We don’t know the nature of the physical dimensions that have led to this pattern of code intersections (since we haven’t shown you the inputs or the input space).  We only know that there are physical dimensions on which items in PW1 can vary and that that X, Y, and Z have the relative similarities, relative orders, given above.  But note that given only what has been said so far, we could attach names to these underlying physical dimensions (regardless of what they actually are).  That is, there is some dimension of the input space on which Y is more similar to X than is Z.  Thus, we could call this dimension, “X-ness”.  Y has more X-ness than Z does.  Similarly, there is another physical dimension present that we can call “Y-ness”, and Z has more Y-ness than X does.  Or, we could label that dimension “Z-ness”, in which case, we’d say that Y has more Z-ness than X does.

Now, consider Fig 1b.  It shows an alternative set of codes for X, Y and Z, that would result if the world had a slightly different physical structure.  Actually, the only change is that Y has a slightly different code.  Thus, the only difference between PW2 and PW1 is that in PW2, whatever physical dimension X-ness corresponds to, item Y has more of it than it does in PW1. That’s because |{φ(X) ∩ φ(Y)| = 3 in PW2, but equals 2 in PW1.  ALL other pairwise relations are the same in PW2 as they are in PW1.  While it is true that this is a small example, I hope it is clear that this principle will scale to higher dimensions and much larger numbers of items.  Essentially, the principle elaborated here leverages the combinatorial space of set intersections (and intersections of intersections, etc.) to counteract the curse of dimensionality.

The example of Fig. 1 shows that when items are represented as sets, they have the internal degrees of freedom to allow their degrees of physical similarity on one dimension to be varied while maintaining their degrees of similarity on another dimension.  We actually made a somewhat stronger claim at the outset, i.e., that items represented as sets can simultaneously exist in physical order on multiple uncorrelated dimensions.  Fig. 2 shows this directly, for the case where the items are in fact simultaneously ordered on two completely anti-correlated dimensions.  

In Fig. 2, the coding field consists of 32 binary units and the convention is that all codes stored in the field will consist of exactly 8 active units. We show the codes of four items (entities), A to D, which we have handpicked to have a particular intersection structure.  The dashed line shows that the units can be divided into two disjoint subsets, each representing a different “feature” (latent variable) of the input space, e.g., Height (H) and IQ (Q).    Thus, as the rest of the figure shows, the pattern of code intersections simultaneously represents both the order, A > B > C > D, for Height and the anti-correlated order, D > C > B > A, for IQ.

simple_simul_order_two_fields_anticorr

Fig. 1

The units comprising this coding field may generally be connected, via weight matrices, to any number of other, downstream coding fields, which could “read out” different functions of this source field, e.g., access the ordering information on either of the two sub-fields, H or Q.

The point of these examples is simply to show that a set of extended objects, i.e., sets, can simultaneously be ordered on multiple uncorrelated dimensions.  But there are other key points including the following.

  1. Although we hand-picked the codes for these examples, the model, Sparsey, which is founded on using a particular format of fixed-size sparse distributed representation (SDR), and which gave rise to the realization described in this essay, is a single-trial, unsupervised learning model that allows the ordering (similarity) relations on multiple latent variables to emerge automatically.  Sparsey is described in detail in several publications: 1996 thesis, 2010, 2014, 2017 arxiv.
  2. While conventional, localist DBs use external indexes (typically trees, e.g., B-trees, KD-trees) to realize log time best-match retrieval, the set-based representational framework described here actually allows fixed-time (no serial search) approximate best-match retrieval on the multiple, uncorrelated dimensions (as well as allowing fixed-time insertion).  And, crucially, there are no external indexes: all the “indexing” information is internal to the representations of the items themselves.  In other words, there is no need for these set objects to exist in an external coordinate system in order for the similarity/ordering relations to be represented and used.

Finally, I underscore two major corollary realizations that bear heavily on understanding the most expedient way forward in developing “learned indexes”.

  1. A localist representation cannot be simultaneously ordered on more than one dimension.  That’s because localist representations have point mass semantics.  All commercial DBs are localist: the records of a DB are stored physically disjointly.  True, records may generally have fields pointing to other records, which can therefore be physically shared by multiple records.  But any record must have at least some portion that is physically disjoint from all other records.  The existence of that portion implies point mass semantics and (ignoring the trivial case where two or more fields of the records of a DB are completely correlated) a set of points can be simultaneously ordered (arranged) on at most one dimension at a time.  This is why a conventional DB generally needs a unique external index (typically some kind of tree structure) for each dimension or tuple on which the records need to be ordered so as to allow fast, i.e., log time, best-match retrieval.
  2. In fact, dense distributed representations (DDR), e.g., vectors of reals, as for example present in the internal fields of most mainstream machine learning / deep learning models, also formally have point mass semantics.  Intersection is formally undefined for vectors over reals.  Thus, any similarity measure between vectors (points) must also formally have point mass semantics, e.g., Euclidean distance.  Consequently, DDR also  precludes simultaneous ordering on multiple uncorrelated dimensions.

Fig. 3 gives final example showing the relation of viewing items in terms of a point representation to set representation.  Here the three stored items are purple, blue, and green.  Fig. 3a begins showing the three items as points with no internal structure sitting in a vector space and having some particular similarity (distance) relationships, namely that purple and blue are close and they are both  far away from green.  In Fig. 3b, we now have set representations of the three items.  There is one coding field here, consisting of 6×7=42 binary units and red units show intersection with the purple item’s representation. Fig 3c shows that the external coordinate system is no longer needed to represent the similarity (distance) relationships, and Fig. 3d just reinforces the fact that there is really only one coding field here and that the three codes are just different activation patterns over that single field. The change from representing information formally as points in an external space to representing them as sets (extended bodies) that require no external space will revolutionize AI / ML.

stepwise_explan_vec_space_to_sets

 Fig. 3

Not Copenhagen, Not Everett, a New Interpretation of Quantum Reality

I’ve claimed for some time that quantum computing can be realized on a single processor Von Neumann machine and published a paper explaining the basic intuition. All that is necessary is to represent information using sparse distributed codes (SDC) rather than localist codes. With SDC, all informational states (hypotheses) represented in a system (e.g., computer) are represented in physical superposition and therefore all such states are partially active in proportion to their similarity with the current most likely (most active) state. My Sparsey® system operates directly on superpositions, transforming the superposition over hypotheses at T into the superposition at T+1 in time that does not depend on the number of hypotheses represented (stored) in the system. This update occurs via passage of signals via a recurrent channel, i.e., a recurrent weight matrix. In addition to a serial (in time) updating of the superposition, mediated by the recurrent matrix, the system may also have an output channel (a different, non-recurrent weight matrix) that “taps” off, i.e., observes, the superposition at each T. Thus the system can generate a sequence of collapsed outputs (observations) which can be thought of as a sequence of maximum likelihood estimates, as well as continuously update the superposition (without requiring collapse).

This has led me to the following interpretation of the quantum theory of the physical universe itself.  Projecting a universe of objects in a low dimensional space, e.g., 3 dimensions, up into higher dimensional spaces, causes the average distance between objects to increase exponentially with the number of dimensions. (The same is true of sparse distributed codes living in a sparse distributed code space.) But now imagine that the objects in the low dimensional space are not point masses, but rather have extension. Specifically, let’s imagine that these objects are something like ball-and-stick lattices, or 3D graphs consisting of edges and nodes. The graph has extension in 3 dimensions, but is mostly just space.  Further, imagine that the graph edges simply represent forces between the nodes (and not constrained to be pairwise forces), where the nodes are the actual material constituents of objects (similar to how an atom is mostly space…and perhaps even a proton is mostly empty space).

Now suppose that the actual universe is of huge dimension, e.g., a million dimensions, or an Avogadro’s number of dimensions, but let’s stick with one million for simplicity. Furthermore, imagine that these are all macroscopic dimensions (as opposed to the Planck-scale rolled up dimensions of string theory). Now imagine that this million-D universe is filled with macroscopic “graph” objects.  They would have macroscopic extent on perhaps a large fraction or even all of those 1 million dimensions, but they would be almost infinitely sparse or diffuse, i.e., ghost-like, so diffuse that numerous, perhaps exponentially numerous such objects, could exist in physical superposition with each other, i.e., physically intermingled.  They could easily pass through each other.  But, as they did so, they would physically interact.

Suppose that we can consider two graphs to be similar in proportion to how many nodes they share in common. Thus two graphs that had a high fraction of their nodes in common might represent two similar states of the same object.

But suppose that instead of thinking of a single graph as representing a single object, we think of it as representing a collection of objects.  In this case, two graphs having a certain set of nodes in common (intersection), could be considered to represent similar world states in which some of the same objects are present and perhaps where some of the those objects have similar internal states and some of the inter-object relations are similar.  Suppose that such a graph, S, consisted of a very large number (e.g., millions) of nodes and that a tiny subset, for concreteness, say, 1000, of those nodes corresponded to the presence of some particular object x.  Then imagine another instance of the overall graph, S’, in which 990 of those nodes are present.  We could imagine that that might represent another state of reality in which x manifests almost identically as it did in the original instance; call that version of x, x‘.  Thus, if S was present, and thus if x was present, we could say that x‘ is also physically present, just with 990/1000 strength rather than with full strength.  In other words, the two states of reality can be said to be physically present, just with varying strength.  Clearly, there are an exponential number of states around x that could also be said to be partially physically present.

Thus we can imagine that the actual physical reality that we experience at each instant is a physical superposition of an exponentially large number of possible states, where that superposition, or world state, corresponds to an extremely diffuse graph, consisting of a very large number of nodes, living in a universe of vastly high dimension.

This constitutes a fundamentally new interpretation of physical reality in which, in contrast to Hugh Everett’s “many worlds” theory, there is only one universe.  In this single universe, objects do physically interact via all the physical forces we already know of.  It’s just that the space itself has such high dimension and these object’s constituents are so diffuse that they can simply exist on top of each other, pass through each other, etc.  Hence, we have found a way for actual physical superposition to be the physical realization of quantum superposition.

Imagine projecting this 1 million dimensional space down into 3 dimensions. These “object-graphs”, which are exponentially diffuse in the original space, will appear dense in the low dimensional manifold. Specifically, the density of such objects increases exponentially with decreasing number of dimensions. I submit that what we experience (perceive) as physical reality is simply an extremely low dimensional, e.g. 3 or 4 dimensions, projection of a hugely-high dimensional universe, whose objects are macroscopic but extremely diffuse.  Note that these graphs (or arbitrary portions thereof) can have rigid structure (due to the forces amongst the nodes).

In particular, this new theory obviates the need for the exponentially large number of physically separate universes that Everett’s theory requires.  Any human easily understands the massive increase in space in going from 1-D to 2-D or from 2-D to 3-D.  There is nothing esoteric or spooky about it.  Anyone can also understand how this generalizes to adding an arbitrary number of dimensions.  In contrast, I submit that no human, including Everett, could offer a coherent explanation of what it means to have multiple, physically separate universes.  We already have the concept, which we are all easily taught in childhood, that the “universe” is all there is.  There is no Physical room for any additional universes.  The “multiverse”—a hypothesized huge or infinite set of physical universes—is simply an abuse of language.

Copenhagen maintains that all possible physical states exist in superposition at once and that when we observe reality, that superposition collapses to one state. But Copenhagen never provided an intuitive physical explanation for this quantum superposition.  What Copenhagen simply does not explain, and what Everett solves by positing an exponential number of physically separate, low-dimensional universes, filled with dense, low-D objects, I solve by positing a single super-high dimensional universe filled with super-diffuse, high-D objects.

Slightly as an aside, this new theory helps resolve a problem I’ve always had with Schrodinger’s cat.  The two states, “cat alive” and “cat dead” are constructed to seem very different.  This misleads people into thinking that at every instant and in every physical subsystem, a veritable infinity of states coexist in superposition.  I mean…why stop at just “cat alive” and “cat dead”?  What about the state in which a toaster has appeared, or a small nuclear-powered satellite?  I suppose it is possible that some vortex of physical forces, perhaps designed by a supercomputer, could instantly rearrange all the atoms in the box from one in which there was a live cat to one in which there is the toaster, or the satellite.  But I think it is better to think of transformations like this to have zero probability.  My point here is that the number of physical states to which any physical subsystem might collapse at any given moment, i.e., the cardinality of the superposition that exists at that moment, is actually vastly smaller than one might naively think having been misled by the typical exposition of Schrodinger’s cat.  Thus, it perhaps becomes more plausible that my theory can accommodate the number of physical states that actually do coexist in superposition.

Again, this theory of what physical reality actually is came to me by first understanding and constructing a similar theory about representing physical reality.  In that SDC theory, the universe is a high-D “codespace”, the “objects” are “representations” or “codes”, and these codes are high-D but are extremely diffuse (sparse) in that codespace.

Sparse distributed representations compute similarity relations exponentially more efficiently than localist representations

If concepts are represented localistically, the first, the most straightforward thing to do is to place those representations in an external coordinate system (of however many dimensions as desired/needed).  The reason is that localist representations of concepts have the semantics of “point masses”.  They have no internal structure and no extension.  So it is natural to view them as points in an N-dimensional coordinate system.  You can then measure the similarity of two concepts (points) using for example, Euclidean distance.  But, note that placing a new point in that N-space does not automatically compute and store the distance between that new point and ALL points already stored in that N-space.  This requires explicit computation and therefore expenditure of time and power.

On the other hand, if concepts are represented using sparse distributed codes (SDCs), i.e., sets of co-active units chosen from a much larger total field of units, where the sets may intersect to arbitrary degrees, then it becomes possible to measure similarity (inverse distance) as the size of intersection between codes.  Note that in this case, the representations (the SDCs) fundamentally have extension…they are not formally equivalent to point masses.  Thus, there is no longer any need for an external coordinate system to hold these representations.  A similarity metric is automatically imposed on the set of represented concepts by the patterns of intersections of their codes.  I’ll call this an internal similarity metric.

Crucially, unlike the case for localist codes, creating a new SDC code (i.e., choosing a set of units to represent a new concept), DOES compute and store the similarities of the new concept to ALL stored concepts.  No explicit computation, and thus no additional computational time or power, is needed beyond the act of choosing/storing the SDC itself.

Consider the toy example below.  Here, the format is that all codes will consist of exactly 6 units chosen from the field.  Suppose the system has assigned the set of red cells to be the code for the concept, “Cat”.  If the system then assigns the yellow cells to be the code for “Dog”, then in the act of choosing those cells, the fact that three of the units (orange) are shared by the code for “Cat” implicitly represents (reifies in structure) a particular similarity measure of “cat” and “Dog”.  If the system later assigns the blue cells to represent “Fish”, then in so doing, it simultaneously reifies in structure particular measures of similarity to both “Cat” and “Dog”, or in general, to ALL concepts previously stored.  No additional computation was done, beyond the choosing of the codes themselves, in order to embed ALL similarity relations, not just the pairwise ones, but those of all orders (though this example is really too small to show that), in the memory.

SDC_reified_simialarities

This is why I talk about SDC as the coming revolution in computation. Computing the similarities of things is in some sense the essential operation that intelligent computers perform.  Twenty years ago, I demonstrated, in the form of the constructive proof that is my model TEMECOR, now Sparsey®, that choosing an SDC for a new input, which respects the similarity structure of the input space, can be done in fixed time (i.e., the number of steps, thus the compute time and power, remains constant as additional items are added).  In light of the above example, this implies that an SDC system computes an exponential number of similarity relations (of all orders) and reifies them in structure also in fixed-time.

Now, what about the possibility of using localist codes, but not simply placed in an N-space, but stored in a tree structure?  Yes.  This is, I would think, essentially how all modern databases are designed.  The underlying information, the fields of the records, are stored in localist fashion, and some number E of external tree indexes are constructed and point into the records.  Each individual tree index allows finding the best-matching item in the database in log time, but only with respect to the particular query represented by that index.  When a new item is added to the database all E indexes must execute their insertion operations independently.  In the terms used above, each index computes the similarity relations of a new item to ALL N stored items and reifies them using only logN comparisons.  However, the similarities are only those specific to the manifold (subspace) corresponding to index (query).  The total number of similarity relations computed is the sum across the E indexes, as opposed to the product. But it is not this sheer quantitative difference, but rather that having predefined indexes precludes reification of almost all of the similarity relations that in fact may exist and be relevant in the input space.

Thus I claim that SDC admits computing similarity relations exponentially more efficiently than localist coding, even localist codes augmented by external tree indexes.  And, that’s at the heart of why in the future, all intelligent computation will be physically realized via SDC….and why that computation will be able to be done as quickly and power-efficiently as in the brain.

Quantum Computing in the Offing

Huge money is being invested in quantum computing (QC), IBM, Intel, Microsoft, Google, China…  However, I propose that it will not arrive (or at least arrive first) by the path these giants are pursuing.  All the “mainstream” efforts to build QC are based on exotic physical mechanisms, e.g., extremely low temperature entities that can remain in superposition (i.e., do not decohere).  But this is not needed to realize QC.  As I’ve argued elsewhere, QC is physically realizable by a standard, single-processor Von Neumann computer, i.e., your desktop, your smart phone.  It’s all a matter of how information is represented in the system, not of the physical nature of the individual units (memory bits) that represent the information.  Specifically, the key to implementing QC lies in simply changing from representing information localistically (see below) to representing information using sparse distributed representations (SDR), a.k.a. sparse distributed coding (SDC).  We’ll use “SDC” throughout.  To avoid confusion, let me emphasize immediately that SDC is a completely different concept than the far more well-known and (correctly) widely-embraced concept of “sparse coding” (Olshausen & Field, 1996), though they are completely compatible.

A localist representation is one in which each item of information (“concept”) stored in the system, e.g., the concept, ‘my car’, is represented by a single, atomic unit, and that physical unit is disjoint from the representations of all other concepts in the system.  We can consider that atomic representational unit to be a word of memory, say 32 or 64 bits.   No other concept, of any scale, represented in the database can use that physical word (representational unit).  Consequently, that single representational unit can be considered the physical representation of my car (since all of the information stored in the database, which together constitutes the full concept of ‘my car’, is reachable via that single unit).  This meets the definition of a localist representation…the representations of the concepts are physically disjoint.

In contrast to localism, we could devise a scheme in which each concept is represented by a subset of the full set of physical units comprising the system, or more specifically, comprising the system’s memory.  For example, if the memory consisted of 1 billion physical bits, we could devise a scheme in which the concept, ‘my car’, might be represented by a particular subset of, say, 10,000 of those 1 billion bits.  In this case, if the concept ‘my car’ was active in that memory, that set of 10,000 bits, and only that particular subset, would be active.

What if some other concept, say, ‘my motorcycle’, needs to become active?  Would some other subset of 10,000 bits that is completely disjoint from the 10,000 bits representing my car, become active?  No.  If our system was designed this way, it would again be a localist representation (since we’d be requiring the representations of distinct concepts to be physically disjoint).  Instead, we could allow the 10,000 bits that represent my motorcycle to share perhaps 5,000 bits in common with my car’s representation.  The two representations are still unique.  After all, they each have 5,000 bits—half their overall representations—not in common with each other.  But the atomic representational units, bits, can now be shared by multiple concepts, i.e., representations can physically overlap.  Such a representation in which a) each concept is represented by a small subset of the total pool of representational units and b) those subsets can intersect, is called a sparse distributed code (SDC).

With these definitions in mind, it is crucial (for the computer industry) to realize  that to date, virtually all information stored electronically on earth, e.g., all information stored in fields of records of databases, is represented localistically.  Equivalently, to date there has been virtually no commercial use of SDC on earth.   Moreover, only a handful of scientists have thus far understood the importance of SDC, Kanerva (~1988), Rachkovskij & Kussul (late 90’s), myself (early 90’s, Thesis 1996), Hecht-Nielsen (~2000), Numenta (~2009), and a few others.  Only in the past year or so, have the first few attempts at  commercialization begun to appear, e.g., Numenta.  Thus, two things:

  1. The computer industry may want to at least consider (due diligence) that SDC may be the next major, i.e., once-in-a-century, paradigm shift
  2. it could be that SDC = QC

With SDC, it becomes possible for those 5,000 bits that the two representations (‘my car’ and ‘my motorcycle’) have in common to represent features (sub-concepts) that are common to both my car and my motorcycle.  In other words, similarity in the space of represented concepts can be represented by physical overlap of the representations of those concepts.  This is something that cannot be achieved with a localist representation (because localist representations don’t overlap).  And from one vantage point, it is the reason why SDC is so superior to localist coding, in fact, exponentially superior to localist coding.

But, the deep (in fact, identity) connection of SDC and QC is not that more similar concepts will have larger intersections.  Rather it is that if all representable (by a particular memory/system) concepts are represented by subsets of an overall pool of units and if those subsets can overlap, then any single concept. i.e., any single subset, can be viewed as, and can function as, a probability (or likelihood) distribution over ALL representable concepts.  We’ll just use “probability”.  That is, any given active representation represents all representable hypotheses in superposition.  And if the model has enforced that similar concepts are assigned to more highly overlapping codes, then the probability of any particular concept at a given moment is the fraction of that concept’s bits that are active in the currently (fully) active code (making the reasonable assumption that for natural worlds, the probabilities of two concepts should correlate with their similarities).

This has the following hugely important consequence.  If there exists an algorithm that updates the probability of the currently active single concept in fixed time, i.e., in computational time that remains constant over the life of the system (more specifically, remains constant as more and more concepts are stored in the memory), then that algorithm can also be viewed as updating the probabilities of all representable concepts in fixed time.  If the number of concepts representable is of exponential order (i.e. exponential in the number of representational units), then we have a system which updates an exponential number of concepts, more specifically, an exponential number of probabilities of concepts (hypotheses), in fixed time. Expressed at this level of generality, this meets the definition of QC.

All that remains to do in order to demonstrate QC is to show that the aforementioned fixed time operation that maps one active SDC into the  next—or equivalently, that maps one active probability distribution into the next—changes the probabilities of all representable concepts in a sensible way, i.e., in a way that accurately models the space of representable concepts (i.e., accurately models the semantics, or the dynamics, or the statistics, of that space).   In fact, such a fixed time operation has existed for some time (since about 1996).  It is the Sparsey® model (formerly TEMECOR, and see thesis at pubs).  And, in fact, the reason why the updates to the probability distribution (i.e., to the superposition) can be sensible is that is, as suggested above, that similarity of concepts can be represented by degree of intersection (overlap) of their SDCs.

I realize that this view, that SDC is identically QC, flies in the face of dogma, where dogma can be boiled down to the phrase “there is no classical analog of quantum superposition”.  But I’m quite sure that the mental block underlying this dogma for so long has simply been that that quantum scientists have been thinking in terms of localist representations.  I predict that it will become quite clear in the near future that SDC constitutes a completely plausible classical analog of quantum superposition.

..more to come, e.g., entanglement is easily and clearly explained in terms of SDC…