Vectors have become the new SEO magic. I see people writing about and speaking about vectors everywhere and few of them really have any idea of what they are talking about.
Vectors are not “math”. They are used in math but they are used in many other disciplines, too. We use math to precisely describe vectors for a specific purpose. The word vector is good, old-fashioned Latin and it has absolutely nothing to do with mathematics.
According to Wiktionary and other dubious sources, 19th century mathematician William Rowan Hamilton adapted the word vector to mathematics when he developed vector algebra. That’s equivalent to saying, “I just invented a new math – I’m going to call it Barney Algebra“. Literally. There is no connection between the original meaning of vector and the concept of drawing lines from points to points. The connection is purely abstract.
It’s a semantic evolution, which I find to be ironic given how much Web marketers are equating vectors with semantics (the two concepts are barely connected in any way).
A Simple Primer on Linear Algebra
If you want to understand vectors then take a class in linear algebra. That is “the algebra of lines”. An algebra is a system of rules for conducting mathematical operations on a class of quantities or entities.
One of my favorite math professors explained it this way (I paraphrase): “You could take a doggie, a tree, and a boat and define an algebra for them. Just come up with some rules for doing any kind of math.”
Most people think of algebra as that math class where you learn how to solve for X in the equation 4X + 7 = 23. But that’s not even an algebra. Rather, it’s an algebraic way of thinking.
In other words, don’t expect mathematicians to explain anything in plain, simple language because they have to overcomplicate every process and idea with semantically befuddled adaptations.
Suffice it to say that linear algebra is “the algebra we use to study and manipulate lines and line segments“. A line can be represented by an infinite vector. A line segment can be represented by a finite vector. There are other ways to represent both without using vectors.
Linear algebra students learn to define lines as sequences of points in a Cartesian plane. They represent these points in (x,y) format, referring to where on the X-axis and Y-axis the points are found as intersections of … well, lines. We don’t care about the line crossing through X1 or the line crossing through Y23 – just where they meet. That meeting point is the point which we represent by the (x,y) coordinate.
The real line – the line we’re studying and manipulating – is a series of these (x,y) points and it looks like this:
[(0,1),(1,2),(2,3),(3,4)]
Okay, technically, that is a line segment because it’s a finite sequence. But it’s represented by a vector.
And this is a line segment, too:
[(doggie),(pony),(fishy),(tree)]
The difference between these line segments is that the first is mapped to a Cartesian plane defined by two infinite number lines and the second is mapped to an abstract algebra defined by an arbitrary list of things represented by at least those four words.
My point is that you can put anything into vector form. Or, rather, you can represent anything as a vector as long as you’re using a system of rules that make sense (to you if not to anyone else).
Semantics Doesn’t Really Use Vectors
Michel Jules Alfred Bréal (1832-1915) invented the field of semantics, which in his view had something to do with the meaning of words. He didn’t use math but it didn’t take long for people to start migrating the idea into mathematics.
Skipping ahead to the modern era, computer scientists (who are strongly influenced by mathematics) took up the challenge of creating algorithms that understand language.
So far there are no algorithms that understand language. Remember that.
To understand something is a very complicated process. We don’t really know what that means. Animals have understanding. One of the coolest new findings in language science is the fact that prairie dogs have the ability to describe you or me in rather fine detail to each other. Yes, there is a bona fide prairie dog language and it has many dialects.
The fact they have language that allows them to describe abstract things such as “human in red walking through our colony” versus “mountain lion eating our friends” means that prairie dogs understand things. They understand things semantically. Michel Bréal would be both proud and astounded, I think.
And so far as we know prairie dogs don’t use vectors to understand anything. But I’m sure some computer scientist somewhere is pondering the challenge of representing prairie dog language in vector form.
Vectors are a tool used by computer science to analyze language. I won’t even dignify them by saying we use them to understand language. The similarity between these concepts is superficial.
The Best Computer Science Can Do is Look for Patterns
Pattern analysis is pretty easy to do. It’s much, much simpler than understanding language. And vectors make it pretty easy to do pattern analysis. Mathematicians and computer scientists have been using vectors for pattern analysis for decades, going back to at least the 1940s – maybe further.
Any line segment is a pattern. A complicated pattern consists of many different line segments that are somehow connected. Since you can represent a line segment as coordinates you can transform those coordinates.
In simple, somewhat misleading terms a transformation consists of adding, subtracting, dividing, or multiplying a value. So if you have a vector like [(0,1),(1,2),(2,3),(3,4)] you can transform it by moving it 3 points higher on the Y-axis ala [(0,4),(1,5),(2,6),(3,7)] or 4 points farther out on the X-axis ala [(4,1),(5,2),(6,3),(7,4)] or both: [(4,4),(5,5),(6,6),(7,7)].
Guess what? We have some patterns here.
We can use vector transformations to normalize the data and compare vectors to see if they follow the same patterns. In a Cartesian sense we’re just looking for similar length and angle among line segments. But since we don’t have to limit ourselves to Cartesian universes we can define vectors around any set of coherent rules and look for patterns according to those rules.
That’s what machine learning is all about. It’s looking for patterns.
Modern Machine Learning Owes Much to Optical Analysis
There are about 3,000,000 people around the world who have expressed enough interest in machine learning to join Kaggle, a community for machine learning and data science enthusiasts founded in 2010. I am not a member. Google acquired Kaggle in 2017, for what it’s worth. Kaggle may be best known for sponsoring machine learning competitions.
The Kaggle site also provides links to publicly available datasets that people can use for their experiments. There is even a 2019 Coronavirus dataset, so you can imagine how up-to-date and recent the lists should be.
Machine learning has become very popular but it wasn’t always so popular. When Google announced it was using neural matching algorithms in 2018, SEO bloggers rushed to dominate the search results with their amateurish explanations of what neural matching is. You guys did a great job, too. You flooded Google with nonsense, thus ensuring almost no one in the Web marketing world would understand what neural matching is.
I wrote about neural matching’s history and what it really means on my personal blog. I just wanted something I could refer to in the future whenever neural matching might come up again.
The U.S. government developed neural matching as a way to compare satellite photos to maps. It was basically converting images to systems of vectors and looking for patterns. You take two images from disparate sources (maps and photographs) and convert them into a similar format (systems of vectors) and then look for similarities.
It’s a bit more complicated than that, but you get the idea.
Neural matching is an attempt to simulate the human brain’s ability to compare a map to an aerial photograph and recognize the relationship between the two things even though they look completely different. Optical analysis turned to linear algebra to help make these comparisons.
The key takeaway here is that what is essentially analog information is converted into vectors.
Analog information, by the way, is much more complex and abstract than analog data. The fact the Earth’s climate is warming is information. All the temperatures recorded around the world proving that the Earth’s climate is warming are data. The semi-random connection between all those disparate temperature readings makes them a set of analog data.
This blog post is analog information. To analyze it, you could break out all the paragraphs and then break out all the sentences from each paragraph. That would be analog data. It’s not digital in the sense that it’s not numeric data organized according to the rules of mathematics. There’s no precise value to anything, and no well-defined distinctions. I have repeated several points of information in this post, but not using the same words.
You Can Digitize Any Analog Data
The simplest way to digitize analog data is to arbitrarily say “this part stops here” and “that part begins there”. If you pick up a simple tape measure you’ll have a set of analog data in your hands. The tape may be labeled digitally but the tape itself is analog. There is no way to definitively see where the first unit of measurement begins or ends.
Now cut the tape measure into many different segments and you have digitized it. This is an abstract way of describing digitization of analog data but don’t worry about how analog each segment remains. You have no way of defining the value of “7” no matter how hard you try. You think of it as a digit but it’s really a piece of analog information.
But this is where computer scientists brought the worlds of semantics (the meaning of words and phrases) together with linear algebra machine learning (pattern analysis). They decided to digitize our words and look for patterns in them.
The digitization is achieved by putting our words into vectors. There’s no simple rule for how to vectorize text. You pick a set of rules and write your programs around it and that’s that.
Here is an example of a vector:
[the,quick,brown,fox,jumped,over,the,lazy,dog]
Now, there’s not much a machine learning system can do with that but it’s a vector. Being a vector, the sentence can be transformed into something else:
[1,2,3,4,5,6,7,8,9]
That’s the same vector just in a different format. I’ve numbered the words in the sequence in which they occur. I could do this for a lot of words. And I could then compare many sentences and phrases and look for patterns. Maybe certain phrases occur over and over again, right?
That’s still not very useful. To approximate something like understanding (which the algorithms cannot achieve), we need to find relationships between words and phrases.
These relationships might mean something in a linguistic sense – prairie dogs can talk about relationships to each other – but computers have very short-term memories. Today’s computer technology will never get relationships. All computer dating jokes aside, there’s just no way for an algorithm to truly understand relationships.
But we can annotate every word we look at and identify connections it has to other words.
BERT & Other Algorithms Tag Relationships
Google’s BERT (Bidirectional Encoder Representations from Transformers) has received a lot of commentary (and really bad explanation) from the Web marketing community. I’m not going to explain BERT here but I’ll point you to a great explanation below.
Think of BERT as the cooking oil you use to grease a sauce pan before you start preparing a meal. That’s BERT’s place in the world of computerized semantic analysis.
BERT studies groups of text and looks for patterns in them. I wrote this article, “How to Explain BERT to Your Mom”, for the SEO Theory Premium Newsletter in November 2019:
* HOW TO EXPLAIN BERT TO YOUR MOM *
Google’s BERT algorithm is revolutionary in machine learning technology because:
- It saves other machine learning tools time by pre-labeling words in a corpus (a large body of text)
- It only works with phrases and/or sentences
- It uses special stacks of special chips called TPU pods (using TPUv3 chips)
- Its accuracy and reliability increase as it processes more text
- It randomly blanks out (“masks”) about 15% of the words fed to it
- It compares the phrases and/or sentences to each other to find patterns that predict which words best fit into the blanks
- This training method was first developed for teaching school children in the 1950s
- It uses about 20-30% more processing power than other machine learning systems like ELMO
- BERT enables Google’s machine learning experiments to complete in minutes rather than months
- Other people using BERT will need to lease Google’s Cloud TPU technology to get similar performance (or invent their own equivalent technology)
- TPU chips have more on-board RAM than the most powerful gaming PCs
- TPU chips can only process certain types of matrix operations
I think Mom will be happy with these talking points.
If you wish, you could add that I think a more descriptive acronym/name for BERT would have been MIST [Matrix Inferred Semantic Transformation].
Because it’s using matrix operations, BERT is not really processing the text backwards and forwards. That isn’t what “bidirectional” means. This article from Toward Data Science [Cf. https://tinyurl.com/y5b5kstj ] describes it so: “[BERT] is considered bidirectional, though it would be more accurate to say that it’s non-directional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).”
CONCLUSION
We need to reduce the discussion of BERT to the simplest terms possible. Web marketers quickly mythologized it into something it’s not. BERT doesn’t produce any scores for documents (confirmed by Google – Cf. https://tinyurl.com/y4lf7nql ) and its results are not used to select and rank search results. BERT teaches other algorithms (including, presumably, the selection and ranking algorithms) to better understand the language of the documents they are processing.
“Teach” may not be the right word. It’s more like Google took all the common processes that machine learning algorithms must do (learn the language of the corpus) and combined them into one algorithm (BERT) whose results are then stored and passed on to the other algorithms.
The other algorithms pick up processing where BERT left off, completing their individual tasks in less time and with improved precision or accuracy. BERT is like a bus that carries people to work every day, saving them time and effort.
How do you know if someone understands what BERT is and does?
Well, if all they can do is quote a research paper about BERT – if they cannot put what BERT does into their own words (and get it close to right) – then they don’t really understand what BERT is supposed to be doing.
There are many, many different explanations of how BERT works among machine learning specialists. Each has his or her own way of visualizing what the algorithm does. In fact, the article I quote above is just one of many from our newsletter that discuss machine learning algorithms, and I’ve written much, much more about BERT than that.
I include it here because it is the best summation I could devise for what I want to say about BERT.
Truth be told, Google has already developed other, after-BERT algorithms that have a more direct impact on search results and no one is talking about them. BERT is so yesterday that by the time you first heard about it Google had already moved on.
A Machine Learning Vector is Rather Complex
Take the vectorized representation I present above of “the quick brown fox jumped over the lazy dog”:
[1,2,3,4,5,6,7,8,9]
You can’t do much with that but you can annotate each element in the vector with data that can be used by procedures and functions within the algorithms:
[1:some data,2:more data,3:etc,4:you get it,5:keep going,6:yup, more data,7:have some data,8:doing it’s thing,9:study this]
What goes after those colons?
That’s up to the person designing the software. And they don’t have to embed the annotations in the primary vectors. This is just one way to do it.
Much machine learning these days is performed by people who have no idea whatsoever of what the base code is doing. They might be writing instructions in meta code. That’s not necessarily a bad thing. But the people developing the tools don’t say much about how they are coded.
To identify relationships between words or between phrases a machine learning system must record the connections. Some people still use the term “co-occurrences” and most marketers understand that. If two words occur in the same sentence they “co-occur” in that sentence.
Modern semantic analysis goes well beyond that. It wants to know how often a word like “quick” is associated with a specific class of nouns that includes “fox” (e.g., “quick horse”, “quick man”, “quick fish”, etc.).
But who defines the class?
What is the difference between a quick fox and a quick runner? How does a quick man differ from quick lime?
Simple pattern analysis won’t get you there. Many machine learning experiments still rely on human classification because algorithms cannot understand language. One of the many goals of modern machine learning is to develop self-classifying systems, where the algorithms can pluck out accurate relationships from a huge blob of texts that use words in many different ways.
When Google announced its use of neural matching, it made a big deal about being able to identify the capitals of countries (and sub-regions) around the world just by analyzing texts. In computer science that IS a big deal. But it’s a far cry from figuring out whether “Paris” refers to a city, a person, the French government, or a political agreement.
Algorithms don’t know what cities are. Algorithms don’t know what people are. Algorithms don’t know what governments are. Algorithms don’t know what political agreements are.
The more data a machine learning system can associate with any word or phrase the better that system becomes at matching patterns.
Pattern Analysis Demands Huge Amounts of Data
In computer science the vectors used by machine learning algorithms are just data structures. In classic data processing vocabulary they are just data records like your customer account record at a bank or ecommerce site.
Matching two or three vectors to each other doesn’t really accomplish much. The objective with pattern analysis is to identify one or more rules by which you can make pretty good predictions.
I won’t even say “reliable” predictions. The goal is to make predictions you can work with but which don’t have to meet a standard of perfection.
Given a phrase like “the quick brown fox” a machine learning system should be able to predict that “jumped over the lazy dog” is likely to come next, or may be used within close proximity.
The predictions won’t always be right but the more texts an algorithm can process and annotate, the better its predictions become.
These systems process hundreds of thousands or millions of vectors of digitized text. They can’t process everything and it wouldn’t make sense for them to do that.
Much like language, machine learning must also digress into dialectal translations. To make better predictions for jargonistic language the algorithms must learn from – identify the relationships between words and phrases in – jargonistic sources.
Every time I read that Google trains an algorithm on Wikipedia articles I just want to gag and throw my computer across the room. I understand why they do this (just to develop the algorithms, hopefully) but their data must be cleaned up later. Hopefully they find better sources of text to process for real-world applications.
Wikipedia is convenient because it publishes millions of articles that anyone can download. But Wikipedia speaks a language that even Wikipedia doesn’t understand. It’s not relevant to real-world human speech and writing. Only 1 person has written this blog post. Any Wikipedia article that has been around for years is likely to have been edited by many people.
A Word about Transformers
BERT and other algorithms have introduced the idea of transformers into the Web marketing lexicon but (thankfully) people haven’t tried to explain what they do very much. I guess the idea is too esoteric for most people to even guess at what a transformer does.
Technically, there are many different kinds of transformers. But as far as Google transformer algorithms go, I like Peter Bloem’s explanation very much. You don’t have to squelch through all the math. He explains it in simple language. And he explains BERT about as well as anyone immersed in jargon and code can explain it.
To condense it down to the simplest level, these systems are computing a lot of weighted averages. Those weighted averages help them normalize the data and look for patterns. A transformer is a black box application that converts a jumble of sources into a meaningful derivation. Think of it like turning wheat into pasta.
Researchers don’t care about individual vectors of data. They have no meaning in the context of the larger set of vectors of similar data. The meaning comes from what you extract from all that. A transformer provides one way to extract meaning (or to create context) from the data. It’s not the only way but it’s the cool kid on the block.
Wrapping Up
When you type a query into Google’s search box it will be vectorized later for analysis. When Google “suggests” ways to complete the query they are showing you predictions their algorithms made.
They’re using vector analysis in a lot of interesting ways, more ways than one person can comprehend or explain. But what everyone needs to understand is that vectors are just a stepping stone on the way to something else. You don’t need to influence vector analysis to improve your rankings. And that’s a good thing because you can’t.
Vectors are not math. They are just things. The math represents what we do to the things, and kind of points toward what we hope to get from the things. The SEO discussion should never have become weighted down in vectors. But now that it is you should avoid all the nonsense that is being written about them.
You can’t just download Word2Vec and become an expert in vectors, machine learning, or semantic analysis.
A Great Video Introducing Vectors and Vector Math
Siraj Raval is a computer science graduate who has taught AI and launched the School of AI. He mixes math with intuitive representations of vectors and how they are used in this video from 2017. He explained what Word2Vec does but the video encapsulates much of I wrote above.
WARNING: He has been accused of scamming people and copying other videos. I’m not selling his services. The video explains vectors very well in plain English. I may replace it later. It’s still a better explanation than you’ll get from the leading SEO “experts” talking about vectors.
Hi,
This is such a great post and very nice article also, here in this post you have share your such a great experience also.
Thanks for the post.