When you don’t know something what to
you do? You Google it! Type in your topic, and click the search
button and… BAM! A second later, you have a list of web sites that will
lead you to your answer if it exists on the web. But how do they do
it? How does Google get you the information you need so quickly and
with such accuracy? While I studied natural language processing for
my undergraduate and graduate theses so I know the general ways to achieve such a task, I don’t know the details of how
Google does it and someone asked so I’m going to BRIEFLY address it
now…
There is one very simple thing that
Google does to increase the loading speed of their web site. They do
not bloat their homepage or results pages with tons of images or
unnecessary HTML code. There is just the logo, tool bar, a couple of
buttons and a text box. To see what I mean, right-click on the Google homepage and select
view page source. A window with some HTML code will pop up. Then do the same on Yahoo’s home page. While you may
not understand any of it, you’ll see at least twice as
much code on Yahoo’s page as Google’s. Now this alone doesn’t define
which one is faster or more accurate but, it does help with loading
speed.
Another one of the major factors is the machines behind the Google search engine. There are hundreds of thousands
of machines all over the world. These machines process the Internet
in segments to get the latest and greatest information out there and
then cache the data. It’s the billions of billions of cached
pages that Google searches through which increases speed. Cached pages are web pages that have been previously read and saved. By using these previously read pages, Google eliminates the need to go to the Internet and get page data for each individual search. Each machine updates it’s assigned cached pages on a regular basis (according to a lecture I watched from Google about once a day). So, even though Google doesn’t search the absolute latest versions of the page, it’s a small price to pay for exponential increase in speed.
In order to organize all the cached pages,
Google’s computers use what is called an inverted index. Such an
index maps key words to the location of data. For example, the word dog
would be mapped to a set of web pages that contain the word dog.
Then, using a search algorithm, Google is able to give you the best results. It takes more time to create such a database than a non-inverted index (one that maps location to data) but it is a lot faster to query on which is what is more important to the user.
There are a tons of papers written on
the search algorithms that Google uses. This is way beyond
the scope of this blog post. But I can say that Google uses
technologies and conducts research in the fields of machine learning,
natural language processing, data management and parallel computing
to improve these algorithms. Hundreds of engineers work on
everyday to bring you faster, more accurate results. However, the basic idea behind
determining accuracy is if a web page has a good reputation (a lot of
other sites have referenced the site) and contains all or most of the
words in your query, it gets a higher rank than those without these
properties. Google also uses data collected from their users to
determine which sites are visited most often based on a search; even
your browsers history can be used to increase accuracy (there is some
controversy about these practices due to privacy concerns).
So, there is my basic understanding of
the way Google works. I may look into some of the
details that are published on the search algorithms for future blog posts (if you are really interested just ask!). But I think this
is enough for everyone (including myself) to think about for now…