Tech Hatter: Web Search

Showing posts with label Web Search. Show all posts

Saturday, 5 October 2013

Matt Cutts Explains How Google Search Works

Matt Cutts is the head of Web Spam at Google’s, posted a 8 minutes conversion video on How To Google search works. From crawling, indexing to ranking, he gives the detail about how Google’s search engine does its job.

Matt explains in this video that how PageRank is used, crawling timelines, frequencies, priorities, indexing and purifying procedures within the databases.

Here is the transcript rendered by YouTube:

0:00
0:00 MATT CUTTS: Hi, everybody.
0:01 We got a really interesting and very expansive question
0:04 from RobertvH in Munich.
0:06 RobertvH wants to know–
0:09 Hi Matt, could you please explain how Google’s ranking
0:12 and website evaluation process works starting with the
0:14 crawling and analysis of a site, crawling time lines,
0:18 frequencies, priorities, indexing and filtering
0:21 processes within the databases, et cetera?
0:25 OK.
0:25 So that’s basically just like, tell me
0:27 everything about Google.
0:28 Right?
0:29 That’s a really expansive question.
0:30 It covers a lot of different ground.
0:32 And in fact, I have given orientation lectures to
0:35 engineers when they come in.
0:37 And I can talk for an hour about all those different
0:40 topics, and even talk for an hour about a very small subset
0:43 of those topics.
0:45 So let me talk for a while and see how much of a feel I can
0:48 give you for how the Google infrastructure works, how it
0:51 all fits together, how our crawling and indexing and
0:53 serving pipeline works.
0:55 Let’s dive right in.
0:57 So there’s three things that you really want to do well if
0:59 you want to be the world’s best search engine.
1:01 You want to crawl the web comprehensively and deeply.
1:03 You want to index those pages.
1:05 And then you want to rank or serve those pages and return
1:08 the most relevant ones first.
1:10 Crawling is actually more difficult
1:11 than you might think.
1:13 Whenever Google started, whenever I joined back in
1:16 2000, we didn’t manage to crawl the web for something
1:18 like three or four months.
1:20 And we had to have a war room.
1:22 But a good way to think about the mental model is we
1:25 basically take page rank as the primary determinant.
1:28 And the more page rank you have– that is, the more
1:31 people who link to you and the more reputable those people
1:34 are– the more likely it is we’re going to discover your
1:37 page relatively early in the crawl.
1:39 In fact, you could imagine crawling in strict page rank
1:41 order, and you’d get the CNNs of the world and The New York
1:45 Times of the world and really very high page rank sites.
1:49 And if you think about how things used to be, we used to
1:51 crawl for 30 days.
1:53 So we’d crawl for several weeks.
1:56 And then we would index for about a week.
1:59 And then we would push that data out.
2:01 And that would take about a week.
2:04 And so that was what the Google dance was.
2:05 Sometimes you’d hit one data center that had old data.
2:07 And sometimes you’d hit a data center that had new data.
2:10 Now there’s various interesting tricks
2:13 that you can do.
2:13 For example, after you’ve crawled for 30 days, you can
2:16 imagine recrawling the high page rank guys so you can see
2:19 if there’s anything new or important that’s hit on the
2:21 CNN home page.
2:22 But for the most part, this is not fantastic.
2:25 Right?
2:25 Because if you’re trying to crawl the web and it takes you
2:28 30 days, you’re going to be out-of-date.
2:30 So eventually, in 2003, I believe, we switched as part
2:36 of an update called Update Fritz to crawling a fairly
2:40 interesting significant chunk of the web every day.
2:43 And so if you imagine breaking the web into a certain number
2:47 of segments, you could imagine crawling that part of the web
2:51 and refreshing it every night.
2:53 And so at any given point, your main base index would
2:58 only be so out of date.
3:00 Because then you’d loop back around and you’d refresh that.
3:03 And that works very, very well.
3:04 Instead of waiting for everything to finish, you’re
3:06 incrementally updating your index.
3:08 And we’ve gotten even better over time.
3:10 So at this point, we can get very, very fresh.
3:14 Any time we see updates, we can usually
3:16 find them very quickly.
3:18 And in the old days, you would have not just a main or a base
3:20 index, but you could have what were called supplemental
3:24 results, or the supplemental index.
3:26 And that was something that we wouldn’t crawl and refresh
3:28 quite as often.
3:29 But it was a lot more documents.
3:31 And so you could almost imagine having really fresh
3:35 content, a layer of our main index, and then more documents
3:40 that are not refreshed quite as often, but there’s a lot
3:42 more of them.
3:43 So that’s just a little bit about the crawl and how to
3:45 crawl comprehensively.
3:47 What you do then is you pass things around.
3:49 And you basically say, OK, I have crawled a large fraction
3:53 of the web.
3:54 And within that web you have, for example, one document.
3:58 And indexing is basically taking things in word order.
4:04 Well, let’s just work through an example.
4:06 Suppose you say Katy Perry.
4:10 In a document, Katy Perry appears right
4:13 next to each other.
4:14 But what you want in an index is which documents does the
4:18 word Katy appear in, and which documents does the word
4:20 Perry appear in?
4:22 So you might say Katy appears in documents 1, and 2, and 89,
4:26 and 555, and 789.
4:32 And Perry might appear in documents number 2, and 8, and
4:37 73, and 555, and 1,000.
4:42 And so the whole process of doing the index is reversing,
4:47 so that instead of having the documents in word order, you
4:50 have the words, and they have it in document order.
4:53 So it’s, OK, these are all the documents that a
4:54 word appears in.
4:56 Now when someone comes to Google and they type in Katy
4:59 Perry, you want to say, OK, what documents might match
5:02 Katy Perry?
5:03 Well, document one has Katy, but it doesn’t have Perry.
5:06 So it’s out.
5:08 Document number two has both Katy and Perry, so that’s a
5:11 possibility.
5:12 Document eight has Perry but not Katy.
5:15 89 and 73 are out because they don’t have the right
5:18 combination of words.
5:19 555 has both Katy and Perry.
5:22 And then these two are also out.
5:25 And so when someone comes to Google and they type in
5:27 Chicken Little, Britney Spears, Matt Cutts, Katy
5:29 Perry, whatever it is, we find the documents that we believe
5:32 have those words, either on the page or maybe in back
5:35 links, in anchor text pointing to that document.
5:38 Once you’ve done what’s called document selection, you try to
5:41 figure out, how should you rank those?
5:43 And that’s really tricky.
5:44 We use page rank as well as over 200 other factors in our
5:49 rankings to try to say, OK, maybe this document is really
5:52 authoritative.
5:53 It has a lot of reputation because it has
5:55 a lot of page rank.
5:56 But it only has the word Perry once.
5:58 And it just happens to have the word Katy somewhere else
6:01 on the page.
6:02 Whereas here is a document that has the word Katy and
6:04 Perry right next to each other, so there’s proximity.
6:07 And it’s got a lot of reputation.
6:09 It’s got a lot of links pointing to it.
6:12 So we try to balance that off.
6:13 You want to find reputable documents that are also about
6:16 what the user typed in.
6:18 And that’s kind of the secret sauce, trying to figure out a
6:20 way to combine those 200 different ranking signals in
6:23 order to find the most relevant document.
6:25 So at any given time, hundreds of millions of times a day,
6:30 someone comes to Google.
6:32 We try to find the closest data center to them.
6:34 They type in something like Katy Perry.
6:36 We send that query out to hundreds of different machines
6:38 all at once, which look through their little tiny
6:41 fraction of the web that we’ve indexed.
6:43 And we find, OK, these are the documents that
6:45 we think best match.
6:47 All those machines return their matches.
6:49 And we say, OK, what’s the creme de la creme?
6:52 What’s the needle in the haystack?
6:53 What’s the best page that matches this query across our
6:56 entire index?
6:57 And then we take that page and we try to show it with a
7:00 useful snippet.
7:01 So you show the key words in the context of the document.
7:03 And you get it all back in under half a second.
7:06 So that’s probably about as long as we can go on without
7:10 straining YouTube.
7:11 But that just gives you a little bit of a feel about how
7:13 the crawling system works, how we index documents, how things
7:16 get returned in under half a second through that massive
7:19 parallelization.
7:20 I hope that helps.
7:21 And if you want to know more, there’s a whole bunch of
7:23 articles and academic papers about Google, and page rank,
7:26 and how Google works.
7:28But you can also apply to–
7:30there’s jobs(at)google.com, I think, or google.com/jobs, if
7:34you’re interested in learning a lot more about how search
7:36engines work.
7:37OK.
7:37Thanks very much.
7:39

Google Roll outs Engaging Infographic, Should Look At In Google "How Search Works"

Last Friday afternoon, Google declared a great fresh and different interactive infographicsnamed How Search Works.

This is an excellent way for a newbie to study the basics on how exactly Google crawls, indexes, positions and shows their search results. But how is this effective for a Search engine optimization spam like yourself? I'll layout the elements I find beneficial to more knowledgeable SEOs and webmasters.

Look Instances Of Spam Pages Erased From Google:

My favorite part is that Google is showing me some real examples of spam pages dragged from their index within the past several hours. You can observe those instances on this pageand Google told us that these are latest examples of pages that "appear to utilize aggressive spam methods such as automatically generated gibberish, cloaking and scraping content from other websites." Before clicking and looking at them, you should be warned (by Google) "these screenshots are created automatically and are not manually filtered. While uncommon, you may see offensive, sexually explicit, or violent content."

Actually, I should have a page shortly that archives all of these examples for you to look back at historically. I'll share that page when it is done.

The following is my favorite sample spam page I've viewed most recently:

Charts Of Manual Actions By Category, Reconsideration Requests & Notifications

Google has told us a lot of times the percentages and numbers of manual acts, reconsideration requests and webmaster notifications they have sent out. But they never plotted that info for us, that is until now. I'll selfishly take credit for it (kidding) but I did ask Matt Cutts a couple of years ago if they could construct these graphs out. I am not sure if I was the initially to ask...

So what did Google inform us with these charts?

Manual Actions By Category: This graph demonstrates the quantity of domains that were affected by a manual action over time and is broken down by the different spam types. Google says only 0.22% of domains had been manually marked for removal.

The types of spam recorded here include pure spam, legacy, hacked site, unnatural links from your site, automatically created content, and infinite spaces, cloaking and/or sneaky redirects, thin content with little or no, added value, unnatural links to your site, parked domains, user-generated spam, hidden text and/or keyword, stuffing, spammy free hosts and dynamic, and DNS providers. You may learn more about these over here.

Reconsideration Requests: This chart displays the weekly quantity of reconsideration requests since 2006.

Webmaster Notifications: This graph shows the variety of spam notifications sent to website owners through Webmaster Tools since 2010.

At this point plot these charts against punishment you may have got to make yourself feel better and maybe to figure out how to fix an issue?

Published Google Quality Rater Guidelines:

Eventually, Google themselves thought I would publish the PDF document of the quality ratters guidelines. This write-up has been leaked lots of times; I guess it just made sense for Google to post it. What changed? Matt McGee covered that over here.

So those are the three things I would pull out as being most fascinating for SEOs andwebmasters to look at.

Interflora Returns In Google Rankings, After 11 Days Google Penalty

Advertorial Spam = 11 Day Google Penalty For Interflora

Last two weeks ago I had read that Interflora was penalized for advertorial link spam by Google. What did that outcome in? Just an 11 day Google penalty. Think about these 11 days.

Matt McGee first person who recognized the Interflora’s bring back in Google after a short period of time out session by Google.

Probably, Interflora was able to convince all the UK newspapers to rid of all advertorials spam and submitted a reconsideration request. If you are small webmaster owners then it’s not possible but if you giant then everything is possible.
Now search their name [Interflora] displays they regenerate position.

Naturally, this conducts various SEOs to conceive bigger brands can really get off with things that small brands can’t.

Friday, 4 October 2013

Google Will Shortly Dismiss Links You Tell It To

Matt Cutt`s On Penalties Vs. Algorithm Changes, Denial Of This Link Tool

In SMX Advanced presentation, Matt Cutt gave a Keynote “You and A” and mentioned that Google is thinking offering a tool that would let webmaster deny some links.

As we know about Matt McGee. He is the executive news editor of Search Engine Land. He asks some question about ranking in Google. Google sure has devoted to transparency, as the video shown proves.

In this presentation Matt Cutt said some quotes, which was in reply to a question about negative SEO:

The story of this year has been more crystal clear, but we’re also trying to be better about implementing our quality guidelines. People have asked questions about negative SEO for a prolonged period. Our guidelines used to say it’s almost impossible to do that, but there have been cases where that’s occurred, so we altered the phrasing on that part of our guidelines.

Some have proposed that Google could deny links. Even though we put in a lot of protection against negative SEO, there’s been so much talk about that that we’re talking about being able to enable that, possibly in a month or two or three.

There are plenty websites who wrote about Google’s wording change regarding negative SEO, which looked to be an admission from the company that this practice is verily possible. These words from Cutts look to be more confirmation.

When Google launches this tool, presuming that it really does, it will be very interesting to look how the rankings move out. It should be an indication of just how significant links actually are these days.

As you may know, Google has sent 700,000 warnings to webmaster owner this year, and such a tool would assist users take quick “manual action” on links rather than spend lots of time to sending link removal requests to other sites.

According to Cutts, still, not many of the warnings were actually about links.

Update: Here’s his clarification:

Tech Hatter

Saturday, 5 October 2013

Matt Cutts Explains How Google Search Works

Google Roll outs Engaging Infographic, Should Look At In Google "How Search Works"

Interflora Returns In Google Rankings, After 11 Days Google Penalty

Friday, 4 October 2013

Google Will Shortly Dismiss Links You Tell It To

Matt Cutt`s On Penalties Vs. Algorithm Changes, Denial Of This Link Tool

Feedjit

JOIN OVER 500 BLOGGER USERS

RSS Feed

LIKE US